## Notebook to scrape the needed data from the NSIDC 0630 database 
#### Looking to take 19 GHz (6.25km) and 37 GHz (3.125km) data from 1987 - 2011
- Running this script downloads all the needed files from the directory for the 19Ghz (6.125km) and the 37Ghz (3.125km) Northern Hemishpere morning images 
- **It is important to remember that each sensor (F11,F13) are not available in every year
    - If you intend to do many years, pay attention to which sensors you need for each time period 

In [1]:
%cd /Users/williamnorris/SWE_Clean/data

/Users/williamnorris/SWE_Clean/data


In [2]:
%pylab notebook
import urllib2
import datetime
import subprocess
from subprocess import call
import shlex
import pandas as pd
import numpy as np
from pandas import datetime
from string import maketrans

Populating the interactive namespace from numpy and matplotlib


**Sensor Years**

**NOTE:** The Nimbus7 sensor did not record 19H. Is there an alternative? Or are those years hosed?

- 1978: Not a full year of data
- 1979: NIMBUS7 (pre Fx classification)
- 1980: NIMBUS7 
    - Link Format: (https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/1980.01.01/NSIDC-0630-EASE2_S3.125km-NIMBUS7_SMMR-1980001-37H-M-SIR-JPL-v1.2.nc)
    
- 1981: NIMBUS7
- 1982: NIMBUS7
- 1983: NIMBUS7
- 1984: NIMBUS7
- 1985: NIMBUS7
- 1986: NIMBUS7
- 1987: NIMBUS7
- 1988: F08 
    - Link Format: (https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/1988.01.01/NSIDC-0630-EASE2_N6.25km-F08_SSMI-1988001-19H-M-SIR-CSU-v1.2.nc)

- 1989: F08
- 1990: F08
- 1991: F08, F10 (F08 ends)

    - F10 Link: (https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/1991.01.01/NSIDC-0630-EASE2_N25km-F10_SSMI-1991001-19H-M-GRD-CSU-v1.2.nc)

- 1992: F10, F11 (F11 has ~20,000 less spikes per year)

    - F11 Link: (https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/1992.01.01/NSIDC-0630-EASE2_N6.25km-F11_SSMI-1992001-19H-M-SIR-CSU-v1.2.nc)

- 1993: F10, F11
- 1994: F10, F11
- 1995: F10, F11
- 1996: F10, F11, F13 (F13 is best current sensor)
    - F13 Link: (https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/1996.01.01/NSIDC-0630-EASE2_N6.25km-F13_SSMI-1996001-19H-M-SIR-CSU-v1.2.nc)

- 1997: F10, F11, F13
- 1998: F11, F13, F14
- 1999: F11, F13, F14
- 2000: F11, F13, F14
- 2001: F13, F14, F15
    - F15 Link: https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/2001.01.01/NSIDC-0630-EASE2_N6.25km-F15_SSMI-2001001-19H-M-SIR-CSU-v1.2.nc

- 2002: F13, F14, F15
- 2003: F13, F14, F15, AQUA
- 2004: F13, F14, F15, AQUA
- 2005: F13, F14, F15, AQUA
- 2006: F13, F14, F15, F16, AQUA
- 2007: F13, F14, F15, F16, AQUA
- 2008: F13, F14, F15, F16, AQUA
- 2009: F13, F15, F16, F17, AQUA (F13 persists longer than F14)
- 2010: F15, F16, F17, AQUA 
- 2011: F15, F16, F17, F18, AQUA (AQUA ends)
    - F18 Link: https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/2011.01.01/NSIDC-0630-EASE2_N6.25km-F18_SSMIS-2011001-19H-M-SIR-CSU-v1.2.nc
    
- 2012: F15, F16, F17, F18
- 2013: F15, F16, F17, F18
- 2014: F15, F16, F17, F18
- 2015: F15, F16, F17, F18, F19 
- 2016: Not a full year of data

$$ 1988:F08 \longrightarrow 1991:F10 \longrightarrow 1992:F11 \longrightarrow 1996:F13 \longrightarrow 2001:F15 \longrightarrow 2011:F18 $$

### File Path Name Notes

Things that change each year:
- Sensor
- Date range 
- SSMI vs SSMIS (year dependent) 
- resolution (37H = 3.125km, 19H = 6.25km)

# Web scraper for all NSIDC 0630 files
### Notes: 
#### - Follow instructions at [the NISDC's website]("https://nsidc.org/support/faq/what-options-are-available-bulk-downloading-data-https-earthdata-login-enabled") for instructions on setup

In [3]:
path = "/Users/williamnorris/SWE_Clean"
path19 = "/Users/williamnorris/SWE_Clean/data/19GHz_Wget"
path37 = "/Users/williamnorris/SWE_Clean/data/37GHz_Wget"

In [4]:
# function to make web scraping a piece of cake
def scrape(sensor, dates, dest19, dest37):
    # store the path of the url before the data (the date is updated on the fly)
    file_pre = 'https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/'
    if sensor in ['F16', 'F17', 'F18', 'F19']:
        post_19 = '/NSIDC-0630-EASE2_N6.25km-'+sensor+'_SSMIS-'
        last_19 = '-19H-M-SIR-CSU-v1.2.nc'
        post_37 = '/NSIDC-0630-EASE2_N3.125km-'+sensor+'_SSMIS-'
        last_37 = '-37H-M-SIR-CSU-v1.2.nc'
    else: 
        post_19 = '/NSIDC-0630-EASE2_N6.25km-'+sensor+'_SSMI-'
        last_19 = '-19H-M-SIR-CSU-v1.2.nc'
        post_37 = '/NSIDC-0630-EASE2_N3.125km-'+sensor+'_SSMI-'
        last_37 = '-37H-M-SIR-CSU-v1.2.nc'
    for date in dates: 
        # convert datetimeindex to date time
        temp = datetime.strptime(date, '%Y.%m.%d')
        # Store the year of the current date, convert to day of year
        year = temp.year
        temp = str(temp.timetuple().tm_yday)
        # pad front of day of year with zeroes to always be 3 char long
        temp = temp.rjust(3, '0')
        add = str(year)+temp
        # Combine constant file portions with dynamic portions 
        new_file_19 = file_pre + date + post_19 + add +last_19
        new_file_37 = file_pre + date + post_37 + add + last_37
        # Call wget on this file 
        cmd = 'wget -nd --load-cookies '+path+'/cookies.txt --save-cookies '+path+'/cookies.txt --keep-session-cookies --no-check-certificate --auth-no-challenge=on -r --reject "index.html*" -np -e robots=off -P '+path19+' '+new_file_19
        cmd2 = 'wget -nd --load-cookies '+path+'/cookies.txt --save-cookies '+path+'/cookies.txt --keep-session-cookies --no-check-certificate --auth-no-challenge=on -r --reject "index.html*" -np -e robots=off -P '+path37+' '+new_file_37
        print("Downloading the 19GHz and 37GHz files for: %s" % date)
        subprocess.call(shlex.split(cmd), shell = False)
        subprocess.call(shlex.split(cmd2), shell = False)
    print('Finished')
    

In [9]:
# Generate list of dates
start = datetime(2012, 1, 1)
end = datetime(2012, 12, 31)
dates = pd.date_range(start, end)

#convert time series to list of strings, translate '-' to '.' for URL
dates = dates.strftime('%Y.%m.%d')

sensor = 'F17'

scrape(sensor, dates, path19, path37)

Downloading the 19GHz and 37GHz files for: 2012.01.01
Downloading the 19GHz and 37GHz files for: 2012.01.02
Downloading the 19GHz and 37GHz files for: 2012.01.03
Downloading the 19GHz and 37GHz files for: 2012.01.04
Downloading the 19GHz and 37GHz files for: 2012.01.05
Downloading the 19GHz and 37GHz files for: 2012.01.06
Downloading the 19GHz and 37GHz files for: 2012.01.07
Downloading the 19GHz and 37GHz files for: 2012.01.08
Downloading the 19GHz and 37GHz files for: 2012.01.09
Downloading the 19GHz and 37GHz files for: 2012.01.10
Downloading the 19GHz and 37GHz files for: 2012.01.11
Downloading the 19GHz and 37GHz files for: 2012.01.12
Downloading the 19GHz and 37GHz files for: 2012.01.13
Downloading the 19GHz and 37GHz files for: 2012.01.14
Downloading the 19GHz and 37GHz files for: 2012.01.15
Downloading the 19GHz and 37GHz files for: 2012.01.16
Downloading the 19GHz and 37GHz files for: 2012.01.17
Downloading the 19GHz and 37GHz files for: 2012.01.18
Downloading the 19GHz and 37

Downloading the 19GHz and 37GHz files for: 2012.06.01
Downloading the 19GHz and 37GHz files for: 2012.06.02
Downloading the 19GHz and 37GHz files for: 2012.06.03
Downloading the 19GHz and 37GHz files for: 2012.06.04
Downloading the 19GHz and 37GHz files for: 2012.06.05
Downloading the 19GHz and 37GHz files for: 2012.06.06
Downloading the 19GHz and 37GHz files for: 2012.06.07
Downloading the 19GHz and 37GHz files for: 2012.06.08
Downloading the 19GHz and 37GHz files for: 2012.06.09
Downloading the 19GHz and 37GHz files for: 2012.06.10
Downloading the 19GHz and 37GHz files for: 2012.06.11
Downloading the 19GHz and 37GHz files for: 2012.06.12
Downloading the 19GHz and 37GHz files for: 2012.06.13
Downloading the 19GHz and 37GHz files for: 2012.06.14
Downloading the 19GHz and 37GHz files for: 2012.06.15
Downloading the 19GHz and 37GHz files for: 2012.06.16
Downloading the 19GHz and 37GHz files for: 2012.06.17
Downloading the 19GHz and 37GHz files for: 2012.06.18
Downloading the 19GHz and 37

Downloading the 19GHz and 37GHz files for: 2012.10.31
Downloading the 19GHz and 37GHz files for: 2012.11.01
Downloading the 19GHz and 37GHz files for: 2012.11.02
Downloading the 19GHz and 37GHz files for: 2012.11.03
Downloading the 19GHz and 37GHz files for: 2012.11.04
Downloading the 19GHz and 37GHz files for: 2012.11.05
Downloading the 19GHz and 37GHz files for: 2012.11.06
Downloading the 19GHz and 37GHz files for: 2012.11.07
Downloading the 19GHz and 37GHz files for: 2012.11.08
Downloading the 19GHz and 37GHz files for: 2012.11.09
Downloading the 19GHz and 37GHz files for: 2012.11.10
Downloading the 19GHz and 37GHz files for: 2012.11.11
Downloading the 19GHz and 37GHz files for: 2012.11.12
Downloading the 19GHz and 37GHz files for: 2012.11.13
Downloading the 19GHz and 37GHz files for: 2012.11.14
Downloading the 19GHz and 37GHz files for: 2012.11.15
Downloading the 19GHz and 37GHz files for: 2012.11.16
Downloading the 19GHz and 37GHz files for: 2012.11.17
Downloading the 19GHz and 37

## The following is old, non-modular code 

In [None]:
# store the path of the url before the data (the date is updated on the fly)
file_pre = 'https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/'
# store the path of the url after the date
file_post = '/NSIDC-0630-EASE2_N6.25km-F08_SSMI-'
file_last = '-19H-M-SIR-CSU-v1.2.nc'


post_37 = '/NSIDC-0630-EASE2_N3.125km-F08_SSMI-'
last_37 = '-37H-M-SIR-CSU-v1.2.nc'

In [None]:
# Generate list of dates
start = datetime(1988, 1, 1)
end = datetime(1988, 12, 31)
dates = pd.date_range(start, end)

#convert time series to list of strings, translate '-' to '.' for URL
dates = dates.strftime('%Y.%m.%d')

In [None]:
# For loop to generate all the file names and download on the fly!
for date in dates: 
    # convert datetimeindex to date time
    temp = datetime.strptime(date, '%Y.%m.%d')
    # Store the year of the current date, convert to day of year
    year = temp.year
    temp = str(temp.timetuple().tm_yday)
    # pad front of day of year with zeroes to alwasy be 3 char long
    temp = temp.rjust(3, '0')
    add = str(year)+temp
    # Combine constant file portions with dynamic portions 
    new_file = file_pre + date + file_post + add +file_last
    new_file_37 = file_pre + date + post_37 + add + last_37
    # Call wget on this file 
    cmd = 'wget -nd --load-cookies '+path+'/cookies.txt --save-cookies '+path+'/cookies.txt --keep-session-cookies --no-check-certificate --auth-no-challenge=on -r --reject "index.html*" -np -e robots=off -P '+path19+' '+new_file
    cmd2 = 'wget -nd --load-cookies '+path+'/cookies.txt --save-cookies '+path+'/cookies.txt --keep-session-cookies --no-check-certificate --auth-no-challenge=on -r --reject "index.html*" -np -e robots=off -P '+path37+' '+new_file_37
    print("Downloading the 19GHz and 37GHz files for: %s" % date)
    subprocess.call(shlex.split(cmd), shell = False)
    subprocess.call(shlex.split(cmd2), shell = False)
print('Finished')

In [None]:
# convert web scraping script to funciton for automation **work in progress** 
def scrape(sensor, dates, dest19, dest37):
    # store the path of the url before the data (the date is updated on the fly)
    file_pre = 'https://n5eil01u.ecs.nsidc.org/MEASURES/NSIDC-0630.001/'
    file_post = '/NSIDC-0630-EASE2_N6.25km-'+sensor+'_SSMI-'
    file_last = '-19H-M-SIR-CSU-v1.2.nc'
    
    post_37 = '/NSIDC-0630-EASE2_N3.125km-'+sensor+'_SSMI-'
    last_37 = '-37H-M-SIR-CSU-v1.2.nc'
    for date in dates: 
        # convert datetimeindex to date time
        temp = datetime.strptime(date, '%Y.%m.%d')
        # Store the year of the current date, convert to day of year
        year = temp.year
        temp = str(temp.timetuple().tm_yday)
        # pad front of day of year with zeroes to always be 3 char long
        temp = temp.rjust(3, '0')
        add = str(year)+temp
        # Combine constant file portions with dynamic portions 
        new_file = file_pre + date + file_post + add +file_last
        new_file_37 = file_pre + date + post_37 + add + last_37
        # Call wget on this file 
        cmd = 'wget -nd --load-cookies '+path+'/cookies.txt --save-cookies '+path+'/cookies.txt --keep-session-cookies --no-check-certificate --auth-no-challenge=on -r --reject "index.html*" -np -e robots=off -P '+path19+' '+new_file
        cmd2 = 'wget -nd --load-cookies '+path+'/cookies.txt --save-cookies '+path+'/cookies.txt --keep-session-cookies --no-check-certificate --auth-no-challenge=on -r --reject "index.html*" -np -e robots=off -P '+path37+' '+new_file_37
        print("Downloading the 19GHz and 37GHz files for: %s" % date)
        subprocess.call(shlex.split(cmd), shell = False)
        subprocess.call(shlex.split(cmd2), shell = False)
    print('Finished')
    

In [None]:
# Generate list of dates
start = datetime(1988, 1, 1)
end = datetime(1988, 12, 31)
dates = pd.date_range(start, end)

#convert time series to list of strings, translate '-' to '.' for URL
dates = dates.strftime('%Y.%m.%d')

sensor = 'F08'

scrape(sensor, dates, path19, path37)