## Download files using EarthPy
You can use the function **data.get_data()** from the earthpy package to download data from online sources.

In [1]:
import os
from glob import glob

import earthpy as et

In [2]:
file_url = "https://ndownloader.figshare.com/files/21894528"
et.data.get_data (url = file_url)

Downloading from https://ndownloader.figshare.com/files/21894528
Extracted output to C:\Users\user\earth-analytics\data\earthpy-downloads\avg-monthly-temp-fahr


'C:\\Users\\user\\earth-analytics\\data\\earthpy-downloads\\avg-monthly-temp-fahr'

By default, **et.data.get_data()** will download files to earth-analytics/data/earthpy-downloads under your home directory, and it will create the necessary directories if they do not already exist.

In [72]:
# Set work directory to earth-analytics
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

# Creat a path to the data folder
data_folder = os.path.join('data\\earthpy-downloads\\avg-monthly-temp-fahr')

In [18]:
os.getcwd()

'C:\\Users\\user\\earth-analytics'

In [10]:
data_folder_D = os.path.join('C:/Users/user/Documents/Program Language/Python/earth-analytics')

## Glob in Python
glob is a powerful tool in Python to help with file management and filtering. While os helps manage and create specific paths that are friendly to whatever machine they are used on, glob helps to filter through large datasets and pull out only files that are of interest.

In [11]:
file_list = glob(data_folder)
file_list

['data\\earthpy-downloads\\avg-monthly-temp-fahr']

In [12]:
file_list_D = glob(data_folder_D)
file_list_D

['C:/Users/user/Documents/Program Language/Python/earth-analytics']

In [19]:
# Create a list containing a specific file name
glob(os.path.join(data_folder, 'San-Diego', 'San-Diego-1999-temp.csv'))

['data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv']

In [20]:
# (Method 2) Create a list containing a specific file name
glob(os.path.join(data_folder + '/San-Diego/San-Diego-1999-temp.csv'))

['data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego/San-Diego-1999-temp.csv']

### * Operator
The * is a sort of wildcard that can be used to search for items that have differences in their names. Whatever text doesn’t match can be replaced by a *.

In [25]:
# Get a list of all files/dirs in data folder
glob(os.path.join(data_folder , '*'))

['data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma']

In [30]:
glob(os.path.join(data_folder + '/*'))

['data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma']

In [28]:
glob(os.path.join(data_folder, 'San-Diego', '*'))

['data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv']

In [29]:
glob(os.path.join(data_folder + '/San-Diego/*'))

['data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2003-temp.csv']

If you only want .csv files, than *.csv will return every file that ends with .csv. If you only want .csv files with the number 2 somewhere in the file name, than *2*.csv will return that list. Note that 2*.csv would only return files that start with the number 2.

In [31]:
glob(os.path.join(data_folder + '/San-Diego/*2*.csv'))

['data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2003-temp.csv']

### Recursive searches
If you are trying to operate on files across multiple directories, you can use multiple * in a file path to indicate that you want every file in all folders in a directory.

In [32]:
glob(os.path.join(data_folder + '/*/*'))

['data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

### Sorting glob lists

In [47]:
# Get list of csv files in Sonoma directory
sonoma_files = glob(os.path.join(data_folder + '/Sonoma/*.csv'))
sonoma_files

['data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2003-temp.csv']

In [51]:
# Sort and reverse glob list
sonoma_files.sort()
sonoma_files.reverse()
sonoma_files

['data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2003-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-1999-temp.csv']

In [48]:
# Another option to sort list
sonoma_files = sorted(glob(os.path.join(data_folder + '/Sonoma/*.csv')))
print(sonoma_files[4])

data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\Sonoma-2003-temp.csv


In [52]:
# Why sort glob list?
sonoma_files.sort()
sonoma_files.reverse()
print(sonoma_files[4])

data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\Sonoma-1999-temp.csv


### Using ranges
In addition to using * to specify which parts of a file name are important to you, you can use [ ] to specify a range of characters to search for.

In [55]:
# Create a search for all files with 2001 to 2003 in the name by using *200 and adding [1-3]* to it
glob(os.path.join(data_folder + '/*/*200[1-3]*'))

['data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

In [56]:
# Notice below that the search does not work correctly because [2001-2003] are more than one character.
glob(os.path.join(data_folder + '/*/*[2001-2003]*'))

['data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

### ? Operator
The ? operator functions similarly to the * operator but is used for a single character. If one character in the file name can be variable, but everything else must stay the same, than ? is a good way to just replace that one character.

In [57]:
glob(os.path.join(data_folder + '/Sonoma/*200?-temp.csv'))

['data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-2003-temp.csv']

? is not limited to one use per search and can be used to replace more than one character in a query.

In [58]:
glob(os.path.join(data_folder + '/Sonoma/*19??-temp.csv'))

['data/earthpy-downloads/avg-monthly-temp-fahr/Sonoma\\Sonoma-1999-temp.csv']

In [59]:
# Save a glob output to a variable
sd_data = glob(os.path.join(data_folder + '/San-Diego/*'))
sd_data.sort()
sd_data

['data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2003-temp.csv']

## os advanced functionality
os is another very powerful tool and has additional functionality that can be useful when dealing with file paths, such as advanced parsing abilities.

For example, **os.path.normpath()** is a great way to clean up file paths. It takes out any unnecessary characters to make the path more easily read.

In [60]:
ex_path = 'C:/Users/user/Documents'
os.path.exists(ex_path)

True

In [61]:
os.path.normpath(ex_path)

'C:\\Users\\user\\Documents'

**os.path.commonpath()** is a very useful when combined with glob. This function will take a list of file paths and find the lowest directory that all the files have in common. So if there were two files, one stored in *home/user/dir/dir2/example.txt* and one stored in *home/user/dir/example.txt*, then os.path.commonpath() would return *home/user/dir* as it’s the lowest common directory the two folders share.

In [62]:
# Print list of files
sd_data

['data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2003-temp.csv']

In [63]:
os.path.commonpath(sd_data)

'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego'

**os.path.basename()** finds the last section of a path and returns that. If a file path is passed in, the file name will be parsed out and returned.

In [64]:
os.path.normpath(data_folder)

'data\\earthpy-downloads\\avg-monthly-temp-fahr'

In [65]:
os.path.basename(os.path.normpath(data_folder))

'avg-monthly-temp-fahr'

**os.path.split()** will split a path into two parts:

1. the last part 
2. the rest

It returns the same output as os.path.basename() with the addition of the rest of the path that was left out as another .

In [66]:
os.path.split(os.path.normpath(data_folder))

('data\\earthpy-downloads', 'avg-monthly-temp-fahr')

In [67]:
os.path.split(os.path.normpath(data_folder))[0]

'data\\earthpy-downloads'

In [68]:
os.path.split(os.path.normpath(data_folder))[1]

'avg-monthly-temp-fahr'

## String manipulation
**.split()** is a built-in Python function that splits a string into a list of strings based on a separator character, and can be used in combination with **os.sep** to separate directories in file paths into their base parts. os.sepis a data value stored in os that will return the character used to separate pathname components, such as directory or file names. This is \ \ for Windows and / for POSIX systems, such as Mac or Linux.

In [73]:
# Separate a path into parts
file_path_list = data_folder.split(os.sep)
file_path_list

['data', 'earthpy-downloads', 'avg-monthly-temp-fahr']

In [74]:
file_path_list[2]

'avg-monthly-temp-fahr'

In addition to built-in functions, file paths can be parsed with string[start_index:end_index] like a normal string. This can help get important infromation from a file path, such as a date.

In [75]:
# Get list of files
sd_data

['data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-1999-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2000-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2001-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2002-temp.csv',
 'data/earthpy-downloads/avg-monthly-temp-fahr/San-Diego\\San-Diego-2003-temp.csv']

In [76]:
# Get a file name
year_path = sd_data[0]
file_name = os.path.basename(year_path)
print(file_name)

San-Diego-1999-temp.csv


In [77]:
# Parse a date from file name
year = file_name[10:14]
print(year)

1999


Notice that the range includes the first index value but not the second index value (e.g. 1999 are index values 10 through 13).