# Notebook 000: Download Datasources

The code in this notebook can be used to download a large portion of the `raw` data sources required for the "Predicting Crimes" analysis.

If you wish bypass using this code, and would like to simply download a copy of the fully populated `raw` data directory, you can do so by:

1. Downloading and extracting the `./raw/` data directory found at this link:
    - https://drive.google.com/file/d/1Pv5M-GmUY2Cvq92GDH3d_h7MvXFjgzID/view?usp=sharing


2. Replacing your local "raw" data sub-directory found at `../data/raw/` in this project repository.
3. Please DO NOT commit any data files to your git history. 

**PLEASE NOTE:** Not included in the code below are data sources requiring API calls nor are data sources requiring web-scraping activities. Those data sources will be pulled using separate notebooks *(NOT YET COMPLETED)*.

**Overall, 44 separate data and shape files listed in the accompanying `data-inventory.csv` file are downloaded by this notebook.**

In [1]:
import os
import urllib
import requests
import zipfile
from pathlib import PurePath

import pandas as pd
import numpy as np

In [2]:
# set path variables
DATA_ROOT = '../data'
parent_dir = os.path.join(DATA_ROOT, 'raw')
inventory_filepath = os.path.join('../data-inventory.csv')

In [3]:
# read data inventory to dataframe
inventory_df = pd.read_csv(inventory_filepath)

In [4]:
# view summary of data inventory
print(inventory_df.info())
inventory_df.head(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 13 columns):
id                  47 non-null int64
category            47 non-null object
access              47 non-null object
source              45 non-null object
directory           47 non-null object
sub-directory       47 non-null object
filename            45 non-null object
zipfile             37 non-null float64
page-url            47 non-null object
data-url            45 non-null object
reference           33 non-null object
description         44 non-null object
access-confirmed    44 non-null object
dtypes: float64(1), int64(1), object(11)
memory usage: 4.9+ KB
None


Unnamed: 0,id,category,access,source,directory,sub-directory,filename,zipfile,page-url,data-url,reference,description,access-confirmed
0,1,boston property assessments,download,data,raw,property,fy19-assessments,0.0,https://data.boston.gov/dataset/property-asses...,https://data.boston.gov/dataset/e02c44d2-3c64-...,", https://data.boston.gov/dataset/e02c44d2-3c6...","Gives property, or parcel, ownership together ...",2019-11-07


In [38]:
# define functions for performing data downloads
def make_subdir(subdir, verbose=True):
    """
    Checks for the existance of a specified sub-directory, and if it
    doesn't exist, the sub-directory is created
    
    subdir: str, relative filepath of the desired subdirectory
    verbose: boolean, default=True, if True prints summary of action taken
    
    returns: None, sub-directory written to disk at specified filepath
    """
    if os.path.exists(subdir):
        if verbose:
            print(
                'The following sub-directory already exists '\
                'and was not created: {0}'.format(subdir)
            )
    
    else:
        os.mkdir(subdir)
        if verbose:
            print(
                'The following sub-directory was created: {0}'\
                ''.format(subdir)
            )


def download_file(subdir, url, filename, verbose=True,
                  overwrite=False, return_results=True):
    """
    Downloads a single file to the specified filepath from the given url
    
    subdir: str, the path of target directory for saving file
    url: str, the url from which the data will be downloaded
    filename: str, the desired name of the saved file
    verbose: boolean, default=True, if True prints summary of action taken
    overwrite: boolean, default=False, if True will overwrite existing local
               copy of the file, if False will only download file if it's
               filepath does not already exists
    return_results: boolean, default=True, if True returns new local filename
                    and download headers
               
    returns: if return_results=True, returns local_filename and headers from
             urllib.request.urlretrieve, or if local file already existed
             and overwrite=False returns nothing
    """
    filepath = os.path.join(subdir, filename)
    
    # check for existence of subdir and mkdir if needed
    make_subdir(subdir, verbose=verbose)
    
    # overwrite existing file if it exists
    if overwrite:
        if os.path.exists(filepath):
            if verbose:
                print(
                    'Downloading and overwritting the existing '\
                    'local file: {0}'.format(filepath)
                )
            local_filename, headers = urllib.request.urlretrieve(
                url,
                filepath,
            )
            if return_results:
                return local_filename, headers
    
    # write file to disk if it does not already exist
    if not os.path.exists(filepath):
        if verbose:
            print(
                'Downloading {0} data to {1}'.format(subdir, filepath)
            )                
        local_filename, headers = urllib.request.urlretrieve(
            url,
            filepath,
        )
        if return_results:
            return local_filename, headers
    
    # handle situation where filepath exists and will not be overwritten
    if not overwrite:
        if os.path.exists(filepath):
            if verbose:
                print(
                    'The following local file already exists and '\
                    'was not overwritten: {0}'.format(filepath)
                )
            if return_results:
                return None, None


def make_download_dict(inventory, parent):
    """
    """
    subdirs = list(set(inventory['sub-directory']))
    inventory['file-type'] = download_df['data-url'].apply(
        lambda x: os.path.join(*PurePath(x).suffixes)
    )
    
    download_dict = {
        subdir: {
            filename: {
                'url': url,
                'filepath': os.path.join(parent, subdir, ''.join([filename, suffix])),
                'is_zip': is_zip
            }
            for filename, url, suffix, is_zip in zip(
                inventory.loc[inventory_df['sub-directory'] == subdir]['filename'],
                inventory.loc[inventory_df['sub-directory'] == subdir]['data-url'],
                inventory.loc[inventory_df['sub-directory'] == subdir]['file-type'],
                inventory.loc[inventory_df['sub-directory'] == subdir]['zipfile'],
            )
        } for subdir in subdirs
    }
    
    return download_dict


def make_subdirs(download_dict, parent, verbose=True):
    """
    """
    
    if not os.path.exists(parent):
        os.mkdir(parent)
        open(os.path.join(parent, '.gitkeep'), 'a').close()
        if verbose:
            print(
                'The {0} parent directory and accompanying .gitkeep file '\
                'have been created.'.format(parent)
            )
            print()
    
    # create list of current top-level files and directories
    existing = os.listdir(parent)

    # check for ./data/ dir and create if it doesn't exist
    [
        os.mkdir(os.path.join(parent, subdir))
        for subdir in download_dict.keys() if not subdir in existing 
    ]
    
    # save new list of files and directories, as well is difference
    new_existing = os.listdir(parent)
    new_added = list(set(new_existing) - set(existing))
    
    # print summary results
    if verbose:
        if len(new_added) > 0:
            print('The following sub-directories were added to {}:'.format(parent))
            for subdir in new_added:
                print(subdir)
            print()
        else:
            print(
                'No directories have been created. All target directories already '\
                'exist locally\n'
            )
    
    return new_existing, new_added


def download_datafiles(download_dict, parent, exclude_subdir='shapefile', verbose=True):
    """
    """
    subdirs = [
        subdir for subdir in list(download_dict.keys())
        if subdir not in exclude_subdir
    ]
    downloaded = dict()
    
    for subdir in subdirs:
        for filename, download in download_dict[subdir].items():
            if not os.path.exists(download['filepath']):
                if verbose:
                    print(
                        'Downloading {0} data to {1}'.format(filename, download['filepath'])
                    )                
                downloaded[filename] = [
                    urllib.request.urlretrieve(
                        download['url'],
                        download['filepath'],
                    )
                ]
    
    if verbose:
        if len(downloaded)==0:
            print(
                'No datafiles have been downloaded. All target files already exist locally.\n'
            )
        else:
            print(
                '{0} data files have been downloaded and stored locally.\n'.format(
                    len(downloaded)
                )
            )
    
    return downloaded


def download_shapefiles(download_dict, parent, target_subdir='shapefile', verbose=True):
    """
    """
    downloaded = dict()
    
    for filename, download in download_dict[target_subdir].items():
        if not os.path.exists(download['filepath']):
            if verbose:
                print(
                    'Downloading {0} shapefile to {1}'.format(filename, download['filepath'])
                )                
            
            # download shape zipfile to directory
            downloaded[filename] = [
                urllib.request.urlretrieve(
                    download['url'],
                    download['filepath'],
                )
            ]
            
            # create target sub-directory for extracting zipfile
            shapedir = os.path.join(os.path.dirname(download['filepath']), filename)
            if not os.path.exists(shapedir):
                os.mkdir(shapedir)
            
            # extract zipfile to target sub-directory
            with zipfile.ZipFile(download['filepath'], 'r') as zipobj:

                if verbose:
                    print(
                        '\t...extracting shapefile zip archive to {0}'.format(shapedir)
                    )                

                # extract all files
                zipobj.extractall(shapedir)

    if verbose:
        if len(downloaded)==0:
            print(
                'No shapefiles have been downloaded. All target files already exist locally.\n'
            )
        else:
            print(
                '{0} shapefiles have been downloaded and extracted locally.\n'.format(
                    len(downloaded)
                )
            )
            
    return downloaded

## Subset data inventory into groups based on required download methods

In [7]:
# subset data inventory to separate include just 'downloads'
cols = ['sub-directory', 'filename', 'zipfile', 'data-url', 'source']
download_df = inventory_df.loc[inventory_df['access']=='download'][cols]

# subset NOAA API query download information
noaa_df = inventory_df.loc[inventory_df['sub-directory']=='noaa'][cols]

## Download NOAA weather data with direct API url query

In [40]:
noaa_results = download_file(
    subdir=os.path.join(parent_dir, 'noaa'),
    url=noaa_df['data-url'].values[0],
    filename=noaa_df['filename'].values[0],
    return_results=True
)

The following sub-directory already exists and was not created: ../data/raw/noaa
The following local file already exists and was not overwritten: ../data/raw/noaa/boston-daily-weather-20140101-20191107.csv


## Download files from Boston Analyze data sources

This includes both "data" sources such as .csv and .xlsx files as well as "shapefile" .zip sources

In [7]:
%%time
# report cell execution time for later reference

# create download dictionary
download_dict = make_download_dict(download_df, parent_dir)

# make required sub-directories in parent directory
listdirs, added = make_subdirs(download_dict, parent_dir)

# download data files to target sub-directories
downloaded_data_confirmation = download_datafiles(download_dict, parent_dir)

# download and extract shapefiles to target sub-directories
downloaded_shape_confirmation = download_shapefiles(download_dict, parent_dir) 

The ../data/raw parent directory and accompanying .gitkeep file has been created.

The following sub-directories were added to ../data/raw:
crime
property
bpd-fio
boston
shapefile

Downloading crime-incidents data to ../data/raw/crime/crime-incidents.csv
Downloading fy19-assessments data to ../data/raw/property/fy19-assessments.csv
Downloading fy18-assessments data to ../data/raw/property/fy18-assessments.csv
Downloading fy17-assessments data to ../data/raw/property/fy17-assessments.csv
Downloading fy16-assessments data to ../data/raw/property/fy16-assessments.csv
Downloading fy15-assessments data to ../data/raw/property/fy15-assessments.csv
Downloading fy14-assessments data to ../data/raw/property/fy14-assessments.csv
Downloading fy13-assessments data to ../data/raw/property/fy13-assessments.csv
Downloading streetlights data to ../data/raw/boston/streetlights.csv
Downloading public-k12-schools data to ../data/raw/boston/public-k12-schools.csv
Downloading nonpublic-k12-schools data to 