# How to Subset the Pre-Computed Travel Cost Matricies
To calculate spatial access measures, data on travel times or distances between origins and destinations must be used. If you only need distances between origins and destinations, the `access` package will calculate Euclidean distances for your projected data. If you need travel times for a specific travel mode (walking, public transit, or driving) you need to generate these so-called travel time (or travel cost) matrices from other sources. We provide pre-calculated travel cost matricies between blocks and tracts [here](https://access.readthedocs.io/en/latest/resources.html) for the 20 largest cities in the US and for the entire country.

Each dataset are point-to-point distance matrices generated by the Center for Spatial Data Science. The matrices come as bzipped CSVs. Cities are listed according to their containing county code. The origins for each distance matrix are the population-weighted centroids that lie within each county, and the destinations are any population-weighted centroids that lie within 100 km of the buffered county.

In this notebook, we show how to download the national travel cost matrix and subset it for your area of interest. In the example below, we will subset the matrix to include only the tracts in Cook County, IL. By the end of this notebook, you should be able to:
- Download the travel matrix
- Define your area of interest using the [Census GEOID](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html)
- Use dask to subset the travel cost matrix to your area of interest
- Use the terminal and basic bash commands to subset the travel cost matrix to your area of interest
    


## Downloading the National Travel Matrix
You can find the national driving travel cost matrix on the [`access` read the docs page](https://access.readthedocs.io/en/latest/resources.html). You can download the dataset either by clicking on the link as shown in the image below, or as shown in the code which follows.
<img src="screenshot_cost_website.png" style="width: 750px;">


In [11]:
import os
import requests 

cost_matrix_url = "https://uchicago.box.com/shared/static/prapz7ac7vwuz44nnab3dhe10vbg55cz.bz2"

def download_file(url, filename):
    req = requests.get(url)
    
    if '.bz2' not in filename:
        filename = filename + '.bz2'
        
    with open(os.path.join('./', filename), 'wb') as f:
        f.write(req.content)
        
download_file(cost_matrix_url, 'national_cost_matrix.csv.bz2')

You should now see the file in the current directory.



In [None]:
os.listdir('./')

## Defining Area of Interest with Census GEOID
You can find more information on the Census GEOID [here](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html). For our example, we will filter the dataset for Cook County, IL. The State ID for Illinois is `17` and the County ID for Cook County is `031`.TOgether, we want to filter our dataset for rows that have an origin and destination that start with `17031`. 

## Use Dask to Subset the Travel Cost Matrix
Since the uncompressed csv is too large to load into the memory of most computers, we cannot load the data in its entirety into memory using `pandas`. Instead, we will show you how to use [`dask`](https://docs.dask.org/en/latest/delayed.html) to subset the travel cost matrix. First, we must uncompress the file and extract the csv. Note: you must install bzip2 if it's not already installed.

In [12]:
!bzip2 -dk national_cost_matrix.csv.bz2

Next, we'll read in the csv dataset using dask dataframe's `.read_csv()` method. Note: nothing has yet happend by running this command. Dask uses lazy evaluation, so nothing will compute until you use the `.compute()` method.

In [1]:
import dask.dataframe as dd

national_cost_matrix = dd.read_csv('./national_cost_matrix.csv')

# Filter out faulty rows which exists in the data - will be cleaned and replaced soon.
national_cost_matrix = national_cost_matrix[national_cost_matrix['origin'] != 'origin']

We will now convert the origin and destination columns to type `str` and use the newly converted string column to filter for `17031`.

In [2]:
national_cost_matrix['origin'] = national_cost_matrix.origin.astype(str)
national_cost_matrix['destination'] = national_cost_matrix.destination.astype(str)

cook_county_fips = '17031'

cook_cost_matrix = national_cost_matrix[national_cost_matrix['origin'].str.startswith(cook_county_fips) &
                                        national_cost_matrix['destination'].str.startswith(cook_county_fips)]

With our delayed transformations setup, now we can execute our data transformations and have `dask` complete those transformations in parallel with the `.compute()` method. **Warning: you might need at least 8GB of memory to successfully execute this process. If you have 8GB of memory and it fails, try closing unused programs and try again.**

In [None]:
cook_cost_matrix = cook_cost_matrix.compute()

In [4]:
cook_cost_matrix.head()

Unnamed: 0,origin,destination,minutes
162415,17031010100,17031010100,0.17
162416,17031010100,17031010201,5.7
162417,17031010100,17031010202,2.63
162418,17031010100,17031010300,3.58
162419,17031010100,17031010400,8.53


Make sure to save a copy of the subsetted data!

In [5]:
cook_cost_matrix.to_csv("cook_county_cost_matrix.csv")

## Use Bash Commands to Subset the Travel Cost Matrix