# Data Download and Exploration

This code means that the notebook will re-import your source code in `src` when it is edited (the default is not to re-import, because most modules are assumed not to change over time).  It's a good idea to include it in any exploratory notebook that uses `src` code

In [1]:
%load_ext autoreload
%autoreload 2

This snippet allows the notebook to import from the `src` module.  The directory structure looks like:

```
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering)
│   │                     followed by the topic of the notebook, e.g.
│   │                     01_data_collection_exploration.ipynb
│   └── exploratory    <- Raw, flow-of-consciousness, work-in-progress notebooks
│   └── report         <- Final summary notebook(s)
│
├── src                <- Source code for use in this project
│   ├── data           <- Scripts to download and query data
│   │   ├── sql        <- SQL scripts. Naming convention is a number (for ordering)
│   │   │                 followed by the topic of the script, e.g.
│   │   │                 03_create_pums_2017_table.sql
│   │   ├── data_collection.py
│   │   └── sql_utils.py
```

So we need to go up two "pardir"s (parent directories) to import the `src` code from this notebook.  You'll want to include this code at the top of any notebook that uses the `src` code.

In [2]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

The code to download all of the data and load it into a SQL database is in the `data` module within the `src` module.  You'll only need to run `download_data_and_load_into_sql` one time for the duration of the project.

In [3]:
from src.data import data_collection

This line may take as long as 10-20 minutes depending on your network connection and computer specs

In [4]:
data_collection.download_data_and_load_into_sql()

Successfully created database and all tables

Successfully downloaded ZIP file
    https://www2.census.gov/programs-surveys/acs/data/pums/2017/5-Year/csv_pwa.zip
    
Successfully downloaded GZIP file
    https://lehd.ces.census.gov/data/lodes/LODES7/wa/wac/wa_wac_S000_JT00_2017.csv.gz
    
Successfully downloaded GZIP file
    https://lehd.ces.census.gov/data/lodes/LODES7/wa/wa_xwalk.csv.gz
    
Successfully downloaded CSV file
    https://www2.census.gov/geo/docs/maps-data/data/rel/2010_Census_Tract_to_2010_PUMA.txt
    
Successfully loaded CSV file into `pums_2017` table
        
Successfully loaded CSV file into `puma_names_2010` table
        
Successfully loaded CSV file into `wa_jobs_2017` table
        
Successfully loaded CSV file into `wa_geo_xwalk` table
        
Successfully loaded CSV file into `ct_puma_xwalk` table
        


Now it's time to explore the data!

In [6]:
import psycopg2
import pandas as pd

In [7]:
DBNAME = "opportunity_youth"

In [23]:
conn = psycopg2.connect(dbname=DBNAME)

In [34]:
pd.read_sql("SELECT * FROM pums_2017 WHERE agep BETWEEN 16 and 24 ORDER BY agep DESC LIMIT 10;", conn)

Unnamed: 0,rt,serialno,division,sporder,puma,region,st,adjinc,pwgtp,agep,...,pwgtp71,pwgtp72,pwgtp73,pwgtp74,pwgtp75,pwgtp76,pwgtp77,pwgtp78,pwgtp79,pwgtp80
0,P,2013000015852,9,3,11503,4,53,1061971,30.0,24.0,...,9.0,8.0,27.0,28.0,30.0,32.0,33.0,47.0,33.0,42.0
1,P,2013000016357,9,1,11703,4,53,1061971,24.0,24.0,...,31.0,22.0,21.0,26.0,9.0,10.0,21.0,20.0,6.0,23.0
2,P,2013000009845,9,1,11900,4,53,1061971,6.0,24.0,...,6.0,6.0,5.0,5.0,6.0,10.0,11.0,11.0,5.0,1.0
3,P,2013000014727,9,2,11802,4,53,1061971,39.0,24.0,...,25.0,11.0,49.0,16.0,16.0,61.0,59.0,26.0,29.0,45.0
4,P,2013000005849,9,1,10100,4,53,1061971,16.0,24.0,...,25.0,16.0,14.0,14.0,31.0,5.0,28.0,16.0,5.0,5.0
5,P,2013000013052,9,1,11604,4,53,1061971,41.0,24.0,...,50.0,11.0,11.0,14.0,11.0,35.0,47.0,16.0,11.0,14.0
6,P,2013000003570,9,2,11610,4,53,1061971,15.0,24.0,...,16.0,26.0,14.0,4.0,17.0,24.0,29.0,14.0,15.0,16.0
7,P,2013000003570,9,1,11610,4,53,1061971,20.0,24.0,...,23.0,33.0,19.0,5.0,18.0,33.0,33.0,24.0,22.0,20.0
8,P,2013000001874,9,2,10703,4,53,1061971,13.0,24.0,...,12.0,22.0,24.0,13.0,15.0,12.0,13.0,22.0,12.0,12.0
9,P,2013000017377,9,2,11200,4,53,1061971,10.0,24.0,...,9.0,3.0,2.0,3.0,14.0,15.0,11.0,8.0,9.0,3.0


Notice the `LIMIT 10` above.  These tables have a large amount of data in them and **your goal is to use SQL to create your main query, not Pandas**.  Pandas can technically do everything that you need to do, but it will be much slower and more inefficient.  Nevertheless, Pandas is still a useful tool for exploring the data and getting a basic sense of what you're looking at.

In [26]:
pd.read_sql("SELECT * FROM ct_puma_xwalk LIMIT 10;", conn)

Unnamed: 0,statefp,countyfp,tractce,puma5ce
0,1,1,20100,2100
1,1,1,20200,2100
2,1,1,20300,2100
3,1,1,20400,2100
4,1,1,20500,2100
5,1,1,20600,2100
6,1,1,20700,2100
7,1,1,20801,2100
8,1,1,20802,2100
9,1,1,20900,2100


In [16]:
pd.read_sql("SELECT * FROM puma_names_2010 LIMIT 10;", conn)

Unnamed: 0,state_fips,state_name,cpuma0010,puma,geoid,gisjoin,puma_name
0,1,Alabama ...,1,100,100100,G01000100,"Lauderdale, Colbert, Franklin & Marion (Northe..."
1,1,Alabama ...,1,800,100800,G01000800,St. Clair & Blount Counties ...
2,1,Alabama ...,1,1400,101400,G01001400,"Walker, Marion (South & West), Fayette & Lamar..."
3,1,Alabama ...,1,1500,101500,G01001500,Tuscaloosa (Outer) & Pickens Counties--Northpo...
4,1,Alabama ...,1,1600,101600,G01001600,Tuscaloosa & Northport (Southeast) Cities ...
5,1,Alabama ...,1,1700,101700,G01001700,"Dallas, Bibb, Marengo, Hale, Sumter, Perry & G..."
6,1,Alabama ...,2,200,100200,G01000200,Limestone & Madison (Outer) Counties--Huntsvil...
7,1,Alabama ...,2,301,100301,G01000301,Huntsville (North) & Madison (East) Cities ...
8,1,Alabama ...,2,302,100302,G01000302,Huntsville City (Central & South) ...
9,1,Alabama ...,2,500,100500,G01000500,Marshall & Madison (Southeast) Counties--Hunts...


In [17]:
pd.read_sql("SELECT * FROM wa_geo_xwalk LIMIT 10;", conn)

Unnamed: 0,tabblk2010,st,stusps,stname,cty,ctyname,trct,trctname,bgrp,bgrpname,...,stanrcname,necta,nectname,mil,milname,stwib,stwibname,blklatdd,blklondd,createdate
0,530630112024017,53,WA,Washington ...,53063,"Spokane County, WA ...",53063011202,"112.02 (Spokane, WA) ...",530630112024,"4 (Tract 112.02, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,47.716671,-117.354964,2019-08-26
1,530630105031024,53,WA,Washington ...,53063,"Spokane County, WA ...",53063010503,"105.03 (Spokane, WA) ...",530630105031,"1 (Tract 105.03, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,47.783324,-117.402801,2019-08-26
2,530630101001010,53,WA,Washington ...,53063,"Spokane County, WA ...",53063010100,"101 (Spokane, WA) ...",530630101001,"1 (Tract 101, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,48.044817,-117.17036,2019-08-26
3,530630101001014,53,WA,Washington ...,53063,"Spokane County, WA ...",53063010100,"101 (Spokane, WA) ...",530630101001,"1 (Tract 101, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,48.015862,-117.164765,2019-08-26
4,530630101001020,53,WA,Washington ...,53063,"Spokane County, WA ...",53063010100,"101 (Spokane, WA) ...",530630101001,"1 (Tract 101, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,48.012248,-117.162461,2019-08-26
5,530630113004017,53,WA,Washington ...,53063,"Spokane County, WA ...",53063011300,"113 (Spokane, WA) ...",530630113004,"4 (Tract 113, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,47.698614,-117.262875,2019-08-26
6,530630018001012,53,WA,Washington ...,53063,"Spokane County, WA ...",53063001800,"18 (Spokane, WA) ...",530630018001,"1 (Tract 18, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,47.679017,-117.373731,2019-08-26
7,530630023003017,53,WA,Washington ...,53063,"Spokane County, WA ...",53063002300,"23 (Spokane, WA) ...",530630023003,"3 (Tract 23, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,47.666345,-117.452497,2019-08-26
8,530630018001024,53,WA,Washington ...,53063,"Spokane County, WA ...",53063001800,"18 (Spokane, WA) ...",530630018001,"1 (Tract 18, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,47.6752,-117.390052,2019-08-26
9,530630023003025,53,WA,Washington ...,53063,"Spokane County, WA ...",53063002300,"23 (Spokane, WA) ...",530630023003,"3 (Tract 23, Spokane, WA) ...",...,,99999,,,,53000012,12 Spokane WIB ...,47.665443,-117.449826,2019-08-26


In [18]:
pd.read_sql("SELECT * FROM wa_jobs_2017 LIMIT 10;", conn)

Unnamed: 0,w_geocode,c000,ca01,ca02,ca03,ce01,ce02,ce03,cns01,cns02,...,cfa02,cfa03,cfa04,cfa05,cfs01,cfs02,cfs03,cfs04,cfs05,createdate
0,530019501001010,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
1,530019501001024,1,0,1,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
2,530019501001026,1,1,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
3,530019501001044,1,0,1,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
4,530019501001053,2,0,2,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
5,530019501001061,7,1,3,3,0,5,2,7,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
6,530019501001090,2,0,2,0,0,2,0,2,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
7,530019501001099,1,1,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
8,530019501001104,1,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,2019-08-25
9,530019501001112,4,1,2,1,0,2,2,4,0,...,0,0,0,0,0,0,0,0,0,2019-08-25


Make sure you close the DB connection when you are done using it

In [None]:
conn.close()