# 2016 Immigration data model
### Data Engineering Capstone Project

#### Project Summary
This project is an interpretation of the Udacity provided project. 
I used the immigration dataset allong with the city demographics and temperature dataset to develop analytics table to gain insight on immigration trends in the US.
Using a postgres database the data is first ingested into staging tables and then normilized and then transfromed into a star schema for analytics

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [105]:
# dependecies
from datetime import datetime, timedelta
import re
import pandas as pd
import psycopg2 as ps
from sqlalchemy import create_engine

pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)

### Step 1: Scope the Project and Gather Data

#### Scope 
The immigration, demographics and temperature datasets provided by Udacity. I used pandas to read the data and manipulate the data and psycopg2 to do an initial load into Postgres. 
Once in postgres the ETL and analysis process was done using SQL.

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 


I split the immigration dataset up into a single fact_immmigration table as well as several dim_ dimension tables. The temperature dataset resulted in one intial dimension table that was the raw data and then I aggregated it into state-level statistics in another dimention table.

Before loading the data into SQL, I did some exploratory data analysis in pandas to get an idea of what DDL should define my tables.


#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

#### Imigration data

In [75]:
# load immigration data into dataframe
# Note here we only load the example as demonstration (see README)
immi_df = pd.read_csv('./data/immigration_data_sample.csv')

In [78]:
# Visualize dataframe
pd.set_option('display.max_columns', 30)
display(immi_df.head())

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,,G,O,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,,G,O,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,,Z,K,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


#### US city demographics data

In [55]:
# load demographics data
dem_df = pd.read_csv('./data/us-cities-demographics.csv', delimiter=';')

In [80]:
# Visualize dataframe
display(dem_df.head())

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


#### Global temperature data

In [61]:
# load immigration data into dataframe
temp_df = pd.read_csv('./data/GlobalLandTemperaturesByCity.csv')

In [84]:
# Visualize dataframe
temp_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


### Step 2: Explore and Assess the Data
#### Immigration data 
For the immigration data, we want to drop all invalid entries for destination and origin cities (e.g., XXX, 11, etc) as described in I94_SAS_Labels_Description.SAS.

In [106]:
# Create dictionary of valid i94port codes
re_obj = re.compile(r'\'(.*)\'.*\'(.*)\'')
i94port = {}
with open('./data/I94_SAS_Labels_Descriptions.SAS') as f:
     for line in f:
         match = re_obj.search(line)
         i94port[match[1]]=[match[2]]
display(i94port)

TypeError: 'NoneType' object is not subscriptable

#### Demographics data
For the demographics data we are interested in total populations and Unique rows.
The data is filtered by:
- Remove duplicates if present
- Remove rows with NaN in total population column

The data is also checked to assert that male + female population, # veterans and foreign born <= total population

In [102]:
df = dem_df.drop_duplicates()
dem_df = df[pd.notnull(df['Total Population'])]
display(dem_df.head())

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [91]:
# Ensure total population correct
for idx, row in dem_df.iterrows():
    if pd.notnull(row['Male Population']):
        assert row['Male Population'] + row['Female Population'] <= row['Total Population']
    if pd.notnull(row['Number of Veterans']):
        assert row['Number of Veterans'] <= row['Total Population']
    if pd.notnull(row['Foreign-born']):
        assert row['Foreign-born'] <= row['Total Population']
    if pd.notnull(row['Count']):
        assert row['Count'] <= row['Total Population']

#### Temperature data
For the temperature data we are only interested in Unique values for United States cities.
The data is filtered by:
- Country equals United States
- Remove duplicates (city, country)
- Remove rows with NaN average temperature values

In [103]:
df = temp_df[temp_df.Country == 'United States']
df = df.drop_duplicates(['City', 'Country'])
temp_df = df[pd.notnull(df.AverageTemperature)]
display(temp_df.head())

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
47555,1820-01-01,2.101,3.217,Abilene,United States,32.95N,100.53W
137066,1743-11-01,3.209,1.961,Akron,United States,40.99N,80.95W
168075,1820-01-01,-3.42,3.182,Albuquerque,United States,34.56N,107.03W
187528,1743-11-01,5.339,1.828,Alexandria,United States,39.38N,76.99W
202251,1743-11-01,3.264,1.665,Allentown,United States,40.99N,74.56W


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# load sql extension
%load_ext sql

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.