# Immigration Data Pipeline
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# Do all imports and installs here
import pandas as pd
import os

### Step 1: Scope the Project and Gather Data



### Step 2: Explore and Assess the Data
 

In [3]:
# Read in the data here
df_immigration = pd.read_sas("/home/workspace/data/i94_apr16_sub.sas7bdat")

In [5]:
df_airport_codes = pd.read_csv("/home/workspace/data/airport-codes_csv.csv")

In [6]:
df_cityTemperature = pd.read_csv("/home/workspace/data/GlobalLandTemperaturesByCity.csv")

In [7]:
df_cityDemographics = pd.read_csv("/home/workspace/data/us-cities-demographics.csv", delimiter=';')

In [17]:
def getColumnsWithNANs(df):
    columns_dict = {}
    drop_columns = []

    for col in df.columns:
        colWiseNullCount = df[col].isna().sum()
        columns_dict[col] = colWiseNullCount
        if colWiseNullCount > 0.05 * (df.shape[0]):
            drop_columns.append(col)
    return columns_dict, drop_columns

In [9]:
def getDetails(df):
    print("Shape of the data frame is: ", df.shape)
    print("\nView top 5 rows: \n")
    print(df.head(20))

In [10]:
getDetails(df_immigration)

Shape of the data frame is:  (3096313, 28)

View top 5 rows: 

    cicid   i94yr  i94mon  i94cit  i94res i94port  arrdate  i94mode i94addr  \
0     6.0  2016.0     4.0   692.0   692.0  b'XXX'  20573.0      NaN     NaN   
1     7.0  2016.0     4.0   254.0   276.0  b'ATL'  20551.0      1.0   b'AL'   
2    15.0  2016.0     4.0   101.0   101.0  b'WAS'  20545.0      1.0   b'MI'   
3    16.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0   b'MA'   
4    17.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0   b'MA'   
5    18.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0   b'MI'   
6    19.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0   b'NJ'   
7    20.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0   b'NJ'   
8    21.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0   b'NY'   
9    22.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0   b'NY'   
10   23.0  2016.0     4.0   101.0   101.0  b'NYC'  20545.0      1.0 

In [11]:
df_immigration.columns

Index(['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate',
       'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count',
       'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu',
       'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline',
       'admnum', 'fltno', 'visatype'],
      dtype='object')

In [19]:
col_dict, drop_col_immigration = getColumnsWithNANs(df_immigration)
print(col_dict)
print(drop_col_immigration)

{'cicid': 0, 'i94yr': 0, 'i94mon': 0, 'i94cit': 0, 'i94res': 0, 'i94port': 0, 'arrdate': 0, 'i94mode': 239, 'i94addr': 152372, 'depdate': 142457, 'i94bir': 802, 'i94visa': 0, 'count': 0, 'dtadfile': 1, 'visapost': 1881250, 'occup': 3088187, 'entdepa': 238, 'entdepd': 138429, 'entdepu': 3095921, 'matflag': 138429, 'biryear': 802, 'dtaddto': 477, 'gender': 414269, 'insnum': 2982605, 'airline': 83627, 'admnum': 0, 'fltno': 19549, 'visatype': 0}
['visapost', 'occup', 'entdepu', 'gender', 'insnum']


In [13]:
getDetails(df_airport_codes)

Shape of the data frame is:  (55075, 12)

View top 5 rows: 

   ident           type                                name  elevation_ft  \
0    00A       heliport                   Total Rf Heliport          11.0   
1   00AA  small_airport                Aero B Ranch Airport        3435.0   
2   00AK  small_airport                        Lowell Field         450.0   
3   00AL  small_airport                        Epps Airpark         820.0   
4   00AR         closed  Newport Hospital & Clinic Heliport         237.0   
5   00AS  small_airport                      Fulton Airport        1100.0   
6   00AZ  small_airport                      Cordes Airport        3810.0   
7   00CA  small_airport             Goldstone /Gts/ Airport        3038.0   
8   00CL  small_airport                 Williams Ag Airport          87.0   
9   00CN       heliport     Kitchen Creek Helibase Heliport        3350.0   
10  00CO         closed                          Cass Field        4830.0   
11  00FA  small

In [21]:
col_dict, drop_col_airport_codes = getColumnsWithNANs(df_airport_codes)
print(col_dict)
print(drop_col_airport_codes)

{'ident': 0, 'type': 0, 'name': 0, 'elevation_ft': 7006, 'continent': 27719, 'iso_country': 247, 'iso_region': 0, 'municipality': 5676, 'gps_code': 14045, 'iata_code': 45886, 'local_code': 26389, 'coordinates': 0}
['elevation_ft', 'continent', 'municipality', 'gps_code', 'iata_code', 'local_code']


In [15]:
getDetails(df_cityTemperature)

Shape of the data frame is:  (8599212, 7)

View top 5 rows: 

            dt  AverageTemperature  AverageTemperatureUncertainty   City  \
0   1743-11-01               6.068                          1.737  Århus   
1   1743-12-01                 NaN                            NaN  Århus   
2   1744-01-01                 NaN                            NaN  Århus   
3   1744-02-01                 NaN                            NaN  Århus   
4   1744-03-01                 NaN                            NaN  Århus   
5   1744-04-01               5.788                          3.624  Århus   
6   1744-05-01              10.644                          1.283  Århus   
7   1744-06-01              14.051                          1.347  Århus   
8   1744-07-01              16.082                          1.396  Århus   
9   1744-08-01                 NaN                            NaN  Århus   
10  1744-09-01              12.781                          1.454  Århus   
11  1744-10-01            

In [22]:
col_dict, drop_col_cityTemperature = getColumnsWithNANs(df_cityTemperature)
print(col_dict)
print(drop_col_cityTemperature)

{'dt': 0, 'AverageTemperature': 364130, 'AverageTemperatureUncertainty': 364130, 'City': 0, 'Country': 0, 'Latitude': 0, 'Longitude': 0}
[]


In [21]:
getDetails(df_cityDemographics)

Shape of the data frame is:  (2891, 12)

View top 5 rows: 

                City           State  Median Age  Male Population  \
0      Silver Spring        Maryland        33.8          40601.0   
1             Quincy   Massachusetts        41.0          44129.0   
2             Hoover         Alabama        38.5          38040.0   
3   Rancho Cucamonga      California        34.5          88127.0   
4             Newark      New Jersey        34.6         138040.0   
5             Peoria        Illinois        33.1          56229.0   
6           Avondale         Arizona        29.1          38712.0   
7        West Covina      California        39.8          51629.0   
8           O'Fallon        Missouri        36.0          41762.0   
9         High Point  North Carolina        35.5          51751.0   
10            Folsom      California        40.9          41051.0   
11            Folsom      California        40.9          41051.0   
12      Philadelphia    Pennsylvania       

In [23]:
col_dict, drop_col_cityDemographics = getColumnsWithNANs(df_cityDemographics)
print(col_dict)
print(drop_col_cityDemographics)

{'City': 0, 'State': 0, 'Median Age': 0, 'Male Population': 3, 'Female Population': 3, 'Total Population': 0, 'Number of Veterans': 13, 'Foreign-born': 13, 'Average Household Size': 16, 'State Code': 0, 'Race': 0, 'Count': 0}
[]


### Step 3: Define the Data Model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model

- Refer notebook 'DECapstoneProjectNotebook_1.ipynb' 

#### 4.2 Data Quality Checks


#### 4.3 Data dictionary 
- refer I94_SAS_Labels_Descriptions.SAS

#### Step 5: Complete Project Write Up
