# Project Title
### Data Engineering Capstone Project

#### Project Summary


The objective of this project is to use four data sets containing immigration data, airport codes, demographics of US cities and global temperature data. 
The primary purpose of the combination is to create an ETL pipeline which can be used to derive various correlations, trends and analytics. 
A use case for this analytics database is to find immigration patterns to the US.
For instance, one could attempt to correlate the influence of the average temperature of a migrant's resident country on their choice of US state; Which contries are attracting most of the immigrants etc.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [4]:
# Do all imports and installs here
! pip install -U numpy
! pip install missingno

Collecting numpy
[?25l  Downloading https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl (13.4MB)
[K    100% |████████████████████████████████| 13.4MB 3.6MB/s eta 0:00:01
[31mtensorflow 1.3.0 requires tensorflow-tensorboard<0.2.0,>=0.1.0, which is not installed.[0m
[?25hInstalling collected packages: numpy
  Found existing installation: numpy 1.12.1
    Uninstalling numpy-1.12.1:
      Successfully uninstalled numpy-1.12.1
Successfully installed numpy-1.19.5
Collecting missingno
  Downloading https://files.pythonhosted.org/packages/87/22/cd5cf999af21c2f97486622c551ac3d07361ced8125121e907f588ff5f24/missingno-0.5.2-py3-none-any.whl
Installing collected packages: missingno
Successfully installed missingno-0.5.2


## Import Required Libraries

In [5]:
# Do all imports and installs here
import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import os
import configparser
import datetime as dt
from pyspark.sql.functions import isnan, when, count, col, udf, dayofmonth, dayofweek, month, year, weekofyear, avg, monotonically_increasing_id
from pyspark.sql.types import *
import requests
requests.packages.urllib3.disable_warnings()
from pyspark.sql.functions import year, month, dayofmonth, weekofyear, date_format
from pyspark.sql import SparkSession, SQLContext, GroupedData, HiveContext
from pyspark.sql.functions import *
from pyspark.sql.functions import date_add as d_add
from pyspark.sql.types import DoubleType, StringType, IntegerType, FloatType
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql import Row
import datetime, time
import utilHelper as util
#import create_tables as ct

### Create a Spark Session

In [2]:
def create_spark_session():
    spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .config("spark.python.worker.memory", "15g") \
        .getOrCreate()
    return spark
spark=create_spark_session()

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

In this project I will gather the data using immigration data provided in the project resources along with data related to global temperature, airports, and demographics.
I will load this data into staging dataframes. 
I will clean the raw data and write it back to csv files. I will perform an ETL process using a Spark cluster. 
Clean files will be loaded and transformed to create star schema having fact and dimension tables and write it back to parquet files. 
The star schema can then be used by the relevant parties to perform data analytics, correlation and ad-hoc reporting in an effective and efficient manner.

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

##### 1. Global Temperatures Data

In [3]:
gtemp_df = pd.read_csv('../../data2/GlobalLandTemperaturesByCity.csv')
gtemp_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [4]:
gtemp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
dt                               object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                             object
Country                          object
Latitude                         object
Longitude                        object
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


##### Data Dictionary

Feature                       |Description
:-----------------------------|:-----------
dt                            |Date
AverageTemperature            |Average temperature in celsius
AverageTemperatureUncertainty |95% confidence interval around average temperature
City                          |Name of city
Country                       |Name of country
Latitude                      |Latitude of city
Longitude                     |Longitude of city

##### 2. US Cities Demographics Data

In [5]:
us_demographics = 'us-cities-demographics.csv'
demo_df = spark.read.csv(us_demographics, inferSchema=True, header=True, sep=';')

In [6]:
demo_df.limit(10).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402
5,Peoria,Illinois,33.1,56229,62432,118661,6634,7517,2.4,IL,American Indian and Alaska Native,1343
6,Avondale,Arizona,29.1,38712,41971,80683,4815,8355,3.18,AZ,Black or African-American,11592
7,West Covina,California,39.8,51629,56860,108489,3800,37038,3.56,CA,Asian,32716
8,O'Fallon,Missouri,36.0,41762,43270,85032,5783,3269,2.77,MO,Hispanic or Latino,2583
9,High Point,North Carolina,35.5,51751,58077,109828,5204,16315,2.65,NC,Asian,11060


In [7]:
demo_df.limit(10).toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
City                      10 non-null object
State                     10 non-null object
Median Age                10 non-null float64
Male Population           10 non-null int32
Female Population         10 non-null int32
Total Population          10 non-null int32
Number of Veterans        10 non-null int32
Foreign-born              10 non-null int32
Average Household Size    10 non-null float64
State Code                10 non-null object
Race                      10 non-null object
Count                     10 non-null int32
dtypes: float64(2), int32(6), object(4)
memory usage: 800.0+ bytes


##### Data Dictionary

Feature                       |Description
:-----------------------------|:-----------
City |City Name
State |US State
Median Age |Median population age
Male Population |Male population
Female Population |Female population
Total Population |Total population
Number of Veterans |Number of veterans living in the city
Foreign-born |Number of residents who were not born in the city
Average Household Size |Average size of houses in the city
State Code |Code of the state
Race |Race class
Count |Number of individuals in each race

##### 3. Airport Codes Data

In [8]:
airport_codes_file = 'airport-codes_csv.csv'
air_df = pd.read_csv(airport_codes_file)

In [9]:
pd.set_option('display.max_colwidth',100) 

In [10]:
air_df.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"



##### Data Dictionary

Feature       |Description
:-------------|:-----------
ident         |Unique identifier
type          |Airport type
name          |Airport name
elevation_ft  |Airport altitude
continent     |Continent
iso_country   |ISO Code of the airport's country
iso_region    |ISO Code for the airport's region
municipality  | City/Municipality where the airport is located
gps_code      |Airport GPS Code
iata_code     |Airport IATA Code
local_code    |Airport local code
coordinates   |Airport coordinates

##### 4. Immigration Data

In [11]:
immigration_sample_data = 'immigration_data_sample.csv'
immigration_sample_df = pd.read_csv(immigration_sample_data)

In [12]:
print(len(immigration_sample_df.index))

1000


In [13]:
immigration_i94_df = pd.read_sas('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat', 'sas7bdat', encoding="ISO-8859-1")
immigration_i94_df.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


In [14]:
print(len(immigration_i94_df.index))

3096313


In [15]:
immigration_i94_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3096313 entries, 0 to 3096312
Data columns (total 28 columns):
cicid       float64
i94yr       float64
i94mon      float64
i94cit      float64
i94res      float64
i94port     object
arrdate     float64
i94mode     float64
i94addr     object
depdate     float64
i94bir      float64
i94visa     float64
count       float64
dtadfile    object
visapost    object
occup       object
entdepa     object
entdepd     object
entdepu     object
matflag     object
biryear     float64
dtaddto     object
gender      object
insnum      object
airline     object
admnum      float64
fltno       object
visatype    object
dtypes: float64(13), object(15)
memory usage: 661.4+ MB


##### Data Dictionary

Feature  |Description
:--------|:-----------
cicid    |Unique ID
i94yr    |year
i94mon   |month
i94cit   |3 digit code for immigrant country of birth
i94res   |3 digit code for immigrant country of residence
i94port  |Port of admission
arrdate  |Arrival Date in the USA
i94mode  |Mode of transportation
i94addr  |USA State of arrival
depdate  |Departure Date from the USA
i94bir   |Age of Respondent in Years
i94visa  |Visa codes collapsed into three categories
count    |Field used for summary statistics
dtadfile |Character Date Field - Date added to I-94 Files
visapost |Department of State where where Visa was issued
occup    |Occupation that will be performed in U.S
entdepa  |Arrival Flag - admitted or paroled into the U.S.
entdepd  |Departure Flag - Departed, lost I-94 or is deceased
entdepu  |Update Flag - Either apprehended, overstayed, adjusted to perm residence
matflag  |Match flag - Match of arrival and departure records
biryear  |year of birth
dtaddto  |Date to which admitted to U.S.
gender   |Non-immigrant sex
insnum   |INS number
airline  |Airline used to arrive in U.S.
admnum   |Admission Number
fltno    |Flight number of Airline used to arrive in U.S.
visatype |Class of admission legally admitting the non-immigrant to temporarily stay in U.S.

#### Extract Tables from I94_SAS_Labels_Description

In [16]:
def extract_table(start, end, sep):
    codes = []
    keys = []
    values = []
    with open('I94_SAS_Labels_Descriptions.SAS', mode='r') as file:
        for i, line in enumerate(file.readlines()):
            if i<start: pass
            elif i==end: break
            else:
                line = line.replace("'","")
                codes.append(line.strip().replace("\t",""))
    for key in codes:
        keys.append(key.split(sep)[0])
        values.append(key.split(sep)[1])
    values[-1] = values[-1].replace(" ;","")
    df = pd.DataFrame({'Code':keys, 'Name':values})
    return df

##### 5. Extracting Country Codes


In [17]:
country_code_df= extract_table(9,298,' =  ')
country_code_df.to_csv('Data/extracted_country_codes.csv', index=False)
country_code_df

Unnamed: 0,Code,Name
0,582,"MEXICO Air Sea, and Not Reported (I-94, no land arrivals)"
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA
5,324,ANGOLA
6,529,ANGUILLA
7,518,ANTIGUA-BARBUDA
8,687,ARGENTINA
9,151,ARMENIA



##### Data Dictionary

Feature       |Description
:-------------|:-----------
Code          |Country Code
Name          |Country name

##### 6. Extracting Modes

In [18]:
modes_df = extract_table(972,976,' = ')
modes_df.to_csv('Data/Modes.csv', index=False)
modes_df

Unnamed: 0,Code,Name
0,1,Air
1,2,Sea
2,3,Land
3,9,Not reported



##### Data Dictionary

Feature       |Description
:-------------|:-----------
Code          |Code
Name          |Mode name

##### 7. Extracting Visa Codes

In [19]:
visa_df = extract_table(1046, 1049,'= ')
visa_df.to_csv('Data/visa_codes.csv', index=False)
visa_df

Unnamed: 0,Code,Name
0,1,Business
1,2,Pleasure
2,3,Student



##### Data Dictionary

Feature       |Description
:-------------|:-----------
Code          |Code
Name          |Visa Category

In [None]:
# from pyspark.sql import SparkSession

#spark = SparkSession.builder.\
#config("spark.jars.repositories", "https://repos.spark-packages.org/").\
#config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
#enableHiveSupport().getOrCreate()

#df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [11]:
#write to parquet
#df_spark.write.parquet("sas_data")
#df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

 - Remove rows having null values for id
 - Check if primary keys are unique
 - Check we don't have any duplicate rows

#### Cleaning Steps
Document steps necessary to clean the data

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.