# Project Title
## Data Engineering Capstone Project

### Project Summary
The purpose of this project is to create an ETL pipeline that wrangles data on immigration, demographics and environmental factors (specifically temperature), in order to be able to glean insights about what factors may make certain US destination popular for international visits. Example questions that business users may want answered include:
* Is there a correlation between a visitor's home temperature and the temperature of their chosen destination?
* Do people from certain countries prefer destination cities where certain ethnicities are better represented?
* Are particular destinations more popular with holders of visas of a certain type?

We'll be using Spark to process the data.

The project is broken down into 5 steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Clean Up the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Let's get ready by importing the libraries and tools we'll be using...

import pandas as pd
import helpers
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date
from pyspark.sql.types import StringType

# ...and initializing our SparkSession

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

## Step 1: Scoping the Project and Gathering Data

### Scope 
We have various data on immigration, travel, temperature and demographics, from various sources. The business need is to be able to query the data to see what factors may affect immigration or US travel destination choices. More specifically, the goal is to create a set of relational tables against which we can run queries to look for patterns between chosen immigration or travel destination, and demographic and environmental factors.

To accomplish this, Spark will be used, as it is well suited to process large amounts of data across multiple servers. If we wanted to take the project further, we could migrate the output data into a suitable RDBMS, like Redshift, and we could automate the pipleline with Airflow; this, however, is outside the scope of this particular project. The primary goal and scope of the task here is to explore the data, model it and load it into relational tables suitable for analytic querying.


### Describing and Gathering the Data 
The source data are the following:

* _I94 Immigration Data_: This data comes from the [US National Tourism and Trade Office](https://www.trade.gov/i-94-arrivals-program). It consists of the 2016 I-94 visitor arrivals data, providing information on arrivals for US visitors who stay 1 night or more. The link to the original dataset is defunct, but a copy of the data was captured and provided by Udacity.
* _World Temperature Data_: [This dataset](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data) was sourced from Kaggle. It consists of a table of global land temperatures by city and by month, from approximately 1750.
* _U.S. City Demographic Data_: [This data](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/) comes from OpenSoft. The data is compiled from the US Census Bureau's 2015 American Community Survey. It contains demographic information for all US cities that have a population of 65,000 or more. 
* _Airport Code Table_: [This](https://datahub.io/core/airport-codes#data) is a table of airport codes and corresponding cities.

To create the relational tables that will allows us to explore demographic and environmental correlations, we will only need data from the first three sets. We will not be using the _Airport Code Table_ dataset.

## Step 2: Exploring and Cleaning Up the Data
Let's identify data quality issues, like missing values, duplicate data, etc.

Some of this data was explored "offline", so to speak; information was gathered by visiting the original sources for the datasets or via supplemental contextual information. The i94 dataset, particularly, benefited from this; the definitions of the columns was understood primarily through looking at the `I94_SAS_Labels_Descriptions.SAS` file located in the same directory as this README file.

### Exploration
#### Immigration Data

In [None]:
# Read in the i94 data
i94_df = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [None]:
# Get a preview
i94_df.limit(5).toPandas()

In [None]:
# Show the total of null or NaN values per column
helpers.show_total_missing_values(i94_df)

#### Demographic Data

In [None]:
# Read in the demographic data
demog_df = spark.read.csv('us-cities-demographics.csv', inferSchema=True, header=True, sep=';')

In [None]:
# Get a preview
demog_df.limit(5).toPandas()

In [None]:
# Show the total of null or NaN values per column
helpers.show_total_missing_values(demog_df)

#### Temperature Data

In [2]:
# Read in the temperature data
temp_df = spark.read.csv('../../data2/GlobalLandTemperaturesByCity.csv', header=True, inferSchema=True)

In [None]:
# Get a preview
temp_df.limit(5).toPandas()

In [3]:
# Show the total of null or NaN values per column...
# ...but first we cast the datetime column to string, or our helper function will expect a double or float
temp_df_string = temp_df.withColumn("dt",col("dt").cast(StringType()))
helpers.show_total_missing_values(temp_df_string)

Unnamed: 0,column name,total missing,% missing
0,dt,0,0.0
1,AverageTemperature,364130,4.234458
2,AverageTemperatureUncertainty,364130,4.234458
3,City,0,0.0
4,Country,0,0.0
5,Latitude,0,0.0
6,Longitude,0,0.0


### Clean Up

#### Immigration Data
If we look again at the table of total missing values per column in the _Exploration_ section, we can see that three columns have a total of more than 85% missing values: `occup`, `entdepu`, and `insnum`. These can be dropped.

Taking this a step further, as we'll see shortly in our data model, there are only 11 columns we're interested in keeping. Let's select these to make the dataframe easier to work with and more legible to display. While we're at it, let's rename those columns and drop duplicates.

In [None]:
i94_df = i94_df.select(col('cicid').alias('id'),
                             col('i94cit').alias('citizenship_country'),
                             col('i94res').alias('residence_country'),
                             col('i94port').alias('port_of_entry'),
                             col('i94addr').alias('destination_state'),
                             col('arrdate').alias('arrival_date'),
                             col('depdate').alias('departure_date'),
                             col('i94bir').alias('age'),
                             col('i94visa').alias('visa_type'),
                             col('dtaddto').alias('admitted_until'),
                             col('gender').alias('gender')
                            ).dropDuplicates().cache()

We'll also convert the SAS dates in `arrdate` and `depdate`, and the string date in `dtaddto`, to PySpark dates.

In [None]:
i94_df = i94_df.withColumn('arrival_date', helpers.convert_sas_date_to_datetime(i94_df.arrival_date)).cache()
i94_df = i94_df.withColumn('departure_date', helpers.convert_sas_date_to_datetime(i94_df.departure_date)).cache()
i94_df = i94_df.withColumn('admitted_until', to_date(col('admitted_until'),'MMddyyyy')).cache()
# i94_df.limit(5).toPandas()

We can see that the fields `citizenship_country`, `residence_country`, `port_of_entry`, and `state_of_arrival` are referenced by codes. These need to be cross-referenced with a list of codes that is currently only available in text form. Since temperature and demographic data both already include city fields, the demographic dataframe includes a state field, and the temperature dataframe includes a country, and since all these fields are written out in full, we can parse the text file of codes and update the values in the i94 dataframe to also be written out in full. This will make it easier to join tables when querying.

In [None]:
with open('I94_SAS_Labels_Descriptions.SAS') as f:
    i94_desc = f.readlines()

country_codes = {}
for countries in i94_desc[10:298]:
    pair = countries.split('=')
    code, country = pair[0].strip(), pair[1].strip().strip("'")
    country_codes[code] = country

In [None]:
countries_df = spark.createDataFrame(list(country_codes.items()), ['code', 'country'])

In [None]:
countries_df.show(5)

In [None]:
i94_df.createOrReplaceTempView('i94_view')
i94_new_df = spark.sql('SELECT * FROM i94_view')

# update citizenship_country
i94_df2 = countries_df.join(i94_new_df, (countries_df.code == i94_new_df.citizenship_country), 'right_outer')\
.select(col('id'),
        col('port_of_entry'),
        col('residence_country'),
        col('destination_state'),
        col('arrival_date'),
        col('departure_date'),
        col('age'),
        col('visa_type'),
        col('admitted_until'),
        col('gender'),
        col('country'),
       ).withColumn('citizenship_country',col('country'))\
.drop('country').cache()

# update residence_country
i94_df2.createOrReplaceTempView('i94_view')
i94_new_df2 = spark.sql('SELECT * FROM i94_view')
i94_df3 = countries_df.join(i94_new_df2, (countries_df.code == i94_new_df2.residence_country), 'right_outer')\
.select(col('id'),
        col('port_of_entry'),
        col('citizenship_country'),
        col('destination_state'),
        col('arrival_date'),
        col('departure_date'),
        col('age'),
        col('visa_type'),
        col('admitted_until'),
        col('gender'),
        col('country'),
       ).withColumn('residence_country',col('country'))\
.drop('country').cache()

In [None]:
i94_df3.show(5)

In [None]:
port_cities = {}
port_states = {}
broken_fields = {}
for cities in i94_desc[303:893]:   
    pair = cities.split('=')
    code, location = pair[0].strip("\t").strip().strip("'"), pair[1].strip('\t').strip()
    city_and_state = location.split(',')
    if len(city_and_state) == 2:
        city, state = city_and_state[0].strip().strip("'"), city_and_state[1].strip().strip("'").strip()
        port_cities[code] = city
        port_states[code] = state

In [None]:
port_cities_df = spark.createDataFrame(list(port_cities.items()), ['code', 'city'])
port_states_df = spark.createDataFrame(list(port_states.items()), ['code','state'])

In [None]:
# extract port_city from port_of_entry
i94_df3.createOrReplaceTempView('i94_view')
i94_new_df3 = spark.sql('SELECT * FROM i94_view')
i94_df4 = port_cities_df.join(i94_new_df3, (port_cities_df.code == i94_new_df3.port_of_entry), 'right_outer')\
.select(col('id'),
        col('port_of_entry'),
        col('citizenship_country'),
        col('residence_country'),
        col('destination_state'),
        col('arrival_date'),
        col('departure_date'),
        col('age'),
        col('visa_type'),
        col('admitted_until'),
        col('gender'),
        col('city'),
       ).withColumn('port_city',col('city'))\
.drop('city').cache()

# extract port_state from port_of_entry
i94_df4.createOrReplaceTempView('i94_view')
i94_new_df4 = spark.sql('SELECT * FROM i94_view')
i94_df5 = port_states_df.join(i94_new_df4, (port_states_df.code == i94_new_df4.port_of_entry), 'right_outer')\
.select(col('id'),
        col('port_of_entry'),
        col('citizenship_country'),
        col('residence_country'),
        col('destination_state'),
        col('arrival_date'),
        col('departure_date'),
        col('age'),
        col('visa_type'),
        col('admitted_until'),
        col('gender'),
        col('state'),
        col('port_city')
       ).withColumn('port_state',col('state'))\
.drop('port_of_entry').drop('state').cache()

In [None]:
i94_df5.show(5)

In [None]:
# get state names by code
state_codes = {}
for states in i94_desc[982:1035]:   
    pair = states.split('=')
    code, state = pair[0].strip("\t").strip().strip("'"), pair[1].strip('\t').strip()
    # make sure to check if there's a N. W. S. or Dist.
    if 'N.' in state:
        state = state.replace('N.','NORTH')
    if 'S.' in state:
        state = state.replace('S.','SOUTH')
    if 'W.' in state:
        state = state.replace('W.','WEST')
    if 'DIST.' in state:
        state = state.replace('DIST.','DISTRICT')
    state_codes[code] = state

In [None]:
state_codes_df = spark.createDataFrame(list(state_codes.items()), ['code', 'state'])

In [None]:
state_codes_df.show(state_codes_df.count())

In [None]:
# convert destination_state to full state name
i94_df5.createOrReplaceTempView('i94_view')
i94_new_df5 = spark.sql('SELECT * FROM i94_view')
i94_df6 = state_codes_df.join(i94_new_df5, (state_codes_df.code == i94_new_df5.destination_state), 'right_outer')\
.select(col('id'),
        col('citizenship_country'),
        col('residence_country'),
        col('arrival_date'),
        col('departure_date'),
        col('age'),
        col('visa_type'),
        col('admitted_until'),
        col('gender'),
        col('port_state'),
        col('port_city'),
        col('state')
       ).withColumn('destination_state',col('state'))\
.drop('state').cache()

In [None]:
i94_df6.groupBy('destination_state').count().show(100)

In [None]:
# convert port_state to full state name
i94_df6.createOrReplaceTempView('i94_view')
i94_new_df6 = spark.sql('SELECT * FROM i94_view')
i94_df7 = state_codes_df.join(i94_new_df6, (state_codes_df.code == i94_new_df6.port_state), 'right_outer')\
.select(col('id'),
        col('citizenship_country'),
        col('residence_country'),
        col('arrival_date'),
        col('departure_date'),
        col('age'),
        col('visa_type'),
        col('admitted_until'),
        col('gender'),
        col('destination_state'),
        col('port_city'),
        col('state')
       ).withColumn('port_state',col('state'))\
.drop('state').cache()

In [None]:
i94_df7.groupBy('port_state').count().show(100)

In [None]:
i94_df7.show(5)

## Step 3: Define the Data Model
### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

## Step 4: Run Pipelines to Model the Data 
### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

## Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.