# Project Title
## Data Engineering Capstone Project

### Project Summary
The purpose of this project is to create an ETL pipeline that wrangles data on immigration, demographics and environmental factors (specifically temperature), in order to be able to glean insights about what factors may make certain US destination popular for international visits. Example questions that business users may want answered include:
* Is there a correlation between a visitor's home temperature and the temperature of their chosen destination?
* Do people from certain countries prefer destination cities where certain ethnicities are better represented?
* Are particular destinations more popular with holders of visas of a certain type?

We'll be using Spark to process the data.

The project is broken down into 5 steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Clean Up the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Let's get ready by importing the libraries and tools we'll be using...

import pandas as pd
import helpers
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, upper, first
from pyspark.sql.types import StringType

# ...and initializing our SparkSession

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

## Step 1: Scoping the Project and Gathering Data

### Scope 
We have various data on immigration, travel, temperature and demographics, from various sources. The business need is to be able to query the data to see what factors may affect immigration or US travel destination choices. More specifically, the goal is to create a set of relational tables against which we can run queries to look for patterns between chosen immigration or travel destination, and demographic and environmental factors.

To accomplish this, Spark will be used, as it is well suited to process large amounts of data across multiple servers. If we wanted to take the project further, we could migrate the output data into a suitable RDBMS, like Redshift, and we could automate the pipleline with Airflow; this, however, is outside the scope of this particular project. The primary goal and scope of the task here is to explore the data, model it and load it into relational tables suitable for analytic querying.


### Describing and Gathering the Data 
The source data are the following:

* _I94 Immigration Data_: This data comes from the [US National Tourism and Trade Office](https://www.trade.gov/i-94-arrivals-program). It consists of the 2016 I-94 visitor arrivals data, providing information on arrivals for US visitors who stay 1 night or more. The link to the original dataset is defunct, but a copy of the data was captured and provided by Udacity.
* _World Temperature Data_: [This dataset](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data) was sourced from Kaggle. It consists of a table of global land temperatures by city and by month, from approximately 1750.
* _U.S. City Demographic Data_: [This data](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/) comes from OpenSoft. The data is compiled from the US Census Bureau's 2015 American Community Survey. It contains demographic information for all US cities that have a population of 65,000 or more. 
* _Airport Code Table_: [This](https://datahub.io/core/airport-codes#data) is a table of airport codes and corresponding cities.

To create the relational tables that will allows us to explore demographic and environmental correlations, we will only need data from the first three sets. We will not be using the _Airport Code Table_ dataset.

## Step 2: Exploring and Cleaning Up the Data
Let's identify data quality issues, like missing values, duplicate data, etc.

Some of this data was explored "offline", so to speak; information was gathered by visiting the original sources for the datasets or via supplemental contextual information. The i94 dataset, particularly, benefited from this; the definitions of the columns was understood primarily through looking at the `I94_SAS_Labels_Descriptions.SAS` file located in the same directory as this README file.

### Exploration
#### Immigration Data

In [2]:
# Read in the i94 data
i94_df = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [None]:
# Get a preview
i94_df.limit(5).toPandas()

In [None]:
# Show the total of null or NaN values per column
helpers.show_total_missing_values(i94_df)

#### Demographic Data

In [2]:
# Read in the demographic data
demog_df = spark.read.csv('us-cities-demographics.csv', inferSchema=True, header=True, sep=';')

In [4]:
# Get a preview
demog_df.limit(5).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402


In [20]:
# Show the total of null or NaN values per column
helpers.show_total_missing_values(demog_df)

Unnamed: 0,column name,total missing,% missing
0,City,0,0.0
1,State,0,0.0
2,Median Age,0,0.0
3,Male Population,3,0.10377
4,Female Population,3,0.10377
5,Total Population,0,0.0
6,Number of Veterans,13,0.449671
7,Foreign-born,13,0.449671
8,Average Household Size,16,0.553442
9,State Code,0,0.0


#### Temperature Data

In [4]:
# Read in the temperature data
temp_df = spark.read.csv('../../data2/GlobalLandTemperaturesByCity.csv', header=True, inferSchema=True)

In [3]:
# Get a preview
temp_df.limit(5).toPandas()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [6]:
# Show the total of null or NaN values per column, but first we cast the datetime column to string
temp_df_string = temp_df.withColumn("dt",col("dt").cast(StringType()))
helpers.show_total_missing_values(temp_df_string)

Unnamed: 0,column name,total missing,% missing
0,dt,0,0.0
1,AverageTemperature,0,0.0
2,AverageTemperatureUncertainty,0,0.0
3,City,0,0.0
4,Country,0,0.0
5,Latitude,0,0.0
6,Longitude,0,0.0


### Clean Up

#### Immigration Data
If we look again at the table of total missing values per column in the _Exploration_ section, we can see that three columns have a total of more than 85% missing values: `occup`, `entdepu`, and `insnum`. These can be dropped.

Taking this a step further, as we'll see shortly in our data model, there are only 11 columns we're interested in keeping. Let's select these to make the dataframe easier to work with and more legible to display. While we're at it, let's rename those columns and drop duplicates.

In [3]:
i94_df = i94_df.select(col('cicid').alias('id'),
                             col('i94cit').alias('citizenship_country'),
                             col('i94res').alias('residence_country'),
                             col('i94port').alias('port_of_entry'),
                             col('i94addr').alias('destination_state'),
                             col('arrdate').alias('arrival_date'),
                             col('depdate').alias('departure_date'),
                             col('i94bir').alias('age'),
                             col('i94visa').alias('visa_type'),
                             col('dtaddto').alias('admitted_until'),
                             col('gender').alias('gender')
                            ).dropDuplicates().cache()

We'll also convert the SAS dates in `arrdate` and `depdate`, and the string date in `dtaddto`, to PySpark dates.

In [4]:
i94_df = i94_df.withColumn('arrival_date', helpers.convert_sas_date_to_datetime(i94_df.arrival_date)).cache()
i94_df = i94_df.withColumn('departure_date', helpers.convert_sas_date_to_datetime(i94_df.departure_date)).cache()
i94_df = i94_df.withColumn('admitted_until', to_date(col('admitted_until'),'MMddyyyy')).cache()

In [None]:
i94_df.limit(5).toPandas()

We can see that the fields `citizenship_country`, `residence_country`, `port_of_entry`, and `state_of_arrival` are referenced by codes. These need to be cross-referenced with a list of codes that is currently only available in text form. Since temperature and demographic data have city, state and country fields written out in full, we can parse the text file of codes and update the values in the i94 dataframe to also be written out in full. This will make it easier to join tables when querying by reducing the number of joins and tables involved.

In [None]:
# Read our i94 fields and values legend 
with open('I94_SAS_Labels_Descriptions.SAS') as f:
    i94_desc = f.readlines()

In [5]:
# get country names by code
country_codes = {}
for countries in i94_desc[10:298]:
    pair = countries.split('=')
    code, country = pair[0].strip(), pair[1].strip().strip("'")
    country_codes[code] = country

In [6]:
countries_df = spark.createDataFrame(list(country_codes.items()), ['code', 'country'])

In [7]:
countries_df.show(5)

+----+-----------+
|code|    country|
+----+-----------+
| 236|AFGHANISTAN|
| 101|    ALBANIA|
| 316|    ALGERIA|
| 102|    ANDORRA|
| 324|     ANGOLA|
+----+-----------+
only showing top 5 rows



In [8]:
# get city names and state abbreviations by numerical codes
port_cities = {}
port_states = {}
broken_fields = {}
for cities in i94_desc[303:893]:   
    pair = cities.split('=')
    code, location = pair[0].strip("\t").strip().strip("'"), pair[1].strip('\t').strip()
    city_and_state = location.split(',')
    if len(city_and_state) == 2:
        city, state = city_and_state[0].strip().strip("'"), city_and_state[1].strip().strip("'").strip()
        port_cities[code] = city
        port_states[code] = state

In [9]:
port_cities_df = spark.createDataFrame(list(port_cities.items()), ['code', 'city'])
port_states_df = spark.createDataFrame(list(port_states.items()), ['code','state'])

In [None]:
port_cities_df.show(5)

In [None]:
port_states_df.show(5)

In [25]:
# get state names by state abbreviation
state_codes = {}
for states in i94_desc[982:1035]:   
    pair = states.split('=')
    code, state = pair[0].strip("\t").strip().strip("'"), pair[1].strip('\t').strip().strip("'")
    if 'N.' in state:
        state = state.replace('N.','NORTH')
    if 'S.' in state:
        state = state.replace('S.','SOUTH')
    if 'W.' in state:
        state = state.replace('W.','WEST')
    if 'DIST.' in state:
        state = state.replace('DIST.','DISTRICT')
    state_codes[code] = state

In [26]:
state_codes_df = spark.createDataFrame(list(state_codes.items()), ['code', 'state'])

In [None]:
state_codes_df.show(5)

In [31]:
# create views from the dictionary dataframes we just created, 
# so we can use sql to replace the codes in our i94 table with their more human-usable values
i94_df.createOrReplaceTempView('i94_view')
countries_df.createOrReplaceTempView('countries_view')
port_cities_df.createOrReplaceTempView('port_cities_view')
port_states_df.createOrReplaceTempView('port_states_view')
state_codes_df.createOrReplaceTempView('state_codes_view')

+---------+------------+--------------+----+---------+--------------+------+-------------------+-----------------+----------------+----------+-----------------+
|       id|arrival_date|departure_date| age|visa_type|admitted_until|gender|citizenship_country|residence_country|       port_city|port_state|destination_state|
+---------+------------+--------------+----+---------+--------------+------+-------------------+-----------------+----------------+----------+-----------------+
|4651287.0|  2016-04-24|    2016-04-29|45.0|      2.0|    2016-10-24|     F|               null|             null|NEWARK/TETERBORO|NEW JERSEY|             null|
|2014736.0|  2016-04-11|    2016-04-15|38.0|      2.0|    2016-07-09|  null|               null|          GERMANY|         HOUSTON|     TEXAS|             null|
|2093085.0|  2016-04-11|    2016-04-15|46.0|      1.0|    2016-10-11|     M|               null|             null|         HOUSTON|     TEXAS|             null|
| 998502.0|  2016-04-06|    2016-0

In [None]:
# clean up our i94 table by converting codes to human-readable values
i94_clean_df = spark.sql("""
    SELECT 
        i94.id,
        i94.arrival_date AS arrival_date,
        i94.departure_date AS departure_date,
        i94.age AS age,
        i94.visa_type AS visa_type,
        i94.admitted_until AS admitted_until,
        i94.gender AS gender,
        c_cit.country AS citizenship_country,
        c_res.country AS residence_country,
        pc.city AS port_city,
        sc_port.state AS port_state,
        sc.state AS destination_state
    FROM i94_view i94
    
    -- retrieve the country names of the code values in i94.citizenship_country
    LEFT JOIN countries_view c_cit
    ON i94.citizenship_country = c_cit.code
    
    -- retrieve the country names of the code values in i94.residence_country
    LEFT JOIN countries_view c_res
    ON i94.residence_country = c_res.code
    
    -- retrieve the city names of the code values in i94.port_of_entry
    LEFT JOIN port_cities_view pc
    ON i94.port_of_entry = pc.code
    
    -- retrieve the state abbreviations of the numerical codes in i94.port_of_entry...
    -- ...then the full state names of the state abbreviations
    LEFT JOIN port_states_view ps
    ON i94.port_of_entry = ps.code
    LEFT JOIN state_codes_view sc_port
    ON ps.state = sc_port.code 
    
    -- retrieve the full state names of the state abbreviations in i94.destination_state
    LEFT JOIN state_codes_view sc
    ON i94.destination_state = sc.code
    
""")

In [None]:
i94_clean_df.show(5)

In [32]:
# check the distribution of our destination_state values. Make sure we didn't accidentally end up with all nulls (for example)
i94_clean_df.groupBy('destination_state').count().show(100)

+--------------------+------+
|   destination_state| count|
+--------------------+------+
|          NEW JERSEY| 76531|
|        PENNSYLVANIA| 30293|
|            ILLINOIS| 82126|
|DISTRICT OF COLUMBIA| 28228|
|            MARYLAND| 25360|
|       WEST VIRGINIA|   808|
|               IDAHO|  1752|
|            MISSOURI|  8484|
|             MONTANA|  1339|
|            MICHIGAN| 32101|
|             FLORIDA|621701|
|                null|187354|
|              OREGON| 12574|
|        SOUTH DAKOTA|   557|
|           LOUISIANA| 22655|
|              ALASKA|  1604|
|         PUERTO RICO|  9474|
|      VIRGIN ISLANDS|   226|
|               MAINE|  2361|
|       NEW HAMPSHIRE|  2817|
|            OKLAHOMA|  3239|
|            VIRGINIA| 31399|
|          WASHINGTON| 55792|
|      NORTH CAROLINA| 23375|
|             WYOMING|   460|
|               TEXAS|134321|
|            NEBRASKA| 26574|
|           MINNESOTA| 11194|
|              HAWAII|168764|
|                GUAM| 94107|
|        R

In [None]:
# check the distribution of port_state values. Make sure we didn't accidentally end up with all nulls (for example)
i94_clean_df.groupBy('port_state').count().show(100)

#### Demographic Data
We'll start by renaming the columns and uppercasing the city and state names, to have consistent data formatting across our different datasets.

In [3]:
demog_df = demog_df.select(upper(col('City')).alias('city'),
                             upper(col('State')).alias('state'),
                             col('Median Age').alias('median_age'),
                             col('Male Population').alias('male_population'),
                             col('Female Population').alias('female_population'),
                             col('Total Population').alias('total_population'),
                             col('Number of Veterans').alias('total_veterans'),
                             col('Foreign-born').alias('foreign_born'),
                             col('Average Household Size').alias('average_household_size'),
                             col('Race').alias('race'),
                             col('Count').alias('race_count')
                            ).dropDuplicates().cache()

In [6]:
demog_df.show(5)

+-----------+----------+----------+---------------+-----------------+----------------+--------------+------------+----------------------+--------------------+----------+
|       city|     state|median_age|male_population|female_population|total_population|total_veterans|foreign_born|average_household_size|                race|race_count|
+-----------+----------+----------+---------------+-----------------+----------------+--------------+------------+----------------------+--------------------+----------+
|TALLAHASSEE|   FLORIDA|      26.2|          89390|           100504|          189894|          9575|       16720|                  2.38|  Hispanic or Latino|     13538|
|  VANCOUVER|WASHINGTON|      37.2|          82958|            89895|          172853|         12391|       21748|                  2.49|  Hispanic or Latino|     23184|
|  MILWAUKEE| WISCONSIN|      31.6|         286315|           313839|          600154|         20615|       58321|                  2.51|American Indi

While we may want denormalized data in our fact and dimension tables, we want to make sure that the denormalizations make sense for our purposes. If we look closely, we can observe that the `Race` column in the demographic data has a small number of unnecessarily repeating values. The `Count` column refers to the total number of residents who identify with the associated race in a given row. To simplify our table and our querying, we can simply pivot the `Race` column, meaning that we would take its five possible values and make each of them a single column. The values in the `Count` column would become the values for the associated `Race` column.

In [4]:
# Turn the Race column into five distinct race columns
demog_df = demog_df.groupBy("city",
                            "state",
                            "median_age",
                            "male_population",
                            "female_population",
                            "total_population",
                            "total_veterans", 
                            "foreign_born", 
                            "average_household_size"
                           ).pivot("race").agg(first("race_count"))
demog_df.show(5)

+-----------+--------------+----------+---------------+-----------------+----------------+--------------+------------+----------------------+---------------------------------+-----+-------------------------+------------------+------+
|       city|         state|median_age|male_population|female_population|total_population|total_veterans|foreign_born|average_household_size|American Indian and Alaska Native|Asian|Black or African-American|Hispanic or Latino| White|
+-----------+--------------+----------+---------------+-----------------+----------------+--------------+------------+----------------------+---------------------------------+-----+-------------------------+------------------+------+
|   COLUMBIA|SOUTH CAROLINA|      28.0|          67686|            65707|          133393|          5708|        6074|                  2.32|                             1420| 3501|                    56398|              7545| 73232|
|BLOOMINGTON|     MINNESOTA|      40.9|          43318|         

In [5]:
# Tidy the ethnicity column names
demog_df = demog_df.select(col('city'),
                           col('state'),
                           col('median_age'),
                           col('male_population'),
                           col('female_population'),
                           col('total_population'),
                           col('total_veterans'),
                           col('foreign_born'),
                           col('average_household_size'),
                           col('American Indian and Alaska Native').alias('indigenous'),
                           col('Asian').alias('asian'),
                           col('Black or African-American').alias('black'),
                           col('Hispanic or Latino').alias('latinx'),
                           col('White').alias('white'),
                          ).dropDuplicates().cache()

Unnamed: 0,city,state,median_age,male_population,female_population,total_population,total_veterans,foreign_born,average_household_size,indigenous,asian,black,latinx,white
0,COLUMBIA,SOUTH CAROLINA,28.0,67686,65707,133393,5708,6074,2.32,1420,3501,56398,7545,73232
1,BLOOMINGTON,MINNESOTA,40.9,43318,43118,86436,6176,10728,2.3,1745,4689,5828,8021,71874
2,UNION CITY,NEW JERSEY,35.4,35376,33773,69149,705,40553,2.85,545,4044,4686,53174,50031
3,GARLAND,TEXAS,34.5,116406,120430,236836,10407,62975,3.12,3083,27217,40507,90989,154484
4,SAN ANGELO,TEXAS,32.8,49669,50744,100413,8412,7859,2.5,548,1658,6079,42494,86655
5,CITRUS HEIGHTS,CALIFORNIA,37.8,41982,45071,87053,6171,9466,2.49,1514,3615,5979,14661,77525
6,GILBERT,ARIZONA,33.2,116711,130812,247523,10817,24531,3.2,5965,19740,9076,39937,211322
7,APPLE VALLEY,CALIFORNIA,34.3,32873,39312,72185,5714,5801,3.03,1446,2281,9124,25928,60767
8,DALY CITY,CALIFORNIA,39.7,53817,52757,106574,3782,56640,3.26,893,61918,5401,24316,25522
9,CARLSBAD,CALIFORNIA,42.1,55119,58347,113466,6031,17689,2.68,1513,11948,876,12969,98705


In [None]:
demog_df.count()

In [None]:
demog_df.limit(5).toPandas()

#### Temperature Data

We saw in our data exploration that there were rows with missing AverageTemperature. These rows will be useless for our purposes, so we will drop them.

In [5]:
temp_df=temp_df.filter(temp_df.AverageTemperature != 'NaN')

We will also see shortly in our data model that the columns `AverageTemperatureUncertainty`, `Latitude` and `Longitude` are not relevant to our inquiry, so we will drop them and take the opportunity to rename our remaining columns, as well as to upprecase the city and country values.

In [6]:
temp_df = temp_df.select(col('dt').alias('date'),
                             col('AverageTemperature').alias('average_temperature'),
                             upper(col('City')).alias('city'),
                             upper(col('Country')).alias('country')
                            ).dropDuplicates().cache()

In [7]:
temp_df.limit(5).toPandas()

Unnamed: 0,date,average_temperature,city,country
0,1772-10-01,10.108,ÅRHUS,DENMARK
1,1775-07-01,18.487,ÅRHUS,DENMARK
2,1786-09-01,11.781,ÅRHUS,DENMARK
3,1828-03-01,2.404,ÅRHUS,DENMARK
4,1850-09-01,11.837,ÅRHUS,DENMARK


Additionally, our i94 dataset is specific to the year 2016. Our inquiry concerns itself with the factors that may have influenced visitors to choose the destinations they chose. The temperature dataset contains data from as far back as 1743. For our context, we will assume that people would not concern themselves with average temperatures more than 10 years old when choosing where to go in the short term. Hence, we will drop historical data from before 2006.

In [16]:
# Drop pre-2006 data, and make sure the date field doesn't get a midnight timestamp added (useless for our purposes)
temp_df = temp_df.filter(temp_df.date >= '2006-01-01').withColumn("date", to_date(col("date"), "yyyy-MM-dd"))
# temp_df.count()

323360

In [17]:
temp_df.show(5)

+----------+-------------------+------+-------+
|      date|average_temperature|  city|country|
+----------+-------------------+------+-------+
|2008-10-01|              9.587| Århus|Denmark|
|2009-07-01|             23.904|Ürümqi|  China|
|2010-01-01|             -13.03|Ürümqi|  China|
|2010-07-01|             18.175|Abakan| Russia|
|2010-10-01|              2.843|Abakan| Russia|
+----------+-------------------+------+-------+
only showing top 5 rows



## Step 3: Define the Data Model
### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

## Step 4: Run Pipelines to Model the Data 
### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

## Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.