# Data Engineering Capstone Project

#### Project Summary
This project builds a data warehouse by integrating data from different data sources like immigration data , temperature data and demographic data for data analysis purpose.
 
The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [4]:
# Do all imports and installs here
import pandas as pd
import pyspark
from datetime import datetime

## Step 1: Scope the Project and Gather Data
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc

### Scope
This project will integrate I94 Immigration Dataset, U.S. City Demographic Dataset and World Temperature Dataset to build a data warehouse with the help of fact and dimension tables. The descriptions contained in the I94_SAS_Labels_Descriptions.SAS file is taken into account as well.

* Datasets used in the project:
    * I94 Immigration Dataset
    * U.S. City Demographic Dataset
    * World Temperature Dataset

* Tools used in the project:
    * AWS S3 for data storage on cloud.
    * Pandas for data analysis on small datasets and PySpark for data processing on large datasets.



### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included?

#### Data Sets :

* [I94 Immigration Data](https://www.trade.gov/national-travel-and-tourism-office) : 
    This data comes from the US National Tourism and Trade Office. Dataset contains the traveler's immigration category, port of entry, data of entry into the United States, status expiration date and had a unique 11-digit identifying number assigned to it. Its purpose was to record the traveler's lawful admission to the United States. This is the main dataset and there is a file for each month of the year of 2016 available in the directory "../../data/18-83510-I94-Data-2016/" in the SAS binary database storage format sas7bdat.
    
* [World Temperature Data](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data) :
    This dataset is from Kaggle and contains the monthly average temperature data of different countries from the world. In the original dataset from Kaggle, several files are available but in this capstone project we will be using only the "GlobalLandTemperaturesByCity.csv".

* [U.S. City Demographic Data](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/) :
    This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000.This data comes from OpenSoft and we will be using "us-cities-demographics.csv".

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

Use PySpark on the datasets to test ETL data pipeline logic and Split datasets to dimensional tables and changed the column names for better clarity.

#### Explore immigration data set

The immigration dataset is the fact table. So it will be at the center of the star schema model of the Data Lake.

In [5]:
immigration_fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
immigration_data = pd.read_sas(immigration_fname, 'sas7bdat', encoding="ISO-8859-1")

In [8]:
immigration_data.head(5)

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


#### Explore World temperature data set

In [9]:
temperature_fname = '../../data2/GlobalLandTemperaturesByCity.csv'
temperature_data = pd.read_csv(temperature_fname)

In [10]:
temperature_data.head(5)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


#### Explore demographics data set

In [11]:
demographics_data = pd.read_csv("us-cities-demographics.csv", sep=";")

In [12]:
demographics_data.head(5)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


### Cleaning Steps
Document steps necessary to clean the data

#### Cleaning immigration_data

In [13]:
immigration_data = immigration_data[['cicid', 'i94yr', 'i94mon', 'i94port', 'i94addr', 'arrdate', 'depdate', 'i94mode', 'i94visa']]
immigration_data.columns = ['cic_id', 'year', 'month', 'city_code', 'state_code', 'arrival_date', 'departure_date', 'mode', 'visa']
immigration_data.head(5)

Unnamed: 0,cic_id,year,month,city_code,state_code,arrival_date,departure_date,mode,visa
0,6.0,2016.0,4.0,XXX,,20573.0,,,2.0
1,7.0,2016.0,4.0,ATL,AL,20551.0,,1.0,3.0
2,15.0,2016.0,4.0,WAS,MI,20545.0,20691.0,1.0,2.0
3,16.0,2016.0,4.0,NYC,MA,20545.0,20567.0,1.0,2.0
4,17.0,2016.0,4.0,NYC,MA,20545.0,20567.0,1.0,2.0


In [14]:
immigration_data['country'] = 'United States'
immigration_data.head(5)

Unnamed: 0,cic_id,year,month,city_code,state_code,arrival_date,departure_date,mode,visa,country
0,6.0,2016.0,4.0,XXX,,20573.0,,,2.0,United States
1,7.0,2016.0,4.0,ATL,AL,20551.0,,1.0,3.0,United States
2,15.0,2016.0,4.0,WAS,MI,20545.0,20691.0,1.0,2.0,United States
3,16.0,2016.0,4.0,NYC,MA,20545.0,20567.0,1.0,2.0,United States
4,17.0,2016.0,4.0,NYC,MA,20545.0,20567.0,1.0,2.0,United States


##### Transform arrival_date and departure_date from SAS time format to pandas datetime format

In [15]:
def SAS_to_datetime(date):
    return pd.to_timedelta(date, unit='D') + pd.Timestamp('1960-1-1')

In [16]:
immigration_data['arrive_date'] = SAS_to_datetime(immigration_data['arrival_date'])
immigration_data['departure_date'] = SAS_to_datetime(immigration_data['departure_date'])
immigration_data.head(5)

Unnamed: 0,cic_id,year,month,city_code,state_code,arrival_date,departure_date,mode,visa,country,arrive_date
0,6.0,2016.0,4.0,XXX,,20573.0,NaT,,2.0,United States,2016-04-29
1,7.0,2016.0,4.0,ATL,AL,20551.0,NaT,1.0,3.0,United States,2016-04-07
2,15.0,2016.0,4.0,WAS,MI,20545.0,2016-08-25,1.0,2.0,United States,2016-04-01
3,16.0,2016.0,4.0,NYC,MA,20545.0,2016-04-23,1.0,2.0,United States,2016-04-01
4,17.0,2016.0,4.0,NYC,MA,20545.0,2016-04-23,1.0,2.0,United States,2016-04-01


#### Cleaning temperature_data

In [17]:
temperature_data = temperature_data[temperature_data['Country'] == 'United States']
temperature_data = temperature_data[['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City', 'Country']]
temperature_data.columns = ['dt', 'avg_temp', 'avg_temp_uncertainty', 'city', 'country']
temperature_data.head(5)

Unnamed: 0,dt,avg_temp,avg_temp_uncertainty,city,country
47555,1820-01-01,2.101,3.217,Abilene,United States
47556,1820-02-01,6.926,2.853,Abilene,United States
47557,1820-03-01,10.767,2.395,Abilene,United States
47558,1820-04-01,17.989,2.202,Abilene,United States
47559,1820-05-01,21.809,2.036,Abilene,United States


In [18]:
temperature_data['dt'] = pd.to_datetime(temperature_data['dt'])
temperature_data['year'] = temperature_data['dt'].apply(lambda t: t.year)
temperature_data['month'] = temperature_data['dt'].apply(lambda t: t.month)
temperature_data.head()

Unnamed: 0,dt,avg_temp,avg_temp_uncertainty,city,country,year,month
47555,1820-01-01,2.101,3.217,Abilene,United States,1820,1
47556,1820-02-01,6.926,2.853,Abilene,United States,1820,2
47557,1820-03-01,10.767,2.395,Abilene,United States,1820,3
47558,1820-04-01,17.989,2.202,Abilene,United States,1820,4
47559,1820-05-01,21.809,2.036,Abilene,United States,1820,5


In [19]:
temperature_data_usa_2016 = temperature_data[temperature_data['year'] == 2016]
temperature_data_usa_2016.head(5)

Unnamed: 0,dt,avg_temp,avg_temp_uncertainty,city,country,year,month


##### This data set doesn't contain USA 2016 temperature data.
If we had temperatures of the year 2016 we could have provided an interesting analysis crossing the two tables (immigration and temperature) 
in order to see how the waves of immigration to the US relate to the changes in temperature.

#### Cleaning demographics_data

In [20]:
demographics_population = demographics_data[['City', 'State', 'Male Population', 'Female Population', 'Number of Veterans', 'Foreign-born', 'Race']]
demographics_population.columns = ['city', 'state', 'male_pop', 'female_pop', 'num_vetarans', 'foreign_born', 'race']
demographics_population.head(5)

Unnamed: 0,city,state,male_pop,female_pop,num_vetarans,foreign_born,race
0,Silver Spring,Maryland,40601.0,41862.0,1562.0,30908.0,Hispanic or Latino
1,Quincy,Massachusetts,44129.0,49500.0,4147.0,32935.0,White
2,Hoover,Alabama,38040.0,46799.0,4819.0,8229.0,Asian
3,Rancho Cucamonga,California,88127.0,87105.0,5821.0,33878.0,Black or African-American
4,Newark,New Jersey,138040.0,143873.0,5829.0,86253.0,White


In [21]:
demographics_stats = demographics_data[['City', 'State', 'Median Age', 'Average Household Size']]
demographics_stats.columns = ['city', 'state', 'median_age', 'avg_household_size']
demographics_stats.head(5)

Unnamed: 0,city,state,median_age,avg_household_size
0,Silver Spring,Maryland,33.8,2.6
1,Quincy,Massachusetts,41.0,2.39
2,Hoover,Alabama,38.5,2.58
3,Rancho Cucamonga,California,34.5,3.18
4,Newark,New Jersey,34.6,2.73


Parse description file to create dimension tables : country_code, city_code and state_code

In [22]:
with open("I94_SAS_Labels_Descriptions.SAS") as f:
    contents = f.readlines()

In [23]:
country_code = {}
for countries in contents[10:298]:
    set = countries.split('=')
    code, country = set[0].strip(), set[1].strip().strip("'")
    country_code[code] = country

In [24]:
df_country_code = pd.DataFrame(list(country_code.items()), columns=['code', 'country'])
df_country_code.head(5)

Unnamed: 0,code,country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


In [25]:
city_code = {}
for cities in contents[303:962]:
    set = cities.split('=')
    code, city = set[0].strip("\t").strip().strip("'"), set[1].strip('\t').strip().strip("''")
    city_code[code] = city

In [26]:
df_city_code = pd.DataFrame(list(city_code.items()), columns=['code', 'city'])
df_city_code.head(5)

Unnamed: 0,code,city
0,ANC,"ANCHORAGE, AK"
1,BAR,"BAKER AAF - BAKER ISLAND, AK"
2,DAC,"DALTONS CACHE, AK"
3,PIZ,"DEW STATION PT LAY DEW, AK"
4,DTH,"DUTCH HARBOR, AK"


In [27]:
state_code = {}
for states in contents[982:1036]:
    set = states.split('=')
    code, state = set[0].strip('\t').strip("'"), set[1].strip().strip("'")
    state_code[code] = state

In [28]:
df_state_code = pd.DataFrame(list(state_code.items()), columns=['code', 'state'])
df_state_code.head(5)

Unnamed: 0,code,state
0,AK,ALASKA
1,AZ,ARIZONA
2,AR,ARKANSAS
3,CA,CALIFORNIA
4,CO,COLORADO


Tranform city and state columns in dimension table to upper case to match city_code and state_code tables

In [29]:
demographics_stats['city'] = demographics_stats['city'].str.upper()
demographics_stats['state'] = demographics_stats['state'].str.upper()
demographics_stats.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,city,state,median_age,avg_household_size
0,SILVER SPRING,MARYLAND,33.8,2.6
1,QUINCY,MASSACHUSETTS,41.0,2.39
2,HOOVER,ALABAMA,38.5,2.58
3,RANCHO CUCAMONGA,CALIFORNIA,34.5,3.18
4,NEWARK,NEW JERSEY,34.6,2.73


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

We will model these data sets with star schema data modeling.

![Star-Schema](images/star-schema.PNG)

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

To accomplish all the tasks related to the preprocessing of the datasets, the steps can be found in **"etl.py"** to load, select, clean, transform and store the resulting datasets in a very convenient way. The open-source framework Apache Spark was used as the main tool. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

All the logic of preprocessing is concentrated here in order to only represent the general steps of the ETL. We have also tested all the steps of **"etl.py"** in **"etl_test.ipynb"** , but here the source and destination data is stored locally and rest of the logic and functionality are same. Data Quality of tables are also tested by verifying, empty tables are not present.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

The data pipeline is built inside the **"etl.py"** file included with this Capstone Project.
We have also created one data pipeline locally for testing purpose using **"etl_test.ipynb"**.

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
#### Run Quality Checks

We have already tested and performed Data quality checks in **"etl_test.ipynb"**, following steps are performed for Data quality check.

#### 1. Data schema of every dimensional table matches data model

In [None]:
from pathlib import Path

**path** will contain destination bucket path, for simplicity have used local destination folder **"output"**

In [None]:
path = "./output/"
s3_bucket = Path(path)

In [None]:
for file_dir in s3_bucket.iterdir():
    if file_dir.is_dir():
        path = str(file_dir)
        df = spark.read.parquet(path)
        print("Table: " + path.split('/')[-1])
        schema = df.printSchema()

#### 2. Tables are not empty after running ETL data pipeline

In [None]:
for file_dir in s3_bucket.iterdir():
    if file_dir.is_dir():
        path = str(file_dir)
        df = spark.read.parquet(path)
        record_num = df.count()
        if record_num <= 0:
            raise ValueError("This table is empty!")
        else:
            print("Table: " + path.split('/')[-1] + f" is not empty. It contains total {record_num} records.")

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

![Data_dictionary](images/Data_dictionary.PNG)

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.

The whole solution implemented here is mounted on top of AWS, since it provides a low-cost, scalable, and highly reliable infrastructure platform in the cloud. Every service we use (AWS S3, Apache Spark) has reasonable cost and is ‘pay as you go’ pricing. So we can start small and scale as our solution grows. No up-front costs involved.

In particular, why we use the following services:

* __S3:__ Provides a relatively cheap, easy-to-use with scalability, high availability, security, and performance. This seems to be perfect to a staging area like our solution here.

* __Spark:__ This is simply the best framework for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Propose how often the data should be updated and why.
* Dimension tables only have to be updated when a new category is created by I94. However, the time dimension table (immigration_date) can be updated every month.
* Tables created from immigration and temperature data set should be updated monthly since the raw data set is built up monthly.
* The US Cities Demographics data is updated every ten years according to https://www.usa.gov/statistics , hence data set could be updated annually since demographics data collection takes time and high frequent demography might take high cost but generate wrong conclusion.

Write a description of how you would approach the problem differently under the following scenarios:
* **The data was increased by 100x:**
 
Deploy the Spark solution on a cluster using AWS (EMR cluster) and use S3 for data and parquet file storage. AWS will easily scale when data increases by 100x.
* **The data populates a dashboard that must be updated on a daily basis by 7am every day:**
 
Apache Airflow could be used for building up a ETL data pipeline to regularly update the date and populate a report. Apache Airflow also integrate with Python and AWS very well. More applications can be combined together to deliever more powerful task automation.

* **The database needed to be accessed by 100+ people:**

The saved parquet files can be bulk copied over to AWS Redshift cluster where it can scale big data requirements and has 'massively parallel' and 'limitless concurrency' for thousands of concurrent queries executed by users.