# Non-immigrants' Statistics

### Data Engineering Capstone Project

#### Project Summary

The project "Non-immigrants' Statistics" collects data from different sources in order to determine the non-immigrants' trends in US states.    
The data would give the ability to analyze the trends around non-immigrants while showing the influx and efflux of induviduals within the American states.       

The project follows the following steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
from pyspark.sql import SparkSession
#from pyspark.sql.functions import udf, desc, asc, sum
from pyspark.sql.functions import *
from pyspark.sql.types import StringType, IntegerType, DateType
import datetime

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
spark = SparkSession \
    .builder \
    .appName("Non-immigrants' Trends") \
    .getOrCreate()

### Step 1: Scope the Project and Gather Data

<!--#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc> -->


##### Problem Statement

In order to analyze the non-immigrants' trends, the bellow questions can be answered with the data:    
1. Where do most U.S. non-immigrants visit/live?
2. Who is arriving today? (By race/ethnicity/gender)
3. Where do non-immigrants come from?
4. How many people in the U.S. are non-immigrants? 

##### The choice of tools, technologies

1. Spark (on an AWS - EMR Cluster) - To extract and transform the collected data from the data sources and write to S3 bucket.
2. S3 bucket - To store the processed data model into parquet files of dimensions and fact tables
3. Airflow (Deployed on AWS - E2 with a docker instance) - To create and run the ETL/ELT pipeline
4. AWS Athena - To read processed data from S3 from dimensions and fact tables in parquet files for analysis

<!--#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? -->

##### Data Sources

- I94 Immigration Data: This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. [This](https://travel.trade.gov/research/reports/i94/historical/2016.html) is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.    
- World Temperature Data: This dataset came from Kaggle. You can read more about it [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).    
- U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).    
- Airport Code Table: This is a simple table of airport codes and corresponding cities. It comes from [here](https://datahub.io/core/airport-codes#data).

##### Data Files

- airport-codes_csv.csv - Airport Code Table    
- I94_SAS_Labels_Descriptions.SAS - I94 Immigration Data (Labels Descriptions)    
- us-cities-demographics.csv - U.S. City Demographic Data    
- immigration_data_sample.csv - I94 Immigration Data (Data Sample with 1000 records)    
- sas_data - I94 Immigration Data (In Parquet files) (Over 3 million records)    




In [3]:
airport_codes_df=spark.read.csv("airport-codes_csv.csv", header='true', inferSchema='true')

In [4]:
us_cities_demographics_df=spark.read.option("delimiter", ';').csv("us-cities-demographics.csv", header='true', inferSchema='true')

In [5]:
immigration_sample_df=spark.read.csv("immigration_data_sample.csv", header='true', inferSchema='true')

In [6]:
immigration_df=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
<!-- #### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc. -->

#### Cleaning Steps
<!-- Document steps necessary to clean the data -->

The loaded data are from different formats, comma delimmited csv, semicolon delimitted csv and parquet file format


In [7]:
airport_codes = airport_codes_df.select(['ident','iso_country','iso_region','name','type'])
airport_codes.printSchema()
airport_codes.show(5)
airport_codes.count()

root
 |-- ident: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)

+-----+-----------+----------+--------------------+-------------+
|ident|iso_country|iso_region|                name|         type|
+-----+-----------+----------+--------------------+-------------+
|  00A|         US|     US-PA|   Total Rf Heliport|     heliport|
| 00AA|         US|     US-KS|Aero B Ranch Airport|small_airport|
| 00AK|         US|     US-AK|        Lowell Field|small_airport|
| 00AL|         US|     US-AL|        Epps Airpark|small_airport|
| 00AR|         US|     US-AR|Newport Hospital ...|       closed|
+-----+-----------+----------+--------------------+-------------+
only showing top 5 rows



55075

In [15]:
#us_cities_demographics_df.select(['City','State','State Code','Median Age','Race','Male Population','Female Population','Foreign-born','Total Population']).show(2)
us_cities_demographics = us_cities_demographics_df.select(col("City").alias("city"),
  col("State").alias("state"),
  col("Median Age").alias("med_age"),
  col("Male Population").alias("male_pop"),
  col("Female Population").alias("fem_pop"),
  col("Total Population").alias("total_pop"),
  col("Number of Veterans").alias("vets"),
  col("Foreign-born").alias("fborn"),
  col("Average Household Size").alias("hsize"),
  col("State Code").alias("state_code"),
  col("Race").alias("race"),
  col("Count").alias("count"))
us_cities_demographics.show(2)
us_cities_demographics.printSchema()
us_cities_demographics.count()

+-------------+-------------+-------+--------+-------+---------+----+-----+-----+----------+------------------+-----+
|         city|        state|med_age|male_pop|fem_pop|total_pop|vets|fborn|hsize|state_code|              race|count|
+-------------+-------------+-------+--------+-------+---------+----+-----+-----+----------+------------------+-----+
|Silver Spring|     Maryland|   33.8|   40601|  41862|    82463|1562|30908|  2.6|        MD|Hispanic or Latino|25924|
|       Quincy|Massachusetts|   41.0|   44129|  49500|    93629|4147|32935| 2.39|        MA|             White|58723|
+-------------+-------------+-------+--------+-------+---------+----+-----+-----+----------+------------------+-----+
only showing top 2 rows

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- med_age: double (nullable = true)
 |-- male_pop: integer (nullable = true)
 |-- fem_pop: integer (nullable = true)
 |-- total_pop: integer (nullable = true)
 |-- vets: integer (nullabl

2891

In [16]:
immigration_sample_df.select(['admnum','cicid','i94cit','i94res','i94port','i94mode','i94addr','i94yr','i94mon','arrdate','i94mode','depdate','gender','visatype','i94bir','biryear','i94visa','dtadfile','dtaddto']).show(2)
immigration_sample_df.show(2)
immigration_sample_df.printSchema()
immigration_sample_df.count()

+---------------+---------+------+------+-------+-------+-------+------+------+-------+-------+-------+------+--------+------+-------+-------+--------+--------+
|         admnum|    cicid|i94cit|i94res|i94port|i94mode|i94addr| i94yr|i94mon|arrdate|i94mode|depdate|gender|visatype|i94bir|biryear|i94visa|dtadfile| dtaddto|
+---------------+---------+------+------+-------+-------+-------+------+------+-------+-------+-------+------+--------+------+-------+-------+--------+--------+
|5.6582674633E10|4084316.0| 209.0| 209.0|    HHW|    1.0|     HI|2016.0|   4.0|20566.0|    1.0|20573.0|     F|      WT|  61.0| 1955.0|    2.0|20160422|07202016|
| 9.436199593E10|4422636.0| 582.0| 582.0|    MCA|    1.0|     TX|2016.0|   4.0|20567.0|    1.0|20568.0|     M|      B2|  26.0| 1990.0|    2.0|20160423|10222016|
+---------------+---------+------+------+-------+-------+-------+------+------+-------+-------+-------+------+--------+------+-------+-------+--------+--------+
only showing top 2 rows

+-------+

1000

In [10]:
immigration_df.select(['admnum','cicid','i94cit','i94res','i94port','i94mode','i94addr']).show(2)
immigration_df.select(['i94yr','i94mon','arrdate','i94mode','depdate','gender','visatype','i94bir','biryear','i94visa','dtadfile','dtaddto']).show(2)
immigration_df.show(1)
immigration_df.printSchema()
immigration_df.count()

+--------------+---------+------+------+-------+-------+-------+
|        admnum|    cicid|i94cit|i94res|i94port|i94mode|i94addr|
+--------------+---------+------+------+-------+-------+-------+
|9.495387003E10|5748517.0| 245.0| 438.0|    LOS|    1.0|     CA|
|9.495562283E10|5748518.0| 245.0| 438.0|    LOS|    1.0|     NV|
+--------------+---------+------+------+-------+-------+-------+
only showing top 2 rows

+------+------+-------+-------+-------+------+--------+------+-------+-------+--------+--------+
| i94yr|i94mon|arrdate|i94mode|depdate|gender|visatype|i94bir|biryear|i94visa|dtadfile| dtaddto|
+------+------+-------+-------+-------+------+--------+------+-------+-------+--------+--------+
|2016.0|   4.0|20574.0|    1.0|20582.0|     F|      B1|  40.0| 1976.0|    1.0|20160430|10292016|
|2016.0|   4.0|20574.0|    1.0|20591.0|     F|      B1|  32.0| 1984.0|    1.0|20160430|10292016|
+------+------+-------+-------+-------+------+--------+------+-------+-------+--------+--------+
onl

3096313

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [11]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [12]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.