# Data Engineering Capstone Project

#### Project Summary

The purpose of this project is an attempt to combine technologies learned throughout the Udacity Data Engineering program. The project encompasses four datasets. The main dataset includes data on immigration to the United States, and supplementary datasets include data on airport codes, U.S. city demographics, and temperature data. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

### Step 1: Scope the Project and Gather Data

#### Scope
This a capstone project and it's a part of the Udacity Data Engineering Nanodegree program. This project mimics a real-life situation when you need to analyze, clean, save the data into a columnar file format and load the data to a data lake on S3 using Spark. Create an Airflow pipeline to load the data to the Redshift database for further analytical purposes. You can see the process below. 

<img src="process.jpg">

Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 

The project contains four files that were gathered and provided by Udacity.

I94 Immigration Data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. There is a link to the original dataset https://travel.trade.gov/research/reports/i94/historical/2016.html. A sample dataset was along provided in the workspace (immigration_data_sample.csv). The data dictionary can be found in I94_SAS_Labels_Descriptions.SAS file.

#### I94 Data Dictionary
- cicid - float64 - ID that uniquely identify one record in the dataset
- i94yr - float64 - 4 digit year
- i94mon- float64 - Numeric month
- i94cit - float64 - 3 digit code of source city for immigration (Born country)
- i94res - float64 - 3 digit code of source country for immigration (Residence country)
- i94port - object - Port addmitted through
- arrdate - float64 - Arrival date in the USA
- i94mode - float64 - Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)
- i94addr - object - State of arrival
- depdate -float64 - Departure date
- i94bir - float64 - Age of Respondent in Years
- i94visa - float64 - Visa codes collapsed into three categories: (1 = Business; 2 = Pleasure; 3 = Student)
- count - float64 - Used for summary statistics
- dtadfile - object - Character Date Field
- visapost - object - Department of State where where Visa was issued
- occup - object - Occupation that will be performed in U.S.
- entdepa - object - Arrival Flag. Whether admitted or paroled into the US
- entdepd - object - Departure Flag. Whether departed, lost visa, or deceased
- entdepu - object - Update Flag. Update of visa, either apprehended, overstayed, or updated to PR
- matflag - object - Match flag
- biryear - float64 - 4 digit year of birth
- dtaddto - object - Character date field to when admitted in the US
- gender - object - Gender
- insnum - object - INS number
- airline - object - Airline used to arrive in U.S.
- admnum - float64 - Admission number, should be unique and not nullable
- fltno - object - Flight number of Airline used to arrive in U.S.
- visatype - object - Class of admission legally admitting the non-immigrant to temporarily stay in U.S.

World Temperature Data comes from Kaggle. Further details about the dataset can be found here: https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data.

#### World Temperature Data Dictionary
- dt - Date in format YYYY-MM-DD
- AverageTemperature - Average temperature of the city in a given date
- AverageTemperatureUncertainty - Standard Deviation of the avg. temperature
- City 
- Country 
- Latitude 
- Longitude 


The U.S. City Demographic Data comes from OpenSoft. Further details about the dataset can be found here: https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/.

#### Demographic Data Dictionary
- City - Name of the city
- State - US state of the city
- Median Age - The median of the age of the population
- Male Population - Number of the male population
- Female Population - Number of the female population
- Total Population - Number of the total population
- Number of Veterans - Number of veterans living in the city
- Foreign-born - Number of residents of the city that were not born in the city
- Average Household Size - Average size of the houses in the city
- State Code - Code of the state of the city
- Race - Race class
- Count - Number of individual of each race

Airport Date is a simple table of airport codes and corresponding cities. The data can be found here: https://datahub.io/core/airport-codes#data.

### Airport Data Dictionary
- ident - Unique identifier
- type - Type of the airport
- name - Airport Name
- elevation_ft - Altitude of the airport
- continent - Continent
- iso_country -ISO code of the country of the airport
- iso_region - ISO code for the region of the airport
- municipality - City where the airport is located
- gps_code - GPS code of the airport
- iata_code - IATA code of the airport
- local_code - Local code of the airport
- coordinates - GPS coordinates of the airport

### Step 2: Explore and Assess the Data

In [1]:
# Do all imports and installs
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import configparser
import os
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import datediff, to_date, date_format
from pyspark.sql.functions import year, month, dayofmonth
from pyspark.sql.functions import udf
from pyspark.sql.functions import col
from pyspark.sql.types import *

In [2]:
# read immigration_data_sample file to preview data
df = pd.read_csv("immigration_data_sample.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,...,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,...,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
Unnamed: 0    1000 non-null int64
cicid         1000 non-null float64
i94yr         1000 non-null float64
i94mon        1000 non-null float64
i94cit        1000 non-null float64
i94res        1000 non-null float64
i94port       1000 non-null object
arrdate       1000 non-null float64
i94mode       1000 non-null float64
i94addr       941 non-null object
depdate       951 non-null float64
i94bir        1000 non-null float64
i94visa       1000 non-null float64
count         1000 non-null float64
dtadfile      1000 non-null int64
visapost      382 non-null object
occup         4 non-null object
entdepa       1000 non-null object
entdepd       954 non-null object
entdepu       0 non-null float64
matflag       954 non-null object
biryear       1000 non-null float64
dtaddto       1000 non-null object
gender        859 non-null object
insnum        35 non-null float64
airline       967 non

In [4]:
df.describe()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,dtadfile,entdepu,biryear,insnum,admnum
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,951.0,1000.0,1000.0,1000.0,1000.0,0.0,1000.0,35.0,1000.0
mean,1542097.0,3040461.0,2016.0,4.0,302.928,298.262,20559.68,1.078,20575.037855,42.382,1.859,1.0,20160420.0,,1973.618,3826.857143,69372370000.0
std,915287.9,1799818.0,0.0,0.0,206.485285,202.12039,8.995027,0.485955,24.211234,17.903424,0.386353,0.0,49.51657,,17.903424,221.742583,23381340000.0
min,10925.0,13208.0,2016.0,4.0,103.0,103.0,20545.0,1.0,20547.0,1.0,1.0,1.0,20160400.0,,1923.0,3468.0,0.0
25%,721442.2,1412170.0,2016.0,4.0,135.0,131.0,20552.0,1.0,20561.0,30.75,2.0,1.0,20160410.0,,1961.0,3668.0,55993010000.0
50%,1494568.0,2941176.0,2016.0,4.0,213.0,213.0,20560.0,1.0,20570.0,42.0,2.0,1.0,20160420.0,,1974.0,3887.0,59314770000.0
75%,2360901.0,4694151.0,2016.0,4.0,438.0,438.0,20567.25,1.0,20580.0,55.0,2.0,1.0,20160420.0,,1985.25,3943.0,93436230000.0
max,3095749.0,6061994.0,2016.0,4.0,746.0,696.0,20574.0,9.0,20715.0,93.0,3.0,1.0,20160800.0,,2015.0,4686.0,95021510000.0


In [6]:
# display missing values in %
(df.isnull().sum() / len(df))*100

Unnamed: 0      0.0
cicid           0.0
i94yr           0.0
i94mon          0.0
i94cit          0.0
i94res          0.0
i94port         0.0
arrdate         0.0
i94mode         0.0
i94addr         5.9
depdate         4.9
i94bir          0.0
i94visa         0.0
count           0.0
dtadfile        0.0
visapost       61.8
occup          99.6
entdepa         0.0
entdepd         4.6
entdepu       100.0
matflag         4.6
biryear         0.0
dtaddto         0.0
gender         14.1
insnum         96.5
airline         3.3
admnum          0.0
fltno           0.8
visatype        0.0
dtype: float64

As we can see some columns have missing values, interestingly gender, i94addr, airline, fltno have some NA values.
airline and fltno have different proportions of missing values. occup and insnum columns have over 95% of missing values.

In [44]:
df1 = df[['cicid', 'i94addr', 'visapost']]
size = df1.groupby('cicid')['visapost','i94addr'].size().reset_index()
len(size[size[0] > 1])

0

It seems there are no duplicated entries

In [45]:
df[['i94mode']].apply(pd.Series.value_counts)

Unnamed: 0,i94mode
1.0,962
3.0,26
2.0,10
9.0,2


i94mode - float64 - Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)
As we are observing the sampled data we can conclude that most visitors enter the U.S. via Air with a small fraction by Sea. However, some people didn't declare how they enter the U.S.

In [8]:
df[['gender']].apply(pd.Series.value_counts)

Unnamed: 0,gender
M,471
F,386
X,2


There are slightly more male visitors than female and a few visitors have not declared gender.

In [31]:
df2 = df.groupby('i94mode').agg(
{'cicid': 'count',
 'i94bir': [min, max], 
 'visatype': ['unique']
})
df2.columns = ["_".join(x) for x in df2.columns.ravel()]
df2.head()

Unnamed: 0_level_0,cicid_count,i94bir_min,i94bir_max,visatype_unique
i94mode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,962,1.0,93.0,"[WT, B2, CP, B1, GMT, WB, F1, E2, F2, M1]"
2.0,10,11.0,88.0,"[WT, B1, B2]"
3.0,26,17.0,65.0,"[WT, B2, B1, F1]"
9.0,2,39.0,55.0,[WT]


It seems people who entered the U.S. by Air have wider spread by age with the youngest visitor = 1 year and the oldest = 93 years.

In [47]:
airport = pd.read_csv("airport-codes_csv.csv")
airport.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [48]:
airport.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55075 entries, 0 to 55074
Data columns (total 12 columns):
ident           55075 non-null object
type            55075 non-null object
name            55075 non-null object
elevation_ft    48069 non-null float64
continent       27356 non-null object
iso_country     54828 non-null object
iso_region      55075 non-null object
municipality    49399 non-null object
gps_code        41030 non-null object
iata_code       9189 non-null object
local_code      28686 non-null object
coordinates     55075 non-null object
dtypes: float64(1), object(11)
memory usage: 5.0+ MB


In [50]:
airport.describe()

Unnamed: 0,elevation_ft
count,48069.0
mean,1240.789677
std,1602.363459
min,-1266.0
25%,205.0
50%,718.0
75%,1497.0
max,22000.0


In [49]:
# display missing values in %
(airport.isnull().sum() / len(airport))*100

ident            0.000000
type             0.000000
name             0.000000
elevation_ft    12.720835
continent       50.329551
iso_country      0.448479
iso_region       0.000000
municipality    10.305946
gps_code        25.501589
iata_code       83.315479
local_code      47.914662
coordinates      0.000000
dtype: float64

As we can see there are some missing values in the airport dataset. It is important to notice that some values are missing in crucial columns like municipality, iata_code, local_code.

In [51]:
airport1 = airport[['ident', 'name', 'municipality']]
size = airport1.groupby('ident')['name','municipality'].size().reset_index()
len(size[size[0] > 1])

0

It seems there are no duplicated entries

In [54]:
fname = '/data2/GlobalLandTemperaturesByCity.csv'
temp_df =pd.read_csv(fname)
temp_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


The temperature data set has the oldest record dated 1743-11-01

In [58]:
temp_df.tail()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
8599207,2013-05-01,11.464,0.236,Zwolle,Netherlands,52.24N,5.26E
8599208,2013-06-01,15.043,0.261,Zwolle,Netherlands,52.24N,5.26E
8599209,2013-07-01,18.775,0.193,Zwolle,Netherlands,52.24N,5.26E
8599210,2013-08-01,18.025,0.298,Zwolle,Netherlands,52.24N,5.26E
8599211,2013-09-01,,,Zwolle,Netherlands,52.24N,5.26E


The newest record dated 2013-09-01, it seems like 2013 doesn't have the average temperature for all 12 months, therefore we will be using the whole 2012 data. 

In [55]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
dt                               object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                             object
Country                          object
Latitude                         object
Longitude                        object
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


In [56]:
temp_df.describe()

Unnamed: 0,AverageTemperature,AverageTemperatureUncertainty
count,8235082.0,8235082.0
mean,16.72743,1.028575
std,10.35344,1.129733
min,-42.704,0.034
25%,10.299,0.337
50%,18.831,0.591
75%,25.21,1.349
max,39.651,15.396


In [57]:
# display missing values in %
(temp_df.isnull().sum() / len(temp_df))*100

dt                               0.000000
AverageTemperature               4.234458
AverageTemperatureUncertainty    4.234458
City                             0.000000
Country                          0.000000
Latitude                         0.000000
Longitude                        0.000000
dtype: float64

There is a very small fraction of missing values in AverageTemperature and AverageTemperatureUncertainty columns

In [60]:
fname = 'us-cities-demographics.csv'
demo_df = pd.read_csv(fname, ';')
demo_df.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [62]:
demo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2891 entries, 0 to 2890
Data columns (total 12 columns):
City                      2891 non-null object
State                     2891 non-null object
Median Age                2891 non-null float64
Male Population           2888 non-null float64
Female Population         2888 non-null float64
Total Population          2891 non-null int64
Number of Veterans        2878 non-null float64
Foreign-born              2878 non-null float64
Average Household Size    2875 non-null float64
State Code                2891 non-null object
Race                      2891 non-null object
Count                     2891 non-null int64
dtypes: float64(6), int64(2), object(4)
memory usage: 271.1+ KB


In [63]:
demo_df.describe()

Unnamed: 0,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,Count
count,2891.0,2888.0,2888.0,2891.0,2878.0,2878.0,2875.0,2891.0
mean,35.494881,97328.43,101769.6,198966.8,9367.832523,40653.6,2.742543,48963.77
std,4.401617,216299.9,231564.6,447555.9,13211.219924,155749.1,0.433291,144385.6
min,22.9,29281.0,27348.0,63215.0,416.0,861.0,2.0,98.0
25%,32.8,39289.0,41227.0,80429.0,3739.0,9224.0,2.43,3435.0
50%,35.3,52341.0,53809.0,106782.0,5397.0,18822.0,2.65,13780.0
75%,38.0,86641.75,89604.0,175232.0,9368.0,33971.75,2.95,54447.0
max,70.5,4081698.0,4468707.0,8550405.0,156961.0,3212500.0,4.98,3835726.0


In [64]:
# display missing values in %
(demo_df.isnull().sum() / len(demo_df))*100

City                      0.000000
State                     0.000000
Median Age                0.000000
Male Population           0.103770
Female Population         0.103770
Total Population          0.000000
Number of Veterans        0.449671
Foreign-born              0.449671
Average Household Size    0.553442
State Code                0.000000
Race                      0.000000
Count                     0.000000
dtype: float64

The dataset looks complete only a few columns are missing a small fraction of data. 

In [88]:
#i94_small = i94_small.join(df, df.code == i94_small.port, how = 'left')
#i94_small = i94_small.join(temperatures, (temperatures.city == poop.city) & (temperatures.month == poop.month) & (temperatures.day == poop.day), how = 'inner')

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Star schema will be used for this project. The schemas will contain a fact table with multiple dimensional tables. Below you can see current tables.

<img src="current_tables.jpg">

And the following Data Model is considered.

### Data Model

<img src="data_model.jpg">

#### 3.2 Mapping Out Data Pipelines

The data pipeline consists of 15 tables and allows to provide analytical information for the immigration department based on multiple parameters, like temperature on arrival city, the volume of visitors by the month of the year and many other insights that can help with planing the workload. The pipeline contains the following steps. 

- Read I94 SAS files to spark data frame, rename columns, remove 
- Rename columns to a more readable format
- Remove null values from dtadfile since we need to use it as a primary key
- Convert dtaddto and date_created to to_date format
- Create a new column stayed_days to define the number of days each visitor stayed
- Create a day column that can be used along month and city to map the temperature table 
- Create the airport table where iata_code is not null and iso_country is US
- For airport table convert iso_region into new column called state
- Create the temperature table where Country is United States and year is 2012
- Create the demographic table and rename columns to easy-read names.
- Define two functions to create five mapping tables for I94 data. 
- Save all tables in parquet format and load them to data lake S3.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [5]:
# read AWS keys to access S3 storage
config = configparser.ConfigParser()
config.read('dl.cfg')
os.environ['AWS_ACCESS_KEY_ID']=config['AWS CREDS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS CREDS']['AWS_SECRET_ACCESS_KEY']

In [6]:
# create spark session
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "saurfang:spark-sas7bdat:2.1.0-s_2.11,org.apache.hadoop:hadoop-aws:2.7.0") \
    .config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.awsAccessKeyId", os.environ['AWS_ACCESS_KEY_ID']) \
    .config("spark.hadoop.fs.s3a.awsSecretAccessKey", os.environ['AWS_SECRET_ACCESS_KEY']) \
    .config("spark.driver.memory", "15g")\
    .enableHiveSupport() \
    .getOrCreate()

spark.conf.set("spark.sql.execution.arrow.enabled", "true")    
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version", "2")
sc._jsc.hadoopConfiguration().set("spark.speculation","false")

In [91]:
# load one month of i94 (immigration data) in sas format and drop all duplicates
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat').drop_duplicates()
# drop all null values in dtadfile column
df_spark = df_spark.where(col("dtadfile").isNotNull())
# conver dtaddto to proper format
df_spark = df_spark.withColumn("allowed_stay_till",  to_date("dtaddto","MMddyyyy"))
# conver dtadfile to proper format
df_spark = df_spark.withColumn('date_created', to_date("dtadfile", 'yyyyMMdd'))
# find difference between dtaddto and dtadfile
df_spark = df_spark.withColumn("allowed_stay_days", datediff('allowed_stay_till', 'date_created'))
# create stayed days for each record
df_spark = df_spark.withColumn('stayed_days', ( df_spark['depdate'] - df_spark['arrdate']))
# create day column
df_spark = df_spark.withColumn('day', dayofmonth("date_created"))
# create a new view and name it as i94
df_spark.createOrReplaceTempView('i94')

query = """SELECT cicid, 
                date_created, 
                i94yr AS year, 
                i94mon AS month,
                day,
                i94cit AS citizenship,
                i94res AS resident,
                i94bir AS age,
                biryear AS birth_year,
                gender,
                occup AS occupation,
                allowed_stay_till,
                allowed_stay_days,
                stayed_days,
                i94visa AS visa_class,
                visatype AS visa_type,
                i94port AS port,
                i94mode AS mode,
                i94addr AS arraval_state,
                visapost AS visa_issued_by,
                entdepa AS arrival_flag,
                entdepd AS depart_flag,
                entdepu AS update_flag,
                matflag AS match_flag,
                airline,
                admnum AS admission_number,
                fltno AS flight_number
           FROM i94
       """
i94 = spark.sql(query)

# since schema was infered by default we need to change soma data types
i94 = i94.withColumn("cicid", i94["cicid"].cast(IntegerType()))
i94 = i94.withColumn("year", i94["year"].cast(IntegerType()))
i94 = i94.withColumn("month", i94["month"].cast(IntegerType()))
i94 = i94.withColumn("citizenship", i94["citizenship"].cast(IntegerType()))
i94 = i94.withColumn("resident", i94["resident"].cast(IntegerType()))
i94 = i94.withColumn("age", i94["age"].cast(IntegerType()))
i94 = i94.withColumn("birth_year", i94["birth_year"].cast(IntegerType()))
i94 = i94.withColumn("stayed_days", i94["stayed_days"].cast(IntegerType()))
i94 = i94.withColumn("visa_class", i94["visa_class"].cast(IntegerType()))
i94 = i94.withColumn("mode", i94["mode"].cast(IntegerType()))
i94 = i94.withColumn("admission_number", i94["admission_number"].cast(IntegerType()))

In [93]:
# save i94 data to parquet format on S3
output_data = "s3a://udacity-data-engineering-stan/data/"
i94.write.parquet(output_data + 'i94/', 'overwrite')

In [94]:
# read parquet file from S3
df_spark2 = spark.read.parquet("s3a://udacity-data-engineering-stan/data/i94/")

In [23]:
# load temperature data
fname = '/data2/GlobalLandTemperaturesByCity.csv'
temp_df = spark.read.csv(fname, header=True)
# create year and month columns
temp_df = temp_df.withColumn('month', month("dt"))\
        .withColumn('year', year('dt'))\
        .withColumn('day', dayofmonth('dt'))\
        .withColumn('id', monotonically_increasing_id())
# create a new view 
temp_df.createOrReplaceTempView('temperatures')
# select only data for 2012 and United States
query = """SELECT id,
                  AverageTemperature AS avg_temp,
                  AverageTemperatureUncertainty AS sd_temp,
                  LOWER(City) AS city,
                  Country AS country,
                  month,
                  day
             FROM temperatures
             WHERE Country = 'United States'
             AND year = 2012
        """
temperatures = spark.sql(query)

temperatures = temperatures.withColumn("avg_temp", temperatures["avg_temp"].cast(FloatType()))
temperatures = temperatures.withColumn("sd_temp", temperatures["sd_temp"].cast(FloatType()))
temperatures = temperatures.withColumn("id", temperatures["id"].cast(IntegerType()))

#save data in parquet format on S3 
temperatures.write.parquet(output_data + 'temperatures/', 'overwrite')
# load data from S3 
df_temp=spark.read.parquet("s3a://udacity-data-engineering-stan/data/temperatures/")
# display data
temperatures.show()

+------+--------+-------+-------+-------------+-----+---+
|    id|avg_temp|sd_temp|   city|      country|month|day|
+------+--------+-------+-------+-------------+-----+---+
| 49859|   7.996|  0.204|abilene|United States|    1|  1|
| 49860|   8.434|  0.252|abilene|United States|    2|  1|
| 49861|  15.628|  0.173|abilene|United States|    3|  1|
| 49862|  21.069|  0.388|abilene|United States|    4|  1|
| 49863|  24.698|  0.323|abilene|United States|    5|  1|
| 49864|  28.217|  0.126|abilene|United States|    6|  1|
| 49865|  29.581|  0.288|abilene|United States|    7|  1|
| 49866|  29.104|  0.322|abilene|United States|    8|  1|
| 49867|  24.333|  0.313|abilene|United States|    9|  1|
| 49868|  16.702|  0.264|abilene|United States|   10|  1|
| 49869|  13.892|  0.286|abilene|United States|   11|  1|
| 49870|   7.951|  0.286|abilene|United States|   12|  1|
|140284|  -0.344|   0.41|  akron|United States|    1|  1|
|140285|   1.527|  0.319|  akron|United States|    2|  1|
|140286|  10.1

In [36]:
# load airport data
fname = 'airport-codes_csv.csv'
# load airport-codes_csv.csv data
airport_df = spark.read.csv(fname, header=True, inferSchema=True).drop_duplicates()
# create a new view 
airport_df.createOrReplaceTempView('airports')
# select iata_code that is not null and not nan and only in US
query = """SELECT ident, 
                  type, 
                  name, 
                  elevation_ft, 
                  iso_country, 
                  iso_region, 
                  municipality, 
                  gps_code, 
                  iata_code AS airport_code, 
                  coordinates
             FROM airports 
             WHERE iata_code IS NOT NULL 
             AND NOT iata_code = 'nan'
             AND iso_country = 'US'
        """
airports = spark.sql(query)

# convert iso_region column to state column
def state(iso_region):
    return iso_region.strip().split('-')[-1]
udf_state = udf(lambda iso_region: state(iso_region), StringType())

airports = airports.withColumn('state', udf_state('iso_region')).drop('iso_region')
airports.show()

+-----+--------------+--------------------+------------+-----------+--------------------+--------+------------+--------------------+-----+
|ident|          type|                name|elevation_ft|iso_country|        municipality|gps_code|airport_code|         coordinates|state|
+-----+--------------+--------------------+------------+-----------+--------------------+--------+------------+--------------------+-----+
|  CZK| small_airport|Cascade Locks Sta...|         151|         US|       Cascade Locks|    KCZK|         CZK|-121.878997803, 4...|   OR|
| KATL| large_airport|Hartsfield Jackso...|        1026|         US|             Atlanta|    KATL|         ATL| -84.428101, 33.6367|   GA|
| KECS| small_airport|       Mondell Field|        4174|         US|           Newcastle|    KECS|         ECS|-104.318001, 43.8...|   WY|
| PASY|medium_airport|Eareckson Air Sta...|          95|         US|              Shemya|    PASY|         SYA|174.1139984, 52.7...|   AK|
|  SAS| small_airport|  Sal

In [37]:
# save data in parquet format on S3
output_data = "s3a://udacity-data-engineering-stan/data/"
airports.write.mode('overwrite').parquet(output_data + "airports_data/")
# load data from S3 
df_airports=spark.read.parquet("s3a://udacity-data-engineering-stan/data/airports_data/")
# display data
df_airports.show()

+-----+--------------+--------------------+------------+-----------+--------------------+--------+------------+--------------------+-----+
|ident|          type|                name|elevation_ft|iso_country|        municipality|gps_code|airport_code|         coordinates|state|
+-----+--------------+--------------------+------------+-----------+--------------------+--------+------------+--------------------+-----+
|  9A8| small_airport|     Ugashik Airport|          25|         US|             Ugashik|     9A8|         UGS|-157.399002075, 5...|   AK|
| CT88|      heliport| Rentschler Heliport|          30|         US|       East Hartford|    CT88|         EHT|   -72.6253, 41.7517|   CT|
| K0K7| small_airport|Humboldt Municipa...|        1093|         US|            Humboldt|     0K7|         HUD|-94.2452011108, 4...|   IA|
| KABQ| large_airport|Albuquerque Inter...|        5355|         US|         Albuquerque|    KABQ|         ABQ|-106.609001, 35.0...|   NM|
| KCEF|medium_airport|Westo

In [38]:
# load demographics data
fname = 'us-cities-demographics.csv'
demo_df = pd.read_csv(fname, ';')
# create spark dataframe
demo_df = spark.createDataFrame(demo_df)
# create a new view
demo_df.createOrReplaceTempView('demographics')
query = """SELECT city, 
                 `Median Age` AS median_age, 
                 `Male Population` AS male_population,
                 `Female Population` AS female_population, 
                 `Total Population` AS population,
                 `Number of Veterans` AS num_veterans, 
                 `Foreign-born` AS foreign_born, 
                 `Average Household Size` AS avg_household_size,
                 `State Code` AS state, 
                 race, 
                 count
            FROM demographics"""
us_cities_demo = spark.sql(query)

us_cities_demo.show()

us_cities_demo = us_cities_demo.withColumn("male_population", us_cities_demo["male_population"].cast(IntegerType()))
us_cities_demo = us_cities_demo.withColumn("female_population", us_cities_demo["female_population"].cast(IntegerType()))
us_cities_demo = us_cities_demo.withColumn("num_veterans", us_cities_demo["num_veterans"].cast(IntegerType()))
us_cities_demo = us_cities_demo.withColumn("foreign_born", us_cities_demo["foreign_born"].cast(IntegerType()))

+----------------+----------+---------------+-----------------+----------+------------+------------+------------------+-----+--------------------+------+
|            city|median_age|male_population|female_population|population|num_veterans|foreign_born|avg_household_size|state|                race| count|
+----------------+----------+---------------+-----------------+----------+------------+------------+------------------+-----+--------------------+------+
|   Silver Spring|      33.8|        40601.0|          41862.0|     82463|      1562.0|     30908.0|               2.6|   MD|  Hispanic or Latino| 25924|
|          Quincy|      41.0|        44129.0|          49500.0|     93629|      4147.0|     32935.0|              2.39|   MA|               White| 58723|
|          Hoover|      38.5|        38040.0|          46799.0|     84839|      4819.0|      8229.0|              2.58|   AL|               Asian|  4759|
|Rancho Cucamonga|      34.5|        88127.0|          87105.0|    175232|  

In [40]:
# save data in parquet format on S3
us_cities_demo.write.mode('overwrite').parquet(output_data + "demographics/")
# load data from S3 
df_demo = spark.read.parquet("s3a://udacity-data-engineering-stan/data/demographics/")
# display data
df_demo.show()

+----------------+----------+---------------+-----------------+----------+------------+------------+------------------+-----+--------------------+------+
|            city|median_age|male_population|female_population|population|num_veterans|foreign_born|avg_household_size|state|                race| count|
+----------------+----------+---------------+-----------------+----------+------------+------------+------------------+-----+--------------------+------+
|   Silver Spring|      33.8|          40601|            41862|     82463|        1562|       30908|               2.6|   MD|  Hispanic or Latino| 25924|
|          Quincy|      41.0|          44129|            49500|     93629|        4147|       32935|              2.39|   MA|               White| 58723|
|          Hoover|      38.5|          38040|            46799|     84839|        4819|        8229|              2.58|   AL|               Asian|  4759|
|Rancho Cucamonga|      34.5|          88127|            87105|    175232|  

In [79]:
# let's create a function for mapping
def mapping_function(file_name, column_name1, column_name2):
    file = open(file_name, 'r')
    code = []
    name = []
    for i in file:
        row = " ".join(i.split())
        code.append(row[:row.index('=')-1])
        name.append(row[row.index('=')+1:])
    file.close()
    df = pd.DataFrame(list(zip(code,name)), columns = [f'{column_name1}', f'{column_name2}'])
    df = spark.createDataFrame(df)
    return df

In [14]:
# mapping data
# load airport_code data for mapping
airport_code = open('airports_code.txt', 'r')
code = []
airport = []
for i in airport_code:
        row = " ".join(i.split())
        code.append(row[:row.index('=')-1])
        row = row.split(",")
        row = row[0].lower()
        row = row[row.index('=')+1:]
        airport.append(row)
airport_code.close()
df = pd.DataFrame(list(zip(code,airport)),columns=['code', 'airport'])
df = spark.createDataFrame(df)
df.show()

+----+--------------------+
|code|             airport|
+----+--------------------+
| ALC|               alcan|
| ANC|           anchorage|
| BAR|baker aaf - baker...|
| DAC|       daltons cache|
| PIZ|dew station pt la...|
| DTH|        dutch harbor|
| EGL|               eagle|
| FRB|           fairbanks|
| HOM|               homer|
| HYD|               hyder|
| JUN|              juneau|
| 5KE|           ketchikan|
| KET|           ketchikan|
| MOS|moses point inter...|
| NIK|             nikiski|
| NOM|                 nom|
| PKC|         poker creek|
| ORI|      port lions spb|
| SKA|             skagway|
| SNP|     st. paul island|
+----+--------------------+
only showing top 20 rows



In [80]:
# let's create a function to map airports
def mapping_function_airport(file_name, column_name1, column_name2):
    file = open(file_name, 'r')
    code = []
    name = []
    for i in file:    
        row = " ".join(i.split())
        code.append(row[:row.index('=')-1])
        row = row.split(",")
        row = row[0].lower()
        row = row[row.index('=')+1:]
        name.append(row)
    file.close()
    df = pd.DataFrame(list(zip(code,name)), columns = [f'{column_name1}', f'{column_name2}'])
    df = spark.createDataFrame(df)
    return df

In [82]:
# save data to S3
output_data = "s3a://udacity-data-engineering-stan/data/"
states.write.mode('overwrite').parquet(output_data + "states_code/")
countries.write.mode('overwrite').parquet(output_data + "countries_code/")
modes.write.mode('overwrite').parquet(output_data + "modes_code/")
visas.write.mode('overwrite').parquet(output_data + "visas_code/")
airports.write.mode('overwrite').parquet(output_data + "airports_code/")

In [81]:
# create mapping dataframes
states = mapping_function('states_code.txt', 'code', 'state')
countries = mapping_function('countries_code.txt', 'code', 'country')
modes = mapping_function('modes_code.txt', 'code', 'mode')
visas = mapping_function('visas_code.txt', 'code', 'visa')
airports = mapping_function_airport('airports_code.txt', 'code', 'airport')
visas = visas.withColumn("code", visas["code"].cast(IntegerType()))
modes = modes.withColumn("code", modes["code"].cast(IntegerType()))
countries = countries.withColumn("code", countries["code"].cast(IntegerType()))
countries = countries.where(col("code").isNotNull())

In [83]:
# load data from S3
df_states = spark.read.parquet("s3a://udacity-data-engineering-stan/data/states_code/")
df_countries = spark.read.parquet("s3a://udacity-data-engineering-stan/data/countries_code/")
df_modes = spark.read.parquet("s3a://udacity-data-engineering-stan/data/modes_code/")
df_visas = spark.read.parquet("s3a://udacity-data-engineering-stan/data/visas_code/")
df_airports = spark.read.parquet("s3a://udacity-data-engineering-stan/data/airports_code/")

In [9]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import configparser
import os
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import datediff, to_date, date_format
from pyspark.sql.functions import year, month, dayofmonth
from pyspark.sql.functions import udf
from pyspark.sql.functions import col
from pyspark.sql.types import *


config = configparser.ConfigParser()
config.read('dl.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['AWS CREDS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS CREDS']['AWS_SECRET_ACCESS_KEY']


def create_spark_session():
    '''
    This function creates a spark session
    '''
    spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "saurfang:spark-sas7bdat:2.1.0-s_2.11,org.apache.hadoop:hadoop-aws:2.7.0") \
    .config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.awsAccessKeyId", os.environ['AWS_ACCESS_KEY_ID']) \
    .config("spark.hadoop.fs.s3a.awsSecretAccessKey", os.environ['AWS_SECRET_ACCESS_KEY']) \
    .config("spark.driver.memory", "15g")\
    .enableHiveSupport() \
    .getOrCreate()
    # this code speeds up parquet write
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")    
    sc = spark.sparkContext
    sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version", "2")
    sc._jsc.hadoopConfiguration().set("spark.speculation","false")
    print(spark)
    return spark

def i94_data(month, year):
    '''
    This function 
    '''
    df_spark =spark.read.format('com.github.saurfang.sas.spark').load(f'../../data/18-83510-I94-Data-2016/i94_{month}{year}_sub.sas7bdat').drop_duplicates()
    df_spark = df_spark.where(col("dtadfile").isNotNull())
    df_spark = df_spark.withColumn('stayed_days', ( df_spark['depdate'] - df_spark['arrdate']))
    df_spark = df_spark.withColumn("allowed_stay_till",  to_date("dtaddto","MMddyyyy"))
    df_spark = df_spark.withColumn('date_created', to_date("dtadfile", 'yyyyMMdd'))
    df_spark = df_spark.withColumn("allowed_stay_days", datediff('allowed_stay_till', 'date_created'))
    df_spark.createOrReplaceTempView('i94')
    query = """SELECT cicid, 
                date_created, 
                i94yr AS year, 
                i94mon AS month,
                i94cit AS citizenship,
                i94res AS resident,
                i94bir AS age,
                biryear AS birth_year,
                gender,
                occup AS occupation,
                allowed_stay_till,
                allowed_stay_days,
                stayed_days,
                i94visa AS visa_class,
                visatype AS visa_type,
                i94port AS port,
                i94mode AS mode,
                i94addr AS arraval_state,
                visapost AS visa_issued_by,
                entdepa AS arrival_flag,
                entdepd AS depart_flag,
                entdepu AS update_flag,
                matflag AS match_flag,
                insnum AS ins_number,
                airline,
                admnum AS admission_number,
                fltno AS flight_number
           FROM i94
       """
    i94 = spark.sql(query)
    i94 = i94.withColumn("cicid", i94["cicid"].cast(IntegerType()))
    i94 = i94.withColumn("year", i94["year"].cast(IntegerType()))
    i94 = i94.withColumn("month", i94["month"].cast(IntegerType()))
    i94 = i94.withColumn("citizenship", i94["citizenship"].cast(IntegerType()))
    i94 = i94.withColumn("resident", i94["resident"].cast(IntegerType()))
    i94 = i94.withColumn("age", i94["age"].cast(IntegerType()))
    i94 = i94.withColumn("birth_year", i94["birth_year"].cast(IntegerType()))
    i94 = i94.withColumn("stayed_days", i94["stayed_days"].cast(IntegerType()))
    i94 = i94.withColumn("visa_class", i94["visa_class"].cast(IntegerType()))
    i94 = i94.withColumn("mode", i94["mode"].cast(IntegerType()))
    i94 = i94.withColumn("admission_number", i94["admission_number"].cast(IntegerType()))
    i94.write.mode('overwrite').partitionBy("month", "year").parquet(output_data + "immigration/")
    print('Immigration data was saved in parquet format on S3')
    

def airport_data():
    fname = 'airport-codes_csv.csv'
    airport_df = spark.read.csv(fname, header=True, inferSchema=True).drop_duplicates()
    airport_df.createOrReplaceTempView('airports')
    query = """SELECT ident, 
                      type, 
                      name, 
                      elevation_ft, 
                      iso_country, 
                      iso_region, 
                      municipality, 
                      gps_code, 
                      iata_code AS airport_code, 
                      coordinates
                 FROM airports 
                 WHERE iata_code IS NOT NULL 
                 AND NOT iata_code = 'nan'
                 AND iso_country = 'US'
            """
    airports = spark.sql(query)
    
    def state(iso_region):
        return iso_region.strip().split('-')[-1]
    udf_state = udf(lambda iso_region: state(iso_region), StringType())
    airports = airports.withColumn('state', udf_state('iso_region')).drop('iso_region')
    airports.write.mode('overwrite').parquet(output_data + "airports_data/")
    print('Airport data was saved in parquet format on S3')
    
    
def us_demo_data():
    fname = 'us-cities-demographics.csv'
    demo_df = pd.read_csv(fname, ';')
    demo_df = spark.createDataFrame(demo_df)
    demo_df.createOrReplaceTempView('demographics')
    query = """SELECT city, 
                     `Median Age` AS median_age, 
                     `Male Population` AS male_population,
                     `Female Population` AS female_population, 
                     `Total Population` AS population,
                     `Number of Veterans` AS num_veterans, 
                     `Foreign-born` AS foreign_born, 
                     `Average Household Size` AS avg_household_size,
                     `State Code` AS state, 
                     race, 
                     count
                FROM demographics"""
    us_cities_demo = spark.sql(query)
    us_cities_demo = us_cities_demo.withColumn("male_population", us_cities_demo["male_population"].cast(IntegerType()))
    us_cities_demo = us_cities_demo.withColumn("female_population", us_cities_demo["female_population"].cast(IntegerType()))
    us_cities_demo = us_cities_demo.withColumn("num_veterans", us_cities_demo["num_veterans"].cast(IntegerType()))
    us_cities_demo = us_cities_demo.withColumn("foreign_born", us_cities_demo["foreign_born"].cast(IntegerType()))
    us_cities_demo.write.mode('overwrite').parquet(output_data + "demographics/")
    print('US cities data was saved in parquet format on S3')
    

def temp_data():
    fname = '/data2/GlobalLandTemperaturesByCity.csv'
    temp_df = spark.read.csv(fname, header=True)
    temp_df = temp_df.withColumn('month', month("dt"))\
            .withColumn('year', year('dt'))\
            .withColumn('day', dayofmonth('dt'))\
            .withColumn('id', monotonically_increasing_id())
    temp_df.createOrReplaceTempView('temperatures')
    query = """SELECT id,
                      AverageTemperature AS avg_temp,
                      AverageTemperatureUncertainty AS sd_temp,
                      LOWER(City) AS city,
                      Country AS country,
                      month,
                      day
                 FROM temperatures
                 WHERE Country = 'United States'
                 AND year = 2012
            """
    temperatures = spark.sql(query)
    temperatures = temperatures.withColumn("avg_temp", temperatures["avg_temp"].cast(FloatType()))
    temperatures = temperatures.withColumn("sd_temp", temperatures["sd_temp"].cast(FloatType()))
    temperatures = temperatures.withColumn("id", temperatures["id"].cast(IntegerType()))
    temperatures.write.parquet(output_data + 'temperatures/', 'overwrite')
    print('Temperature data was saved in parquet format on S3')


def mapping_function(file_name):
    file = open(file_name+'_code.txt', 'r')
    code = []
    name = []
    for i in file:
        row = " ".join(i.split())
        code.append(row[:row.index('=')-1])
        name.append(row[row.index('=')+1:])
    file.close()
    df = pd.DataFrame(list(zip(code,name)), columns = ['code', 'name'])
    df = spark.createDataFrame(df)
    if file_name == 'visas':
        df = df.withColumn("code", df["code"].cast(IntegerType()))
    if file_name == 'modes': 
        df = df.withColumn("code", df["code"].cast(IntegerType()))
    if file_name == 'countries':  
        df = df.withColumn("code", df["code"].cast(IntegerType()))
        df = df.where(col("code").isNotNull())
    df.write.mode('overwrite').parquet(output_data + f"{file_name} + '_code'/")
    print(file_name + ' data was saved in parquet format on S3')

def mapping_function_airport(file_name):
    file = open(file_name+'_code.txt', 'r')
    code = []
    name = []
    for i in file:    
        row = " ".join(i.split())
        code.append(row[:row.index('=')-1])
        row = row.split(",")
        row = row[0].lower()
        row = row[row.index('=')+1:]
        name.append(row)
    file.close()
    df = pd.DataFrame(list(zip(code,name)), columns = ['code', 'name'])
    df = spark.createDataFrame(df)
    df.write.mode('overwrite').parquet(output_data + f"{file_name} +'_code'/")
    print(file_name + ' data was saved in parquet format on S3')


In [10]:
output_data = "s3a://udacity-data-engineering-stan/data/"
spark=create_spark_session()
i94_data('jan', 16)
airport_data()
us_demo_data()
mapping_list=['countries','states','visas','modes']
for i in mapping_list:
    mapping_function(i)
mapping_function_airport('airports')

countries data was saved in parquet format on S3
states data was saved in parquet format on S3
visas data was saved in parquet format on S3
modes data was saved in parquet format on S3


#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

Data quality checks were run by the Airflow pipeline during the process of uploading data to the Redshift database.

<img src="airflow1.jpg">

<img src="airflow2.jpg">

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
- Spark, S3, Airflow and Redshift have been used for this project. Spark allows fast in-memory parallel data processing. Spark was used to create columnar format files that have been loaded to S3. S3 is a key-value store that can be represented as a data lake. Airflow was used to create a pipeline to create a database and all tables in Redshift and load the data to the tables. 
* Propose how often the data should be updated and why.
- Certain tables that are related to i94 data and temperature should be updated daily via Airflow scheduler as this data gets generated on daily basis and should be fresh to provide the most accurate analytical reports and dashboards.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 - The current project was designed with big data in mind, both Spark, Airflow and Redshift allow fast scalability for growing data and user throughput. 
 * The data populates a dashboard that must be updated on a daily basis by 7 am every day.
 - Airflow can be used to schedule data ingestion to the Redshift data warehouse.
 * The database needed to be accessed by 100+ people.
 - Many cloud providers offer instant scalability, for example, Amazon EMR servers can be upscaled to provide more power to data transformation and Redshift can be auto-scaled to provide smooth access to the data to all users.

Sources:  https://github.com/srkucd/data_engineering_capstone/blob/master/prototype.ipynb