# Udacity Data Engineer Nanodegree - Capstone Project

## Project Summary

This project builds upon 4 datasetrs included in the Udacity project workspace. The workflow for this project is described below:

1. Downloading and uploading the data to S3 bucket
2. Exploring all the data to understand them, clean them, and possibly save a new copy
3. Define the Data Model  based on the exploration 
4. Design ETL as such and then run it to model the data

In [180]:
import pandas as pd
import pyspark
import configparser
from datetime import datetime
import os
import glob
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.functions import year, month, dayofmonth, dayofweek, hour, weekofyear, date_format, to_date
from pyspark.sql.functions import lit, expr
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType, TimestampType

In [181]:
config = configparser.ConfigParser()
config.read('config.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config.get('AWS', 'AWS_ACCESS_KEY_ID')
os.environ['AWS_SECRET_ACCESS_KEY']=config.get('AWS', 'AWS_SECRET_ACCESS_KEY')
AWS_ACCESS_KEY_ID = config.get('AWS', 'AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = config.get('AWS', 'AWS_SECRET_ACCESS_KEY')

## Step 1: Downloading and uploading data to S3 bucket

### Datasets
For this project, I will be mainly working wiht the following 4 datasets hosted in the Udacity Workspace.

* **I94 Immigration Data:** This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. [This](https://travel.trade.gov/research/reports/i94/historical/2016.html) is where the data comes from. There's also a sample file so you can take a look at the data in csv format before reading it all in. 

* **World Temperature Data:** This dataset came from Kaggle. You can read more about it [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

* **U.S. City Demographic Data:** This data comes from OpenSoft. You can read more about it [here]().
* **Airport Code Table:** This is a simple table of airport codes and corresponding cities. It comes from [here](https://datahub.io/core/airport-codes#data).

### Downloading and uploading to S3

#### Downloading data

* I94 Immigration Data: 
The data is in the  folder with the following path: `../../data/18-83510-I94-Data-2016/`. To download this, I ran the following command in the command prompt.
`!zip -r data.zip ../../data/18-83510-I94-Data-2016` This created a zip folder named **data.zip** in the working directory. Now, I just downloaded this to my local computer by right clicking on it and clicking on **Download**. 

* World Temperature Data: 
The data is in the folder with the following path: `../../data2/`. To download this, I ran the following command in the command prompt.
`!zip -r data2.zip ../../data2`
This created a zip folder named **data2.zip** in the working directory. Now, like before, I downloaded this to my local computer by right clicking on this zipped folder and clicking on **Download**. 

* U.S. City Demographic Data: This data is in the working directory as a single csv file named us-cities-demographics.csv and so just right click it and download.
* Airport Code Table: This file named as airport-codes_csv.csv can also be directly downloaded to the local computer.

#### Uploading to S3
Now, upload all of these data to a S3 bucket so that this can be later accessed through Spark and processed as such.

## Step 2: Exploring all these data to understand them

### Step 2.a. Explore I94_SAS_Labels_Descriptions.SAS

#### First get the country code and name and then write this as CSV file.

* Country code

In [160]:
with open("I94_SAS_Labels_Descriptions.SAS") as file:
    auxiliary_data = file.readlines()

As the data on this starts from 10th row and ends on 298th row, I will now get all these data by looping through these lines.

In [27]:
country = {}
for countries in auxiliary_data[9:298]:
    line = countries.split("=")
    code = line[0].strip()
    country_name = line[1].strip().strip("'")
    country[code] = country_name

In [32]:
country_pd = pd.DataFrame(list(country.items()), columns = ['code', 'country'])

In [33]:
country_pd.head(5)

Unnamed: 0,code,country
0,582,"MEXICO Air Sea, and Not Reported (I-94, no lan..."
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA


In [34]:
# Write the data to the directory
country_pd.to_csv("countries.csv", index = False)

#### Secondly, get the city code and name and then write this as CSV file.

* City code

In [4]:
city = {}
for cities in auxiliary_data[302:962]:
    line = cities.split("=")
    code = line[0].strip().strip("'")
    city_name = line[1].strip().strip("'")
    city[code] = city_name

In [5]:
city_pd = pd.DataFrame(list(city.items()), columns = ['code', 'city'])

In [7]:
city_pd.head()

Unnamed: 0,code,city
0,ALC,"ALCAN, AK"
1,ANC,"ANCHORAGE, AK"
2,BAR,"BAKER AAF - BAKER ISLAND, AK"
3,DAC,"DALTONS CACHE, AK"
4,PIZ,"DEW STATION PT LAY DEW, AK"


In [8]:
# Write the data to the directory
city_pd.to_csv("cities.csv", index = False)

#### Thirdly, get the mode code and name and then write this as CSV file.

* Mode code

In [20]:
mode = {}
for modes in auxiliary_data[972:976]:
    line = modes.split("=")
    code = line[0].strip()
    mode_name = line[1].strip().strip("'").strip(";").strip("'")
    mode[code] = mode_name

In [21]:
mode_pd = pd.DataFrame(list(mode.items()), columns = ['code', 'mode'])

In [22]:
mode_pd.head()

Unnamed: 0,code,mode
0,1,Air
1,2,Sea
2,3,Land
3,9,Not reported'


In [23]:
# Write the data to the directory
mode_pd.to_csv("mode.csv", index = False)

#### Fourthly, get the state code and name and then write this as CSV file.
* State code

In [24]:
state = {}
for states in auxiliary_data[981:1036]:
    line = states.split("=")
    code = line[0].strip().strip("'")
    state_name = line[1].strip().strip("'")
    state[code] = state_name

In [25]:
state_pd = pd.DataFrame(list(state.items()), columns = ['code', 'state'])

In [26]:
state_pd.head()

Unnamed: 0,code,state
0,AL,ALABAMA
1,AK,ALASKA
2,AZ,ARIZONA
3,AR,ARKANSAS
4,CA,CALIFORNIA


In [27]:
# Write the data to the directory
state_pd.to_csv("state.csv", index = False)

#### Fifthly, get the visa code and name and then write this as CSV file.
* Visa code

In [28]:
visa = {}
for visas in auxiliary_data[1046:1049]:
    line = visas.split("=")
    code = line[0].strip()
    visa_name = line[1].strip()
    visa[code] = visa_name

In [29]:
visa_pd = pd.DataFrame(list(visa.items()), columns = ['code', 'visa'])

In [30]:
# Write the data to the directory
visa_pd.to_csv("visa.csv", index = False)

In [31]:
visa_pd.head()

Unnamed: 0,code,visa
0,1,Business
1,2,Pleasure
2,3,Student


### Step 2.b. Explore I94 Immigration Data

In [3]:
immigration = pd.read_sas("../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat", "sas7bdat", encoding="ISO-8859-1")

In [4]:
pd.options.display.max_columns = None
immigration.head(15)

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,validres,delete_days,delete_mexl,delete_dup,delete_visa,delete_recdup,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,4.0,2016.0,6.0,135.0,135.0,XXX,20612.0,,,,59.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,Z,,U,,1957.0,10032016,,,,14938460000.0,,WT
1,5.0,2016.0,6.0,135.0,135.0,XXX,20612.0,,,,50.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,Z,,U,,1966.0,10032016,,,,17460060000.0,,WT
2,6.0,2016.0,6.0,213.0,213.0,XXX,20609.0,,,,27.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,T,,U,,1989.0,D/S,,,,1679298000.0,,F1
3,7.0,2016.0,6.0,213.0,213.0,XXX,20611.0,,,,23.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,T,,U,,1993.0,D/S,,,,1140963000.0,,F1
4,16.0,2016.0,6.0,245.0,245.0,XXX,20632.0,,,,24.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,T,,U,,1992.0,D/S,,,,1934535000.0,,F1
5,19.0,2016.0,6.0,254.0,276.0,XXX,20612.0,,,,21.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,T,,U,,1995.0,D/S,,,,1148758000.0,,F1
6,27.0,2016.0,6.0,343.0,343.0,XXX,20611.0,,,,32.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,T,,U,,1984.0,D/S,,,,1152545000.0,,F1
7,33.0,2016.0,6.0,582.0,582.0,XXX,20612.0,,,,18.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,T,,U,,1998.0,D/S,,,,1150900000.0,,F2
8,38.0,2016.0,6.0,687.0,687.0,XXX,20623.0,,,,19.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,U,,U,,1997.0,06172018,,,,35753780000.0,,E2
9,39.0,2016.0,6.0,694.0,694.0,XXX,20611.0,,,,20.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,,,,U,,U,,1996.0,04162017,,,,1142101000.0,,M1


In [5]:
# Have a look at all the columns
immigration.columns

Index(['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate',
       'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count',
       'validres', 'delete_days', 'delete_mexl', 'delete_dup', 'delete_visa',
       'delete_recdup', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd',
       'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum',
       'airline', 'admnum', 'fltno', 'visatype'],
      dtype='object')

In [6]:
# Have a quick look at different statistics of the table
immigration.describe()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,validres,delete_days,delete_mexl,delete_dup,delete_visa,delete_recdup,biryear,admnum
count,3574989.0,3574989.0,3574989.0,3574469.0,3574989.0,3574989.0,3513802.0,3287918.0,3574350.0,3574989.0,3574989.0,3574989.0,3574989.0,3574989.0,3574989.0,3574989.0,3574989.0,3574350.0,3574989.0
mean,3258526.0,2016.0,6.0,316.2554,314.8681,20620.85,1.07024,20635.85,40.30389,1.88548,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1975.696,68911720000.0
std,1888572.0,0.0,0.0,209.0522,207.1219,8.78243,0.4343662,19.79676,18.10871,0.3806378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.10871,28634610000.0
min,4.0,2016.0,6.0,101.0,101.0,20606.0,1.0,18804.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1900.0,0.0
25%,1625393.0,2016.0,6.0,135.0,135.0,20613.0,1.0,20623.0,27.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1962.0,61634650000.0
50%,3275808.0,2016.0,6.0,245.0,245.0,20621.0,1.0,20632.0,40.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1976.0,62593420000.0
75%,4949151.0,2016.0,6.0,525.0,516.0,20629.0,1.0,20643.0,54.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1989.0,98573180000.0
max,6432838.0,2016.0,6.0,999.0,760.0,20635.0,9.0,20745.0,116.0,3.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2016.0,100000000000.0


In [192]:
# Read using spark
spark = SparkSession.builder\
        .config("spark.jars.repositories", "https://repos.spark-packages.org/")\
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0,saurfang:spark-sas7bdat:2.0.0-s_2.11")\
        .enableHiveSupport().getOrCreate()

In [193]:
# Then setup the sparkContext object
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

In [184]:
#file_location = os.path.join("../../data/18-83510-I94-Data-2016/*.sas7bdat")
#import glob
# absolute path to search all text files inside a specific folder
path = r'../../data/18-83510-I94-Data-2016/*.sas7bdat'
files = glob.glob(path, recursive=True)
file_location = os.path.join("../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat")
files

['../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_sep16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_nov16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_mar16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_aug16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_may16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_jan16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_oct16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_jul16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_feb16_sub.sas7bdat',
 '../../data/18-83510-I94-Data-2016/i94_dec16_sub.sas7bdat']

In [185]:
months = []
years = []
for file in files:
    parts = file.split("/")
    year_month_str = parts[-1].split("_")[1]
    month = year_month_str[:3]
    year = year_month_str[3:5]
    months.append(month)
    years.append(year)
print(months)
print(years)

['apr', 'sep', 'nov', 'mar', 'jun', 'aug', 'may', 'jan', 'oct', 'jul', 'feb', 'dec']
['16', '16', '16', '16', '16', '16', '16', '16', '16', '16', '16', '16']


In [194]:
# Get all the column names
immigration_spark = spark.read.format('com.github.saurfang.sas.spark').load(files[0])
colnames = []
for col in immigration_spark.dtypes:
    colnames.append(col[0])

In [187]:
len(colnames)

28

In [206]:
# immigration_spark = spark.read.format('com.github.saurfang.sas.spark').load(file_location)
#immigration_spark_2 = spark.read.format('com.github.saurfang.sas.spark').load(files[4])

new_colnames = []
#for col in immigration_spark_2.dtypes:
    #new_colnames.append(col[0])

In [207]:
len(new_colnames)

0

In [208]:
remove_columns = []
result = set(new_colnames).difference(colnames)
remove_columns = list(result)
remove_columns

[]

In [110]:
#remove_columns[0#

In [81]:
#immigration_spark = spark.read.format('com.github.saurfang.sas.spark').load(files[0])
#for file in files[1:]:
    #immigration_spark = immigration_spark.union(spark.read.format('com.github.saurfang.sas.spark').load(file)) # This doesn';t work as different files have different number of columns

In [166]:
type(immigration_spark)

pyspark.sql.dataframe.DataFrame

In [204]:
fact_immigration = immigration_spark.distinct()\
                         .withColumn("immigration_id", monotonically_increasing_id())

In [168]:
# Show the 5 top rows
fact_immigration.show(n=5)

+------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+--------------+
| cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|         admnum|fltno|visatype|immigration_id|
+------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+--------------+
| 474.0|2016.0|   4.0| 103.0| 103.0|    NEW|20545.0|    2.0|   null|20547.0|  25.0|    2.0|  1.0|20160401|    null| null|      G|      O|   null|      M| 1991.0|06292016|     F|  null|    VES|5.5410441233E10|91285|      WT|             0|
|1508.0|2016.0|   4.0| 104.0| 104.0|    NYC|

### This is where I am getting error -- I need help here. The code below worked before but not anymore

In [210]:
fact_immigration = fact_immigration.withColumn("year", col("i94yr").cast("integer"))

TypeError: 'tuple' object is not callable

In [209]:
fact_immigration = fact_immigration.withColumn("month", col("i94mon").cast("integer"))

TypeError: 'tuple' object is not callable

### The below code is also giving error. The code below worked before but not anymore

In [202]:
immigration_reshaped = fact_immigration\
.withColumn("year", col("i94yr").cast("integer"))\
.withColumn("month", col("i94mon").cast("integer"))\
.withColumn("city_code", col("i94cit").cast("integer"))\
.withColumn("origin_country_code", col("i94res").cast("integer"))\
.withColumnRenamed("i94port", "port_code")
.withColumn("data_base_sas", to_date(lit("01/01/1960"), "MM/dd/yyyy"))\
.withColumn("arrival_date", expr("date_add(data_base_sas, arrdate)"))\
.withColumn("mode_code", col("i94mode").cast("integer"))\
.withColumnRenamed("i94addr", "state_code")\
.withColumn("departure_date", expr("date_add(data_base_sas, depdate)"))\
.withColumn("age", col("i94bir").cast("integer")).withColumn("visa_code", col("i94visa").cast("integer")).withColumn("birth_year", col("biryear").cast("integer")).drop("i94yr", "i94mon", "i94cit", "i94res", "data_base_sas", "arrdate", "i94mode", "depdate", "i94bir", "i94visa", "biryear")

SyntaxError: invalid syntax (<ipython-input-202-bcbefb07df35>, line 2)

In [83]:
fact_immigration_reshaped.show(n=5)

+-----+---------+----------+-----+--------+-----------+-----------+----------+-----------+-------------+--------+--------+-----+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+--------------+----+-----+---------+-------------------+------------+---------+--------------+---+---------+----------+
|cicid|port_code|state_code|count|validres|delete_days|delete_mexl|delete_dup|delete_visa|delete_recdup|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag| dtaddto|gender|insnum|airline|         admnum|fltno|visatype|immigration_id|year|month|city_code|origin_country_code|arrival_date|mode_code|departure_date|age|visa_code|birth_year|
+-----+---------+----------+-----+--------+-----------+-----------+----------+-----------+-------------+--------+--------+-----+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+--------------+----+-----+---------+-------------------+------------+---------+--------------+

In [64]:
#output_data = "s3a://aws-logs-608251643021-us-west-2/elasticmapreduce/"
#fact_immigration_reshaped.write.mode("overwrite").parquet("{}final_project/fact_immigration.parquet".format(output_data))

In [211]:
fact_immigration = fact_immigration.withColumn("year", col("i94yr").cast("integer"))

TypeError: 'tuple' object is not callable

In [212]:
fact_immigration = fact_immigration.withColumn("month", col("i94mon").cast("integer"))
# fact_immigration = fact_immigration.withColumnRenamed("i94port", "port_code") \

TypeError: 'tuple' object is not callable

In [213]:
new_colnames = []
for col in fact_immigration.dtypes:
    new_colnames.append(col[0])

In [214]:
new_colnames

['cicid',
 'i94yr',
 'i94mon',
 'i94cit',
 'i94res',
 'i94port',
 'arrdate',
 'i94mode',
 'i94addr',
 'depdate',
 'i94bir',
 'i94visa',
 'count',
 'dtadfile',
 'visapost',
 'occup',
 'entdepa',
 'entdepd',
 'entdepu',
 'matflag',
 'biryear',
 'dtaddto',
 'gender',
 'insnum',
 'airline',
 'admnum',
 'fltno',
 'visatype',
 'immigration_id']

In [215]:
immigration_reshaped = fact_immigration.withColumn("year", col("i94yr").cast("integer")).withColumn("month", col("i94mon").cast("integer")).withColumn("city_code", col("i94cit").cast("integer")).withColumn("origin_country_code", col("i94res").cast("integer")).withColumnRenamed("i94port", "port_code").withColumn("data_base_sas", to_date(lit("01/01/1960"), "MM/dd/yyyy"))\
.withColumn("arrival_date", expr("date_add(data_base_sas, arrdate)")).withColumn("mode_code", col("i94mode").cast("integer")).withColumnRenamed("i94addr", "state_code").withColumn("departure_date", expr("date_add(data_base_sas, depdate)")).withColumn("age", col("i94bir").cast("integer")).withColumn("visa_code", col("i94visa").cast("integer")).withColumn("birth_year", col("biryear").cast("integer")).drop("i94yr", "i94mon", "i94cit", "i94res", "data_base_sas", "arrdate", "i94mode", "depdate", "i94bir", "i94visa", "biryear")

TypeError: 'tuple' object is not callable