# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
#importing libraries
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from pyspark.sql import SparkSession
from pyspark.sql.functions import date_add, col

ModuleNotFoundError: No module named 'pyspark'

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [None]:
#launching a spark session
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

Writing immigration data:
* spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
* spark.write.parquet("sas_data")

In [None]:
# Reading in the data
c = pd.read_csv("/Users/tatianatikhonova/Documents/udacity/Capstone/us-cities-demographics.csv", sep = ';')
a = pd.read_csv("/Users/tatianatikhonova/Documents/udacity/Capstone/airport-codes.csv", sep = ',')
immigration = spark.read.parquet("sas_data") 

#immigration sample
i = pd.read_csv("/Users/tatianatikhonova/Documents/udacity/Capstone/immigration_data_sample.csv", sep = ',')


---

### Reading in the data dictionary for Immigration

In [None]:

with open('./I94_SAS_Labels_Descriptions.SAS') as f:
    f_content = f.read()
    f_content = f_content.replace('\t', '')

def code_mapper(file, idx):
    f_content2 = f_content[f_content.index(idx):]
    f_content2 = f_content2[:f_content2.index(';')].split('\n')
    f_content2 = [i.replace("'", "") for i in f_content2]
    dic = [i.split('=') for i in f_content2[1:]]
    dic = dict([i[0].strip(), i[1].strip()] for i in dic if len(i) == 2)
    return dic

i94cit_res = code_mapper(f_content, "i94cntyl")
i94port = code_mapper(f_content, "i94prtl")
i94mode = code_mapper(f_content, "i94model")
i94addr = code_mapper(f_content, "i94addrl")
i94visa = {'1':'Business',
'2': 'Pleasure',
'3' : 'Student'}

In [None]:
#turning listo of ports into a dataframe with distinct columns
i94port_df = pd.DataFrame.from_dict(i94port, orient='index')
i94port_df.reset_index(level=0, inplace=True)
i94port_df.columns = ['Code','City_State']
i94port_df

In [None]:
#splitting City_State column in two
i94port_df[['City', 'State','Else']] = i94port_df['City_State'].astype("string").str.split(', ', expand=True)
i94port_df.drop(['City_State', 'Else'], axis=1, inplace=True)
i94port_df

In [None]:
#Counting % of missing ports out of total. Turns out to be 9%
i94port_df.query('City.str.contains("No PORT")', engine='python').count() / i94port_df.shape[0]

In [None]:
missing_ports_list = list(i94port_df.query('City.str.contains("No PORT")', engine='python').Code)

---

### Step 2: Explore and Assess the Data

What I will be looking out for:

* **Completeness** (do we have all the records that we need? any missing / NaaN?)
* **Validity** (records that don’t conform to a defined schema, e.g. negative height not possible but present or duplicate key identifier)
* **Accuracy** (adheres to define schema, but is incorrect; e.g. overestimated values or out of date information)
* **Consistency** (data valid and accurate, but fields are represented in an inconsistent manner, e.g. state as NY and New York)
* **Tidiness** (structure of tidy data: variable = column, observation = row, observational unit = table)

In [None]:
c.head(5)

In [None]:
a.head(5)

In [None]:
i.head(5)

In [None]:
i94port

In [None]:
immigration.head(1)

In [None]:
#understand number of rows and columns
print(f'count of rows and columns for cities: {c.shape}')
print(f'count of rows and columns for airports: {a.shape}')
print(f'count of rows and columns for immigration: {immigration.count(), len(immigration.columns)}')

In [None]:
#understand columns and data types
c.info()

In [None]:
a.info()

In [None]:
i.info()

In [None]:
c.columns.to_series().groupby(c.dtypes).groups

In [None]:
c.describe()

#### Missing values in Cities

In [None]:
def find_missing_data(df):
    '''
    INPUT:
        df - (dataframe), dataframe to check for missing values in its columns
    OUTPUT:
        df_null: (dataframe), with count & percentage of missing values in input dataframe columns
    '''
    null_data = df.isnull().sum()[df.isnull().sum() > 0]
    
    data_dict = {'count': null_data.values, 
                 'pct': np.round(null_data.values *100/df.shape[0],2)}
    
    df_null = pd.DataFrame(data=data_dict, index=null_data.index)
    df_null.sort_values(by='count', ascending=False, inplace=True)
    return df_null



In [None]:
c.isnull().sum().sum() #count of all missing values in Cities

In [None]:
find_missing_data(c)

In [None]:
def nans(df): 
    return df[df.isnull().any(axis=1)]

nans(c)

In [None]:
#missing values comprise less than 1% of Cities data, so it's safe to drop them
c2 = c.dropna(axis=0)
c2.isnull().sum().sum() 

#### Missing values in Airport

In [None]:
a.isnull().sum().sum() #count of all missing values in Airports

In [None]:
find_missing_data(a)

In [None]:
#dropping columns that are missing over 25% of data
cols = a.columns[a.isnull().sum()/len(a) > .25]
a2 = a.drop(cols,axis=1)
a2.head(2)

In [None]:
#dropping the remaining rows with null values
a2.dropna(axis=0, inplace=True)
a2.isnull().sum().sum() #count of all missing values in Airports

#### Missing values in Immigration

In [None]:
type(immigration)

In [None]:
i2 = immigration[~immigration["i94port"].isin(missing_ports_list)] #filtering out missng ports derived from the SAS file

{col:i2.filter(i2[col].isNull()).count() / i2.count() for col in i2.columns} #checking % of null values in each col

In [None]:
i2 = i2.drop("insnum", "entdepu", "occup", "visapost") #dropping extra columns
i2 = i2.dropna(how='any') #dropping null values
{col:i2.filter(i2[col].isNull()).count() / i2.count() for col in i2.columns}

In [None]:
i2 = i2.filter(i2.i94addr != 'other')
i3 = i2.select(col("cicid").alias("id"), 
                                       col("arrdate").alias("arrival_date"),
                                       col("i94port").alias("port_code"),
                                       col("i94addr").alias("state_code"),
                                       col("i94bir").alias("age"),
                                       col("gender").alias("gender"),
                                       col("i94visa").alias("visa_type"),
                                       "count").drop_duplicates()

i3.head()

In [None]:
type(i3.arrival_date)

In [None]:
i3 = i3.withColumn('new_arr_date', col("arrival_date").cast("timestamp"))
i3 = i3.drop('arrival_date')
i3.head()

In [None]:
i3.printSchema()

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model


Immigration = FACT
Airports = DIM
Cities = DIM


#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

In [None]:
!pyspark --version

In [None]:
i3.write.mode("append").partitionBy("port_code").parquet("/results/immigration.parquet")\
        .mode('overwrite')\
        .trigger(processingTime="20 seconds") \
        .outputMode('Complete') \
        .format('console') \
        .start()\
        .awaitTermination()
# writing immigration dimension table to parquet files partitioned by port_code
#i3.createOrReplaceTempView("immigration_view")


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

Immigration columns:

*  cicid      primary key, id from sas file
*  i94yr      entry year, 4 digit year
*  i94mon     entry month, numeric month
*  i94cit     i94 citizenship country code as per SAS Labels Descriptions file
*   i94res    i94 residence country code as per SAS Labels Descriptions file
*  i94port     i94port code as per SAS Labels Descriptions file
*  arrdate     date of arrival in U.S.
*  i94mode     code for travel mode of arrival as per SAS Labels Descriptions file
*  i94addr     address
*  depdate     departure date from U.S.
*  i94bir      age of the immigrant
*  i94visa     visa category code as per SAS Labels Descriptions file
*  dtadfile    Character Date Field - Date added to I-94 Files - CIC does not use */  
*  visapost    visa category code as per SAS Labels Descriptions file
*  occup       occupation of immigrant
*  entdepa     Arrival Flag - admitted or paroled into the U.S. - CIC does not use
*  entdepd     Departure Flag - Departed, lost I-94 or is deceased - CIC does not use
*  entdepu     Update Flag - Either apprehended, overstayed, adjusted to perm residence - CIC does not use
*  matflag     Match flag - Match of arrival and departure records
*  biryear     birth year of immigrant
* count        used for summary stats
*  dtaddto     character Date Field - Date to which admitted to U.S. (allowed to stay until) - CIC does not use */
*  gender      gender of immigrant
*  insnum      INS number
*  airline     airline code used to arrive in U.S.
*  admnum      admission number
*  fltno       flight number
*  visatype  visa type

In [77]:
fact_immigraions:
|-- cicid: id from sas file
|-- entry_year: 4 digit year
|-- entry_month: numeric month
|-- origin_country_code: i94 country code as per SAS Labels Descriptions file
|-- port_code: i94port code as per SAS Labels Descriptions file
|-- arrival_date: date of arrival in U.S.
|-- travel_mode_code: code for travel mode of arrival as per SAS Labels Descriptions file
|-- us_state_code: two letter U.S. state code
|-- departure_date: departure date from U.S.
|-- age: age of the immigrant
|-- visa_category_code: visa category code as per SAS Labels Descriptions file
|-- occupation: occupation of immigrant
|-- gender: gender of immigrant
|-- birth_year: birth year of immigrant
|-- entry_date: Date to which admitted to U.S. (allowed to stay until)
|-- airline: airline code used to arrive in U.S.
|-- admission_number: admission number
|-- flight_number: flight number
|-- visa_type: visa type
    
dim_city_demographics:
|-- port_code: i94port code
|-- city: U.S. city name
|-- state_code: two letter U.S. sate code
|-- male_population: total male population
|-- female_population: total female population
|-- total_population: total population
|-- number_of_veterans: number of veterans
|-- num_foreign_born: number of foreign born 

SyntaxError: invalid syntax (<ipython-input-77-33effe78f7ae>, line 1)

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.