# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 0: Preparation and import data from s3
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
#Before we continue, we need to install related python package.
import sys

!{sys.executable} -m pip install boto3
!{sys.executable} -m pip install s3fs
!{sys.executable} -m pip install pyspark
!{sys.executable} -m pip install cqlsh

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [5]:
# Do all imports and installs here
import configparser
import pandas as pd
import os
import boto3
import uuid
from time import sleep

In [6]:
config = configparser.ConfigParser()
config.read('iam.cfg')
os.environ['AWS_ACCESS_KEY_ID']=config['AWS_CREDS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS_CREDS']['AWS_SECRET_ACCESS_KEY']

client=boto3.client('s3')


### Scope the Project and Gather Data

#### Project description:

This project will be separate to multiple parts, and all four dataset will be used. 

Before we talk about the details, we need to know the characteristics of relational DB and non-relational DB.

For relational DB, its characteristics is low redundancy and high completeness, which means it is very suitable for small or medium size data, and the database does not change so much. In our case, we should store temperature, airport code and US cities demographic data into a relational database that meets 3NF because it does not always change so much and the volume of data is not that large.

For non-relational DB, its characteristics is higher elasticity, faster read & write speed and evoving data volume. In our case, we should save I94 data into non-relational DB. Because this piece of data need to make ETL process almost every minutes in real world background, and it need dynamic write and read for real-time data monitoring.


The final solution will work as a a database management system. When user input the time or time period and the column they interested in (e.g., visa type), the system will return related data as a dataframe. For example, a user needs to know where are the busiest airport for investment visa holder(E-1 visa) in 2016 and its basic information such as temperature, and the status of the city such as population(age, majority race, etc.), or when is the peak-time for international student come to the United States and where are they come from.

* Data will be imported from Amazon S3
* Relational DB will be implement on AWS Redshift
* Non-Relationalship DB will be implement on Amazon Keyspace, and data backup will be stored at S3 as parquet format.
* Data cleaning and ETL process will be implement on Amazon EMR with Spark

#### The dataset is going to use in this project are:

* I94 Immigration Data: This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. https://travel.trade.gov/research/reports/i94/historical/2016.html is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
* World Temperature Data: This dataset came from Kaggle. You can read more about it here: https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data.
* U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it here: https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/.
* Airport Code Table: This is a simple table of airport codes and corresponding cities. It comes from here:https://datahub.io/core/airport-codes#data.

In [4]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()

while True:
    if os.path.isdir('16_jan.parquet'):
        break
    else:
        sql = '''SELECT id_, cicid, i94yr, i94mon, i94cit, i94res, i94port,
                 arrdate, i94mode, i94addr, depdate, i94bir, i94visa,
                 count, dtadfile, visapost, occup, entdepa, entdepd,
                 entdepu, matflag, biryear, dtaddto, gender, insnum,
                 airline, admnum, fltno, visatype FROM i94'''
        
        i94_url='s3://srk-data-eng-capstone/i94/i94_jan16_sub.sas7bdat'
        i94 = pd.read_sas(i94_url, 'sas7bdat',encoding="ISO-8859-1").drop_duplicates()
        i94['id_'] = pd.Series([uuid.uuid1() for each in range(len(i94))])
        i94['dtadfile'].fillna(-1,inplace=True)
        for each in i94:
            dt=i94[each].dtype
            if dt == 'object':
                i94[each].fillna('Null',inplace=True)
            else:
                i94[each].fillna(-1,inplace=True)
        i94.to_csv('16_jan.csv', index=False)
        df_spark=spark.read.option('header','true').csv('16_jan.csv')
        df_spark.createOrReplaceTempView('i94')
        parquet_data=spark.sql(sql)
        parquet_data.write.parquet('16_jan.parquet',mode='overwrite')
        os.system('aws s3 cp 16_jan.parquet s3://i94-backup --recursive')
        break        



In [5]:
from cassandra.cluster import Cluster
from ssl import SSLContext, PROTOCOL_TLSv1, CERT_REQUIRED
from cassandra.auth import PlainTextAuthProvider
from cassandra import ConsistencyLevel
from tqdm import tqdm

ssl_context = SSLContext(PROTOCOL_TLSv1)
ssl_context.load_verify_locations('AmazonRootCA1.pem')
ssl_context.verify_mode = CERT_REQUIRED
auth_provider = PlainTextAuthProvider(username=str(config['APACHE_CASSANDRA_CREDS']['CASSANDRA_USERNAME']), password=str(config['APACHE_CASSANDRA_CREDS']['CASSANDRA_PASSWORD']))
cluster = Cluster(['cassandra.eu-west-1.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider, port=9142)

In [6]:
print('Patient...')
session = cluster.connect()
create_keyspace="""CREATE KEYSPACE IF NOT EXISTS "i94"
                   WITH REPLICATION={'class':'SingleRegionStrategy'}"""
session.execute(create_keyspace)
sleep(10)

create_table="""CREATE TABLE IF NOT EXISTS "i94".i94 (
                                                      cicid DOUBLE,
                                                      i94yr DOUBLE,
                                                      i94mon DOUBLE,
                                                      i94cit DOUBLE,
                                                      i94res DOUBLE,
                                                      i94port TEXT,
                                                      arrdate DOUBLE,
                                                      i94mode DOUBLE,
                                                      i94addr TEXT,
                                                      depdate DOUBLE,
                                                      i94bir DOUBLE,
                                                      i94visa DOUBLE,
                                                      count DOUBLE,
                                                      dtadfile DOUBLE,
                                                      visapost TEXT,
                                                      occup TEXT,
                                                      entdepa TEXT,
                                                      entdepd TEXT,
                                                      entdepu TEXT,
                                                      matflag TEXT,
                                                      biryear DOUBLE,
                                                      dtaddto TEXT,
                                                      gender TEXT,
                                                      insnum TEXT,
                                                      airline TEXT,
                                                      admnum DOUBLE,
                                                      fltno TEXT,
                                                      visatype TEXT,
                                                      id_ TEXT,
                                                      PRIMARY KEY(id_)
                ) """
session.execute(create_table)
sleep(10)
print('Table well-prepared. you can input data from dataset.')

Patient...
Table well-prepared. you can input data from dataset.


In [7]:
origin_sql="""INSERT INTO "i94".i94 ("cicid","i94yr","i94mon","i94cit","i94res","i94port","arrdate","i94mode","i94addr","depdate",
                              "i94bir","i94visa","count","dtadfile","visapost","occup","entdepa","entdepd","entdepu","matflag",
                              "biryear","dtaddto","gender","insnum","airline","admnum","fltno","visatype","id_")
                              VALUES ({0},{1},{2},{3},{4},'{5}',{6},{7},'{8}',{9},
                                      {10},{11},{12},{13},'{14}','{15}','{16}','{17}','{18}','{19}',
                                      {20},'{21}','{22}','{23}','{24}',{25},'{26}','{27}','{28}')"""


total_length=len(pd.read_csv('16_jan.csv'))

with open('{}.csv'.format('16_jan'),'r') as csv:
    #Ignore the first line, which is column head.
    csv=iter(csv)
    next(csv)
#     counts=0
    for each in tqdm(csv):
        lists=[]
        columns=each.split(',')
        cicid=columns[0]
        i94yr=columns[1]
        i94mon=columns[2]
        i94cit=columns[3]
        i94res=columns[4]
        i94port=columns[5]
        arrdate=columns[6]
        i94mode=columns[7]
        i94addr=columns[8]
        depdate=columns[9]
        i94bir=columns[10]
        i94visa=columns[11]
        count=columns[12]
        dtadfile=columns[13]
        visapost=columns[14]
        occup=columns[15]
        entdepa=columns[16]
        entdepd=columns[17]
        entdepu=columns[18]
        matflag=columns[19]
        biryear=columns[20]
        dtaddto=columns[21]
        gender=columns[22]
        insnum=columns[23]
        airline=columns[24]
        admnum=columns[25]
        fltno=columns[26]
        visatype=columns[27]
        id_=columns[28]        
        
        original_list=[cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,
                      count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,
                      airline,admnum,fltno,visatype,id_]
        
        for each in original_list:
            try:
                each = float(each)
                lists.append(each)
            except:
                if each == '':
                    each = None
                    lists.append(each)
                else:
                    lists.append(each)
                    continue                            
        
        formated_sql=origin_sql.format(lists[0],lists[1],lists[2],lists[3],lists[4],lists[5],lists[6],lists[7],lists[8],lists[9],
                      lists[10],lists[11],lists[12],lists[13],lists[14],lists[15],lists[16],lists[17],lists[18],lists[19],
                      lists[20],lists[21],lists[22],lists[23],lists[24],lists[25],lists[26],lists[27],lists[28])
        
        sql=session.prepare(formated_sql)
        sql.consistency_level = ConsistencyLevel.LOCAL_QUORUM
        try:
            session.execute(sql)
        except:
            session = cluster.connect()
            session.execute(sql)
        
#         print('Import row{} complete.'.format(counts) + '{} remaining.'.format(total_length-counts))
#         counts+=1
    


62693it [59:14, 17.98it/s]

KeyboardInterrupt: 

62693it [59:31, 17.98it/s]

In [None]:
while True:
    try:
        os.remove('16_jan.csv')
    except:
        break

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [42]:
global_land_temperature_url = 's3://srk-data-eng-capstone/GlobalLandTemperaturesByCity.csv'
airport_codes_url = 's3://srk-data-eng-capstone/airport-codes_csv.csv'
us_city_demographics_url = 's3://srk-data-eng-capstone/us-cities-demographics.csv'

In [78]:
global_land_temperature = pd.read_csv(global_land_temperature_url)
airport_codes=pd.read_csv(airport_codes_url)
us_city_demographics=pd.read_csv(us_city_demographics_url, sep=';')

In [91]:
us_cities = global_land_temperature['Country']=='United States'
global_land_temperature = global_land_temperature[us_cities]
global_land_temperature = global_land_temperature.drop(['Country'],axis=1)
start_date = '2012-01-01'
end_date = '2012-12-01'
period = (global_land_temperature['dt']>=start_date) & (global_land_temperature['dt']<=end_date)
global_land_temperature=global_land_temperature[period]

In [79]:
acceptable_airport = ['small_airport','medium_airport','large_airport']
airport_codes = airport_codes[airport_codes['type'].isin(acceptable_airport)]
#Do we need to delete rows with no IATA code?


In [97]:
us_city_demographics
#The relationship between this table and global temperature table: city=City
#The relationship between this table and nosql table: State Code=i94addr

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.60,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402
...,...,...,...,...,...,...,...,...,...,...,...,...
2886,Stockton,California,32.5,150976.0,154674.0,305650,12822.0,79583.0,3.16,CA,American Indian and Alaska Native,19834
2887,Southfield,Michigan,41.6,31369.0,41808.0,73177,4035.0,4011.0,2.27,MI,American Indian and Alaska Native,983
2888,Indianapolis,Indiana,34.1,410615.0,437808.0,848423,42186.0,72456.0,2.53,IN,White,553665
2889,Somerville,Massachusetts,31.0,41028.0,39306.0,80334,2103.0,22292.0,2.43,MA,American Indian and Alaska Native,374


In [88]:
airport_codes
#The relationship between nosql and this table: i94port=iata_code
#The relationship between this table and global temperature table: city=municipality
#The relationship between this table and US city demographics: iso_region[3:](if the United States)=State Code

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
5,00AS,small_airport,Fulton Airport,1100.0,,US,US-OK,Alex,00AS,,00AS,"-97.8180194, 34.9428028"
6,00AZ,small_airport,Cordes Airport,3810.0,,US,US-AZ,Cordes,00AZ,,00AZ,"-112.16500091552734, 34.305599212646484"
...,...,...,...,...,...,...,...,...,...,...,...,...
55069,ZYYJ,medium_airport,Yanji Chaoyangchuan Airport,624.0,AS,CN,CN-22,Yanji,ZYYJ,YNJ,,"129.451004028, 42.8828010559"
55070,ZYYK,medium_airport,Yingkou Lanqi Airport,0.0,AS,CN,CN-21,Yingkou,ZYYK,YKH,,"122.3586, 40.542524"
55071,ZYYY,medium_airport,Shenyang Dongta Airport,,AS,CN,CN-21,Shenyang,ZYYY,,,"123.49600219726562, 41.784400939941406"
55073,ZZ-0002,small_airport,Glorioso Islands Airstrip,11.0,AF,TF,TF-U-A,Grande Glorieuse,,,,"47.296388888900005, -11.584277777799999"


In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.