# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 0: Preparation and import data from s3
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
#Before we continue, we need to install related python package.
import sys

!{sys.executable} -m pip install boto3
!{sys.executable} -m pip install s3fs
!{sys.executable} -m pip install pyspark
!{sys.executable} -m pip install cqlsh
!{sys.executable} -m pip install findspark
!{sys.executable} -m pip install pyarrow

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
# Do all imports and installs here
import configparser
import pandas as pd
import os
import boto3
import uuid
from pyspark.sql import types as T
from time import sleep

In [2]:
config = configparser.ConfigParser()
config.read('iam.cfg')
os.environ['AWS_ACCESS_KEY_ID']=config['AWS_CREDS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS_CREDS']['AWS_SECRET_ACCESS_KEY']

client=boto3.client('s3')


# Set spark environments
os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/local/bin/python3'

### Scope the Project and Gather Data

#### Project description:

This project will be separate to multiple parts, and all four dataset will be used. 

Before we talk about the details, we need to know the characteristics of relational DB and non-relational DB.

For relational DB, its characteristics is low redundancy and high completeness, which means it is very suitable for small or medium size data, and the database does not change so much. In our case, we should store temperature, airport code and US cities demographic data into a relational database that meets 3NF because it does not always change so much and the volume of data is not that large.

The final solution will work as a a database management system. When user input the time or time period and the column they interested in (e.g., visa type), the system will return related data as a dataframe. For example, a user needs to know where are the busiest airport for investment visa holder(E-1 visa) in 2016 and its basic information such as temperature, and the status of the city such as population(age, majority race, etc.), or when is the peak-time for international student come to the United States and where are they come from.

* Data will be imported from Amazon S3
* Relational DB will be implement on AWS Redshift
* Non-Relationalship DB will be implement on Amazon Keyspace, and data backup will be stored at S3 as parquet format.
* Data cleaning and ETL process will be implement on Amazon EMR with Spark

#### The dataset is going to use in this project are:

* I94 Immigration Data: This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. https://travel.trade.gov/research/reports/i94/historical/2016.html is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
* World Temperature Data: This dataset came from Kaggle. You can read more about it here: https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data.
* U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it here: https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/.
* Airport Code Table: This is a simple table of airport codes and corresponding cities. It comes from here:https://datahub.io/core/airport-codes#data.

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
from pyspark.sql import types as T
from pyspark.sql.types import *
from pyspark import SparkContext

spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.driver.memory", "15g")\
.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [4]:
i94 = pd.read_sas('i94_jan16_sub.sas7bdat', 'sas7bdat',encoding="ISO-8859-1").drop_duplicates()
i94['id_'] = pd.Series([str(uuid.uuid1()) for each in range(len(i94))])
i94['arrival_date'] = pd.to_timedelta(i94['arrdate'],unit='D') + pd.Timestamp('1960-1-1')
i94=spark.createDataFrame(i94)

In [5]:
def mapping_processor(names):
    origin=open('mappings/{}.txt'.format(names),'r')
    code=[]
    name=[]
    for each in origin:
        line=" ".join(each.split())
        try:
            code.append(int(line[:line.index('=')]))
        except:
            code.append(line[1:line.index('=')-1])
        name.append(line[line.index('=')+2:-1])
    origin.close()
    col_code=names+'_code'
    col_name=names+'_name'
    df=pd.DataFrame(list(zip(code,name)),columns=[col_code,col_name])
    df=spark.createDataFrame(df)
    return df

In [78]:
country=mapping_processor('country')
mode=mapping_processor('mode')
port=mapping_processor('port')
us_states=mapping_processor('us_states')
visacode=mapping_processor('visacode')

country.createOrReplaceTempView('country')
mode.createOrReplaceTempView('mode')
port.createOrReplaceTempView('port')
us_states.createOrReplaceTempView('us_states')
visacode.createOrReplaceTempView('visacode')
i94.createOrReplaceTempView('i94')

In [83]:
sql="""SELECT i94yr AS year,
              i94mon AS month,
              i94cit AS citizenship,
              i94res AS resident,
              i94port AS port,
              arrival_date,
              i94mode AS mode,
              i94addr AS us_state,
              depdate AS depart_date,
              i94bir AS age,
              i94visa visa_category,
              dtadfile AS date_added,
              visapost AS visa_issued_by,
              occup AS occupation,
              entdepa AS arrival_flag,
              entdepd AS depart_flag,
              entdepu AS update_flag,
              matflag AS match_arrival_depart_flag,
              biryear AS birth_year,
              dtaddto AS allowed_date,
              gender,
              insnum AS ins_number,
              airline,
              admnum AS admission_number,
              fltno AS flight_no,
              visatype,
              id_
              FROM i94;
       """
i94_df=spark.sql(sql)
i94_df.createOrReplaceTempView('i94')

In [123]:
temp=spark.sql("SELECT mode, COUNT(mode) FROM i94 WHERE NOT (mode = 'air') GROUP BY mode")
temp.show()

+----+-----------+
|mode|count(mode)|
+----+-----------+
+----+-----------+



### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [250]:
global_land_temperature_url = 's3://srk-data-eng-capstone/GlobalLandTemperaturesByCity.csv'
airport_codes_url = 's3://srk-data-eng-capstone/airport-codes_csv.csv'
us_city_demographics_url = 's3://srk-data-eng-capstone/us-cities-demographics.csv'

In [257]:
#Global land temperature view preparation
global_land_temperature = pd.read_csv(global_land_temperature_url)
global_land_temperature['id_'] = pd.Series([str(uuid.uuid1()) for each in range(len(global_land_temperature))])
global_land_temperature = spark.createDataFrame(global_land_temperature)


In [324]:
global_land_temperature.createOrReplaceTempView('glt')
# sql="""SELECT DISTINCT city, AverageTemperature AS winter_avg FROM glt WHERE dt='2012-01-01'"""
# winter=spark.sql(sql)
# winter.createOrReplaceTempView('winter')
# sql="""SELECT DISTINCT city, AverageTemperature AS summer_avg FROM glt WHERE dt='2012-07-01'"""
# summer=spark.sql(sql)
# summer.createOrReplaceTempView('summer')

sql=""""""
temp=spark.sql(sql)
temp.show()

+-----------------+------------------+--------+---------+
|             city|            summer|latitude|longitude|
+-----------------+------------------+--------+---------+
|         Glendale|            28.416|  32.95N|  112.02W|
|          Abilene|            21.666|  32.95N|  100.53W|
|      Albuquerque|            21.666|  34.56N|  107.03W|
|       Huntsville|            26.982|  34.56N|   85.62W|
|       Huntsville|27.593000000000004|  34.56N|   85.62W|
|         El Monte|            27.813|  34.56N|  118.70W|
|       Alexandria|23.555999999999997|  39.38N|   76.99W|
|Lexington Fayette|            31.184|  37.78N|   85.42W|
|            Tampa|            27.427|  28.13N|   82.73W|
|         Torrance| 26.63000000000001|  34.56N|  118.70W|
|           Joliet|            28.566|  40.99N|   87.34W|
|            Salem|32.300000000000004|  44.20N|  122.98W|
|       Des Moines|32.300000000000004|  40.99N|   93.73W|
|       Des Moines|             27.68|  40.99N|   93.73W|
|       Evansv

In [312]:
sql="""SELECT AverageTemperature AS winter FROM glt 
       WHERE dt='2012-01-01'
       AND country = 'United States'"""
temp=spark.sql(sql)
temp.show()

+-----------+----------+-------------+--------------------+
|       city|        dt|      country|              winter|
+-----------+----------+-------------+--------------------+
|    Abilene|2012-01-01|United States|               7.996|
|      Akron|2012-01-01|United States|-0.34399999999999986|
|Albuquerque|2012-01-01|United States|                1.84|
| Alexandria|2012-01-01|United States|               2.382|
|  Allentown|2012-01-01|United States|-0.04099999999999...|
|   Amarillo|2012-01-01|United States|               6.534|
|    Anaheim|2012-01-01|United States|              15.113|
|  Anchorage|2012-01-01|United States|             -22.628|
|  Ann Arbor|2012-01-01|United States|               -1.44|
|    Antioch|2012-01-01|United States|               9.993|
|  Arlington|2012-01-01|United States|               9.359|
|  Arlington|2012-01-01|United States|               2.382|
|     Arvada|2012-01-01|United States|              -5.402|
|    Atlanta|2012-01-01|United States|  

In [313]:
sql="""SELECT AverageTemperature AS summer FROM glt 
       WHERE dt='2012-07-01'
       AND country = 'United States'"""
temp=spark.sql(sql)
temp.show()

+-----------+----------+-------------+------------------+
|       city|        dt|      country|            winter|
+-----------+----------+-------------+------------------+
|    Abilene|2012-07-01|United States|            29.581|
|      Akron|2012-07-01|United States|            24.966|
|Albuquerque|2012-07-01|United States|23.555999999999997|
| Alexandria|2012-07-01|United States|            26.629|
|  Allentown|2012-07-01|United States|            24.479|
|   Amarillo|2012-07-01|United States|            28.553|
|    Anaheim|2012-07-01|United States|            18.984|
|  Anchorage|2012-07-01|United States|            10.697|
|  Ann Arbor|2012-07-01|United States| 24.91500000000001|
|    Antioch|2012-07-01|United States|            19.632|
|  Arlington|2012-07-01|United States|            30.415|
|  Arlington|2012-07-01|United States|            26.629|
|     Arvada|2012-07-01|United States|            15.634|
|    Atlanta|2012-07-01|United States|             26.34|
|     Aurora|2

In [317]:
sql = """SELECT DISTINCT city AS unique_city, latitude, longitude FROM glt
         WHERE country = 'United States'
         ORDER BY unique_city"""
temp=spark.sql(sql)
temp.show()

+-----------+--------+---------+
|unique_city|latitude|longitude|
+-----------+--------+---------+
|    Abilene|  32.95N|  100.53W|
|      Akron|  40.99N|   80.95W|
|Albuquerque|  34.56N|  107.03W|
| Alexandria|  39.38N|   76.99W|
|  Allentown|  40.99N|   74.56W|
|   Amarillo|  34.56N|  101.19W|
|    Anaheim|  32.95N|  117.77W|
|  Anchorage|  61.88N|  151.13W|
|  Ann Arbor|  42.59N|   82.91W|
|    Antioch|  37.78N|  122.03W|
|  Arlington|  39.38N|   76.99W|
|  Arlington|  32.95N|   96.70W|
|     Arvada|  39.38N|  106.13W|
|    Atlanta|  34.56N|   83.68W|
|     Aurora|  40.99N|   87.34W|
|     Aurora|  39.38N|  104.05W|
|     Austin|  29.74N|   97.85W|
|Bakersfield|  36.17N|  119.34W|
|  Baltimore|  39.38N|   76.99W|
|Baton Rouge|  29.74N|   90.46W|
+-----------+--------+---------+
only showing top 20 rows



In [91]:
airport_codes = pd.read_csv(airport_codes_url)
airport_codes = spark.createDataFrame(airport_codes)
airport_codes.createOrReplaceTempView('airports')
sql = """SELECT ident, type, name, elevation_ft, continent, 
                iso_country, iso_region, municipality, gps_code, iata_code AS airport_code, coordinates
         FROM airports WHERE iata_code IS NOT NULL
         UNION
         SELECT ident, type, name, elevation_ft, continent,
                iso_country, iso_region, municipality, gps_code, local_code AS airport_code, coordinates
         FROM airports WHERE local_code IS NOT NULL"""
airports = spark.sql(sql)
airports.createOrReplaceTempView('airports')

In [124]:
us_city_demographics=pd.read_csv(us_city_demographics_url, sep=';')
us_city_demographics['id_'] = pd.Series([str(uuid.uuid1()) for each in range(len(us_city_demographics))])
us_city_demographics=spark.createDataFrame(us_city_demographics)
us_city_demographics.createOrReplaceTempView('us_cities')
sql="""SELECT id_, city, `Median Age` AS median_age, `Male Population` AS male_population,
              `Female Population` AS female_population, `Total Population` AS population,
              `Number of Veterans` AS num_veterans, `Foreign-born` AS foreign_born, `Average Household Size` AS avg_household_size,
              `State Code` AS state, race, count
       FROM us_cities"""
us_cities = spark.sql(sql)
us_cities.createOrReplaceTempView('us_cities')

+--------------------+----------+--------------------+---------------------------+-----------+--------+---------+
|                 id_|        dt|     avg_temperature|avg_temperature_uncertainty|       city|latitude|longitude|
+--------------------+----------+--------------------+---------------------------+-----------+--------+---------+
|78bdec90-f682-11e...|2012-01-01|               7.996|                      0.204|    Abilene|  32.95N|  100.53W|
|78bded76-f682-11e...|2012-07-01|              29.581|        0.28800000000000003|    Abilene|  32.95N|  100.53W|
|78f5f218-f682-11e...|2012-01-01|-0.34399999999999986|                       0.41|      Akron|  40.99N|   80.95W|
|78f5f310-f682-11e...|2012-07-01|              24.966|                      0.401|      Akron|  40.99N|   80.95W|
|7908418c-f682-11e...|2012-01-01|                1.84|                      0.484|Albuquerque|  34.56N|  107.03W|
|7908427a-f682-11e...|2012-07-01|  23.555999999999997|                      0.335|Albuqu

In [137]:
sql="""SELECT uc.id_, uc.city, uc.median_age, uc.male_population, 
              uc.female_population, uc.num_veterans, uc.foreign_born, uc.avg_household_size,
              uc.state, uc.race, (SELECT avg_temperature FROM temperature 
                                  WHERE dt='2012-01-01') AS winter_avg_temperature,
                                  (SELECT avg_temperature FROM temperature
                                  WHERE dt='2012-07-01') AS summer_avg_temperature
       FROM temperature AS t LEFT JOIN us_cities AS uc ON t.city = uc.city 
       WHERE uc.city='New York'"""
temp=spark.sql(sql)
temp.show()

Py4JJavaError: An error occurred while calling o681.showString.
: java.lang.RuntimeException: more than one row returned by a subquery used as an expression:
Subquery scalar-subquery#2708, [id=#1678]
+- *(1) Project [AverageTemperature#2446 AS avg_temperature#2468]
   +- *(1) Filter (((((isnotnull(Country#2449) AND isnotnull(dt#2445)) AND (Country#2449 = United States)) AND (dt#2445 >= 2012-01-01)) AND (dt#2445 <= 2012-12-01)) AND (dt#2445 = 2012-01-01))
      +- *(1) Scan ExistingRDD arrow[dt#2445,AverageTemperature#2446,AverageTemperatureUncertainty#2447,City#2448,Country#2449,Latitude#2450,Longitude#2451,id_#2452]

	at scala.sys.package$.error(package.scala:30)
	at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:85)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:243)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:242)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.sql.execution.SparkPlan.waitForSubqueries(SparkPlan.scala:242)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:212)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
	at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)
	at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)
	at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:41)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:632)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:692)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:434)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3625)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
	at sun.reflect.GeneratedMethodAccessor237.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


In [70]:
print(country)
print(mode)
print(port)
print(us_states)
print(visacode)

DataFrame[country_code: bigint, country_name: string]
DataFrame[mode_code: bigint, mode_name: string]
DataFrame[port_code: string, port_name: string]
DataFrame[us_states_code: string, us_states_name: string]
DataFrame[visacode_code: bigint, visacode_name: string]


In [76]:
# print(i94_df)
# print(glt_df)
# print(airports)
print(us_cities)

DataFrame[id_: string, city: string, median_age: double, male_population: double, female_population: double, population: bigint, num_veterans: double, foreign_born: double, avg_household_size: double, state_code: string, race: string, count: bigint]


In [107]:
temp=spark.sql("SELECT DISTINCT city FROM us_cities")
temp.show()

+--------------------+
|                city|
+--------------------+
|        Saint George|
|           Worcester|
|               Tyler|
|         Springfield|
|              Caguas|
|          Charleston|
|               Pasco|
|              Corona|
|               Tempe|
|     North Las Vegas|
|              Auburn|
|            Palatine|
|            Thornton|
|Augusta-Richmond ...|
|           Bethlehem|
|             Phoenix|
|            Waukegan|
|           Hollywood|
|           Pittsburg|
|          Toms River|
+--------------------+
only showing top 20 rows



In [117]:
temp=spark.sql("SELECT iso_country, iso_region FROM airports LIMIT 5")
temp.show()

+-----------+----------+
|iso_country|iso_region|
+-----------+----------+
|         PG|    PG-MPL|
|         IS|      IS-4|
|         BZ|     BZ-SC|
|         CA|     CA-BC|
|         CN|     CN-13|
+-----------+----------+



In [74]:
#Tmorrow we start from here.
#Connect each tables above, and modify if needed.

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

The data populates a dashboard that must be updated on a daily basis by 7am every day.

In this situation, after 7am, we can import data into a NoSQL database like below:

In [None]:
from cassandra.cluster import Cluster
from ssl import SSLContext, PROTOCOL_TLSv1, CERT_REQUIRED
from cassandra.auth import PlainTextAuthProvider
from cassandra import ConsistencyLevel

ssl_context = SSLContext(PROTOCOL_TLSv1)
ssl_context.load_verify_locations('AmazonRootCA1.pem')
ssl_context.verify_mode = CERT_REQUIRED
auth_provider = PlainTextAuthProvider(username=str(config['APACHE_CASSANDRA_CREDS']['CASSANDRA_USERNAME']), password=str(config['APACHE_CASSANDRA_CREDS']['CASSANDRA_PASSWORD']))
cluster = Cluster(['cassandra.eu-west-1.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider, port=9142)
print('Patient...')
session = cluster.connect()

create_keyspace="""CREATE KEYSPACE IF NOT EXISTS "i94"
                   WITH REPLICATION={'class':'SingleRegionStrategy'}"""
session.execute(create_keyspace)
sleep(10)

create_table="""CREATE TABLE IF NOT EXISTS "i94".i94 (
                                                      year DOUBLE,
                                                      month DOUBLE,
                                                      birth_country DOUBLE,
                                                      resident_country DOUBLE,
                                                      port TEXT,
                                                      arrive_date DOUBLE,
                                                      mode DOUBLE,
                                                      state_code TEXT,
                                                      departure_date DOUBLE,
                                                      age DOUBLE,
                                                      visa DOUBLE,
                                                      date_to_db DOUBLE,
                                                      visa_issued_dep TEXT,
                                                      occupation TEXT,
                                                      arrival_flag TEXT,
                                                      depart_flag TEXT,
                                                      update_flag TEXT,
                                                      match_arrival_depart TEXT,
                                                      birthyear DOUBLE,
                                                      allowed_date TEXT,
                                                      gender TEXT,
                                                      ins_num TEXT,
                                                      airline TEXT,
                                                      admission_number DOUBLE,
                                                      flight_no TEXT,
                                                      visatype TEXT,
                                                      id_ TEXT,
                                                      PRIMARY KEY(id_)
                ) """
session.execute(create_table)
sleep(10)
print('Table well-prepared. you can input data from dataset.')

For non-relational DB, its characteristics is higher elasticity, faster read & write speed and evoving data volume. In our case, we should save I94 data into non-relational DB. Because this piece of data need to make ETL process almost every minutes in real world background, and it need dynamic write and read for real-time data monitoring. 

In [None]:
original_sql="""INSERT INTO "i94".i94 ("cicid","i94yr","i94mon","i94cit","i94res","i94port","arrdate","i94mode","i94addr","depdate",
                              "i94bir","i94visa","count","dtadfile","visapost","occup","entdepa","entdepd","entdepu","matflag",
                              "biryear","dtaddto","gender","insnum","airline","admnum","fltno","visatype","id_")
                              VALUES ({0},{1},{2},{3},{4},'{5}',{6},{7},'{8}',{9},
                                      {10},{11},{12},{13},'{14}','{15}','{16}','{17}','{18}','{19}',
                                      {20},'{21}','{22}','{23}','{24}',{25},'{26}','{27}','{28}')"""

lists=[888,1991,10,999,666,'port_test',9527,777,'addr_test',10,10,10,10,10,"visapost",'occup','entdepa','entdepd',
      'entdepu','mat',1984,'dtaddto','M','insnumber','AerLingus',29,'filtnumber','H1B']
sql=original_sql.format(lists[0],lists[1],lists[2],lists[3],lists[4],lists[5],lists[6],lists[7],lists[8],lists[9],
                       lists[10],lists[11],lists[12],lists[13],lists[14],lists[15],lists[16],lists[17],lists[18],lists[19],
                       lists[20],lists[21],lists[22],lists[23],lists[24],lists[25],lists[26],lists[27],uuid.uuid1())
sql=session.prepare(sql)
sql.consistency_level = ConsistencyLevel.LOCAL_QUORUM
session.execute(sql)

# This part is going to be used in transcript.

# while True:
#     values=input("Insert data. Split values by comma. If data is empty, just input comma. Enter Q for quit.")
#     lists=values.split(',')
#     if len(values) < 28:
#         print('Did you lose something?')
#         continue
#     elif values.lower() == 'Q':
#         print('Quit.')
#         break
#     else:
#         sql=session.format(sql)
#         sql.consistency_level = ConsistencyLevel.LOCAL_QUORUM
#         session.execute(sql.format(lists[0],lists[1],lists[2],lists[3],lists[4],lists[5],lists[6],lists[7],lists[8],lists[9],
#                       lists[10],lists[11],lists[12],lists[13],lists[14],lists[15],lists[16],lists[17],lists[18],lists[19],
#                       lists[20],lists[21],lists[22],lists[23],lists[24],lists[25],lists[26],lists[27],uuid.uuid1()))
#         next_one=input('Done. Do you wish to continue?Y/N')
#         if next_one.lower() == 'y':
#             continue
#         else:
#             print('Thanks. Quit.')
#             break


temp = session.execute('SELECT * FROM i94.i94')
df = pd.DataFrame(temp, columns=['id_','admnum','airline','arrdate','biryear','cicid','count','depdate','dtadtto','dtadfile','entdepa','entdepd','entdepu','fltno','gender','i94addr','i94bir','i94cit','i94mode','i94mon','i94port','i94res','i94visa','i94yr','insnum','matflag','occup','visapost','visatype'])
df.to_parquet('dashboard.parquet.gzip',compression='gzip')

And we run the script above at 7AM everyday to create