# Smart Advertisement Service


## Business scenario

An advertising consultant startup in US, which focuses on consulting range of business on effective advertisement, wants to adapt the cutting edge technology in order to enhance their quality of services. Lately conducted research on company has found out that around 14 percent of the national population are immigrant, [source](https://www.americanimmigrationcouncil.org/research/immigrants-in-the-united-states) . Thus the company has decided to build a data warehouse which will used for analytics for better consultation on advertisement. Not limited to that, the warehouse will also be used as a brain for backend services that will possibly automate the advertisement of certain category.

## Structure of the Project

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

### Step 1: Scope the Project and Gather Data

#### Scope
In this project I will create a cloud data warehouse that will support answering question through analytics tables and dashboards. Later, the warehouse could be used as a backend for developing automatic advertisement services.

The following steps will be carried out:
- Transform the data and store in s3 in parquet format
- Load the data to redshift for analytics
- Use airflow to monitor pipeline
- Use SQL client tool to the redshift database for analytics
- Later, api will be developed using lambda function and API gateway for dashboard and automatic advertisement

#### The data

- I94 immigration data that comes from the US National Tourism and Trade Office, [source](https://www.trade.gov/national-travel-and-tourism-office).
- U.S City demographic data that comes from Opensoft, [source](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).
- Airports data from Datahub, [source](https://datahub.io/core/airport-codes#data)


In [1]:
import pandas as pd
import os
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from datetime import datetime, timedelta
import json
from pyspark.sql.functions import desc, monotonically_increasing_id, udf, to_date, from_unixtime, trim, col

In [2]:
# Create spark session
spark = SparkSession.builder.config(
    "spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0"
).getOrCreate()

21/12/08 01:23:06 WARN Utils: Your hostname, Yugeshs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.18 instead (on interface en0)
21/12/08 01:23:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/Users/yugesh/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/yugesh/.ivy2/cache
The jars for the packages stored in: /Users/yugesh/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5e90d866-2492-43be-b2cf-dede2821d5a4;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;2.7.0 in central
	found org.apache.hadoop#hadoop-common;2.7.0 in central
	found org.apache.hadoop#hadoop-annotations;2.7.0 in central
	found com.google.guava#guava;11.0.2 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found commons-cli#commons-cli;1.2 in central
	found org.apache.commons#commons-math3;3.1.1 in central
	found xmlenc#xmlenc;0.52 in central
	found commons-httpclient#commons-httpclient;3.1 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.4 in central
	found commons-io#commons-io;2.4 in central
	found commons-net#commons-net;3.1 in central
	found commons-collections#commons-colle

In [3]:
# Read in the data here
# partial
# immigration_df = spark.read.parquet("./data/sas_data/part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet")
# all
immigration_df = spark.read.parquet("./data/sas_data/")




In [4]:
immigration_df.count()



3096313

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

### Exploring immigration data

In [5]:
immigration_df.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

#### Convert double to integer

In [6]:
int_cols = ["cicid", "i94cit", "i94res", "arrdate", "i94mode", "depdate", "i94bir", "i94visa"]
# Additional options "dtadfile", "daddto"

for col_name in int_cols:
    immigration_df = immigration_df.withColumn(col_name, immigration_df[col_name].cast(IntegerType()))

immigration_df.select(int_cols).show(1)



+-------+------+------+-------+-------+-------+------+-------+
|  cicid|i94cit|i94res|arrdate|i94mode|depdate|i94bir|i94visa|
+-------+------+------+-------+-------+-------+------+-------+
|5748517|   245|   438|  20574|      1|  20582|    40|      1|
+-------+------+------+-------+-------+-------+------+-------+
only showing top 1 row



#### Remove duplicate values

- Drop duplicate with original ccid, however unable to remove duplicate because of ccid.
- Drop duplicate without ccid and recreate ccid able to remove ()

In [7]:
# Dropping duplicate with ccid
with_duplicate = immigration_df.count()

immigration_df.dropDuplicates()

without_duplicate = immigration_df.count()
print(f"Total number of deleted duplicate values {with_duplicate - without_duplicate}")
print(f"Total number of availabel records {immigration_df.count()}")

Total number of deleted duplicate values 0
Total number of availabel records 3096313


In [8]:
# Dropping duplicate by excluding ccid

with_duplicate = immigration_df.count()

immigration_df = immigration_df.drop("cicid")
immigration_df = immigration_df.dropDuplicates()

immigration_df = immigration_df.withColumn("cicid", monotonically_increasing_id())

without_duplicate = immigration_df.count()
print(f"Total number of deleted duplicate values {with_duplicate - without_duplicate}")
print(f"Total number of available records {immigration_df.count()}")


21/12/08 01:23:36 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Total number of deleted duplicate values 0




Total number of available records 3096313




#### Assign 0 to all null in integer column

In [9]:
immigration_df = immigration_df.fillna(0, int_cols)

#### Assign real values

In [10]:
def assign_value(col_name, key):
    ROOT = "./data/lookup/"
    key = str(key)
    col_name = col_name.lower()

    filepath = os.path.join(ROOT, f"{col_name}.json")
    with open(filepath, "r") as f:
        data = json.load(f)

    if col_name == "i94port":
        return data[key].split(",")[0] if key in data else None

    return data[key] if key in data else None

In [11]:
# Retrieve transporation mode using i94mode
get_mode_udf = udf(lambda x: assign_value(key=x, col_name="i94mode"), T.StringType())
immigration_df = immigration_df.withColumn("transportation_mode", get_mode_udf(immigration_df.i94mode))

In [12]:
# Retrieve arrived city
get_city_udf = udf(lambda x: assign_value(key=x, col_name="i94port"), T.StringType())
immigration_df = immigration_df.withColumn("arrived_city", get_city_udf(immigration_df.i94port))

In [13]:
# Retrieve arrived state
get_state_udf = udf(lambda x: assign_value(key=x, col_name="i94addr"), T.StringType())
immigration_df = immigration_df.withColumn("us_address", get_state_udf(immigration_df.i94addr))

In [14]:
# Retrieve origin city and travelled from using i94CIT and i94res
get_origin_udf = udf(lambda x: assign_value(key=x, col_name="i94cit"), T.StringType())
immigration_df = immigration_df.withColumn("origin_city", get_origin_udf(immigration_df.i94cit)).withColumn("traveled_from", get_origin_udf(immigration_df.i94res))

In [15]:
# Retrive i94visa with value
get_visa_udf = udf(lambda x: assign_value(key=x, col_name="i94visa"), T.StringType())
immigration_df = immigration_df.withColumn("visa_status", get_visa_udf(immigration_df.i94visa))

#### Exclude unused columns

In [16]:
unused_cols = [ "i94yr","i94mon","count", "fltno",  "insnum", "entdepd", "biryear", "dtadfile", "biryear", "visapost", "entdepu", "admnum", "i94cit", "i94res", "i94port", "i94addr", "i94mode", "i94visa", "entdepa", "dtaddto"]
print(f"No of columns before removing: {len(immigration_df.columns)}")
immigration_df = immigration_df.drop(*unused_cols)
print(f"No of columns after removing: {len(immigration_df.columns)}")

No of columns before removing: 34
No of columns after removing: 15


In [17]:
immigration_df.printSchema()

root
 |-- arrdate: integer (nullable = true)
 |-- depdate: integer (nullable = true)
 |-- i94bir: integer (nullable = true)
 |-- occup: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- visatype: string (nullable = true)
 |-- cicid: long (nullable = false)
 |-- transportation_mode: string (nullable = true)
 |-- arrived_city: string (nullable = true)
 |-- us_address: string (nullable = true)
 |-- origin_city: string (nullable = true)
 |-- traveled_from: string (nullable = true)
 |-- visa_status: string (nullable = true)



#### Rename columns

In [18]:
# Rename columns
immigration_df = immigration_df.withColumnRenamed("arrdate", "arrival_date").withColumnRenamed("depdate", "departure_date").withColumnRenamed("i94bir", "age").withColumnRenamed("occup", "occupation").withColumnRenamed("matflag", "matched_flag")

#### Order the columns in proper sequence

In [19]:
immigration_df = immigration_df.select(["cicid", "origin_city", "traveled_from", "arrived_city", "us_address", "arrival_date", "departure_date", "transportation_mode", "age", "gender",
"visa_status", "occupation", "airline"])

#### Convert sas date format to "YYYY_MM_DD"

In [20]:
def convert_datetime(x):
    try:
        start = datetime(1960, 1, 1)
        return start + timedelta(days=int(x))
    except:
        return None
udf_datetime_from_sas = udf(lambda x: convert_datetime(x), T.DateType())

In [21]:
immigration_df = immigration_df.withColumn(
    "arrival_date", udf_datetime_from_sas(immigration_df.arrival_date)
).withColumn("departure_date", udf_datetime_from_sas(immigration_df.departure_date))


### Exploring us cities demographics

In [22]:
filepath = "./data/us_cities_demographics.csv"
demographic_df = spark.read.csv(filepath,inferSchema=True, header=True, sep=';')

In [23]:
demographic_df.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: integer (nullable = true)



#### Removing duplicates: Pivoting column

In [24]:
# Finding types of races
demographic_df.select("race").distinct().show()

+--------------------+
|                race|
+--------------------+
|Black or African-...|
|  Hispanic or Latino|
|               White|
|               Asian|
|American Indian a...|
+--------------------+



In [25]:
# Pivot column Race to different columns
pivot_cols = ["City", "State"]
pivot_df = demographic_df.groupBy(pivot_cols).pivot("Race").sum("Count")

# Joining the pivot 
demographic_df = demographic_df.join(other=pivot_df, on=pivot_cols)

In [26]:
demographic_df.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: integer (nullable = true)
 |-- American Indian and Alaska Native: long (nullable = true)
 |-- Asian: long (nullable = true)
 |-- Black or African-American: long (nullable = true)
 |-- Hispanic or Latino: long (nullable = true)
 |-- White: long (nullable = true)



#### Remove unwanted columns from the schema

In [27]:
del_cols = ["median age", "Number of Veterans", "foreign-born", "Average Household Size", "State Code", "race", "count"]
demographic_df = demographic_df.drop(*del_cols)
demographic_df.show(1)



+----------+-----+---------------+-----------------+----------------+---------------------------------+-----+-------------------------+------------------+------+
|      City|State|Male Population|Female Population|Total Population|American Indian and Alaska Native|Asian|Black or African-American|Hispanic or Latino| White|
+----------+-----+---------------+-----------------+----------------+---------------------------------+-----+-------------------------+------------------+------+
|Cincinnati| Ohio|         143654|           154883|          298537|                             3362| 7633|                   133430|              9121|162245|
+----------+-----+---------------+-----------------+----------------+---------------------------------+-----+-------------------------+------------------+------+
only showing top 1 row





#### Dropping duplicates 

In [28]:
# Dropping duplicate by excluding ccid

with_duplicate = demographic_df.count()

demographic_df = demographic_df.dropDuplicates()

without_duplicate = demographic_df.count()
print(f"Total number of deleted duplicate values {with_duplicate - without_duplicate}")
print(f"Total number of availabel records {demographic_df.count()}")

Total number of deleted duplicate values 2295
Total number of availabel records 596


In [29]:
demographic_df.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- American Indian and Alaska Native: long (nullable = true)
 |-- Asian: long (nullable = true)
 |-- Black or African-American: long (nullable = true)
 |-- Hispanic or Latino: long (nullable = true)
 |-- White: long (nullable = true)



#### Convert column names

In [30]:
demographic_df = demographic_df.toDF("city", "state", "male_population", "female_population", "total_population", "american_indian_alaska_native", "asian", "black_african_american", "hispanic_latino", "white")

In [31]:
demographic_df.printSchema()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- female_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)
 |-- american_indian_alaska_native: long (nullable = true)
 |-- asian: long (nullable = true)
 |-- black_african_american: long (nullable = true)
 |-- hispanic_latino: long (nullable = true)
 |-- white: long (nullable = true)



#### Filling null with 0

In [32]:
num_cols = ["male_population", "female_population", "total_population",  "american_indian_alaska_native", "asian", "black_african_american", "hispanic_latino", "white"]
demographic_df = demographic_df.fillna(0, num_cols)

#### Add id column

In [33]:
demographic_df = demographic_df.withColumn("demographic_id", monotonically_increasing_id())
demographic_df.printSchema()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- female_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)
 |-- american_indian_alaska_native: long (nullable = true)
 |-- asian: long (nullable = true)
 |-- black_african_american: long (nullable = true)
 |-- hispanic_latino: long (nullable = true)
 |-- white: long (nullable = true)
 |-- demographic_id: long (nullable = false)



#### Reordering the columns

In [34]:
demographic_df = demographic_df.select(["demographic_id", "city", "state", "american_indian_alaska_native",
"asian",
"black_african_american",
"hispanic_latino", "white", "male_population", "female_population", "total_population"])

In [35]:
demographic_df.printSchema()

root
 |-- demographic_id: long (nullable = false)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- american_indian_alaska_native: long (nullable = true)
 |-- asian: long (nullable = true)
 |-- black_african_american: long (nullable = true)
 |-- hispanic_latino: long (nullable = true)
 |-- white: long (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- female_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)



### Exploring airport data

In [36]:
# reading csv file
airport_csv_path = "./data/airport_codes.csv"
airport_df = spark.read.csv(airport_csv_path, header=True, sep=',', ignoreTrailingWhiteSpace=False)

airport_df.printSchema()

root
 |-- ident   : string (nullable = true)
 |-- type           : string (nullable = true)
 |-- name                                                                                                                               : string (nullable = true)
 |-- elevation_ft : string (nullable = true)
 |-- continent : string (nullable = true)
 |-- iso_country : string (nullable = true)
 |-- iso_region : string (nullable = true)
 |-- municipality                                                   : string (nullable = true)
 |-- gps_code : string (nullable = true)
 |-- iata_code : string (nullable = true)
 |-- local_code : string (nullable = true)
 |-- coordinates: string (nullable = true)



#### Removing whitespace in column name

In [37]:
airport_df = airport_df.select([F.col(col).alias(col.replace(' ', '')) for col in airport_df.columns])
airport_df.show(1)

+--------+---------------+--------------------+-------------+----------+------------+-----------+--------------------+---------+----------+-----------+--------------------+
|   ident|           type|                name| elevation_ft| continent| iso_country| iso_region|        municipality| gps_code| iata_code| local_code|         coordinates|
+--------+---------------+--------------------+-------------+----------+------------+-----------+--------------------+---------+----------+-----------+--------------------+
|00A     |heliport       |Total Rf Heliport...|11           |NA        |US          |US-PA      |Bensalem         ...|00A      |          |00A        |-74.9336013793945...|
+--------+---------------+--------------------+-------------+----------+------------+-----------+--------------------+---------+----------+-----------+--------------------+
only showing top 1 row



#### Excluding airport outside of US

In [38]:
airport_df = airport_df.filter(airport_df.iso_country.like("%US%"))
airport_df.count()

22757

#### Add state_code column

In [39]:
get_state_code_udf = udf(lambda x :  x if x is None else x.split("-")[1])
airport_df = airport_df.withColumn("state_code", get_state_code_udf("iso_region"))

#### Retrieve state_name from state code

In [40]:
airport_df = airport_df.withColumn("state", get_state_udf(airport_df.state_code))

#### Remove unwanted columns

In [41]:
airport_delete_cols = ["elevation_ft", "continent", "iso_country", "iso_region" , "gps_code", "iata_code", "local_code", "coordinates", "state_code"]
airport_df = airport_df.drop(*airport_delete_cols)
airport_df.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- state: string (nullable = true)



#### Drop duplicates

In [42]:
# Dropping duplicate by excluding ident

with_duplicate = airport_df.count()

airport_df = airport_df.drop("ident")
airport_df = airport_df.dropDuplicates()

airport_df = airport_df.withColumn("airport_id", monotonically_increasing_id())

without_duplicate = airport_df.count()
print(f"Total number of deleted values {with_duplicate - without_duplicate}")
print(f"Total number of available records {airport_df.count()}")



Total number of deleted values 120




Total number of available records 22637




In [43]:
airport_df.printSchema()

root
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- state: string (nullable = true)
 |-- airport_id: long (nullable = false)



#### Rename munacipality to city

In [44]:
airport_df = airport_df.withColumnRenamed("municipality", "city")

#### Reordering columns

In [45]:
airport_df = airport_df.select(["airport_id", "name", "city", "state", "type"])
airport_df.printSchema()

root
 |-- airport_id: long (nullable = false)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- type: string (nullable = true)



#### Remove space from columns values

In [46]:
remove_whitespace_udf = udf(lambda x: str(x).strip())
airport_df = (
    airport_df.withColumn(
        "name",
        remove_whitespace_udf(airport_df.name)
        if airport_df.name.isNotNull
        else airport_df.name,
    )
    .withColumn(
        "city",
        remove_whitespace_udf(airport_df.city)
        if airport_df.city.isNotNull
        else airport_df.city,
    )
    .withColumn(
        "state",
        remove_whitespace_udf(airport_df.state)
        if airport_df.state.isNotNull
        else airport_df.state,
    )
    .withColumn(
        "type",
        remove_whitespace_udf(airport_df.type)
        if airport_df.type.isNotNull
        else airport_df.type,
    )
)

#### Exclude unused airport

In [47]:
used_airports = [
    "seaplane_base",
    "medium_airport",
    "small_airport",
    "large_airport",
]
airport_df = airport_df.filter(airport_df.type.isin(used_airports))

In [51]:
airport_df.show(5)



+----------+--------------------+------------+-----+-------------+
|airport_id|                name|        city|state|         type|
+----------+--------------------+------------+-----+-------------+
|         0|        Lowell Field|Anchor Point| None|small_airport|
|         2|Pan Lake Strip Ai...|      Willow| None|small_airport|
|         3|     Meadow STOLport|     Jericho| None|small_airport|
|         5|The 2A Ranch Airport|Ormond Beach| None|small_airport|
|         7|    Pankratz Airport| Springfield| None|small_airport|
+----------+--------------------+------------+-----+-------------+
only showing top 5 rows



21/12/08 02:22:01 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1177655 ms exceeds timeout 120000 ms
21/12/08 02:22:01 WARN SparkContext: Killing executors is not supported by current scheduler.


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
The conceptual model has been designed keeping into the mind that the model will be used for finding effective advertisement strategies targeted for immigrants. Therefore, I have considered the immigration as a fact table, which consists of details regarding the immigrant. Further, the demographics and airports as dimension tables, since these tables will allow us to estimate the details about the movement of immigrants.

Only looking at the conceptual model we could say that from this model we could easily extract information such as:

- Average age of immigrant of certain visa type
- Total number of Asian living in particular city
- Peak month of travel

The list could go on, and each insights could be effectively used for making advertisement decision.

<img src="./images/er_diagram.png" width="500">

#### 3.2 Mapping Out Data Pipelines

All of the collected data were stored in the s3 data lake without transforming. In order to transform the data into the conceptual model following steps were taken

- Using Amazon EMR the data stored in s3 were transformed as shown in conceptual model
- The transformed data are than stored in s3
- The transformed data stored in s3 is than loaded to the redshift
- The loading of data has been monitored using Apache airflow

    - Tree

    <img src="./images/tree.png" width="500">

    - Graph

    <img src="./images/graph.png" width="500">

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [49]:
# In the src code

#### 4.2 Data Quality Checks

In [50]:
# Data quality checks has been performed in the airflow operator

#### 4.3 Data dictionary 

#### immigration (fact_table)

|   column_name  |                              description                              | old_column_name |
|:--------------:|:---------------------------------------------------------------------:|:---------------:|
|      cicid     |               Unique identifier  of the immigrant record              |      cicid      |
|   origin_city  |                      City of origin of immigrant                      |      i94cit     |
|  traveled_from |                 City from where the immigrant traveled                |      i94res     |
|  arrived_city  |          City where immigrant arrived, extracted from i94port         |        -        |
|   us_address   |                       Address of immigrant in US                      |     i94addr     |
|  arrival_date  |                      Date of arrival of immigrant                     |     arrdate     |
| departure_date |              Date on which the immigrant departed form US             |     depdate     |
|       age      |                          Age of the immigrant                         |       age       |
|     gender     |                        Gender of the immigrant                        |      gender     |
|   visa_status  |                      Visa status of the immigrant                     |     i94visa     |
|    visa_type   |                         Visa type of immigrant                        |     visatype    |
|   occupation   |                     Occupation of immigrant in US                     |      occup      |
|     airline    |              Airline through which the immigrant traveled             |     airline     |
|  matched_flag  | Flag that denotes the user has arrived and departed(or visa expired) |     mat_flag    |


#### demographics (dim_table)

| column_name                   | description                                                     |
|-------------------------------|-----------------------------------------------------------------|
| city_id                       | Unique id of for the city                                       |
| city                          | Name of city                                                    |
| state                         | State of city                                                   |
| american_indian_alaska_native | Total population of American Indian or Alaska Native in the city |
| asian                         | Total population of Asian in the city                           |
| hispanic_latino               | Total population of Hispanic or Latino                          |
| white                         | Total population of White                                       |
| female_population             | Total number of females                                         |
| male_population               | Total number of males                                           |
| total_population              | Total population of the city                                    |

#### airports (dim_table)

| column_name | description                                                 |
|-------------|-------------------------------------------------------------|
| airport_id  | Unique identifier of an airport                             |
| name        | Name of the airport                                         |
| city        | City at which the airport is located, extracted from region |
| state       | State at which the airport is located                       |
| type        | Type of the airport                                         |


#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

#### Clearly state the rationale for the choice of tools and technologies for the project.

During the development of the proposed smart advertisement system several tools and technologies were used as mentioned [above](##technology-used). I have selected each tools very carefully with consideration of project goal and time. For cloud I have selected amazon because of the following services:

- Amazon s3: It is low cost, highly available, secure, infinitely scalable and easy to manage object storage service. This is perfect because effectiveness of the smart advertisement service will increases with the amount of data thus the service should increase the size of the datasets as much as possible.

- Amazon EMR: This service enables us to easily transform our data using apache spark which is lightning fast analytics engine for big data.

- Amazon Redshift: This service is a fast, easy, and secure cloud data warehousing service that provides high scalability. Since it is fast it is the best choice for developing api endpoints for dashboard and automatic advertisement in near future. 

- Managed Apache Airflow: This service allows running of airflow without maintaining servers. The airflow is extremely important tool for monitoring etl tasks. It provides a visual representation of etl process, which is highly valuable for debugging and making discussion easier with data scientist and data analyst.

Furthermore, I have used notion for project management and github for the version control. These two tool are must used tools for effective delivery of a product. Also, I have used python as a programming language because I love writing codes in python.

#### Propose how often the data should be updated and why ?

The data should be updated once a month because it would be difficult get data from US National Tourism and Trade Office often since the data has to be requested. Also, in the real product we will be using huge amount, performing analysis on one month old data might have nearly same effectiveness.

#### How to approach the problem if the data was increased by 100x.

- Transforming data: Currently in EMR only one master and slave node are running, since the EMR is highly scalable, I would have used appropriate number of slaves. I would also have partitioned the immigration data by month while writing in parquet format.

- Loading data to redshift: I would have increase the size of the warehouse and while copying the data from s3 to redshift I would have used template filed that would allow to load timestamped files from S3 based on the execution time and run backfills.

#### The data populates a dashboard that must be updated on a daily basis by 7am every day.

- Use airflow to schedule and run pipelines

#### The database needed to be accessed by 100+ people.

- I would have used developed endpoints using sql client, lambda function and api gateway.
