# Smart Advertisement Service


## Business scenario

An advertising consultant startup in US, which focuses on consulting range of business on effective advertisement, wants to adapt the cutting edge technology in order to enhance their quality of services. Lately conducted research on company has found out that around 14 percent of the national population are immigrant, [source](https://www.americanimmigrationcouncil.org/research/immigrants-in-the-united-states) . Thus the company has decided to build a data warehouse which will used for analytics for better consultation on advertisement. Not limited to that, the warehouse will also be used as a brain for backend services that will possibly automate the advertisement of certain category.

## Structure of the Project

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

### Step 1: Scope the Project and Gather Data

#### Scope
In this project I will be create a cloud data warehouse that will support answering question through analytics tables and dashboards. Later, the warehouse could be used as a backend for developing automatic advertisement services. 

The following steps will be carried out:

The data I
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc.

#### The data

The project uses data from two data sources
- I94 immigration data that comes from the US National Tourism and Trade Office.
- U.S City demographic data that comes from Opensoft.
- World temperature data from Kaggle


### Architecture

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 



In [71]:
# Do all imports and installs here
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from datetime import datetime, timedelta
from pyspark.sql.functions import desc, monotonically_increasing_id, udf, to_date, from_unixtime, trim, col

In [2]:
# Create spark session
spark = SparkSession.builder.config(
    "spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0"
).getOrCreate()

21/12/06 16:15:03 WARN Utils: Your hostname, Yugeshs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.13 instead (on interface en0)
21/12/06 16:15:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/Users/yugesh/opt/anaconda3/envs/airflow/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/yugesh/.ivy2/cache
The jars for the packages stored in: /Users/yugesh/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-78566a4a-3f0f-4b2a-b2ab-017d6aff0425;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;2.7.0 in central
	found org.apache.hadoop#hadoop-common;2.7.0 in central
	found org.apache.hadoop#hadoop-annotations;2.7.0 in central
	found com.google.guava#guava;11.0.2 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found commons-cli#commons-cli;1.2 in central
	found org.apache.commons#commons-math3;3.1.1 in central
	found xmlenc#xmlenc;0.52 in central
	found commons-httpclient#commons-httpclient;3.1 in central
	found commons-logging#commons-logging;1.1.3 in central
	found commons-codec#commons-codec;1.4 in central
	found commons-io#commons-io;2.4 in central
	found commons-net#commons-net;3.1 in central
	found commons-collections#commons-colle

In [3]:
# Read in the data here
# partial
# immigration_df = spark.read.parquet("./data/sas_data/part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet")
# all
immigration_df = spark.read.parquet("./data/sas_data/")




In [4]:
immigration_df.count()



3096313

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

### Exploring immigration data

In [61]:
immigration_df.select(["i94cit","i94res"]).show(5)

596

#### Remove unwanted columns

In [6]:
# unused_col = [ "count", "fltno",  "insnum", "entdepu", "admnum"]
unused_cols = [ "i94yr","i94mon","count", "fltno",  "insnum", "entdepd", "biryear", "dtadfile", "biryear", "visapost", "entdepa"]
print(f"No of columns before removing: {len(immigration_df.columns)}")
immigration_df = immigration_df.drop(*unused_cols)
print(f"No of columns after removing: {len(immigration_df.columns)}")

No of columns before removing: 28
No of columns after removing: 18


#### Convert double to integer

In [7]:
int_cols = ["cicid", "i94cit", "i94res", "arrdate", "i94mode", "depdate", "i94bir", "i94visa"]
# Additional options "dtadfile", "daddto"

for col_name in int_cols:
    immigration_df = immigration_df.withColumn(col_name, immigration_df[col_name].cast(IntegerType()))

immigration_df.select(int_cols).show(5)

+-------+------+------+-------+-------+-------+------+-------+
|  cicid|i94cit|i94res|arrdate|i94mode|depdate|i94bir|i94visa|
+-------+------+------+-------+-------+-------+------+-------+
|5748517|   245|   438|  20574|      1|  20582|    40|      1|
|5748518|   245|   438|  20574|      1|  20591|    32|      1|
|5748519|   245|   438|  20574|      1|  20582|    29|      1|
|5748520|   245|   438|  20574|      1|  20588|    29|      1|
|5748521|   245|   438|  20574|      1|  20588|    28|      1|
+-------+------+------+-------+-------+-------+------+-------+
only showing top 5 rows



#### Remove duplicate values

- Drop duplicate with original ccid, however unable to remove duplicate because of ccid.
- Drop duplicate without ccid and recreate ccid able to remove ()

In [8]:
# Dropping duplicate with ccid
with_duplicate = immigration_df.count()

immigration_df.dropDuplicates()

without_duplicate = immigration_df.count()
print(f"Total number of deleted duplicate values {with_duplicate - without_duplicate}")
print(f"Total number of availabel records {immigration_df.count()}")

Total number of deleted duplicate values 0
Total number of availabel records 3096313


In [9]:
# Dropping duplicate by excluding ccid

with_duplicate = immigration_df.count()

immigration_df = immigration_df.drop("cicid")
immigration_df = immigration_df.dropDuplicates()

immigration_df = immigration_df.withColumn("cicid", monotonically_increasing_id())

without_duplicate = immigration_df.count()
print(f"Total number of deleted duplicate values {with_duplicate - without_duplicate}")
print(f"Total number of available records {immigration_df.count()}")




Total number of deleted duplicate values 7




Total number of available records 3096306




#### Convert sas date format to "YYYY_MM_DD"

In [10]:
date_format = "%Y-%m-%d"
date_cols = ["arrdate", "depdate"]
convert_sas_udf = udf(lambda x: x if x is None else (timedelta(days=x) + datetime(1960, 1, 1)).strftime(date_format))


In [11]:
def convert_datetime(x):
    try:
        start = datetime(1960, 1, 1)
        return start + timedelta(days=int(x))
    except:
        return None
udf_datetime_from_sas = udf(lambda x: convert_datetime(x), T.DateType())

In [12]:
immigration_df = immigration_df.withColumn("arrdate", udf_datetime_from_sas(immigration_df.arrdate)).withColumn("depdate", udf_datetime_from_sas(immigration_df.depdate))
immigration_df.select(date_cols).show(5)
# for col_name in date_cols:
    # test_df = immigration_df.withColumn(col_name, udf_datetime_from_sas("arrdate"))
    # test_df = immigration_df.withColumn(col_name, convert_sas_udf(immigration_df[col_name]))
    # test_df = immigration_df.withColumn(col_name, to_date(from_unixtime(immigration_df[col_name])))



+----------+-------+
|   arrdate|depdate|
+----------+-------+
|2016-04-09|   null|
|2016-04-30|   null|
|2016-04-23|   null|
|2016-04-05|   null|
|2016-04-02|   null|
+----------+-------+
only showing top 5 rows





#### Append new columns

In [13]:
# Appending new columns

# Calculate age and append

# No of days stayed and calculate


### Exploring us cities demographics

In [14]:
filepath = "./data/us-cities-demographics.csv"
demographic_df = spark.read.csv(filepath,inferSchema=True, header=True, sep=';')

In [15]:
demographic_df.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: integer (nullable = true)



#### Removing duplicates: Pivoting column

In [16]:
# Finding types of races
demographic_df.select("race").distinct().show()

+--------------------+
|                race|
+--------------------+
|Black or African-...|
|  Hispanic or Latino|
|               White|
|               Asian|
|American Indian a...|
+--------------------+



In [17]:
# Pivot column Race to different columns
pivot_cols = ["City", "State"]
pivot_df = demographic_df.groupBy(pivot_cols).pivot("Race").sum("Count")

# Joining the pivot 
demographic_df = demographic_df.join(other=pivot_df, on=pivot_cols)

In [18]:
demographic_df.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: integer (nullable = true)
 |-- American Indian and Alaska Native: long (nullable = true)
 |-- Asian: long (nullable = true)
 |-- Black or African-American: long (nullable = true)
 |-- Hispanic or Latino: long (nullable = true)
 |-- White: long (nullable = true)



#### Remove unwanted columns from the schema

In [19]:
del_cols = ["median age", "Number of Veterans", "foreign-born", "Average Household Size", "State Code", "race", "count"]
demographic_df = demographic_df.drop(*del_cols)
demographic_df.show(1)

+----------+-----+---------------+-----------------+----------------+---------------------------------+-----+-------------------------+------------------+------+
|      City|State|Male Population|Female Population|Total Population|American Indian and Alaska Native|Asian|Black or African-American|Hispanic or Latino| White|
+----------+-----+---------------+-----------------+----------------+---------------------------------+-----+-------------------------+------------------+------+
|Cincinnati| Ohio|         143654|           154883|          298537|                             3362| 7633|                   133430|              9121|162245|
+----------+-----+---------------+-----------------+----------------+---------------------------------+-----+-------------------------+------------------+------+
only showing top 1 row



#### Dropping duplicates 

In [20]:
# Dropping duplicate by excluding ccid

with_duplicate = demographic_df.count()

demographic_df = demographic_df.dropDuplicates()

without_duplicate = demographic_df.count()
print(f"Total number of deleted duplicate values {with_duplicate - without_duplicate}")
print(f"Total number of availabel records {demographic_df.count()}")

Total number of deleted duplicate values 2295
Total number of availabel records 596


In [21]:
demographic_df.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- American Indian and Alaska Native: long (nullable = true)
 |-- Asian: long (nullable = true)
 |-- Black or African-American: long (nullable = true)
 |-- Hispanic or Latino: long (nullable = true)
 |-- White: long (nullable = true)



#### Convert column names

In [22]:
demographic_df = demographic_df.toDF("city", "state", "male_population", "female_population", "total_population", "american_indian_alaska_native", "asian", "black_african_american", "hispanic_latino", "white")

In [23]:
demographic_df.printSchema()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- female_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)
 |-- american_indian_alaska_native: long (nullable = true)
 |-- asian: long (nullable = true)
 |-- black_african_american: long (nullable = true)
 |-- hispanic_latino: long (nullable = true)
 |-- white: long (nullable = true)



#### Filling null with 0

In [24]:
num_cols = ["male_population", "female_population", "total_population",  "american_indian_alaska_native", "asian", "black_african_american", "hispanic_latino", "white"]
demographic_df = demographic_df.fillna(0, num_cols)

#### Add id column

In [25]:
demographic_df = demographic_df.withColumn("id", monotonically_increasing_id())
demographic_df.printSchema()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- female_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)
 |-- american_indian_alaska_native: long (nullable = true)
 |-- asian: long (nullable = true)
 |-- black_african_american: long (nullable = true)
 |-- hispanic_latino: long (nullable = true)
 |-- white: long (nullable = true)
 |-- id: long (nullable = false)



#### Reordering the columns

In [26]:
demographic_df = demographic_df.select(["id", "city", "state", "american_indian_alaska_native",
"asian",
"black_african_american",
"hispanic_latino", "white", "male_population", "female_population", "total_population"])

In [27]:
demographic_df.printSchema()

root
 |-- id: long (nullable = false)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- american_indian_alaska_native: long (nullable = true)
 |-- asian: long (nullable = true)
 |-- black_african_american: long (nullable = true)
 |-- hispanic_latino: long (nullable = true)
 |-- white: long (nullable = true)
 |-- male_population: integer (nullable = true)
 |-- female_population: integer (nullable = true)
 |-- total_population: integer (nullable = true)



### Exploring airport data

In [28]:
# reading csv file
airport_csv_path = "./data/airport-codes_csv.csv"
airport_df = spark.read.csv(airport_csv_path, header=True, sep=',', ignoreTrailingWhiteSpace=False)

airport_df.printSchema()

root
 |-- ident   : string (nullable = true)
 |-- type           : string (nullable = true)
 |-- name                                                                                                                               : string (nullable = true)
 |-- elevation_ft : string (nullable = true)
 |-- continent : string (nullable = true)
 |-- iso_country : string (nullable = true)
 |-- iso_region : string (nullable = true)
 |-- municipality                                                   : string (nullable = true)
 |-- gps_code : string (nullable = true)
 |-- iata_code : string (nullable = true)
 |-- local_code : string (nullable = true)
 |-- coordinates: string (nullable = true)



#### Removing whitespace in column name

In [29]:
airport_df = airport_df.select([F.col(col).alias(col.replace(' ', '')) for col in airport_df.columns])
airport_df.show(1)

+--------+---------------+--------------------+-------------+----------+------------+-----------+--------------------+---------+----------+-----------+--------------------+
|   ident|           type|                name| elevation_ft| continent| iso_country| iso_region|        municipality| gps_code| iata_code| local_code|         coordinates|
+--------+---------------+--------------------+-------------+----------+------------+-----------+--------------------+---------+----------+-----------+--------------------+
|00A     |heliport       |Total Rf Heliport...|11           |NA        |US          |US-PA      |Bensalem         ...|00A      |          |00A        |-74.9336013793945...|
+--------+---------------+--------------------+-------------+----------+------------+-----------+--------------------+---------+----------+-----------+--------------------+
only showing top 1 row



#### Excluding airport outside of US

In [30]:
airport_df = airport_df.filter(airport_df.iso_country.like("%US%"))
airport_df.count()

22757

#### Add state_code column

In [31]:
get_state_udf = udf(lambda x :  x if x is None else x.split("-")[1])
airport_df = airport_df.withColumn("state_code", get_state_udf("iso_region"))

#### Remove unwanted columns

In [32]:
airport_delete_cols = ["elevation_ft", "continent", "iso_country", "iso_region" , "gps_code", "iata_code", "local_code", "coordinates"]
airport_df = airport_df.drop(*airport_delete_cols)
airport_df.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- state_code: string (nullable = true)



#### Drop duplicates

In [33]:
# Dropping duplicate by excluding ident

with_duplicate = airport_df.count()

airport_df = airport_df.drop("ident")
airport_df = airport_df.dropDuplicates()

airport_df = airport_df.withColumn("id", monotonically_increasing_id())

without_duplicate = airport_df.count()
print(f"Total number of deleted values {with_duplicate - without_duplicate}")
print(f"Total number of available records {airport_df.count()}")



Total number of deleted values 14
Total number of available records 22743


In [34]:
airport_df.printSchema()

root
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- id: long (nullable = false)



#### Reordering columns

In [35]:
airport_df = airport_df.select(["id", "name", "type", "state_code", "municipality"])
airport_df.printSchema()

root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- municipality: string (nullable = true)



# Testing

In [42]:
immigration_df.filter(immigration_df.i94visa > 1).show(20)



+------+------+-------+----------+-------+-------+-------+------+-------+-----+-------+-------+--------+------+-------+---------------+--------+-----+
|i94cit|i94res|i94port|   arrdate|i94mode|i94addr|depdate|i94bir|i94visa|occup|entdepu|matflag| dtaddto|gender|airline|         admnum|visatype|cicid|
+------+------+-------+----------+-------+-------+-------+------+-------+-----+-------+-------+--------+------+-------+---------------+--------+-----+
|   582|   582|    DAL|2016-04-03|      1|   null|   null|    57|      2| null|   null|   null|10022016|     M|     AA| 9.265410473E10|      B2|  126|
|   582|   582|    LVG|2016-04-03|      1|   null|   null|    59|      2| null|   null|   null|10022016|     M|     AM| 9.264889993E10|      B2|  127|
|   582|   582|    LVG|2016-04-03|      2|   null|   null|     9|      2| null|   null|   null|10022016|     F|    VES| 9.265079733E10|      B2|  128|
|   582|   582|    LVG|2016-04-03|      2|   null|   null|    15|      2| null|   null|   null

21/12/06 16:51:35 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 991167 ms exceeds timeout 120000 ms
21/12/06 16:51:35 WARN SparkContext: Killing executors is not supported by current scheduler.


In [57]:
airport_df.select("type").distinct().show()

+---------------+
|           type|
+---------------+
|seaplane_base  |
|medium_airport |
|balloonport    |
|small_airport  |
|closed         |
|heliport       |
|large_airport  |
+---------------+



In [44]:
immigration_df.select("i94mode").distinct().show()

+-------+
|i94mode|
+-------+
|   null|
|      1|
|      3|
|      9|
|      2|
+-------+



In [37]:
immigration_df.select("i94visa").distinct().show()

+-------+
|i94visa|
+-------+
|      1|
|      3|
|      2|
+-------+



In [63]:
demographic_df.select("city").distinct().count()

567

In [38]:
immigration_df.select("matflag").distinct().show()


+-------+
|matflag|
+-------+
|   null|
|      M|
+-------+



In [72]:
demographic_df.groupBy("city").count().sort(col("count").desc()).show()


+------------+-----+
|        city|count|
+------------+-----+
|    Columbia|    3|
| Bloomington|    3|
| Springfield|    3|
|       Allen|    2|
|  Wilmington|    2|
| Westminster|    2|
|   Lafayette|    2|
|   Arlington|    2|
|     Jackson|    2|
|     Norwalk|    2|
| Kansas City|    2|
|      Albany|    2|
|Jacksonville|    2|
|      Aurora|    2|
|    Pasadena|    2|
|   Rochester|    2|
|    Lakewood|    2|
|Fayetteville|    2|
|      Peoria|    2|
|    Columbus|    2|
+------------+-----+
only showing top 20 rows



In [74]:
demographic_df.filter(demographic_df.city == "Columbia").show()

+---+--------+--------------+-----------------------------+-----+----------------------+---------------+-----+---------------+-----------------+----------------+
| id|    city|         state|american_indian_alaska_native|asian|black_african_american|hispanic_latino|white|male_population|female_population|total_population|
+---+--------+--------------+-----------------------------+-----+----------------------+---------------+-----+---------------+-----------------+----------------+
|136|Columbia|      Maryland|                          488|17821|                 30075|           8033|58343|          52202|            51265|          103467|
|340|Columbia|      Missouri|                         1713| 8673|                 15489|           4956|96067|          56544|            62554|          119098|
|500|Columbia|South Carolina|                         1420| 3501|                 56398|           7545|73232|          67686|            65707|          133393|
+---+--------+--------------

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [39]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [40]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.