# OFLC H-1B Program Data and Job Board Sites
## Data Engineering Capstone Project
### Author - Salinee Kingbaisomboon

#### Project Summary
The primary purpose of this project is to prepare the data structure in order to be viewed & analysed by many parties such as H1B job seekers, international students, HR recruiters and public who are interested to incorporate this data to their interest, for example, **planning for in-demand career paths**, **H1B workers trend prediction**, **H1B processing time analysis & forcast** and so on.

This project will be mainly focusing on the end-to-end **Data Pipelines Process** starting from <i>integrate data from various resources</i>, perform an <i>ETL (Extract, Transform and Load) on top of this big datasets</i> into a new form of table schema. Then create **parquet files** from those new tables and finally, write these files into a <i>data lake</i> **Amazon S3**.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

# Abstract

The H-1B is a visa in the United States under the Immigration and Nationality Act, section 101 that allows U.S. employers to temporarily employ foreign workers in specialty occupations. A specialty occupation requires the application of specialized knowledge and a bachelor's degree or the equivalent of work experience. The duration of stay is three years, extendable to six years; after which the visa holder may need to reapply. Laws limit the number of H-1B visas that are issued each year: 188,100 new and initial H-1B visas were issued in 2019.Employers must generally withhold Social Security and Medicare taxes from the wages paid to employees in H-1B status.

The H-1B visa has its roots in the H1 visa of the Immigration and Nationality Act of 1952; the split between H-1A (for nurses) and H-1B was created by the Immigration Act of 1990. 65,000 H-1B visas were made available each fiscal year, out of which employers could apply through Labor Condition Applications. Additional modifications to H1-B rules were made by legislation in 1998, 2000, in 2003 for Singapore and Chile, in the H-1B Visa Reform Act of 2004, 2008, and 2009. United States Citizenship and Immigration Services has modified the rules in the years since then.<i>[1]</i>

<i>[1] Source: https://en.wikipedia.org/wiki/H-1B_visa</i>

# Understanding the Important Role that H-1B Workers Play in the U.S. Economy

Foreign workers fill a critical need in the U.S. labor market—particularly in the Science, Technology, Engineering, and Math (STEM) fields.

<img src="https://www.urban.org/sites/default/files/styles/feature2_full_hero/public/feature2/jobs-feature-header-1700x700.png?itok=8lB15_K3" width="500"/>
<center><i>Source: https://www.urban.org/features/how-government-jobs-programs-could-boost-employment</i></center>

The United States has a successful economy system. Foreign-born workers of all types and skills, from every corner of the globe, have joined with american-born workers to build it. Skilled immigrants’ contributions to the U.S. economy help create new jobs and new opportunities for economic expansion. Indeed, H-1B workers **positively** impact the U.S. economy and the employment opportunities of american-born workers.

The skills that H-1B workers bring with them can be critical in responding to national emergencies. For instance, over the past decade (FY 2010-FY 2019), eight companies that are currently trying to develop a coronavirus vaccine—Gilead Sciences, Moderna Therapeutics, GlaxoSmithKline, Inovio, Johnson and Johnson Pharmaceuticals, Regeneron, Vir Therapeutics, and Sanofi—received approvals for 3,310 biochemists, biophysicists, chemists, and other scientists through the H-1B program.<i>[2]</i>

<i>[2] Source: https://www.americanimmigrationcouncil.org/research/h1b-visa-program-fact-sheet#:~:text=Skilled%20immigrants'%20contributions%20to%20the,opportunities%20of%20native%2Dborn%20workers.</i>

# Data Engineering Process

### Step 0: Import all necessary libraries and perform the installation

In [1]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import xml.etree.ElementTree as et

import configparser
from datetime import datetime
import os

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, monotonically_increasing_id, lit
from pyspark.sql.functions import date_format, to_date, unix_timestamp, from_unixtime
from pyspark.sql.types import FloatType, IntegerType, TimestampType, StringType, LongType, DateType
from pyspark.sql import Row

import warnings
warnings.filterwarnings('ignore')

### Step 1: Scope the Project and Gather Data

#### Scope  
**<u><font color=blue>Plan for the Project:</font></u>**
1. Gathering three datasets which will be used to construct a new data model.
    - 1.1 **H1B** dataset which will be used as primary data source for the **fact table**.
    - 1.2 Two of the **job posting** datasets which will be used to provide data source to the **dimension tables**. Since we want to provide the in-demand or in-trend jobs for the current markets and trying to related them back to the **H1B** jobs that employers in the U.S. are hiring.
2. After that, we will start exploring all the datasets and perform some data wrangling (discovering, cleaning, enriching, validating and such).
3. Design the new database schema which will represent as the new data sources for users which can help them answer some questions, for example:
    - 3.1 What kind of **job title**, **industry** are in-demand for hiring.
    - 3.2 What **cities** or **states** are thoes jobs?
    - 3.1 How long does it take for **H1B** application processes from the submitted date until the decision date.
    
    Thoese questions will help many people (such as international students) to make their decisions to choose their program of study, where should they looking for a job after graduating and such.
4. Create a python scripts to perform **Data Pipelines** processes.
    - 4.1 The main goal is to create a very simple **Data Lake** on **AWS S3** by gathering various datasets and perform the **ETL** on those raw data and convert them into a new schema structure with a new form of entities and relationship.
    - 4.2 We will use **Spark** (local mode) to ingest all data, extract and transform them into the designated fact & dimension tables.
    - 4.3 Then, we will generated **parquet files** based on each tables. All thoses files will be written to **AWS S3** via  **Spark session**.
5. The end result would be the new completely datasets reside on data lake which will be available to accesses for any parties who are intereted.

**<u><font color=blue>Data used in this Project:</font></u>**

There are three datasets used in this project.

All of the datasets used in this project are from <a href="https://www.kaggle.com" target="_blank">Kaggle</a> which is a subsidiary of <a href="https://www.google.com" target="_blank">Google LLC</a>, is an online community of data scientists and machine learning practitioners.<i>[3]</i>

**<u><font color=blue>End Solution:</font></u>**
This project will be a very simple **Data Pipeline Solutions** which in the end will be served for querying through a **Data Lake Storage on AWS** via **S3** bucket.

**<u><font color=blue>Tools used in this Projects:</font></u>**
- AWS regular account
- AWS S3
- Jupyter Notebook
- Spark (built-in local mode)

<i>[3] Source: https://en.wikipedia.org/wiki/Kaggle.</i>

<i>[4] Source: https://en.wikipedia.org/wiki/Star_schema#:~:text=The%20star%20schema%20is%20an,it%20representing%20the%20star's%20points..</i>

#### Describe and Gather Data
Below are three of the datasets I'll used in this project. 

1. **OFLC H-1B Program Data (2011-2018)**
   - Source: <a href="https://www.kaggle.com/rakeshchintha/oflc-h1b-data" target="_blank">Kaggle</a>
   - File Name: **h1b_data_fy2011_fy2018_20190401.csv**
   - File Extension: csv
   - This dataset contains 8 years worth of H-1B LCA petition data from 2011-2018, with more than **4 million records**.<br />
       **Available Fields**:
       - Fiscal Year
       - Case Status
       - Case Submitted
       - Decision Date
       - Employer Name
       - Employer City
       - Employer State
       - Employer Zip
       - Job Title
       - SOC Code <i>(More details: https://en.wikipedia.org/wiki/Standard_Occupational_Classification_System)</i>
       - SOC Name
       - NAIC S Code <i>(More details: https://en.wikipedia.org/wiki/North_American_Industry_Classification_System)</i>
       - Fulltime Position
       - Prevailing Wage
       - Prevailing Wage Unit
       - Wage From
       - Wage To
       - Wage Unit
       - Work City
       - Work State
       - Work Zip
2. **Indeed Job Posting Dataset**
   - Source: <a href="https://www.kaggle.com/promptcloud/indeed-job-posting-dataset" target="_blank">Kaggle</a>
   - File Name: **marketing_sample_for_trulia_com-real_estate__20190901_20191031__30k_data.csv**
   - File Extension: csv
   - This dataset contains around 30K records of jobs posting from date range from 1st Aug 2019 to 31st Oct 2019 on https://www.indeed.com/.<br />
       **Available Fields**:
       - Job Title
       - Job Description
       - Job Type
       - Categories
       - Location
       - City
       - State
       - Country
       - Zip Code
       - Address
       - Salary From
       - Salary To
       - Salary Period
       - Apply Url
       - Apply Email
       - Employees
       - Industry
       - Company Name
       - Employer Email
       - Employer Website
       - Employer Phone
       - Employer Logo
       - Company description
       - Employer Location
       - Employer City
       - Employer State
       - Employer Country
       - Employer Zip Code
       - Uniq Id
       - Crawl Timestamp 
3. **Job Posting On Amazon Jobs**
   - Source: <a href="https://www.kaggle.com/promptcloud/job-posting-on-amazon-jobs" target="_blank">Kaggle</a>
   - File Name: **amazon_com-jobs__20190901_20191231_sample.xml**
   - File Extension: xml
   - This dataset contains around 30K records of jobs posting from date range from 1st Sep 2019 to 31st December 2019 on https://amazon.com/.<br />
       **Available Fields**:
       - Unique ID
       - Crawl TimeStamp
       - Job URL
       - Title
       - Description
       - Location
       - Category
       - Job Id
       - Company Name
   - Note: The file extension is xml and we interest only the **record** tag and its children.

### Step 2: Explore and Assess the Data
In this step, we will explore each of our datasets using **Pandas** libraries in order to identify data quality issues. We will perform four tasks:
1. Explore each datasets
2. Identify missing values
3. Identify duplicate data
4. Perform data cleansing

#### 1. Explore the Data

##### 1.1 OFLC H-1B Program Data (2011-2018)

In [2]:
# Read data
h1b_pandas_df = pd.read_csv('h1b_data_fy2011_fy2018_20190401.csv', engine='python')
h1b_pandas_df.head()

Unnamed: 0,fiscal_year,case_status,case_submitted,decision_date,emp_name,emp_city,emp_state,emp_zip,job_title,soc_code,...,naics_code,full_time_position,prevailing_wage,pw_unit,wage_from,wage_to,wage_unit,work_city,work_state,work_zip
0,2011,C,8/3/2011,8/9/2011,WILLISTON NORTHAMPTON SCHOOL,EASTHAMPTON,MA,1027.0,CHINESE TEACHER,,...,611110,1.0,23350.0,Y,40400.0,,Y,EASTHAMPTON,MA,
1,2011,C,1/11/2011,1/18/2011,"NYFIX, INC.",NEW YORK,NY,10005.0,PROJECT MANAGER,15-1031,...,5415,1.0,101088.0,Y,150000.0,,Y,NEW YORK,NY,
2,2011,CW,4/21/2011,4/27/2011,TGS-NOPEC GEOPHYSICAL COMPANY,HOUSTON,TX,77042.0,PRINCIPAL TRAINER / DEVELOPMENT ANALYST,15-1031,...,541360,1.0,77480.0,Y,87500.0,,Y,HOUSTON,TX,
3,2011,C,4/19/2011,4/25/2011,"AFREN USA, INC.",THE WOODLANDS,TX,77380.0,DRILLING MANAGER,11-9041,...,211111,1.0,165506.0,Y,233500.0,,Y,THE WOODLANDS,TX,
4,2011,C,4/4/2011,4/8/2011,"BA-INSIGHT, LLC",NEW YORK,NY,10165.0,TECHNICAL SUPPORT ENGINEER,15-1041,...,541519,1.0,62358.0,Y,68000.0,,Y,BOSTON,MA,


In [3]:
print(f'H1B Dataset Shape:{h1b_pandas_df.shape}')

H1B Dataset Shape:(4192087, 21)


##### 1.2 Indeed Job Posting Dataset

In [4]:
# Read data
indeed_jobs_pandas_df = pd.read_csv('marketing_sample_for_trulia_com-real_estate__20190901_20191031__30k_data.csv')
indeed_jobs_pandas_df.head()

Unnamed: 0,Job Title,Job Description,Job Type,Categories,Location,City,State,Country,Zip Code,Address,...,Employer Phone,Employer Logo,Companydescription,Employer Location,Employer City,Employer State,Employer Country,Employer Zip Code,Uniq Id,Crawl Timestamp
0,Shift Manager,,,,"Mission Hills, CA 91345",Mission Hills,CA,United States,91345.0,,...,,https://d2q79iu7y748jz.cloudfront.net/s/_squar...,Del Taco is an American quick service restaura...,"Mission Hills, CA 91345",Mission Hills,CA,United States,91345.0,511f9a53920f4641d701d51d3589349f,2019-08-24 09:13:18 +0000
1,Operations Support Manager,,,,"Atlanta, GA 30342",Atlanta,GA,United States,30342.0,,...,,https://d2q79iu7y748jz.cloudfront.net/s/_logo/...,"Based in Atlanta, FOCUS Brands Inc. is an inno...",,,,United States,,4955daf0a3facbe2acb6c429ba394e6d,2019-09-19 08:16:55 +0000
2,Senior Product Manager - Data,,,,"Chicago, IL",Chicago,IL,United States,,,...,,,Vibes Corp. reputation was built and establish...,,,,United States,,a0e0d12df1571962b785f17f43ceae12,2019-09-18 02:13:10 +0000
3,Part-Time Office Concierge,,,,"Festus, MO",Festus,MO,United States,,,...,,,,,,,United States,,56e411fd731f76ac916bf4fb169250e9,2019-10-24 16:39:13 +0000
4,Print & Marketing Associate,,,,"Cedar Rapids, IA 52404",Cedar Rapids,IA,United States,52404.0,,...,,https://d2q79iu7y748jz.cloudfront.net/s/_logo/...,"Staples is The Worklife Fulfillment Company, h...","Cedar Rapids, IA 52404",Cedar Rapids,IA,United States,52404.0,3fff5c0ad6981bf4bff6260bd5feab63,2019-08-24 22:29:10 +0000


In [5]:
print(f'Indeed Jobs Posting Dataset Shape:{indeed_jobs_pandas_df.shape}')

Indeed Jobs Posting Dataset Shape:(30002, 30)


##### 1.3 Job Posting On Amazon Jobs
Note: Since this file is in a xml format, we need to parse it and convert the results into the pandas dataframe. I use **xml.etree.ElementTree** (https://docs.python.org/3/library/xml.etree.elementtree.html) from python library to perform this task.

In [6]:
# Read data
xtree = et.parse("amazon_com-jobs__20190901_20191231_sample.xml")
xroot = xtree.getroot()

In [8]:
# We have to examine the xml file to extract these column names from each xml nodes
amazon_jobs_df_cols = ["uniq_id", "crawl_timestamp", "job_url", "title", "description", "location", "category", \
                       "job_id", "company_name"]
amazon_jobs_rows = []

# We interested in the Page node and all of it's elements
for page in xroot.findall('page'):
    for record in page.findall('record'):
        s_uniq_id = record.find("uniq_id").text
        s_crawl_timestamp = record.find("crawl_timestamp").text if record is not None else None
        s_job_url = record.find("job_url").text if record is not None else None
        s_title = record.find("title").text if record is not None else None
        s_description = record.find("description").text if record is not None else None
        s_location = record.find("location").text if record is not None else None
        s_category = record.find("category").text if record is not None else None
        s_job_id = record.find("job_id").text if record is not None else None
        s_company_name = record.find("company_name").text if record is not None else None

        # Append the extract information into rows
        amazon_jobs_rows.append({"uniq_id": s_uniq_id, 
                     "crawl_timestamp": s_crawl_timestamp, 
                     "job_url": s_job_url, 
                     "title": s_title,
                     "description": s_description,
                     "location": s_location,
                     "category": s_category,
                     "job_id": s_job_id,
                     "company_name": s_company_name})

# Convert list to Pandas dataframe
amazon_jobs_pandas_df = pd.DataFrame(amazon_jobs_rows, columns = amazon_jobs_df_cols)
amazon_jobs_pandas_df.head()

Unnamed: 0,uniq_id,crawl_timestamp,job_url,title,description,location,category,job_id,company_name
0,850f0199e2ce9d1d51ee5e2e11318d94,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986897/softwar...,Software Development Engineer I,Amazon’s Global Logistics Technology team is c...,"US, WA, Seattle",Software Development,986897,"Amazon.com Services, Inc."
1,9215700a26684c9dca1ad77718f33ff2,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986896/busines...,Business Intelligence Engineer,Amazon Advertising is dedicated to driving mea...,"US, CA, Palo Alto",Business Intelligence,986896,"Amazon.com Services, Inc."
2,14b00d40ced925890eb4eca458040133,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986894/senior-...,Senior Solution Design Manager,The Amazon Cross Border Exports team is lookin...,"US, WA, Bellevue",Supply Chain/Transportation Management,986894,"Amazon.com Services, Inc."
3,0c96a59282dc24bbc546bf31d42825d5,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986893/instock...,Instock Manager,The AmazonFresh and Prime Now organizations ar...,"US, WA, Seattle","Buying, Planning, & Instock Management",986893,"Amazon.com Services, Inc."
4,fff278b8672d6999631c5517c081fc4d,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986891/shift-m...,"Shift Manager, Logistics","Do you want to work hard, have fun and make hi...","CA, AB, Calgary",Fulfillment & Operations Management,986891,"AMZN CAN Fulfillment Svcs, ULC"


In [9]:
print(f'Amazon Jobs Posting Dataset Shape:{amazon_jobs_pandas_df.shape}')

Amazon Jobs Posting Dataset Shape:(50, 9)


####  2. Identifies Missing Values

##### 2.1 OFLC H-1B Program Data (2011-2018)
Let examine the data types for each columns and its summary statistics of the numerical columns.

In [10]:
print('*****************************************')
print('H1B Dataframe Data Types:')
print('*****************************************')
print(h1b_pandas_df.dtypes)
print('\n')
print('*****************************************')
print('H1B Dataframe Summary Statistics:')
print('*****************************************')
# Format each describable columns in General format
h1b_pandas_df.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))

*****************************************
H1B Dataframe Data Types:
*****************************************
fiscal_year             int64
case_status            object
case_submitted         object
decision_date          object
emp_name               object
emp_city               object
emp_state              object
emp_zip               float64
job_title              object
soc_code               object
soc_name               object
naics_code             object
full_time_position    float64
prevailing_wage       float64
pw_unit                object
wage_from             float64
wage_to               float64
wage_unit              object
work_city              object
work_state             object
work_zip              float64
dtype: object


*****************************************
H1B Dataframe Summary Statistics:
*****************************************


Unnamed: 0,fiscal_year,emp_zip,full_time_position,prevailing_wage,wage_from,wage_to,work_zip
count,4192090.0,4191100.0,3557830.0,4191490.0,4191670.0,2361990.0,2488600.0
mean,2014.94,47323.6,0.972199,69744.6,81438.9,86294.2,50989.7
std,2.20426,33546.8,0.164401,876329.0,4245330.0,71573800.0,33518.3
min,2011.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,2013.0,11747.0,1.0,53596.0,60000.0,0.0,19355.0
50%,2015.0,45216.0,1.0,66726.0,72420.0,0.0,48374.0
75%,2017.0,77046.0,1.0,84386.0,93557.0,87000.0,85034.0
max,2018.0,99988.0,1.0,1000000000.0,7278870000.0,110000000000.0,99929.0


From the above statistics, we can clearly see that some of the numeric columns have missing values (min value is equal to **zero**):
1. **Employer Zip**
2. **Prevailing Wage**
3. **Wage From**
4. **Wage To**
5. **Work Zip**

Next, we will examine if is there any missing values on the **Work City & State** columns on the **H1B** dataset.
<br />
**<font color=red>Note:</font>** These two columns will be our **<font color=blue>partitions keys</font>** of our fact tables. So, it's very important we must find these rows which have missing values and remove them from the dataset.

In [11]:
print(f'Does H1B Dataset has any missing value on either work_city or work_state columns? {h1b_pandas_df.loc[:, ["work_city", "work_state"]].isnull().values.any()}')
print('\n')
print('How many rows missing per each columns?')
h1b_pandas_df.loc[:, ["work_city", "work_state"]].isnull().sum()

Does H1B Dataset has any missing value on either work_city or work_state columns? True


How many rows missing per each columns?


work_city     538
work_state    478
dtype: int64

In [12]:
print('*****************************************')
print('H1B Missing Values Dataframe:')
print('*****************************************')
h1b_missing_city_or_state_df = h1b_pandas_df[(h1b_pandas_df['work_city'].isnull()) | (h1b_pandas_df['work_state'].isnull())] 
h1b_missing_city_or_state_df.head()

*****************************************
H1B Missing Values Dataframe:
*****************************************


Unnamed: 0,fiscal_year,case_status,case_submitted,decision_date,emp_name,emp_city,emp_state,emp_zip,job_title,soc_code,...,naics_code,full_time_position,prevailing_wage,pw_unit,wage_from,wage_to,wage_unit,work_city,work_state,work_zip
41,2011,W,2/22/2011,2/22/2011,BIRLASOFT INC,EDISON,NJ,8817.0,SYSTEMS ANALYST,15-1051,...,541511.0,1.0,,Y,,,Y,,,
186,2011,W,11/23/2010,11/23/2010,,,,,,,...,,,,,,,,,,
292,2011,W,2/10/2011,2/10/2011,,,,,,,...,,,,,,,,,,
375,2011,W,3/25/2011,3/25/2011,SKY FOUNDATION,OKLAHOMA CITY,OK,73106.0,BUSINESS MANAGER,,...,611110.0,,,,,,,,,
398,2011,W,3/25/2011,3/25/2011,RIVERWALK FOUNDATION,SAN ANTONIO,TX,78209.0,ASSISTANT PRINCIPAL,11-9032,...,611110.0,1.0,,,,,,,,


In [13]:
print(f'H1B Missing Values Dataset Shape:{h1b_missing_city_or_state_df.shape}')

H1B Missing Values Dataset Shape:(595, 21)


##### 2.2 Indeed Job Posting Dataset
Let examine the data types for each columns and its summary statistics of each numerical columns.

In [14]:
print('*****************************************')
print('Indeed Jobs Posting Dataframe Data Types:')
print('*****************************************')
print(indeed_jobs_pandas_df.dtypes)
print('\n')
print('*************************************************')
print('Indeed Jobs Posting Dataframe Summary Statistics:')
print('*************************************************')
# Format each describable columns in General format
indeed_jobs_pandas_df.describe()

*****************************************
Indeed Jobs Posting Dataframe Data Types:
*****************************************
Job Title              object
Job Description       float64
Job Type              float64
Categories            float64
Location               object
City                   object
State                  object
Country                object
Zip Code               object
Address               float64
Salary From           float64
Salary To             float64
Salary Period         float64
Apply Url              object
Apply Email           float64
Employees             float64
Industry              float64
Company Name           object
Employer Email        float64
Employer Website      float64
Employer Phone        float64
Employer Logo          object
Companydescription     object
Employer Location      object
Employer City          object
Employer State         object
Employer Country       object
Employer Zip Code      object
Uniq Id                object
Craw

Unnamed: 0,Job Description,Job Type,Categories,Address,Salary From,Salary To,Salary Period,Apply Email,Employees,Industry,Employer Email,Employer Website,Employer Phone
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,,,,,,,,,,,,
std,,,,,,,,,,,,,
min,,,,,,,,,,,,,
25%,,,,,,,,,,,,,
50%,,,,,,,,,,,,,
75%,,,,,,,,,,,,,
max,,,,,,,,,,,,,


From the above statistics, we can't see much of any useful information. However, we can say that all of these numeric columns have missing or malformed values (since the summary statistics return **NaN**). 

**Note:** We probably will <font color=red>drop some of these columns</font> since not all of them are not useful to our scenario. (For example, all information related to Employer since these information didn't help our audiences to make their decisions where they want to relocate after graduation. The more relevant information will be **Work City & State** since that's the place they will perform their work if they got hired.

- **Job Type**
- **Categories**
- **Address**
- **Salary From**
- **Salary To**
- **Salary Period**
- **Apply Email**
- **Employees**
- **Industry**
- **Employer Email**
- **Employer Website**
- **Employer Phone**

Next, we will examine if is there any missing values on the Work City & State columns on the Indeed Jobs Posting dataset.<br /> 
**<font color=red>Note:</font>** These two columns can be joined with the **<font color=blue>partition keys</font>** of the fact table to create ad-hoc query.

In [15]:
print(f'Does Indeed Jobs Posting Dataset has any missing value on either Job City or State columns? {indeed_jobs_pandas_df.loc[:, ["City", "State"]].isnull().values.any()}')
print('\n')
print('How many rows missing per each columns?')
indeed_jobs_pandas_df.loc[:, ["City", "State"]].isnull().sum()

Does Indeed Jobs Posting Dataset has any missing value on either Job City or State columns? False


How many rows missing per each columns?


City     0
State    0
dtype: int64

**Nice!** The Indeed Jobs Posting Dataset doesn't have any missing rows on our focus columns (**Job City** and **Job State**).

##### 2.3 Job Posting On Amazon Jobs
Let examine the data types for each columns and its summary statistics of the numerical columns.

In [16]:
print('*****************************************')
print('Amazon Jobs Posting Dataframe Data Types:')
print('*****************************************')
print(amazon_jobs_pandas_df.dtypes)
print('\n')
print('*************************************************')
print('Amazon Jobs Posting Dataframe Summary Statistics:')
print('*************************************************')
# Describe this dataframe
amazon_jobs_pandas_df.describe()

*****************************************
Amazon Jobs Posting Dataframe Data Types:
*****************************************
uniq_id            object
crawl_timestamp    object
job_url            object
title              object
description        object
location           object
category           object
job_id             object
company_name       object
dtype: object


*************************************************
Amazon Jobs Posting Dataframe Summary Statistics:
*************************************************


Unnamed: 0,uniq_id,crawl_timestamp,job_url,title,description,location,category,job_id,company_name
count,50,50,50,50,50,50,50,50,50
unique,50,5,50,45,47,25,20,50,18
top,0c96a59282dc24bbc546bf31d42825d5,2019-11-15 00:16:13 +0000,https://www.amazon.jobs/en/jobs/977743/sr-tech...,Software Development Engineer,"Interested in working on IoT products, AWS clo...","US, WA, Seattle",Software Development,977767,"Amazon.com Services, Inc."
freq,1,10,1,4,3,15,9,1,23


We can't tell if there are any missing value based on the above statistics. However, the more serious problem is that this datafram doesn't have **City** and **State** cloumns. We need to extract these information from **location** column.

In [17]:
# Since this dataframe has only 50 rows, so it's ok to loop each rows
for index, row in amazon_jobs_pandas_df.iterrows():
    # Extract Country, City, State from location column
    addresses = row['location'].split(',')
    if len(addresses) == 3:
        # This address has complete Country, State and City
        amazon_jobs_pandas_df.loc[index,['Country']] = addresses[0]
        amazon_jobs_pandas_df.loc[index,['State']] = addresses[1]
        amazon_jobs_pandas_df.loc[index,['City']] = addresses[2]
    elif len(addresses) == 2:
        # This address has only Country and City (probably not in the U.S.)
        amazon_jobs_pandas_df.loc[index,['Country']] = addresses[0]
        amazon_jobs_pandas_df.loc[index,['State']] = None
        amazon_jobs_pandas_df.loc[index,['City']] = addresses[1]
    elif len(addresses) == 1:
        # This address has only Country
        amazon_jobs_pandas_df.loc[index,['Country']] = addresses[0]
        amazon_jobs_pandas_df.loc[index,['State']] = None
        amazon_jobs_pandas_df.loc[index,['City']] = None
        
# Check to see if new columns were created
amazon_jobs_pandas_df.head()

Unnamed: 0,uniq_id,crawl_timestamp,job_url,title,description,location,category,job_id,company_name,Country,State,City
0,850f0199e2ce9d1d51ee5e2e11318d94,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986897/softwar...,Software Development Engineer I,Amazon’s Global Logistics Technology team is c...,"US, WA, Seattle",Software Development,986897,"Amazon.com Services, Inc.",US,WA,Seattle
1,9215700a26684c9dca1ad77718f33ff2,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986896/busines...,Business Intelligence Engineer,Amazon Advertising is dedicated to driving mea...,"US, CA, Palo Alto",Business Intelligence,986896,"Amazon.com Services, Inc.",US,CA,Palo Alto
2,14b00d40ced925890eb4eca458040133,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986894/senior-...,Senior Solution Design Manager,The Amazon Cross Border Exports team is lookin...,"US, WA, Bellevue",Supply Chain/Transportation Management,986894,"Amazon.com Services, Inc.",US,WA,Bellevue
3,0c96a59282dc24bbc546bf31d42825d5,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986893/instock...,Instock Manager,The AmazonFresh and Prime Now organizations ar...,"US, WA, Seattle","Buying, Planning, & Instock Management",986893,"Amazon.com Services, Inc.",US,WA,Seattle
4,fff278b8672d6999631c5517c081fc4d,2019-11-15 00:15:23 +0000,https://www.amazon.jobs/en/jobs/986891/shift-m...,"Shift Manager, Logistics","Do you want to work hard, have fun and make hi...","CA, AB, Calgary",Fulfillment & Operations Management,986891,"AMZN CAN Fulfillment Svcs, ULC",CA,AB,Calgary


Next, we will examine if is there any missing values on the **Job City & State** columns on the **Amazon Jobs Posting** dataset.<br /> 
**<font color=red>Note:</font>** These two columns can be joined with the **<font color=blue>partition keys</font>** of the fact table to create ad-hoc query.

In [18]:
print(f'Does Amazon Jobs Posting Dataset has any missing value on either City or State columns? {amazon_jobs_pandas_df.loc[:, ["City", "State"]].isnull().values.any()}')
print('\n')
print('How many rows missing per each columns?')
amazon_jobs_pandas_df.loc[:, ["City", "State"]].isnull().sum()

Does Amazon Jobs Posting Dataset has any missing value on either City or State columns? True


How many rows missing per each columns?


City     1
State    4
dtype: int64

In [19]:
print('*********************************************')
print('Amazon Jobs Posting Missing Values Dataframe:')
print('*********************************************')
amazon_jobs_missing_city_or_state_df = amazon_jobs_pandas_df[(amazon_jobs_pandas_df['City'].isnull()) | (amazon_jobs_pandas_df['State'].isnull())] 
amazon_jobs_missing_city_or_state_df.head()

*********************************************
Amazon Jobs Posting Missing Values Dataframe:
*********************************************


Unnamed: 0,uniq_id,crawl_timestamp,job_url,title,description,location,category,job_id,company_name,Country,State,City
12,5c376060d06977df7825cc49141c308b,2019-11-15 00:13:36 +0000,https://www.amazon.jobs/en/jobs/956568/sr-busi...,Sr. Business Development Manager – AWS High Pe...,Amazon Web Services (AWS) is the pioneer and r...,"UK, London",Business & Merchant Development,956568,AWS EMEA SARL (UK Branch),UK,,London
20,6b745f0dfdfb8aba7d584d12bc712643,2019-11-15 00:13:07 +0000,https://www.amazon.jobs/en/jobs/960293/softwar...,Software Development Engineer,Prime Video is changing the way millions of cu...,"UK, London",Software Development,960293,Amazon Dev Centre (London) Ltd,UK,,London
36,82084f962c940000e13168f4d5a03657,2019-11-15 00:16:13 +0000,https://www.amazon.jobs/en/jobs/977623/dceo-cl...,"DCEO Cluster Manager, Singapore",DCEO Manager SingaporeResponsible for the mana...,"SG, Singapore","Operations, IT, & Support Engineering",977623,Amzn Asia-PacificResources SGP,SG,,Singapore
43,19c516821e775cfa56a6d785b4e703a4,2019-11-15 00:17:14 +0000,https://www.amazon.jobs/en/jobs/979735/senior-...,Senior Security Engineer - Hardware,Help us protect not only the Amazon Web Servic...,US,"Systems, Quality, & Security Engineering",979735,"Amazon.com Services, Inc.",US,,


####  3. Identifies Duplicate Data

##### 3.1 OFLC H-1B Program Data (2011-2018)
Let examine if there any **duplicate** rows based on all columns.

In [20]:
print('********************************************')
print('Is there any duplicat rows on H1B Dataframe:')
print('********************************************')
h1b_pandas_df.duplicated().any()

********************************************
Is there any duplicat rows on H1B Dataframe:
********************************************


True

In [21]:
# Select the duplicate rows
# keep = 'first' considers first value as unique and rest of the same values as duplicate
h1b_duplicate_df = h1b_pandas_df[h1b_pandas_df.duplicated(keep = 'first')]

In [22]:
print("Duplicate Rows except first occurrence based on all columns are :")
h1b_duplicate_df

Duplicate Rows except first occurrence based on all columns are :


Unnamed: 0,fiscal_year,case_status,case_submitted,decision_date,emp_name,emp_city,emp_state,emp_zip,job_title,soc_code,...,naics_code,full_time_position,prevailing_wage,pw_unit,wage_from,wage_to,wage_unit,work_city,work_state,work_zip
224,2011,C,5/2/2011,5/6/2011,"ALERIS ROLLED PRODUCTS, INC.",BEACHWOOD,OH,44122.0,SENIOR INDUSTRIAL ENGINEER,17-2112,...,331316,1.0,74298.00,Y,74298.0,,Y,LEWISPORT,KY,
232,2011,C,7/28/2011,8/3/2011,"WS ATKINS, INC.",HOUSTON,TX,77024.0,PRINCIPAL CONSULTANT,19-3032,...,541990,1.0,99445.00,Y,165000.0,,Y,HOUSTON,TX,
234,2011,C,8/30/2011,9/6/2011,"CRISPIN PORTER & BOGUSKY, LLC",MIAMI,FL,33133.0,COPYWRITER,27-3043,...,541810,1.0,32178.00,Y,43000.0,,Y,BOULDER,CO,
236,2011,C,8/30/2011,9/6/2011,"CRISPIN PORTER & BOGUSKY, LLC",MIAMI,FL,33133.0,ART DIRECTOR,27-1011,...,541810,1.0,43300.00,Y,43300.0,,Y,BOULDER,CO,
277,2011,C,11/8/2010,11/15/2010,"MUSE MANAGEMENT, INC.",NEW YORK,NY,10010.0,PROFESSIONAL FASHION MODEL,41-9012,...,711410,0.0,21.87,H,100.0,500.0,H,NEW YORK,NY,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4192041,2018,C,2/8/18,2/15/18,DELOITTE CONSULTING LLP,PHILADELPHIA,PA,19103.0,CONSULTANT,15-1121,...,54161,1.0,57096.00,Y,57096.0,0.0,Y,PHILADELPHIA,PA,19103.0
4192042,2018,C,2/21/18,2/27/18,"TECH MAHINDRA (AMERICAS),INC.",SOUTH PLAINFIELD,NJ,7080.0,NETWORK ARCHITECT,15-1143,...,541511,1.0,99549.00,Y,99549.0,0.0,Y,BOXBOROUGH,MA,1719.0
4192050,2018,C,3/22/18,3/28/18,"NAMITUS TECHNOLOGIES, INC.",FRISCO,TX,75035.0,SOFTWARE DEVELOPER,15-1132,...,541511,1.0,90646.00,Y,90646.0,0.0,Y,FRISCO,TX,75035.0
4192052,2018,C,3/16/18,3/22/18,ADBAKX LLC,MONMOUTH JUNCTION,NJ,8852.0,BUSINESS ANALYST,15-1121,...,541511,1.0,74734.00,Y,74734.0,0.0,Y,ATLANTA,GA,30308.0


In [22]:
print(f'H1B Duplicate Rows Dataset Shape:{h1b_duplicate_df.shape}')

H1B Duplicate Rows Dataset Shape:(353129, 21)


##### 3.2 Indeed Job Posting Dataset
Let examine if there any duplicate rows based on all columns.

In [23]:
print('************************************************************')
print('Is there any duplicat rows on Indeed Jobs Posting Dataframe:')
print('************************************************************')
indeed_jobs_pandas_df.duplicated().any()

************************************************************
Is there any duplicat rows on Indeed Jobs Posting Dataframe:
************************************************************


False

**Nice!** The Indeed Jobs Posting Dataset doesn't have any duplicate rows consodered all columns.

##### 3.3 Job Posting On Amazon Jobs
Let examine if there any duplicate rows based on all columns.

In [24]:
print('************************************************************')
print('Is there any duplicat rows on Amazon Jobs Posting Dataframe:')
print('***********************************************************')
amazon_jobs_pandas_df.duplicated().any()

************************************************************
Is there any duplicat rows on Amazon Jobs Posting Dataframe:
***********************************************************


False

**Nice!** The Amazon Jobs Posting Dataset doesn't have any duplicate rows consodered all columns.

#### 4. Cleaning Steps
Steps necessary to clean the data are:
1. Remove missing value's rows (if any). In our case, we will consider if the row has missing value if either or both **City & State** are missing.
2. Remove the duplicate rows (if any). We consider if the row is duplicate when data from all columns are duplicated. In our case, we want to keep the first occurance and disregard the rest of the same value's rows.

**<center>Below is the summary of what we found on each datasets</center>**

| Dataset | Has Missing Values | Has Duplicate Rows |
|------|------|------|
|   H1B  | Yes | Yes |
|   Indeed Jobs Posting  | No | No |
|   Amazon Jobs Posting  | Yes | No |

Therefore, we will perform **Data Cleansing** on the two datasets:
1. **OFLC H-1B Program Data (2011-2018)**
2. **Job Posting On Amazon Jobs**

##### 4.1 OFLC H-1B Program Data (2011-2018)

In [25]:
# Remove missing value's row 
# Select the rows that have both work city and state values
cleaned_h1b_df = h1b_pandas_df[~(h1b_pandas_df['work_city'].isnull() | h1b_pandas_df['work_state'].isnull())]

# Drop duplicate rows
cleaned_h1b_df = cleaned_h1b_df[~cleaned_h1b_df.duplicated()]

In [26]:
print(f'Does Cleaned H1B Dataset has any missing value on either City or State columns? {cleaned_h1b_df.loc[:, ["work_city", "work_state"]].isnull().values.any()}')
print('\n')
print(f'Does Cleaned H1B Dataset has any duplicate rows? {cleaned_h1b_df.duplicated().any()}')

Does Cleaned H1B Dataset has any missing value on either City or State columns? False


Does Cleaned H1B Dataset has any duplicate rows? False


In [27]:
print(f'Clean H1B Dataset Shape:{cleaned_h1b_df.shape}')

Clean H1B Dataset Shape:(3838578, 21)


##### 4.2 Job Posting On Amazon Jobs

In [28]:
# Remove missing value's row 
# Select the rows that have both work city and state values
cleaned_amazon_jobs_df = amazon_jobs_pandas_df[~(amazon_jobs_pandas_df['City'].isnull() \
                                               | amazon_jobs_pandas_df['State'].isnull())]

In [29]:
print(f'Does Cleaned Amazon Jobs Posting Dataset has any missing value on either City or State columns? {cleaned_amazon_jobs_df.loc[:, ["City", "State"]].isnull().values.any()}')

Does Cleaned Amazon Jobs Posting Dataset has any missing value on either City or State columns? False


In [30]:
print(f'Clean Amazon Jobs Posting Dataset Shape:{cleaned_amazon_jobs_df.shape}')

Clean Amazon Jobs Posting Dataset Shape:(46, 12)


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Below is a simple conceptual data model of this project.
<img src="Conceptual Data Model.png"/>

**<center>Database (Star) Schema</center>**
 
There are two main aspects I'd like to address:
1. The reason why I choose to design tables in a **star schema** is due to its simplicity. Star schema join-logic is generally simple to write a logic and also easy for fast aggregations.  
2. The reason why I select **Work City & State** as the partition keys for fact table's parquet files is that:

   `Based on my personel experience, I always put the Where as a filter everytime when I search for a job since I really don't want to relocate. I think the choice of how to choose the partition keys is really depends and there is no absolute right or wrong answers` 
 
#### 3.2 Mapping Out Data Pipelines
<img src="Data Pipelines.png"/>

**<center>Data Pipelines Architecture Diagram</center>**

Below are the steps necessary to pipeline the data into the chosen data model:
1. Create a **Config**  file to store the **AWS credentials** which will be used in python code (env. variables) to access **AWS S3** bucket.
2. Setup the **AWS S3 output files path**. (This path will serve a **Data Lake**).
3. Create a **Spark session**.
4. Perform the similar steps above but instead of using **Pandas**, we will use **PySpark** instead.
5. Then we will read all files into the **Spark** dataframes, perfrom **cleaning** and **drop duplicate (if exists)**.
6. Extract columns from those dataframes to create tables:
    - 4.1 **Fact_H1B_Sponsors**. Note: These parquet files will be partitioned by **Work City & State**.
    - 4.2 **Dim_Employers**
    - 4.3 **Dim_Occupations**
    - 4.4 **Dim_H1B_Cases**
    - 4.5 **Dim_Job_Postings**
7. Write thoese tables in the form of **parquet files**  and save to the output path.
8. Check on the **ASW S3** if all thoese files are presented.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Below is the code for this end-to-end data pipeline to create the data model which previously designed.

***
Note 1: All the code can be used to create as an **ETL** python script file which can be run as a singleton to perform an end-to-end data pipeline processes.
***

***
Note 2: My **output path** is **s3a://salinee-bucket/** which is my personel bucket that I already setup on **AWS** which will be my **Data Lake S3**.
***

In [10]:
# Read config file
config = configparser.ConfigParser()
config.read('credentials.cfg')

# load AWS Credentials
os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

# Set up Output Path for the write
output_data = "output/{}"
# output_data = "s3a://salinee-bucket/{}"

# Create Spark Session
spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .getOrCreate()

# When the value of mapreduce.fileoutputcommitter.algorithm.version is 2, 
# task moves data generated by a task directly to the final destination
# which make the writing process much faster
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

# Set this configuration to restore the behavior before Spark 3.0 since our raw datetime format is in M/dd/yyyy format
# which consider old format
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
# Since our raw datetime is in an old format, set this up will write parquet files as is without throwing an error
spark.sql("set spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED")

## -- Read all data from files -- ##
# Read H1B data from file into the dataframe
h1b_df = spark.read.format("csv")\
         .option("header", "true")\
         .load("h1b_data_fy2011_fy2018_20190401.csv")


# Read Indeed Job Post data from file into the dataframe
indeed_jobs_df = spark.read.format("csv")\
                 .option("header", "true")\
                 .load("marketing_sample_for_trulia_com-real_estate__20190901_20191031__30k_data.csv")

# Read Amazon Job Post data from file into the dataframe
xtree = et.parse("amazon_com-jobs__20190901_20191231_sample.xml")
xroot = xtree.getroot()

amazon_jobs_row = Row("uniq_id", "crawl_timestamp", "job_url", "title", "description", "location", \
                       "city", "state", "country", "category", "job_id", "company_name")
amazon_jobs_rows = []

# We interested in the Page node and all of it's elements
for page in xroot.findall('page'):
    for record in page.findall('record'):
        s_uniq_id = record.find("uniq_id").text
        s_crawl_timestamp = record.find("crawl_timestamp").text if record is not None else None
        s_job_url = record.find("job_url").text if record is not None else None
        s_title = record.find("title").text if record is not None else None
        s_description = record.find("description").text if record is not None else None
        s_location = record.find("location").text if record is not None else None
        
        # Extract country, city, state from location column
        s_address = s_location.split(',')
        if len(s_address) == 3:
            # This address has complete Country, State and City
            s_country = s_address[0]
            s_state = s_address[1]
            s_city = s_address[2]
        elif len(s_address) == 2:
            # This address has only Country and City (probably not in the U.S.)
            s_country = s_address[0]
            s_state = None
            s_city = s_address[1]
        elif len(s_address) == 1:
            # This address has only Country
            s_country = s_address[0]
            s_state = None
            s_city = None
        s_category = record.find("category").text if record is not None else None
        s_job_id = record.find("job_id").text if record is not None else None
        s_company_name = record.find("company_name").text if record is not None else None
        
        # Append the extract information into rows
        amazon_jobs_rows.append(amazon_jobs_row(s_uniq_id, s_crawl_timestamp, s_job_url, s_title, s_description,\
                                s_location, s_city, s_state, s_country, s_category, s_job_id, s_company_name))
amazon_jobs_df = spark.createDataFrame(amazon_jobs_rows)

## -- Count dataframe's records (Before Cleaning) -- ##
# H1B
print('**************************************************')
print(f'H1B Dataset count:{h1b_df.count()}')
print('**************************************************')
print('\n')

# Indeed Jobs
print('**************************************************')
print(f'Indeed Jobs Dataset count:{indeed_jobs_df.count()}')
print('**************************************************')
print('\n')

# Amazon Jobs
print('**************************************************')
print(f'Amazon Jobs Dataset count:{amazon_jobs_df.count()}')
print('**************************************************')
print('\n')

## -- Perform Data Cleaning -- ##
# H1B
# Drop records for null values on 'work_city' or 'work_state' (if exists)
h1b_df_valid = h1b_df.filter(h1b_df.work_city.isNotNull() & h1b_df.work_state.isNotNull())
# Drop duplicate records
h1b_df_valid = h1b_df_valid.dropDuplicates()

# Indeed Jobs
# Drop records for null values on 'City' or 'State' (if exists)
indeed_jobs_df_valid = indeed_jobs_df.filter(indeed_jobs_df.City.isNotNull() & indeed_jobs_df.State.isNotNull())
# Drop duplicate records
indeed_jobs_df_valid = indeed_jobs_df_valid.dropDuplicates()

# Amazon Jobs
# Drop records for null values on 'city' or 'state' (if exists)
amazon_jobs_df_valid = amazon_jobs_df.filter(amazon_jobs_df.city.isNotNull() & amazon_jobs_df.state.isNotNull())
# Drop duplicate records
amazon_jobs_df_valid = amazon_jobs_df_valid.dropDuplicates()

## -- Count dataframe's records (After Cleaning) -- ##
# H1B
print('**************************************************')
print(f'Cleaned H1B Dataset count:{h1b_df_valid.count()}')
print('**************************************************')
print('\n')

# Indeed Jobs
print('**************************************************')
print(f'Cleaned Indeed Jobs Dataset count:{indeed_jobs_df_valid.count()}')
print('**************************************************')
print('\n')

# Amazon Jobs
print('**************************************************')
print(f'Cleaned Amazon Jobs Dataset count:{amazon_jobs_df_valid.count()}')
print('**************************************************')
print('\n')

## -- Extract columns to create tables designed in the Data Model diagram -- ##
# Dim_H1B_Cases
# Note1: need to cast 'case_submitted' & 'decision_date' from String to Date type
# Note2: Add an id column by auto increment this column
dim_h1b_cases_table = h1b_df_valid.select(monotonically_increasing_id().alias("id"), \
                               to_date(from_unixtime(unix_timestamp('case_submitted', 'M/dd/yyyy'))).alias('case_submitted'),\
                               to_date(from_unixtime(unix_timestamp('decision_date', 'M/dd/yyyy'))).alias("decision_date"), \
                               col("job_title"),\
                               col("work_city"),\
                               col("work_state")).dropDuplicates()

print('******************************************************')
print(f'Finished reading "dim_h1b_cases_table" dataframe')
print('******************************************************')
print('\n')

# Dim_Employers
# Note1: Add an id column by auto increment this column
dim_employers_table = h1b_df_valid.select(monotonically_increasing_id().alias("id"), \
                                    col("emp_name").alias("employer_name"),\
                                    col("emp_city").alias("employer_city"),\
                                    col("emp_state").alias("employer_state")).dropDuplicates()

print('******************************************************')
print(f'Finished reading "dim_employers_table" dataframe')
print('******************************************************')
print('\n')

# Dim_Occupations
# Note: Add an id column by auto increment this column
dim_occupations_table = h1b_df_valid.select(monotonically_increasing_id().alias("id"), \
                                    col("soc_code"),\
                                    col("soc_name"),\
                                    col("naics_code"),\
                                    col("full_time_position").cast(IntegerType()).alias("full_time"),\
                                    col("prevailing_wage").cast(FloatType()).alias("prevailing_wage"), \
                                    col("pw_unit").alias("prevailing_unit"), \
                                    col("wage_from").cast(FloatType()).alias("wage_from"), \
                                    col("wage_to").cast(FloatType()).alias("wage_to")).dropDuplicates()

print('******************************************************')
print(f'Finished reading "dim_occupations_table" dataframe')
print('******************************************************')
print('\n')

# Dim_Job_Postings
# Extract columns from Indeed Job Posting dataframe
indeed_job_postings = indeed_jobs_df_valid.select(col("Job Title").alias("job_title"),\
                                    col("City").alias("job_city"),\
                                    col("State").alias("job_state"),\
                                    col("Company Name").alias("company_name"),\
                                    col("Crawl Timestamp").cast(TimestampType()).alias("crawl_timestamp"))
# Add new 'source' with constant value = 'Indeed' to inform where this data came from
indeed_job_postings = indeed_job_postings.withColumn("source", lit("Indeed"))

# Extract columns from Amazon Job Posting dataframe
amazob_job_postings = amazon_jobs_df_valid.select(col("title").alias("job_title"),\
                                    col("city").alias("job_city"),\
                                    col("state").alias("job_state"),\
                                    col("company_name").alias("company_name"),\
                                    col("crawl_timestamp").cast(TimestampType()).alias("crawl_timestamp"))
# Add new 'source' with constant value = 'Amazon' to inform where this data came from
amazob_job_postings = amazob_job_postings.withColumn("source", lit("Amazon"))

# Combine two dataframe into a new 'Job Postings' table
dim_job_postings_table = indeed_job_postings.union(amazob_job_postings);

# Add an id column by auto increment this column
dim_job_postings_table = dim_job_postings_table.select(monotonically_increasing_id().alias("id"), \
                    col("job_title"),\
                    col("job_city"),\
                    col("job_state"),\
                    col("company_name"),\
                    col("crawl_timestamp"),
                    col("source")).dropDuplicates()

print('******************************************************')
print(f'Finished reading "dim_job_postings_table" dataframe')
print('******************************************************')
print('\n')

# Fact_H1B_Sponsors
# Note1: I joined 'dim_job_postings_table' as a left outer join because 
#        we can't be sure if the H1B jobs have been advertised on job board's sites, 
#        we probably won't found a lot of matches
# Note2: Add an id column by auto increment this column (on the last select statement)
fact_h1b_sponsors_table = h1b_df_valid \
                          .join(dim_h1b_cases_table, 
                               (date_format(dim_h1b_cases_table.case_submitted, "M/dd/yyyy") == h1b_df_valid.case_submitted) & 
                               (date_format(dim_h1b_cases_table.decision_date, "M/dd/yyyy") == h1b_df_valid.decision_date) &
                               (dim_h1b_cases_table.job_title == h1b_df_valid.job_title) & 
                               (dim_h1b_cases_table.work_city == h1b_df_valid.work_city) &
                               (dim_h1b_cases_table.work_state == h1b_df_valid.work_state),
                           'inner')\
                           .join(dim_occupations_table, 
                               (dim_occupations_table.soc_code == h1b_df_valid.soc_code) &
                               (dim_occupations_table.soc_name == h1b_df_valid.soc_name) &
                               (dim_occupations_table.naics_code == h1b_df_valid.naics_code) &
                               (dim_occupations_table.full_time.cast(StringType()) == h1b_df_valid.full_time_position) & 
                               (dim_occupations_table.prevailing_wage.cast(StringType()) == h1b_df_valid.prevailing_wage) &
                               (dim_occupations_table.prevailing_unit.cast(StringType())  == h1b_df_valid.pw_unit) &
                               (dim_occupations_table.wage_from == h1b_df_valid.wage_from) &
                               (dim_occupations_table.wage_to == h1b_df_valid.wage_to),
                           'inner')\
                            .join(dim_employers_table, 
                               (dim_employers_table.employer_name == h1b_df_valid.emp_name) & 
                               (dim_employers_table.employer_city == h1b_df_valid.emp_city) &
                               (dim_employers_table.employer_state == h1b_df_valid.emp_state),
                           'inner')\
                            .join(dim_job_postings_table, 
                               (dim_job_postings_table.job_title == h1b_df_valid.job_title) & 
                               (dim_job_postings_table.job_city == h1b_df_valid.work_city) &
                               (dim_job_postings_table.job_state == h1b_df_valid.work_state) & 
                               (dim_job_postings_table.company_name == h1b_df_valid.emp_name),
                           'left_outer')\
                             .select(monotonically_increasing_id().alias("id"),\
                                     dim_h1b_cases_table.id.alias("h1b_id"),\
                                     dim_occupations_table.id.alias("occupation_id"),\
                                     dim_employers_table.id.alias("employer_id"),\
                                     dim_job_postings_table.id.alias("job_posting_id"),\
                                     dim_h1b_cases_table.work_city.alias("work_city"),\
                                     dim_h1b_cases_table.work_state.alias("work_state"))

print('******************************************************')
print(f'Finished reading "fact_h1b_sponsors_table" dataframe')
print('******************************************************')
print('\n')

## -- Write all tables to parquet files -- ##
# Dim_H1B_Cases
dim_h1b_cases_table.write.mode("overwrite")\
                    .parquet(output_data.format("dim_h1b_cases_table"))
print('******************************************************')
print(f'Finished writing "dim_h1b_cases_table" parquet files')
print('******************************************************')
print('\n')

# Dim_Employers
dim_employers_table.write.mode("overwrite")\
                    .parquet(output_data.format("dim_employers_table"))
print('******************************************************')
print(f'Finished writing "dim_employers_table" parquet files')
print('******************************************************')
print('\n')

# Dim_Occupations
dim_occupations_table.write.mode("overwrite")\
                    .parquet(output_data.format("dim_occupations_table"))
print('******************************************************')
print(f'Finished writing "dim_occupations_table" parquet files')
print('******************************************************')
print('\n')
      
# Dim_Job_Postings
dim_job_postings_table.write.mode("overwrite")\
                    .parquet(output_data.format("dim_job_postings_table"))
print('******************************************************')
print(f'Finished writing "dim_job_postings_table" parquet files')
print('******************************************************')
print('\n')
      
# Fact_H1B_Sponsors
# Note: Partitioned by 'work_city' and 'work_state'
fact_h1b_sponsors_table.write.mode("overwrite")\
.partitionBy("work_city", "work_state")\
.parquet(output_data.format("fact_h1b_sponsors_table"))
print('******************************************************')
print(f'Finished writing "fact_h1b_sponsors_table" parquet files')
print('******************************************************')

**************************************************
H1B Dataset count:4192087
**************************************************


**************************************************
Indeed Jobs Dataset count:30002
**************************************************


**************************************************
Amazon Jobs Dataset count:50
**************************************************


**************************************************
Cleaned H1B Dataset count:3838664
**************************************************


**************************************************
Cleaned Indeed Jobs Dataset count:30002
**************************************************


**************************************************
Cleaned Amazon Jobs Dataset count:46
**************************************************


******************************************************
Finished reading "dim_h1b_cases_table" dataframe
******************************************************


*****************

#### 4.2 Data Quality Checks
The steps to perform here to ensure the pipeline ran as expected are:
 * **Check Schema** for each tables to make sure all columns have the correct data types as intended.
 * **Count Check** for each tables to make sure all tables has some records inserted. (Not an empty table)
 * **Check Null on the Primary Key** of each tables to make sure there is no corrupted records for all the tables.
 
 First, we will check what is the original schema for each tables. We will use these schemas to compare against the dafaframe from **parquet files** and the schema for the same table should look the same regardless of the sources. 

In [11]:
# Print Schema for 'Dim_H1B_Cases' table
dim_h1b_cases_table.printSchema()

root
 |-- id: long (nullable = false)
 |-- case_submitted: date (nullable = true)
 |-- decision_date: date (nullable = true)
 |-- job_title: string (nullable = true)
 |-- work_city: string (nullable = true)
 |-- work_state: string (nullable = true)



In [12]:
# Print Schema for 'Dim_Employers' table
dim_employers_table.printSchema()

root
 |-- id: long (nullable = false)
 |-- employer_name: string (nullable = true)
 |-- employer_city: string (nullable = true)
 |-- employer_state: string (nullable = true)



In [13]:
# Print Schema for 'Dim_Occupations' table
dim_occupations_table.printSchema()

root
 |-- id: long (nullable = false)
 |-- soc_code: string (nullable = true)
 |-- soc_name: string (nullable = true)
 |-- naics_code: string (nullable = true)
 |-- full_time: integer (nullable = true)
 |-- prevailing_wage: float (nullable = true)
 |-- prevailing_unit: string (nullable = true)
 |-- wage_from: float (nullable = true)
 |-- wage_to: float (nullable = true)



In [14]:
# Print Schema for 'Dim_Job_Postings' table
dim_job_postings_table.printSchema()

root
 |-- id: long (nullable = false)
 |-- job_title: string (nullable = true)
 |-- job_city: string (nullable = true)
 |-- job_state: string (nullable = true)
 |-- company_name: string (nullable = true)
 |-- crawl_timestamp: timestamp (nullable = true)
 |-- source: string (nullable = false)



In [15]:
# Print Schema for 'Fact_H1B_Sponsors' table
fact_h1b_sponsors_table.printSchema()

root
 |-- id: long (nullable = false)
 |-- h1b_id: long (nullable = false)
 |-- occupation_id: long (nullable = false)
 |-- employer_id: long (nullable = false)
 |-- job_posting_id: long (nullable = true)
 |-- work_city: string (nullable = true)
 |-- work_state: string (nullable = true)



#### Run Quality Checks
Below is the code for this end-to-end data quality checks processes.

***
Note: All the code can be used to create as an **Unit Test** python script file which can be run as a singleton.
***

In [75]:
## -- Define functions used in this unit test -- ##
def checkSchema(dataframe, table_tuples, table_name):
    """This function will loop through the provided dataframe and check the datatype for each field name with the 
       field datatype to compared if it's the same as expected or not
    Parameters: 
        dataframe: Spark DataFrame
        table_tuples: list of expected 'field_name' and 'field_datatype'
        table_name: string of the table's name that we're checking the schema
    Returns: 
        None
    """
    
    # Loop through each dataframe's fields
    for f in dataframe.schema.fields:
        try:
            # Find the expected field's datatype for the specific field's name
            table_field = next(x for x in table_tuples if x["field_name"] == f.name)
            if (table_field != None):
                # Check datatype for each fields to see if it has the same type as expected
                if f.dataType != table_field["field_datatype"]:
                    # If Not, raise an exception
                    raise ValueError(f'Schema quality check failed. \033[1m{table_name}.{f.name}\033[0m field has wrong data type. \u2717')
                else:
                    # If yes, check passed
                    print(f'Schema quality on table \033[1m{table_name}.{f.name}\033[0m field check passed. \u2713')
        except StopIteration as ex:
            print(ex)
            
def checkCount(dataframe, table_name):
    """This function check if the provided dataframe has some records or not
    Parameters: 
        dataframe: Spark DataFrame
        table_name: string of the table's name that we're checking the schema
    Returns: 
        None
    """
    
    # Count the dataframe's records
    counts = dataframe.count()
    if counts < 1:
        # If there is no records, raise an exception
        raise ValueError(f'Data quality check failed. \033[1m{table_name}\033[0m table returned no results. \u2717')
    else:
        # Otherwise, check passed
        print(f'Data quality on table \033[1m{table_name}\033[0m check passed with {counts} records. \u2713')
        
def checkNullOnPK(dataframe, pk_field_name, table_name):
    """This function check if the provided dataframe has some records or not
    Parameters: 
        dataframe: Spark DataFrame
        table_name: string of the table's name that we're checking the schema
    Returns: 
        None
    """
    
    # Count the dataframe's records where it's primary key column has NULL value
    pk_null_counts = dataframe.where(dataframe[pk_field_name].isNull()).count()
    
    if pk_null_counts > 0:
        # If there is some records, raise an exception
        raise ValueError(f'Data quality check failed. \033[1m{table_name}\033[0m table returned some records with null PRIMARY KEY. \u2717')
    else:
        # Otherwise, check passed
        print(f'Data quality on table \033[1m{table_name}\033[0m check passed. No record is corrupted. \u2713')

## -- Read parquet files from each table's folders -- ##
# Dim_H1B_Cases
dim_h1b_cases_parquet_df = spark.read.parquet(output_data.format("dim_h1b_cases_table"))

# Dim_Employers
dim_employers_parquet_df = spark.read.parquet(output_data.format("dim_employers_table"))

# Dim_Occupations
dim_occupations_parquet_df = spark.read.parquet(output_data.format("dim_occupations_table"))

# Dim_Job_Postings
dim_job_postings_parquet_df = spark.read.parquet(output_data.format("dim_job_postings_table"))

# Fact_H1B_Sponsors
fact_h1b_sponsors_parquet_df = spark.read.parquet(output_data.format("fact_h1b_sponsors_table"))

print('***************************************************')
print(f'Finished reading all dataframes from parquet files')
print('****************************************************')

## -- Perform quality checks -- ##
##################################

## -- 1. Check Schema -- ##
# Dim_H1B_Cases
table_tuples = []

table_tuples.append({'field_name': 'id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'case_submitted', 'field_datatype': DateType()})
table_tuples.append({'field_name': 'decision_date', 'field_datatype': DateType()})
table_tuples.append({'field_name': 'job_title', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'work_city', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'work_state', 'field_datatype': StringType()})

checkSchema(dim_h1b_cases_parquet_df, table_tuples, 'Dim_H1B_Cases')

# Dim_Employers
table_tuples = []

table_tuples.append({'field_name': 'id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'employer_name', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'employer_city', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'employer_state', 'field_datatype': StringType()})

checkSchema(dim_employers_parquet_df, table_tuples, 'Dim_Employers')
            
# Dim_Occupations
table_tuples = []

table_tuples.append({'field_name': 'id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'full_time', 'field_datatype': IntegerType()})
table_tuples.append({'field_name': 'prevailing_wage', 'field_datatype': FloatType()})
table_tuples.append({'field_name': 'wage_from', 'field_datatype': FloatType()})
table_tuples.append({'field_name': 'wage_to', 'field_datatype': FloatType()})
table_tuples.append({'field_name': 'soc_code', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'soc_name', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'naics_code', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'prevailing_unit', 'field_datatype': StringType()})

checkSchema(dim_occupations_parquet_df, table_tuples, 'Dim_Occupations')
            
# Dim_Job_Postings
table_tuples = []

table_tuples.append({'field_name': 'id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'crawl_timestamp', 'field_datatype': TimestampType()})
table_tuples.append({'field_name': 'job_title', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'job_city', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'job_state', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'company_name', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'source', 'field_datatype': StringType()})

checkSchema(dim_job_postings_parquet_df, table_tuples, 'Dim_Job_Postings')
            
# Fact_H1B_Sponsors
table_tuples = []

table_tuples.append({'field_name': 'id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'h1b_id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'occupation_id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'employer_id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'job_posting_id', 'field_datatype': LongType()})
table_tuples.append({'field_name': 'work_city', 'field_datatype': StringType()})
table_tuples.append({'field_name': 'work_state', 'field_datatype': StringType()})

checkSchema(fact_h1b_sponsors_parquet_df, table_tuples, 'Fact_H1B_Sponsors')

            
print('****************************************************************')
print(f'Finished checking schemas for all dataframes from parquet files')
print('****************************************************************')

## -- 2. Count Check -- ##

# Dim_H1B_Cases
checkCount(dim_h1b_cases_parquet_df, 'Dim_H1B_Cases')

# Dim_Employers
checkCount(dim_employers_parquet_df, 'Dim_Employers')

# Dim_Occupations
checkCount(dim_occupations_parquet_df, 'Dim_Occupations')

# Dim_Job_Postings
checkCount(dim_job_postings_parquet_df, 'Dim_Job_Postings')

# Fact_H1B_Sponsors
checkCount(fact_h1b_sponsors_parquet_df, 'Fact_H1B_Sponsors')

print('****************************************************************')
print(f'Finished count checking for all dataframes from parquet files')
print('****************************************************************')

## -- 3. Check Null on the Primary Key -- ##

# Dim_H1B_Cases
checkNullOnPK(dim_h1b_cases_parquet_df, "id", "Dim_H1B_Cases")

# Dim_Employers
checkNullOnPK(dim_employers_parquet_df, "id", "Dim_Employers")

# Dim_Occupations
checkNullOnPK(dim_occupations_parquet_df, "id", "Dim_Occupations")

# Dim_Job_Postings
checkNullOnPK(dim_job_postings_parquet_df, "id", "Dim_Job_Postings")

# Fact_H1B_Sponsors
checkNullOnPK(fact_h1b_sponsors_parquet_df, "id", "Fact_H1B_Sponsors")

print('***********************************************************************')
print(f'Finished checking NULL on the PK for all dataframes from parquet files')
print('***********************************************************************')

***************************************************
Finished reading all dataframes from parquet files
****************************************************
Schema quality on table [1mDim_H1B_Cases.id[0m field check passed. ✓
Schema quality on table [1mDim_H1B_Cases.case_submitted[0m field check passed. ✓
Schema quality on table [1mDim_H1B_Cases.decision_date[0m field check passed. ✓
Schema quality on table [1mDim_H1B_Cases.job_title[0m field check passed. ✓
Schema quality on table [1mDim_H1B_Cases.work_city[0m field check passed. ✓
Schema quality on table [1mDim_H1B_Cases.work_state[0m field check passed. ✓
Schema quality on table [1mDim_Employers.id[0m field check passed. ✓
Schema quality on table [1mDim_Employers.employer_name[0m field check passed. ✓
Schema quality on table [1mDim_Employers.employer_city[0m field check passed. ✓
Schema quality on table [1mDim_Employers.employer_state[0m field check passed. ✓
Schema quality on table [1mDim_Occupations.id[0m fiel

#### 4.3 Data dictionary
Below are the **data dictionary** for each tables in the data model for this project.

#### 4.3.1 Dim_H1B_Cases

| Field Name | Data Type | Description | Required? | Accepts null value? | Notes |
|------|------|------|------|------|------|
|   id  | long | Identity column | Yes | No | Primary Key |
|   case_submitted  | date | Date when the H1B case was submit to the system | No | Yes |  |
|   decision_date  | date | Date when the H1B case's decision was made | No | Yes |  |
|   job_title | string | Job title when filing the H1B's case | No | Yes | |
|   work_city  | string | City to perform the work when filing the H1B's case | No | Yes | |
|   work_state  | string | State to perform the work when filing the H1B's case | No | Yes | |

#### 4.3.2 Dim_Employers

| Field Name | Data Type | Description | Required? | Accepts null value? | Notes |
|------|------|------|------|------|------|
|   id  | long | Identity column | Yes | No | Primary Key |
|   employer_name  | string | Employer's name | No | Yes |  |
|   employer_city  | string | City where the employer's company resides | No | Yes |  |
|   employer_state  | string | State where the employer's company resides | No | Yes |  |

#### 4.3.3 Dim_Occupations

| Field Name | Data Type | Description | Required? | Accepts null value? | Notes |
|------|------|------|------|------|------|
|   id  | long | Identity column | Yes | No | Primary Key |
|   soc_code  | string | Standard Occupational Code | No | Yes |  |
|   soc_name | string | SOC occupation title | No | Yes |  |
|   naics_code  | string | North American Industry Classification System (NAICS) | No | Yes |  |
|   full_time  | integer | Job position fulltime's status | No | Yes |  |
|   prevailing_wage | float | Average wage paid to similarly employed workers in a specific occupation | No | Yes |  |
|   prevailing_unit | string | Prevailing Wage's unit of payment | No | Yes | For examle: by year, month, hour |
|   wage_from  | float | Minimum (staring) wage for the job title | No | Yes |  |
|   wage_to | float | Maximum wage for the job title | No | Yes |  |

#### 4.3.4 Dim_Job_Postings 

| Field Name | Data Type | Description | Required? | Accepts null value? | Notes |
|------|------|------|------|------|------|
|   id  | long | Identity column | Yes | No | Primary key |
|   job_title | string | Job title | No | Yes |  |
|   job_city | string | City where the job will be performed | No | Yes |  |
|   job_state | string | State where the job will be performed | No | Yes |  |
|   company_name | string | Job posted by company | No | Yes |  |
|   crawl_timestamp | timestamp | The automatic crawler detect timestamp |  No | Yes |  |
|   source | string | Source domain of where the jobs are posting | Yes | No |  |

#### 4.3.5 Fact_H1B_Sponsors 

| Field Name | Data Type | Description | Required? | Accepts null value? | Notes |
|------|------|------|------|------|------|
|   id | long | Identity column | Yes | No | Primary key |
|   h1b_id | long | Reference key to **Dim_H1B_Cases** table | Yes | No | Foreign key |
|   occupation_id | long | Reference key to **Dim_Occupations** table | Yes | No | Foreign key |
|   employer_id  | long | Reference key to **Dim_Employers** table | Yes | No | Foreign key |
|   job_posting_id  | long | Reference key to **Dim_Job_Postings** table | No | Yes | Nullable foreign key |
|   work_city  | string | City where the job will be perform | No | Yes | Partition key |
|   work_state  | string | State where the job will be perform | No | Yes | Partition key |

#### Step 5: Complete Project Write Up

### <font color=blue>Clearly state the rationale for the choice of tools and technologies for the project.</font>

**Below is the list of tools I decide to  used in this Projects:**
- AWS regular account
- AWS S3
- Jupyter Notebook
- Spark (built-in local mode)

There are some reasons which make me decide to choose to implement the above technologies into this project:
1. The functionality of the **AWS** cloud platform and the technology behind it have been finely tuned by many years and it's is very reliable and widely accepted by many industries.
2. Due to the rising popularity of open-source software in the industry, along with rapid growth of data science and machine learning the **Jupyter Notebook** has become famous amoung many data scientists. We can easily implement many **AI & ML** related libraries for good useses. The notebook itself also easily to illustrate images, color, many graphics such as tables.  
3. **Spark** is optimised to operate in-memory, allowing it to process data much quicker than other solutions which is good fit for project prototyping. (Good fit for development process, in this case).

### <font color=blue>Propose how often the data should be updated and why.</font>

We could update our data every 6 months (**Biyearly** basis). USCIS are are often taking 4-5 months to process and update the H-1B Petition's case status.[1] Therefore, I propose that 6 months timeframe is appropriate to update our data.

<i>Source: https://internationalaffairs.uchicago.edu/page/processing-times-h-1b-petitions</i>

### Write a description of how you would approach the problem differently under the following scenarios:

###### <font color=blue>The data was increased by 100x.</font>

We can migrate from an existing **Data Lake with S3** to **AWS EMR** instead.

**AWS EMR** has an **auto scaling** and **resizing** capabilities which is very flexible to increase/decrease data storage's sizes on demand which in the end, will reduce the overall costs since we don't need to rent the massive data storage all the time but Amazon will manage and calculate the data storage system for us.

The downside of using **AWS EMR** is that it's **transient cluster** which means it is the compute clusters that automatically shut down when processing is finished. The data on the **AWS EMR** isn't real-time and it's require some kind of scheduler automation to update the data.

###### <font color=blue>The data populates a dashboard that must be updated on a daily basis by 7am every day.</font>

We can integrate our code to work with **Apache Airflow** which is an open-source workflow management platform. Below is a sample code of how we setup the schedule interval on **DAG** that will run on a daily basis by 7am every day. 

```python
dag = DAG('udac_example_dag',
          description='Sample DAG with Airflow',
          schedule_interval='0 0 7 1/1 * ? *'
        )
```
**Cron expression: 0 0 7 1/1 * ? *** will generated the below scheduler: <I>(Show 5 intervals as an example)</I>
1.	2020-09-20 Sun 07:00:00
2.	2020-09-21 Mon 07:00:00
3.	2020-09-22 Tue 07:00:00
4.	2020-09-23 Wed 07:00:00
5.	2020-09-24 Thu 07:00:00
***
Note: Cron string was generated from http://www.cronmaker.com/.
***

###### <font color=blue>The database needed to be accessed by 100+ people.</font>

We can migrate from an existing **Data Lake with S3** to **Data Warehouse with AWS Redshift** instead.

**AWS Redshift**'s performance is outperform **S3 Bucket** when accessing data by large amount of users. With AWS Redshift, when it comes to queries that are executed frequently, the subsequent queries are usually executed faster. This is because Redshift spends a good portion of the execution plan optimizing the query.

**AWS Redshift** has an architecture that allows massively parallel processing using multiple nodes, reducing the load times. **AWS Redshift**'s clusters are available **24X7** which is pretty convenient for 100+ people to access the data warehouse at anytime.

The downside of using **AWS Redshift** is that it's generally costly compared with most of the **AWS Cloud** products.