# Project Title
### Data Engineering Capstone Project

#### Project Summary
The main goal of the project to create a Data Lake in S3 using Airflow trigger Spark.  

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
import pandas as pd
import os
import configparser
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DateType

In [2]:
spark = SparkSession.builder.appName("DataEngineeringCapstoneProject").getOrCreate()

# spark = SparkSession.builder.\
# config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
# .enableHiveSupport().getOrCreate()

### Step 1: Scope the Project and Gather Data

#### Scope 
The scope of project is to build a Data Mode to analysis CVID-19 world vaccination progress. For example We can analysis where need to speed up vaccination progress.

#### The following technologies were used: 
- Spark
- Airflow
- AWS EMR S3 

#### Describe and Gather Data 

##### 1.The Data sources:

The Daily and Total Vaccination for COVID-19 in the world is provided by Kaggle
- [country_vaccinations.csv](https://www.kaggle.com/gpreda/covid-world-vaccination-progress)

WHO Coronavirus Disease (COVID-19) is provided by WHO
- [WHO-COVID-19-global-data.csv](https://covid19.who.int/)

ISO country code provide by Kaggle
- [country_code.csv](https://www.kaggle.com/koki25ando/country-code)

The Data contains 12 columns provide by Kaggle
 - [ Countries_usefulFeatures.csv](https://www.kaggle.com/ishivinal/covid19-useful-features-by-country)
 
The Data extracted from Wikipedia's list of countries by category is provided by Kaggle 
 - [WORLD DATA by country (2020)](https://www.kaggle.com/daniboy370/world-data-by-country-2020?select=Life+expectancy.csv)

##### 2.Gather Data

In [None]:
spark.conf.set('spark.sql.session.timeZone', 'UTC') 

In [None]:
COVID19_vaccinations_spark = spark.read.csv("data/country_vaccinations.csv",sep=",", inferSchema=True, header=True)

##### The data contains the following information:

 - Country- this is the country for which the vaccination information is provided;
 - Country ISO Code - ISO code for the country;
 - Date - date for the data entry; for some of the dates we have only the daily vaccinations, for others, only the (cumulative) total;
 - Total number of vaccinations - this is the absolute number of total immunizations in the country;
 - Total number of people vaccinated - a person, depending on the immunization scheme, will receive one or more (typically 2) vaccines; at a certain moment, the number of vaccination might be larger than the number of people;
 - Total number of people fully vaccinated - this is the number of people that received the entire set of immunization according to the immunization scheme (typically 2); at a certain moment in time, there might be a certain number of people that received one vaccine and another number (smaller) of people that received all vaccines in the scheme;
 - Daily vaccinations (raw) - for a certain data entry, the number of vaccination for that date/country;
 - Daily vaccinations - for a certain data entry, the number of vaccination for that date/country;
 - Total vaccinations per hundred - ratio (in percent) between vaccination number and total population up to the date in the country;
 - Total number of people vaccinated per hundred - ratio (in percent) between population immunized and total population up to the date in the country;
 - Total number of people fully vaccinated per hundred - ratio (in percent) between population fully immunized and total population up to the date in the country;
 - Number of vaccinations per day - number of daily vaccination for that day and country;
 - Daily vaccinations per million - ratio (in ppm) between vaccination number and total population for the current date in the country;
 - Vaccines used in the country - total number of vaccines used in the country (up to date);
 - Source name - source of the information (national authority, international organization, local organization etc.);
 - Source website - website of the source of information;

In [None]:
COVID19_vaccinations_spark.printSchema()

Add columns *year*  *month*  *week*  for aggregating data

In [None]:
# COVID19_vaccinations_spark = COVID19_vaccinations_spark.withColumn('year', func.year('date'))\
#                               .withColumn('month', func.month('date'))\
#                               .withColumn('week_of_year', func.weekofyear('date'))

In [None]:
COVID19_vaccinations_spark.show(1)

In [None]:
COVID19_global_spark = spark.read.csv("data/WHO-COVID-19-global-data.csv",sep=",", inferSchema=True, header=True)

In [None]:
COVID19_global_spark.printSchema()

In [None]:
COVID19_global_spark.show(5)

In [None]:
country_code_spark = spark.read.csv("data/country_code.csv",sep=",", inferSchema=True, header=True)

In [None]:
country_code_spark.printSchema()

In [None]:
country_code_spark.show(5)

In [None]:
COVID19_usefulFeatures_spark = spark.read.csv("data/Countries_usefulFeatures.csv",sep=",", inferSchema=True, header=True)

- Country_Region: Name of the country
- Population_Size: the population size 2018 stats
- Tourism: International tourism, number of arrivals 2018
- Date_FirstFatality: Date of the first Fatality of the COVID-19
- Date_FirstConfirmedCase: Date of the first confirmed case of the COVID-19
- Latitude
- Longitude
- Mean_Age: mean age of the population 2018 stats
- Lockdown_Date: date of the lockdown
- Lockdown_Type: type of the lockdown
- Country_Code: 3 digit country code

In [None]:
COVID19_usefulFeatures_spark.printSchema()
COVID19_usefulFeatures_spark.show(5)

In [None]:
country_GDP_spark = spark.read.csv("data/WORLD DATA by country (2020)/GDP per capita.csv",sep=",", inferSchema=True, header=True)

In [None]:
country_GDP_spark.printSchema()
country_GDP_spark.show(5)

In [None]:
country_life_expectancy_spark = spark.read.csv("data/WORLD DATA by country (2020)/Life expectancy.csv",sep=",", inferSchema=True, header=True)

In [None]:
country_life_expectancy_spark.printSchema()
country_life_expectancy_spark.show(5)

In [None]:
country_median_age_spark = spark.read.csv("data/WORLD DATA by country (2020)/Median age.csv",sep=",", inferSchema=True, header=True)
country_median_age_spark.printSchema()
country_median_age_spark.show(5)

In [None]:
country_population_growth_spark = spark.read.csv("data/WORLD DATA by country (2020)/Population growth.csv",sep=",", inferSchema=True, header=True)
country_population_growth_spark.printSchema()
country_population_growth_spark.show(5)

In [None]:
country_urbanization_rate_spark = spark.read.csv("data/WORLD DATA by country (2020)/Urbanization rate.csv",sep=",", inferSchema=True, header=True)
country_urbanization_rate_spark.printSchema()
country_urbanization_rate_spark.show(5)

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

2.1 Explore and Cleaning  *COVID19_vaccinations_spark DataFrame*

In [None]:
# Explore the Null value columns
count = COVID19_vaccinations_spark.count()
{col: f'{ COVID19_vaccinations_spark.filter(COVID19_vaccinations_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in COVID19_vaccinations_spark.columns}

The 'iso_code' fill NULL as iso country code. Other column null value are filled with 0. 

In [None]:
dict_null = COVID19_vaccinations_spark.select("country","iso_code")\
                          .filter(COVID19_vaccinations_spark["iso_code"].isNull()).groupBy("country").count()
dict_null.show()

In [None]:
vaccinations_cleaned = COVID19_vaccinations_spark.fillna({"iso_code":"GBR", "total_vaccinations":0.0, "people_vaccinated":0.0, "people_fully_vaccinated":0.0,\
                       "daily_vaccinations_raw":0.0, "daily_vaccinations":0.0, "total_vaccinations_per_hundred":0.0,\
                                    "people_vaccinated_per_hundred":0.0, "people_fully_vaccinated_per_hundred":0.0,\
                                    "daily_vaccinations_per_million":0.0
                                   })

2.2 Explore and Cleaning *COVID19_global_spark* 

In [None]:
# Explore the Null value columns
count = COVID19_global_spark.count()
{col: f'{ COVID19_global_spark.filter(COVID19_global_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in COVID19_global_spark.columns}

In [None]:
global_data_clean = COVID19_global_spark
global_data_clean.select("country","Country_code").show(1)

The country code is 2 digit

2.3 Explore and Cleaning *COVID19_usefulFeatures_spark*

In [None]:
# Explore the Null value columns
count = COVID19_usefulFeatures_spark.count()
{col: f'{ COVID19_usefulFeatures_spark.filter(COVID19_usefulFeatures_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in COVID19_usefulFeatures_spark.columns}

The NULL values are meaningful,Just Keeping 
- Date_FirstFatality: Date of the first Fatality of the COVID-19
- Lockdown_Date: date of the lockdown
- Lockdown_Type: type of the lockdown

In [None]:
usefulFeatures_cleaned = COVID19_usefulFeatures_spark

2.4 Explore and Cleaning *WORLD DATA by country (2020)*

In [None]:
count = country_GDP_spark.count()
{col: f'{ country_GDP_spark.filter(country_GDP_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in country_GDP_spark.columns}

In [None]:
country_GDP_cleaned = country_GDP_spark

In [None]:
count = country_life_expectancy_spark.count()
{col: f'{ country_life_expectancy_spark.filter(country_life_expectancy_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in country_life_expectancy_spark.columns}

In [None]:
country_life_expectancy_cleaned = country_life_expectancy_spark 

In [None]:
count = country_median_age_spark.count()
{col: f'{ country_median_age_spark.filter(country_median_age_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in country_median_age_spark.columns}

In [None]:
dict_null = country_median_age_spark.select("Country","ISO-code")\
                          .filter(country_median_age_spark["ISO-code"].isNull()).groupBy("country").count()
dict_null.show(truncate=False)

In [None]:
country_median_age_cleaned = country_median_age_spark.fillna({"ISO-code":"VGB"})

In [None]:
count = country_population_growth_spark.count()
{col: f'{ country_population_growth_spark.filter(country_population_growth_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in country_population_growth_spark.columns}

In [None]:
country_population_growth_cleaned = country_population_growth_spark

In [None]:
count = country_urbanization_rate_spark.count()
{col: f'{ country_urbanization_rate_spark.filter(country_urbanization_rate_spark[col].isNull()).count()/count *100 :.2f}%'\
for col in country_urbanization_rate_spark.columns}

In [None]:
country_urbanization_rate_cleaned = country_urbanization_rate_spark

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
The dimension table is the entry point for data, There are four tables *country_region_dim*  *time_dim*  *vaccines_dim*   *source_dim*  and each measurement event in the world has a one-to-one relationship with the corresponding fact table row. There is one fact table *vaccinations_fact*.

![DataModel.jpg](image/DataModel.jpg)


#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

![DataModel.jpg](image/ETL.jpg)

### Step 4: Run Pipelines to Model the Data 

In [None]:
! ls -l

```
- LICENSE
- README.md
- data   # raw data files
- image # project relevant images
- docker-airflow #  DAG files 
- research.ipynb # The project jupyter notebook file
```

#### 4.1 Install Airflow Docker

##### 1. Download docker and installing
https://www.docker.com/products/docker-desktop

##### 2. Installation docker-airflow 
```
docker pull puckel/docker-airflow

```

##### 3. Build docker-airflow
```
cd <project path>/docker-airflow/

docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" -t puckel/docker-airflow .

docker build --rm --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow .
```

###### 3. Run Airflow Docker

```
docker-compose -f docker-compose-LocalExecutor.yml up -d

docker ps

docker exec -it <container_id> /bin/bash
```

http://localhost:8080/admin/

<img src="image/localhost.jpg" width="400" align="left">

#### 4.2 Create the data model
Build the data pipelines to create the data model.

#### 4.3 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.