This project will pull data from all sources and create fact and dimension tables to show movement of immigration in US. This project was made using Spark in order to make a datawarehouse in parquet file format that reflects inmigration data in US airports. It's used a star schema with a facts table an dimensional tables.

The project follows the follow steps:
- Step 1: Scope the Project and Gather Data
- Step 2: Explore and Assess the Data
- Step 3: Define the Data Model
- Step 4: Run ETL to Model the Data

## Setup

In [1]:
%%capture
!git clone https://github.com/RecoHut-Datasets/i94_immigration
%cd i94_immigration
!tar -xvf sas_data.tar.xz

In [93]:
%%capture
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install -q pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/i94_immigration/spark-3.0.0-bin-hadoop3.2"

import findspark
findspark.init()

In [25]:
import pandas as pd
import pyspark

from pyspark.sql import SparkSession, SQLContext, GroupedData
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import date_add as d_add

import warnings
warnings.filterwarnings('ignore')

In [4]:
#Build spark session
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11").enableHiveSupport().getOrCreate()

## Gather and Explore Data

### Sources Configurations

In [14]:
paths = {
    "demographics" : "./us-cities-demographics.csv",
    "airports" :  "./airport-codes_csv.csv",
    "sas_data" : "./sas_data",
    "us_states" : "./us_states.csv",
    "cities" : "./cities.csv",
    "countries" : "./countries.csv",
    "visa" : "./visa.csv",
    "inmigrant_airports" : "./airports.csv",
    "mode" : "./mode.csv",
    "airlines" : "./airlines.dat"
}

### Get all the sources

In [15]:
class Source:
    """
    Get the sources and return dataframes
    """

    def __init__(self, spark, paths):
        self.spark = spark
        self.paths = paths

    def _get_standard_csv(self, filepath, delimiter=","):
        """
        Get sources in CSV format
        :param filepath: csv file path
        :param delimiter: delimiter
        :return: dataframe
        """
        return self.spark.read.format("csv").option("header", "true").option("delimiter", delimiter).load(filepath)

    def get_cities_demographics_raw(self):
        """
        Get demographics dataset
        :return: demographics dataset
        """
        return self._get_standard_csv(filepath=self.paths["demographics"], delimiter=";")

    def get_airports_raw(self):
        """
        Get airports dataset
        :return: airports dataset
        """
        return self._get_standard_csv(self.paths["airports"])

    def get_inmigration_raw(self):
        """
        Get inmigration dataset.
        :return: inmigration dataset
        """
        return self.spark.read.parquet(self.paths["sas_data"])

    def get_countries_raw(self):
        """
        Get countries dataset
        :return: countries dataset
        """
        # return self.spark.read.json(self.paths["countries"], multiLine=True)
        return self._get_standard_csv(self.paths["countries"])

    def get_visa_raw(self):
        """
        Get visa dataset
        :return: visa dataset
        """
        return self._get_standard_csv(self.paths["visa"])

    def get_mode_raw(self):
        """
        Get modes dataset
        :return: modes dataset
        """
        return self._get_standard_csv(self.paths["mode"])

    def get_airlines(self):
        """
        Get airlines dataset
        :return: airlines dataset
        """
        schema = StructType([
            StructField("Airline_ID", IntegerType(), True),
            StructField("Name", StringType(), True),
            StructField("Alias", StringType(), True),
            StructField("IATA", StringType(), True),
            StructField("ICAO", StringType(), True),
            StructField("Callsign", StringType(), True),
            StructField("Country", StringType(), True),
            StructField("Active", StringType(), True)])

        return self.spark.read.csv(self.paths["airlines"], header=False, schema=schema)

In [17]:
source = Source(spark, paths)

demog = source.get_cities_demographics_raw()
airport = source.get_airports_raw()
sas_data = source.get_inmigration_raw()
countries = source.get_countries_raw()
visa = source.get_visa_raw()
mode = source.get_mode_raw()
airlines = source.get_airlines()

### View Sources Datasets in raw format

In [18]:
demog.show()

+----------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+------+
|            City|         State|Median Age|Male Population|Female Population|Total Population|Number of Veterans|Foreign-born|Average Household Size|State Code|                Race| Count|
+----------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+------+
|   Silver Spring|      Maryland|      33.8|          40601|            41862|           82463|              1562|       30908|                   2.6|        MD|  Hispanic or Latino| 25924|
|          Quincy| Massachusetts|      41.0|          44129|            49500|           93629|              4147|       32935|                  2.39|        MA|               White| 58723|
|          Hoover|       Alabama|      38.5|      

In [19]:
airport.show()

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|     heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|-74.9336013793945...|
| 00AA|small_airport|Aero B Ranch Airport|        3435|       NA|         US|     US-KS|       Leoti|    00AA|     null|      00AA|-101.473911, 38.7...|
| 00AK|small_airport|        Lowell Field|         450|       NA|         US|     US-AK|Anchor Point|    00AK|     null|      00AK|-151.695999146, 5...|
| 00AL|small_airport|        Epps Airpark|         820|       NA|         US|     

In [20]:
sas_data.show()

+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|    cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|        admnum|fltno|visatype|
+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|5748517.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     CA|20582.0|  40.0|    1.0|  1.0|20160430|     SYD| null|      G|      O|   null|      M| 1976.0|10292016|     F|  null|     QF|9.495387003E10|00011|      B1|
|5748518.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|    1.0|     NV|20591.0|  32.0|    1.0|  

In [21]:
countries.show()

+----+--------------------+
|code|        country_name|
+----+--------------------+
| 582|MEXICO Air Sea, a...|
| 236|         AFGHANISTAN|
| 101|             ALBANIA|
| 316|             ALGERIA|
| 102|             ANDORRA|
| 324|              ANGOLA|
| 529|            ANGUILLA|
| 518|     ANTIGUA-BARBUDA|
| 687|           ARGENTINA|
| 151|             ARMENIA|
| 532|               ARUBA|
| 438|           AUSTRALIA|
| 103|             AUSTRIA|
| 152|          AZERBAIJAN|
| 512|             BAHAMAS|
| 298|             BAHRAIN|
| 274|          BANGLADESH|
| 513|            BARBADOS|
| 104|             BELGIUM|
| 581|              BELIZE|
+----+--------------------+
only showing top 20 rows



In [22]:
visa.show()

+---------+--------+
|visa_code|    visa|
+---------+--------+
|        1|Business|
|        2|Pleasure|
|        3| Student|
+---------+--------+



In [23]:
mode.show()

+--------+----------------+
|cod_mode|       mode_name|
+--------+----------------+
|     1.0|             Air|
|     2.0|             Sea|
|     3.0|            Land|
|     9.0|Not reportedmode|
+--------+----------------+



In [24]:
airlines.show()

+----------+--------------------+-----+----+----+--------------+--------------+------+
|Airline_ID|                Name|Alias|IATA|ICAO|      Callsign|       Country|Active|
+----------+--------------------+-----+----+----+--------------+--------------+------+
|        -1|             Unknown|   \N|   -| N/A|            \N|            \N|     Y|
|         1|      Private flight|   \N|   -| N/A|          null|          null|     Y|
|         2|         135 Airways|   \N|null| GNL|       GENERAL| United States|     N|
|         3|       1Time Airline|   \N|  1T| RNX|       NEXTIME|  South Africa|     Y|
|         4|2 Sqn No 1 Elemen...|   \N|null| WYT|          null|United Kingdom|     N|
|         5|     213 Flight Unit|   \N|null| TFU|          null|        Russia|     N|
|         6|223 Flight Unit S...|   \N|null| CHD|CHKALOVSK-AVIA|        Russia|     N|
|         7|   224th Flight Unit|   \N|null| TTF|    CARGO UNIT|        Russia|     N|
|         8|         247 Jet Ltd|   \N|null

## Clean the data

In [38]:
class Cleaner:
    """
    Clean de origin datasets
    """


    @staticmethod
    def get_cities_demographics(demographics):
        """
        Clean demographics dataset, filling null values withn 0 and grouping by city and state and pivot
        Race in diferent columns.
        :param demographics: demographics dataset
        :return: demographics dataset cleaned
        """
        pivot = demographics.groupBy(col("City"), col("State"), col("Median Age"), col("Male Population"),
                                     col("Female Population") \
                                     , col("Total Population"), col("Number of Veterans"), col("Foreign-born"),
                                     col("Average Household Size") \
                                     , col("State Code")).pivot("Race").agg(sum("count").cast("integer")) \
            .fillna({"American Indian and Alaska Native": 0,
                     "Asian": 0,
                     "Black or African-American": 0,
                     "Hispanic or Latino": 0,
                     "White": 0})

        return pivot

    @staticmethod
    def get_airports(airports):
        """
        Clean airports dataset filtering only US airports and discarting anything else that is not an airport.
        Extract iso regions and cast as float elevation feet.
        :param airports: airports dataframe
        :return: airports dataframe cleaned
        """
        airports = airports \
            .where(
            (col("iso_country") == "US") & (col("type").isin("large_airport", "medium_airport", "small_airport"))) \
            .withColumn("iso_region", substring(col("iso_region"), 4, 2)) \
            .withColumn("elevation_ft", col("elevation_ft").cast("float"))

        return airports

    @staticmethod
    def get_inmigration(inmigration):
        """
        Clean the inmigrantion dataset. Rename columns with understandable names. Put correct formats in dates and s
        elect only important columns 
        :param inmigration: inmigrantion dataset
        :return: inmigrantion dataset cleaned
        """
        inmigration = inmigration \
            .withColumn("cic_id", col("cicid").cast("integer")) \
            .drop("cicid") \
            .withColumnRenamed("i94addr", "cod_state") \
            .withColumnRenamed("i94port", "cod_port") \
            .withColumn("cod_visa", col("i94visa").cast("integer")) \
            .drop("i94visa") \
            .withColumn("cod_mode", col("i94mode").cast("integer")) \
            .drop("i94mode") \
            .withColumn("cod_country_origin", col("i94res").cast("integer")) \
            .drop("i94res") \
            .withColumn("cod_country_cit", col("i94cit").cast("integer")) \
            .drop("i94cit") \
            .withColumn("year", col("i94yr").cast("integer")) \
            .drop("i94yr") \
            .withColumn("month", col("i94mon").cast("integer")) \
            .drop("i94mon") \
            .withColumn("bird_year", col("biryear").cast("integer")) \
            .drop("biryear") \
            .withColumn("age", col("i94bir").cast("integer")) \
            .drop("i94bir") \
            .withColumn("counter", col("count").cast("integer")) \
            .drop("count") \
            .withColumn("arr_date", col("arrdate").cast("integer")) \
            .drop("arrdate") \
            .withColumn("dep_date", col("depdate").cast("integer")) \
            .drop("depdate") \
            .withColumn("data_base_sas", to_date(lit("01/01/1960"), "MM/dd/yyyy")) \
            .withColumn("arrival_date", expr("date_add(data_base_sas, arr_date)")) \
            .withColumn("departure_date", expr("date_add(data_base_sas, dep_date)")) \
            .drop("data_base_sas", "arr_date", "dep_date")

        return inmigration.select(col("cic_id"), col("cod_port"), col("cod_state"), col("visapost"), col("matflag"),
                                  col("dtaddto") \
                                  , col("gender"), col("airline"), col("admnum"), col("fltno"), col("visatype"),
                                  col("cod_visa"), col("cod_mode") \
                                  , col("cod_country_origin"), col("cod_country_cit"), col("year"), col("month"),
                                  col("bird_year") \
                                  , col("age"), col("counter"), col("arrival_date"), col("departure_date"))

    @staticmethod
    def get_countries(countries):
        """
        Clean countries dataset.
        :param countries: countries dataset
        :return: countries dataset cleaned
        """
        country = countries \
            .withColumnRenamed("code", "cod_country")
        return country

    @staticmethod
    def get_visa(visa):
        """
        Clean visa dataset. 
        :param visa: visa dataset
        :return: visa dataset cleaned
        """
        visa = visa \
            .withColumnRenamed("visa_code", "cod_visa")
        return visa

    @staticmethod
    def get_mode(mode):
        """
        Clean mode dataset
        :param mode: mode dataset
        :return: mode dataset cleaned
        """
        modes = mode \
            .withColumn("cod_mode", col("cod_mode").cast("integer")) \
            .withColumnRenamed(" mode_name", "mode_name")
        return modes

    @staticmethod
    def get_airlines(airlines):
        """
        Clean airlines dataset and filter only airlines with IATA code.
        :param airlines: airlines dataset 
        :return: airlines dataset  cleaned
        """
        airlines = airlines \
            .where((col("IATA").isNotNull()) & (col("Airline_ID") > 1)) \
            .drop("Alias")

        return airlines

In [27]:
demog_clean = Cleaner.get_cities_demographics(demog)
demog_clean.show()

+---------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+---------------------------------+-----+-------------------------+------------------+------+
|           City|         State|Median Age|Male Population|Female Population|Total Population|Number of Veterans|Foreign-born|Average Household Size|State Code|American Indian and Alaska Native|Asian|Black or African-American|Hispanic or Latino| White|
+---------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+---------------------------------+-----+-------------------------+------------------+------+
|         Skokie|      Illinois|      43.4|          31382|            33437|           64819|              1066|       27424|                  2.78|        IL|                                0|20272|                     4937|              6

In [28]:
airport_clean = Cleaner.get_airports(airport)
airport_clean.show()

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
| 00AA|small_airport|Aero B Ranch Airport|      3435.0|       NA|         US|        KS|       Leoti|    00AA|     null|      00AA|-101.473911, 38.7...|
| 00AK|small_airport|        Lowell Field|       450.0|       NA|         US|        AK|Anchor Point|    00AK|     null|      00AK|-151.695999146, 5...|
| 00AL|small_airport|        Epps Airpark|       820.0|       NA|         US|        AL|     Harvest|    00AL|     null|      00AL|-86.7703018188476...|
| 00AS|small_airport|      Fulton Airport|      1100.0|       NA|         US|     

In [33]:
!cp -r /content/Udacity-Data-Engineer-nanodegree/05-capstone-project/sources/sas_data .

In [39]:
inmigrant_clean = Cleaner.get_inmigration(sas_data)
inmigrant_clean.show()

+-------+--------+---------+--------+-------+--------+------+-------+--------------+-----+--------+--------+--------+------------------+---------------+----+-----+---------+---+-------+------------+--------------+
| cic_id|cod_port|cod_state|visapost|matflag| dtaddto|gender|airline|        admnum|fltno|visatype|cod_visa|cod_mode|cod_country_origin|cod_country_cit|year|month|bird_year|age|counter|arrival_date|departure_date|
+-------+--------+---------+--------+-------+--------+------+-------+--------------+-----+--------+--------+--------+------------------+---------------+----+-----+---------+---+-------+------------+--------------+
|5748517|     LOS|       CA|     SYD|      M|10292016|     F|     QF|9.495387003E10|00011|      B1|       1|       1|               438|            245|2016|    4|     1976| 40|      1|  2016-04-30|    2016-05-08|
|5748518|     LOS|       NV|     SYD|      M|10292016|     F|     VA|9.495562283E10|00007|      B1|       1|       1|               438|        

In [30]:
countries_clean = Cleaner.get_countries(countries)
countries_clean.show()

+-----------+--------------------+
|cod_country|        country_name|
+-----------+--------------------+
|        582|MEXICO Air Sea, a...|
|        236|         AFGHANISTAN|
|        101|             ALBANIA|
|        316|             ALGERIA|
|        102|             ANDORRA|
|        324|              ANGOLA|
|        529|            ANGUILLA|
|        518|     ANTIGUA-BARBUDA|
|        687|           ARGENTINA|
|        151|             ARMENIA|
|        532|               ARUBA|
|        438|           AUSTRALIA|
|        103|             AUSTRIA|
|        152|          AZERBAIJAN|
|        512|             BAHAMAS|
|        298|             BAHRAIN|
|        274|          BANGLADESH|
|        513|            BARBADOS|
|        104|             BELGIUM|
|        581|              BELIZE|
+-----------+--------------------+
only showing top 20 rows



In [40]:
visa_clean = Cleaner.get_visa(visa)
visa_clean.show()

+--------+--------+
|cod_visa|    visa|
+--------+--------+
|       1|Business|
|       2|Pleasure|
|       3| Student|
+--------+--------+



In [41]:
mode_clean = Cleaner.get_mode(mode)
mode_clean.show()

+--------+----------------+
|cod_mode|       mode_name|
+--------+----------------+
|       1|             Air|
|       2|             Sea|
|       3|            Land|
|       9|Not reportedmode|
+--------+----------------+



In [42]:
airlines_clean = Cleaner.get_airlines(airlines)
airlines_clean.show()

+----------+--------------------+----+----+------------+-----------------+------+
|Airline_ID|                Name|IATA|ICAO|    Callsign|          Country|Active|
+----------+--------------------+----+----+------------+-----------------+------+
|         3|       1Time Airline|  1T| RNX|     NEXTIME|     South Africa|     Y|
|        10|         40-Mile Air|  Q5| MLA|    MILE-AIR|    United States|     Y|
|        13|    Ansett Australia|  AN| AAA|      ANSETT|        Australia|     Y|
|        14|Abacus International|  1B|null|        null|        Singapore|     Y|
|        15|     Abelag Aviation|  W9| AAB|         ABG|          Belgium|     N|
|        21|          Aigle Azur|  ZI| AAF|  AIGLE AZUR|           France|     Y|
|        22|      Aloha Airlines|  AQ| AAH|       ALOHA|    United States|     Y|
|        24|   American Airlines|  AA| AAL|    AMERICAN|    United States|     Y|
|        28|     Asiana Airlines|  OZ| AAR|      ASIANA|Republic of Korea|     Y|
|        29|    

## Define the Data Model

### Mapping Out Data Pipelines

In [43]:
class Transformer:
    """
    Realize transformations necessaries in order to create the model.
    """

    @staticmethod
    def transform_demographics(demographics):
        """
        Transform demographics dataset grouping by state an calculate all the totals and ratios for every race
         in every state.
        :param demographics: demographics dataset
        :return: demographics dataset transformed
        """
        demo = demographics \
            .groupBy(col("State Code").alias("State_code"), col("State")).agg(
            sum("Total Population").alias("Total_Population")\
            , sum("Male Population").alias("Male_Population"), sum("Female Population").alias("Female_Population")\
            , sum("American Indian and Alaska Native").alias("American_Indian_and_Alaska_Native")\
            , sum("Asian").alias("Asian"), sum("Black or African-American").alias("Black_or_African-American")\
            , sum("Hispanic or Latino").alias("Hispanic_or_Latino")\
            , sum("White").alias("White")) \
            .withColumn("Male_Population_Ratio", round(col("Male_Population") / col("Total_Population"), 2))\
            .withColumn("Female_Population_Ratio", round(col("Female_Population") / col("Total_Population"), 2))\
            .withColumn("American_Indian_and_Alaska_Native_Ratio",
                        round(col("American_Indian_and_Alaska_Native") / col("Total_Population"), 2))\
            .withColumn("Asian_Ratio", round(col("Asian") / col("Total_Population"), 2))\
            .withColumn("Black_or_African-American_Ratio",
                        round(col("Black_or_African-American") / col("Total_Population"), 2))\
            .withColumn("Hispanic_or_Latino_Ratio", round(col("Hispanic_or_Latino") / col("Total_Population"), 2))\
            .withColumn("White_Ratio", round(col("White") / col("Total_Population"), 2))

        return demo

    @staticmethod
    def transform_inmigrants(inmigrants):
        """
        Transform inmigration dataset on order to get arrival date in different columns (year, month, day) 
        for partitioning the dataset.
        :param inmigrants: inmigration dataset
        :return: inmigration dataset transformed
        """
        inmigrants = inmigrants \
            .withColumn("arrival_date-split", split(col("arrival_date"), "-")) \
            .withColumn("arrival_year", col("arrival_date-split")[0]) \
            .withColumn("arrival_month", col("arrival_date-split")[1]) \
            .withColumn("arrival_day", col("arrival_date-split")[2]) \
            .drop("arrival_date-split")

        return inmigrants

In [44]:
demog_transformer = Transformer.transform_demographics(demog_clean)
demog_transformer.show()

+----------+--------------------+----------------+---------------+-----------------+---------------------------------+-------+-------------------------+------------------+--------+---------------------+-----------------------+---------------------------------------+-----------+-------------------------------+------------------------+-----------+
|State_code|               State|Total_Population|Male_Population|Female_Population|American_Indian_and_Alaska_Native|  Asian|Black_or_African-American|Hispanic_or_Latino|   White|Male_Population_Ratio|Female_Population_Ratio|American_Indian_and_Alaska_Native_Ratio|Asian_Ratio|Black_or_African-American_Ratio|Hispanic_or_Latino_Ratio|White_Ratio|
+----------+--------------------+----------------+---------------+-----------------+---------------------------------+-------+-------------------------+------------------+--------+---------------------+-----------------------+---------------------------------------+-----------+--------------------------

In [45]:
inmigrant_transformer = Transformer.transform_inmigrants(inmigrant_clean)
inmigrant_transformer.show()

+-------+--------+---------+--------+-------+--------+------+-------+--------------+-----+--------+--------+--------+------------------+---------------+----+-----+---------+---+-------+------------+--------------+------------+-------------+-----------+
| cic_id|cod_port|cod_state|visapost|matflag| dtaddto|gender|airline|        admnum|fltno|visatype|cod_visa|cod_mode|cod_country_origin|cod_country_cit|year|month|bird_year|age|counter|arrival_date|departure_date|arrival_year|arrival_month|arrival_day|
+-------+--------+---------+--------+-------+--------+------+-------+--------------+-----+--------+--------+--------+------------------+---------------+----+-----+---------+---+-------+------------+--------------+------------+-------------+-----------+
|5748517|     LOS|       CA|     SYD|      M|10292016|     F|     QF|9.495387003E10|00011|      B1|       1|       1|               438|            245|2016|    4|     1976| 40|      1|  2016-04-30|    2016-05-08|        2016|           04| 

### Configuration to write the Model

In [46]:
paths_write = {
    "demographics" : "./model/demographics.parquet",
    "airports" :  "./model/airports.parquet",
    "airlines" : "./model/airlines.parquet",
    "countries" : "./model/countries.parquet",
    "visa" : "./model/visa.parquet",
    "mode" : "./model/mode.parquet",
    "facts" : "./model/facts_inmigration.parquet"
}

In [47]:
class Modelizer:
    """
    Modelizes the datawarehouse (star schema) from datasets. Creating the facts table and dimension tables.
    """

    def __init__(self, spark, paths):
        self.spark = spark
        self.paths = paths

    def _modelize_demographics(self, demographics):
        """
        Create de demographics dimension table in parquet.
        :param demographics: demographics dataset.
        """
        demographics.write.mode('overwrite').parquet(self.paths["demographics"])

    def _modelize_airports(self, airports):
        """
        Create de airports dimension table in parquet.
        :param airports: airports dataset
        """
        airports.write.mode('overwrite').parquet(self.paths["airports"])

    def _modelize_airlines(self, airlines):
        """
        Create de airlines dimension table in parquet.
        :param airlines: airlines dataset
        """
        airlines.write.mode('overwrite').parquet(self.paths["airlines"])

    def _modelize_countries(self, countries):
        """
        Create countries dimension table in parquet
        :param countries: countries dataset
        """
        countries.write.mode('overwrite').parquet(self.paths["countries"])

    def _modelize_visa(self, visa):
        """
        Create visa dimension table in parquet
        :param visa: visa dataset
        """
        visa.write.mode('overwrite').parquet(self.paths["visa"])

    def _modelize_mode(self, mode):
        """
        Create modes dimension table in parquet
        :param mode: modes dataset
        """
        mode.write.mode('overwrite').parquet(self.paths["mode"])

    def _modelize_facts(self, facts):
        """
        Create facts table from inmigration in parquet particioned by arrival_year, arrival_month and arrival_day
        :param facts: inmigration dataset
        """
        facts.write.partitionBy("arrival_year", "arrival_month", "arrival_day").mode('overwrite').parquet(
            self.paths["facts"])

    def modelize(self, facts, dim_demographics, dim_airports, dim_airlines, dim_countries, dim_visa, dim_mode):
        """
        Create the Star Schema for the Data Warwhouse
        :param facts: facts table, inmigration dataset
        :param dim_demographics: dimension demographics
        :param dim_airports: dimension airports
        :param dim_airlines: dimension airlines
        :param dim_countries: dimension countries
        :param dim_visa: dimension visa
        :param dim_mode: dimension mode
        """
        facts = facts \
            .join(dim_demographics, facts["cod_state"] == dim_demographics["State_Code"], "left_semi") \
            .join(dim_airports, facts["cod_port"] == dim_airports["local_code"], "left_semi") \
            .join(dim_airlines, facts["airline"] == dim_airlines["IATA"], "left_semi") \
            .join(dim_countries, facts["cod_country_origin"] == dim_countries["cod_country"], "left_semi") \
            .join(dim_visa, facts["cod_visa"] == dim_visa["cod_visa"], "left_semi") \
            .join(dim_mode, facts["cod_mode"] == dim_mode["cod_mode"], "left_semi")

        self._modelize_demographics(dim_demographics)
        self._modelize_airports(dim_airports)
        self._modelize_airlines(dim_airlines)
        self._modelize_countries(dim_countries)
        self._modelize_visa(dim_visa)
        self._modelize_mode(dim_mode)

        self._modelize_facts(facts)

In [48]:
model = Modelizer(spark, paths_write)

In [49]:
model.modelize(inmigrant_transformer, demog_transformer, airport_clean, airlines_clean, countries_clean, visa_clean, mode_clean)

### Data Quality Checks

In [50]:
class Validator:
    """
    Validate and checks the model and data.
    """

    def __init__(self, spark, paths):
        self.spark = spark
        self.paths = paths

    def _get_demographics(self):
        """
        Get demographics dimension
        :return: demographics dimension
        """
        return self.spark.read.parquet(self.paths["demographics"])

    def _get_airports(self):
        """
        Get airports dimension
        :return: airports dimension
        """
        return self.spark.read.parquet(self.paths["airports"])

    def _get_airlines(self):
        """
        Get airlines dimension
        :return: airlines dimension
        """
        return self.spark.read.parquet(self.paths["airlines"])

    def _get_countries(self):
        """
        Get countries dimension
        :return: countries dimension
        """
        return self.spark.read.parquet(self.paths["countries"])

    def _get_visa(self):
        """
        Get visa dimension
        :return: visa dimension
        """
        return self.spark.read.parquet(self.paths["visa"])

    def _get_mode(self):
        """
        Get mode dimension
        :return: mode dimension
        """
        return self.spark.read.parquet(self.paths["mode"])

    def get_facts(self):
        """
        Get facts table
        :return: facts table
        """
        return self.spark.read.parquet(self.paths["facts"])

    def get_dimensions(self):
        """
        Get all dimensions of the model
        :return: all dimensions
        """
        return self._get_demographics(), self._get_airports(), self._get_airlines() \
            , self._get_countries(), self._get_visa(), self._get_mode()

    def exists_rows(self, dataframe):
        """
        Checks if there is any data in a dataframe
        :param dataframe: dataframe
        :return: true or false if the dataset has any row
        """
        return dataframe.count() > 0

    def check_integrity(self, fact, dim_demographics, dim_airports, dim_airlines, dim_countries, dim_visa, dim_mode):
        """
        Check the integrity of the model. Checks if all the facts columns joined with the dimensions has correct values 
        :param fact: fact table
        :param dim_demographics: demographics dimension
        :param dim_airports: airports dimension
        :param dim_airlines: airlines dimension
        :param dim_countries: countries dimension
        :param dim_visa: visa dimension
        :param dim_mode: mode dimension
        :return: true or false if integrity is correct.
        """
        integrity_demo = fact.select(col("cod_state")).distinct() \
                             .join(dim_demographics, fact["cod_state"] == dim_demographics["State_Code"], "left_anti") \
                             .count() == 0

        integrity_airports = fact.select(col("cod_port")).distinct() \
                                 .join(dim_airports, fact["cod_port"] == dim_airports["local_code"], "left_anti") \
                                 .count() == 0

        integrity_airlines = fact.select(col("airline")).distinct() \
                                 .join(dim_airlines, fact["airline"] == dim_airlines["IATA"], "left_anti") \
                                 .count() == 0

        integrity_countries = fact.select(col("cod_country_origin")).distinct() \
                                  .join(dim_countries, fact["cod_country_origin"] == dim_countries["cod_country"],
                                        "left_anti") \
                                  .count() == 0

        integrity_visa = fact.select(col("cod_visa")).distinct() \
                             .join(dim_visa, fact["cod_visa"] == dim_visa["cod_visa"], "left_anti") \
                             .count() == 0

        integrity_mode = fact.select(col("cod_mode")).distinct() \
                             .join(dim_mode, fact["cod_mode"] == dim_mode["cod_mode"], "left_anti") \
                             .count() == 0

        return integrity_demo & integrity_airports & integrity_airlines & integrity_countries\
               & integrity_visa & integrity_mode

In [51]:
validator = Validator(spark, paths_write)
facts = validator.get_facts()
dim_demographics, dim_airports, dim_airlines, dim_countries, dim_get_visa, dim_get_mode = validator.get_dimensions()

In [52]:
validator.exists_rows(dim_demographics)

True

In [53]:
validator.exists_rows(dim_airports)

True

In [54]:
validator.exists_rows(dim_airlines)

True

In [55]:
validator.exists_rows(dim_countries)

True

In [56]:
validator.exists_rows(dim_get_visa)

True

In [57]:
validator.exists_rows(dim_get_mode)

True

In [58]:
validator.exists_rows(facts)

True

In [59]:
validator.check_integrity(facts, dim_demographics, dim_airports, dim_airlines, dim_countries, dim_get_visa, dim_get_mode)

True

## Data Dictionary

**Data Dictionary Dimension Tables**

### Airports Data
 * ident: string (nullable = true) - Airport id
 * type: string (nullable = true) - size of airport
 * name: string (nullable = true) - name
 * elevation_ft: float (nullable = true) - elevation in feet
 * continent: string (nullable = true) - continet
 * iso_country: string (nullable = true) - country (ISO-2)
 * iso_region: string (nullable = true) - region (ISO-2)
 * municipality: string (nullable = true) - municipality
 * gps_code: string (nullable = true) - gps code
 * iata_code: string (nullable = true) - IATA code
 * local_code: string (nullable = true) - Local code
 * coordinates: string (nullable = true) - coordinates

### U.S. Demographic by State
 * State: string (nullable = true)-Full state name
 * state_code: string (nullable = true)- State code
 * Total_Population: double (nullable = true) - Total population of the state
 * Male_Population: double (nullable = true)- Total Male population per state
 * Female_Population: double (nullable = true)- Total Female population per state
 * American_Indian_and_Alaska_Native: long (nullable = true) - Total American Indian and Alaska Native population per state
 * Asian: long (nullable = true) - Total Asian population per state
 * Black_or_African-American: long (nullable = true) - Total Black or African-American population per state
 * Hispanic_or_Latino: long (nullable = true) - Total Hispanic or Latino population per state 
 * White: long (nullable = true) - Total White population per state 
 * Male_Population_Ratio: double (nullable = true) - Male population ratio per state
 * Female_Population_Ratio: double (nullable = true) - Female population ratio per state
 * American_Indian_and_Alaska_Native_Ratio: double (nullable = true) - Black or African-American population ratio per state
 * Asian_Ratio: double (nullable = true) - Asian population ratio per state
 * Black_or_African-American_Ratio: double (nullable = true) - Black or African-American population ratio per state
 * Hispanic_or_Latino_Ratio: double (nullable = true) - Hispanic or Latino population ratio per state 
 * White_Ratio: double (nullable = true) - White population ratio per state 

### Airlines
 * Airline_ID: integer (nullable = true) - Airline id
 * Name: string (nullable = true) -  Airline name
 * IATA: string (nullable = true) - IATA code
 * ICAO: string (nullable = true) - ICAO code
 * Callsign: string (nullable = true) - name code
 * Country: string (nullable = true) - country
 * Active: string (nullable = true) - Active

### Countries
 * cod_country: long (nullable = true) - Country code
 * country_name: string (nullable = true) - Country name

### Visas
 * cod_visa: string (nullable = true) - visa code
 * visa: string (nullable = true) - visa description

### Mode to access
 * cod_mode: integer (nullable = true) - Mode code
 * mode_name: string (nullable = true) - Mode description

### Fact Table (Inmigration Registry)

 * cic_id: integer (nullable = true) - CIC id
 * cod_port: string (nullable = true) - Airport code
 * cod_state: string (nullable = true) - US State code
 * visapost: string (nullable = true) - Department of State where where Visa was issued
 * matflag: string (nullable = true) - Match flag - Match of arrival and departure records
 * dtaddto: string (nullable = true) -  Character Date Field - Date to which admitted to U.S. (allowed to stay until)
 * gender: string (nullable = true) - Gender
 * airline: string (nullable = true) - Airline code
 * admnum: double (nullable = true) - Admission Number
 * fltno: string (nullable = true) - Flight number of Airline used to arrive in U.S.
 * visatype: string (nullable = true) - Class of admission legally admitting the non-immigrant to temporarily stay in U.S
 * cod_visa: integer (nullable = true) - Visa code
 * cod_mode: integer (nullable = true) - Mode code
 * cod_country_origin: integer (nullable = true) - Country of origin code
 * cod_country_cit: integer (nullable = true) - City code of origin
 * year: integer (nullable = true) - Year
 * month: integer (nullable = true) - Month
 * bird_year: integer (nullable = true) - Year of Birth
 * age: integer (nullable = true) - Age
 * counter: integer (nullable = true) - Used for summary statistics
 * arrival_date: date (nullable = true) - Arrival date
 * departure_date: date (nullable = true) - Departure Date
 * arrival_year: integer (nullable = true) - arrival year
 * arrival_month: integer (nullable = true) - Arrival month
 * arrival_day: integer (nullable = true) - arrival day of month