## Exercise 1: Data Modeling

Using draw.io create a data model. Your data model MUST meet the following requirements:

1. Contain a _tickets_ fact table
1. Contain the following dimensions: _airlines_, _airports_, and _passengers_
1. Develop _passengers_ as an SCD Type2 dimension:
    - Passenger email can be used as the natural key
    - Be sure to add a surrogate key and effective start/end dates
    - You can optionally add an active column
1. IATA codes can be used as the primary key for both _airlines_ and _airports_
1. Use the t-ticket number as the primary key for the _tickets_ fact

In [2]:
# Import the data from json file and explore the data

import os
import sys
import pandas as pd
import logging
from google.cloud import bigquery
from hashlib import md5
from typing import List
import json

# **** SETUP ****

# change to match your filesystem
DATA_DIR = "../data/airtravel/"
DEFAULT_RECEIPTS_FILE = os.path.join(DATA_DIR, "tickets.json")
PROJECT_NAME = "deb-01-371820"
DATASET_NAME = "air_travel"

data = []
with open('./data/air_travel/tickets.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame.from_dict(pd.json_normalize(data), orient='columns')

display(df.head(n=10))

Unnamed: 0,eticket_num,confirmation,ticket_date,price,seat,status,airline.name,airline.iata,airline.icao,airline.callsign,...,passenger.last_name,passenger.gender,passenger.birth_date,passenger.email,passenger.street,passenger.city,passenger.state,passenger.zip,origin,destination
0,498-938211-0795,ZVFDC4,2022-03-23,723.42,31I,active,China Eastern Airlines,MU,CES,CHINA EASTERN,...,Brown,M,1969-02-17,robert.brown.69@hotmail.com,5007 Thomas Way,Lake Hollystad,DC,20027,,
1,482-850738-6048,IL5GUI,2022-03-23,765.18,29B,active,Hawaiian Airlines,HA,HAL,HAWAIIAN,...,Kent,F,1998-08-05,laura.kent.98@hotmail.com,13991 Davis Village,North Catherineborough,PA,16516,,
2,275-207321-8092,CYEFBC,2022-03-21,753.89,26I,active,Wizz Air,W6,WZZ,WIZZ AIR,...,Tucker,F,1965-01-22,lisa.tucker.65@hotmail.com,04135 Marvin Via,North Kristabury,MA,1093,,
3,246-793315-3102,ZNGPC2,2022-03-22,793.89,15A,active,AirAsia,AK,AXM,ASIAN EXPRESS,...,Yates,NB,1975-03-31,matthew.yates.75@yahoo.com,76045 Samantha Road Suite 111,Lake Jeffrey,DE,19898,,
4,091-128904-1226,MGSBD9,2022-03-24,820.25,17F,active,Xiamen Airlines,MF,CXA,XIAMEN AIR,...,Villanueva,NB,1945-08-14,megan.villanueva.45@hotmail.com,848 Melissa Springs Suite 947,Kellerstad,TX,76177,,
5,115-196069-8963,XFYQC0,2022-03-23,892.69,18C,active,Air New Zealand,NZ,ANZ,NEW ZEALAND,...,Hall,NB,1944-08-31,sarah.hall.44@gmail.com,75420 Michael Mountains Suite 485,New Victoria,HI,96727,,
6,396-673460-1326,N5UOOZ,2022-03-23,889.53,3C,active,Jeju Air,7C,JJA,JEJU AIR,...,Thompson,M,1968-05-02,seth.thompson.68@yahoo.com,22455 Higgins Junction Apt. 042,New Keith,OR,97405,,
7,380-894599-8109,PAA19Y,2022-03-22,706.78,7D,active,American Airlines,AA,AAL,AMERICAN,...,Garcia,F,1950-02-12,jennifer.garcia.50@gmail.com,6607 Sharp Common,Chadstad,VA,22121,,
8,614-960971-2686,EF4BHJ,2022-03-23,486.4,24J,active,Juneyao Airlines,HO,DKH,JUNEYAO AIRLINES,...,Clark,F,1991-11-09,becky.clark.91@gmail.com,691 Jones Cliffs,Michaelburgh,TX,76003,,
9,481-321233-0702,FVM9EE,2022-03-23,855.93,16A,active,Royal Air Maroc,AT,RAM,ROYALAIR MAROC,...,Cook,M,1976-07-29,ronald.cook.76@hotmail.com,93328 Davis Island,Rodriguezside,MD,21408,,


Here is the Data Model that includes:
1. tickets fact table
2. airlines dimension table
3. airports dimension table
4. passengers dimension table

<img src="./imgs/air_travel_data_modeling.drawio.png" alt="data model" width="640" />

## Exercise 2: Data Loading and Normalization

Develop an ETL pipeline that loads our dimensions and facts from the source file. You pipeline MUST meet the following requirements:

**General**:
- Load all dimensions in order: _airlines_, _airports_, and _passengers_
- Load the _tickets_ fact table after loading dimensions
- Your pipeline can drop/replace tables
- You can assume only inserts at this state. No updates, deletes, or merges

**Airlines Dim:**
- Identify unique airlines
- Use IATA code as the dimension key

**Airports Dim:**
- Identify unique airports from both origin and destination fields
- Use IATA code as the dimension key

**Passengers Dim:**
- Identify unique passengers
- Use the passenger email as the dimension natural key
- Generate UUIDs for the dimension surrogate keys
- Set the effective start date to any date. You can either use the ticket date, current date, or a fixed set date in the past
- Set the effective end date to None
- Optionally set your active flag to 'Y'
- Passenger address columns are considered SCD Type 2 columns
- All other columns are SCD Type 1

**Tickets Fact:**
- Link to _airlines_ and _airports_ dimensions by their IATA codes. 
- You don't need a lookup for _airlines_ and _airports_ since we use their IATA as dimension keys
- Link to the _passengers_ dimension by its surrogate key
- You need to perform a lookup to the _passengers_ dimension
- Load all teh tickets

<br><br>

In [4]:
DIMS_TABLE_METADATA = {
    'airlines': {
        'table_name': 'airlines',
        'schema': [
            # indexes are written if only named in the schema
            bigquery.SchemaField('airline.iata', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airline.name', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airline.icao', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airline.callsign', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airline.country', 'string', mode='NULLABLE'),
        ],
    },
     'airports': {
        'table_name': 'airports',
        'schema': [
            # indexes are written if only named in the schema
            bigquery.SchemaField('airport.iata', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airport.name', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airport.city', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airport.icao', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airport.latitude', 'float', mode='NULLABLE'),
            bigquery.SchemaField('airport.longitude', 'float', mode='NULLABLE'),
            bigquery.SchemaField('airport.altitude', 'float', mode='NULLABLE'),
            bigquery.SchemaField('airport.tz_timezone', 'string', mode='NULLABLE'),
        ],
    },
    'passengers': {
        'table_name': 'passengers',
        'schema': [
            # indexes are written if only named in the schema
            bigquery.SchemaField('passenger_id', 'string', mode='REQUIRED'),
            bigquery.SchemaField('passenger.email', 'string', mode='REQUIRED'),
            bigquery.SchemaField('passenger.first_name', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.last_name', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.birth_date', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.street', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.city', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.state', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.zip', 'int64', mode='NULLABLE'),
            bigquery.SchemaField('effective_start_date', 'string', mode='NULLABLE'),
            bigquery.SchemaField('effective_end_date', 'string', mode='NULLABLE'),
        ],
    },
}

In [5]:
# **** SETUP LOGGING ****
# setup logging and logger
logging.basicConfig(            # setting up the root logger
    format='[%(levelname)-5s][%(asctime)s][%(module)s:%(lineno)04d] : %(message)s',
    level=logging.INFO,
    stream=sys.stdout
)
logger: logging.Logger = logging.getLogger('root')      # alias the root logger as `logger`
logger.setLevel(logging.DEBUG)                          # programmatically reassign the logging level


# **** BIGQUERY CLIENT ****
logger.debug(f"Creating bigquery client")
client = bigquery.Client()

logger.info(f"Setup Completed")

[DEBUG][2023-01-06 09:38:42,811][309645063:0013] : Creating bigquery client
[INFO ][2023-01-06 09:38:42,820][309645063:0016] : Setup Completed


In [6]:
# create dataset
dataset_id = f"{PROJECT_NAME}.{DATASET_NAME}"
dataset = bigquery.Dataset(dataset_id)
dataset.location = "US"
dataset = client.create_dataset(dataset, exists_ok=True)

logger.info(f"Created air_travel dataset: {dataset.full_dataset_id}")

[INFO ][2023-01-06 09:38:53,921][2174618521:0007] : Created air_travel dataset: deb-01-371820:air_travel


In [7]:
# Create dataframe for airlines to prep for table load

airlines_df = df

airline_cols = ['airline.iata', 'airline.name','airline.icao', 'airline.callsign', 'airline.country']
airlines_df = airlines_df.groupby(airline_cols).all()
airlines_df = airlines_df.reset_index().loc[:, airline_cols]
airlines_df = airlines_df.set_index()

logger.info(f"airlines dim - found {len(airlines_df.index)} rows")

display(airlines_df)

[INFO ][2023-01-06 09:39:21,659][3708083480:0009] : airlines dim - found 48 rows


Unnamed: 0,airline.iata,airline.name,airline.icao,airline.callsign,airline.country
0,3U,Sichuan Airlines,CSC,SI CHUAN,China
1,7C,Jeju Air,JJA,JEJU AIR,Republic of Korea
2,9K,Cape Air,KAP,CAIR,United States
3,9S,Spring Airlines,CQH,AIR SPRING,China
4,AA,American Airlines,AAL,AMERICAN,United States
5,AC,Air Canada,ACA,AIR CANADA,Canada
6,AF,Air France,AFR,AIRFRANS,France
7,AK,AirAsia,AXM,ASIAN EXPRESS,Malaysia
8,AS,Alaska Airlines,ASA,Inc.,ALASKA
9,AT,Royal Air Maroc,RAM,ROYALAIR MAROC,Morocco


In [None]:
airports_df = df

airline_cols = ['airline.iata', 'airline.name','airline.icao', 'airline.callsign', 'airline.country']
airlines_df = airlines_df.groupby(airline_cols).all()
airlines_df = airlines_df.reset_index().loc[:, airline_cols]
airlines_df = airlines_df.set_index()