## Exercise 1: Data Modeling

Using draw.io create a data model. Your data model MUST meet the following requirements:

1. Contain a _tickets_ fact table
1. Contain the following dimensions: _airlines_, _airports_, and _passengers_
1. Develop _passengers_ as an SCD Type2 dimension:
    - Passenger email can be used as the natural key
    - Be sure to add a surrogate key and effective start/end dates
    - You can optionally add an active column
1. IATA codes can be used as the primary key for both _airlines_ and _airports_
1. Use the t-ticket number as the primary key for the _tickets_ fact

In [1]:
# Import the data from json file and explore the data

import os
import sys
import pandas as pd
import logging
from google.cloud import bigquery
from hashlib import md5
from typing import List
import json

# **** SETUP ****

# change to match your filesystem
DATA_DIR = "../data/airtravel/"
DEFAULT_RECEIPTS_FILE = os.path.join(DATA_DIR, "tickets.json")
PROJECT_NAME = "deb-01-371820"
DATASET_NAME = "air_travel"

data = []
with open('./data/air_travel/tickets.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))

df = pd.DataFrame.from_dict(pd.json_normalize(data), orient='columns')

print(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4096 entries, 0 to 4095
Data columns (total 40 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   eticket_num              4096 non-null   object 
 1   confirmation             4096 non-null   object 
 2   ticket_date              4096 non-null   object 
 3   price                    4096 non-null   float64
 4   seat                     4096 non-null   object 
 5   status                   4096 non-null   object 
 6   airline.name             4096 non-null   object 
 7   airline.iata             4096 non-null   object 
 8   airline.icao             4096 non-null   object 
 9   airline.callsign         4096 non-null   object 
 10  airline.country          4096 non-null   object 
 11  origin.name              4041 non-null   object 
 12  origin.city              4041 non-null   object 
 13  origin.country           4041 non-null   object 
 14  origin.iata             

Here is the Data Model that includes:
1. tickets fact table
2. airlines dimension table
3. airports dimension table
4. passengers dimension table

<img src="./imgs/air_travel_data_modeling.drawio.png" alt="data model" width="640" />

## Exercise 2: Data Loading and Normalization

Develop an ETL pipeline that loads our dimensions and facts from the source file. You pipeline MUST meet the following requirements:

**General**:
- Load all dimensions in order: _airlines_, _airports_, and _passengers_
- Load the _tickets_ fact table after loading dimensions
- Your pipeline can drop/replace tables
- You can assume only inserts at this state. No updates, deletes, or merges

**Airlines Dim:**
- Identify unique airlines
- Use IATA code as the dimension key

**Airports Dim:**
- Identify unique airports from both origin and destination fields
- Use IATA code as the dimension key

**Passengers Dim:**
- Identify unique passengers
- Use the passenger email as the dimension natural key
- Generate UUIDs for the dimension surrogate keys
- Set the effective start date to any date. You can either use the ticket date, current date, or a fixed set date in the past
- Set the effective end date to None
- Optionally set your active flag to 'Y'
- Passenger address columns are considered SCD Type 2 columns
- All other columns are SCD Type 1

**Tickets Fact:**
- Link to _airlines_ and _airports_ dimensions by their IATA codes. 
- You don't need a lookup for _airlines_ and _airports_ since we use their IATA as dimension keys
- Link to the _passengers_ dimension by its surrogate key
- You need to perform a lookup to the _passengers_ dimension
- Load all teh tickets

<br><br>

In [None]:
TABLE_METADATA = {
    'airlines': {
        'table_name': 'airlines',
        'schema': [
            # indexes are written if only named in the schema
            bigquery.SchemaField('airline.iata', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airline.name', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airline.icao', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airline.callsign', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airline.country', 'string', mode='NULLABLE'),
        ],
    },
     'airports': {
        'table_name': 'airports',
        'schema': [
            # indexes are written if only named in the schema
            bigquery.SchemaField('airport.iata', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airport.name', 'string', mode='REQUIRED'),
            bigquery.SchemaField('airport.city', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airport.icao', 'string', mode='NULLABLE'),
            bigquery.SchemaField('airport.latitude', 'float', mode='NULLABLE'),
            bigquery.SchemaField('airport.longitude', 'float', mode='NULLABLE'),
            bigquery.SchemaField('airport.altitude', 'float', mode='NULLABLE'),
            bigquery.SchemaField('airport.tz_timezone', 'string', mode='NULLABLE'),
        ],
    },
    'passengers': {
        'table_name': 'passengers',
        'schema': [
            # indexes are written if only named in the schema
            bigquery.SchemaField('passenger_id', 'string', mode='REQUIRED'),
            bigquery.SchemaField('passenger.email', 'string', mode='REQUIRED'),
            bigquery.SchemaField('passenger.first_name', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.last_name', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.birth_date', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.street', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.city', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.state', 'string', mode='NULLABLE'),
            bigquery.SchemaField('passenger.zip', 'int64', mode='NULLABLE'),
            bigquery.SchemaField('effective_start_date', 'string', mode='NULLABLE'),
            bigquery.SchemaField('effective_end_date', 'string', mode='NULLABLE'),
        ],
    },
}

In [None]:
airlines_df = df[['airline.iata', 'airline.name','airline.icao', 'airline.callsign', 'airline.country']]
airlines_df = airlines_df.groupby('airlines.iata')
airlines_df = airlines_df.reset_index()


