# HW2: ETL

The .csv file from <a href="https://drive.google.com/file/d/1adkkH5xlRLj_y78KdITb1sRVuKbsZJc9/view" target="_blank">this google drive link</a> is needed before we begin the assignment. The dataset was downloaded from gapminder.org.

## Instructions

For this assignment, you will Extract, Transform, and Load (ETL) the data in the link above into <a href="https://github.com/xyjiang970/cis9440-dataWarehousing/blob/main/homeworks/hw2/CIS9440_HW2_Fall22.pdf" target="_blank">the following Dimensional Model</a>.

Using any tools or programming language (Excel, python, SQL, Google Sheets,
R, DBT, etc), transform the data from the CSV you downloaded into the Fact
Table and Dimensions outlined in the image above. The Fact Table and
Dimensions must be saved as 3 separate CSV files:

- CSV 1: GDP_fact
- CSV 2: DateDim
- CSV 3: CountryDim

### Required Python Packages

Only need to be run once to install. Can skip these this cells once all packages have been installed. 

Once installed, go to "Kernal" in Jupyter Notebook and select "Restart & Clear Output". 

Then, you can start with the importing libraries call and hit Cell > Run All Below.

In [None]:
pip install --upgrade google-cloud-bigquery

## Importing Libraries

In [1]:
# import libraries
import pandas as pd
import numpy as np
from google.cloud import bigquery
from google.oauth2 import service_account

## Data Profiling

Quick process of examining, analyzing, and creating useful summaries of data.

In [2]:
data = pd.read_csv("income_per_person_inflation_adjusted.csv")

display(data.head())
print("\n-------------------------------------------------------")
display(data.columns)
print("\n-------------------------------------------------------")
data.info(verbose=True, show_counts=True)
print("\n-------------------------------------------------------")
print(f"number of duplicate rows: {len(data[data.duplicated()])} \n\n")

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2031,2032,2033,2034,2035,2036,2037,2038,2039,2040
0,Afghanistan,603,603,603,603,603,603,603,603,603,...,2550,2600,2660,2710,2770,2820,2880,2940,3000,3060
1,Albania,667,667,667,667,667,668,668,668,668,...,19400,19800,20200,20600,21000,21500,21900,22300,22800,23300
2,Algeria,715,716,717,718,719,720,721,722,723,...,14300,14600,14900,15200,15500,15800,16100,16500,16800,17100
3,Andorra,1200,1200,1200,1200,1210,1210,1210,1210,1220,...,73600,75100,76700,78300,79900,81500,83100,84800,86500,88300
4,Angola,618,620,623,626,628,631,634,637,640,...,6110,6230,6350,6480,6610,6750,6880,7020,7170,7310



-------------------------------------------------------


Index(['country', '1800', '1801', '1802', '1803', '1804', '1805', '1806',
       '1807', '1808',
       ...
       '2031', '2032', '2033', '2034', '2035', '2036', '2037', '2038', '2039',
       '2040'],
      dtype='object', length=242)


-------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 242 columns):
 #    Column   Non-Null Count  Dtype 
---   ------   --------------  ----- 
 0    country  193 non-null    object
 1    1800     193 non-null    int64 
 2    1801     193 non-null    int64 
 3    1802     193 non-null    int64 
 4    1803     193 non-null    int64 
 5    1804     193 non-null    int64 
 6    1805     193 non-null    int64 
 7    1806     193 non-null    int64 
 8    1807     193 non-null    int64 
 9    1808     193 non-null    int64 
 10   1809     193 non-null    int64 
 11   1810     193 non-null    int64 
 12   1811     193 non-null    int64 
 13   1812     193 non-null    int64 
 14   1813     193 non-null    int64 
 15   1814     193 non-null    int64 
 16   1815     193 non-null    int64 
 17   1816     193 non-null    int64 
 18   1817     193 non-null    int64 
 19   1818     193 non-null    int64 
 

## Creating Dimensions

In [3]:
# Country Dimension 
CountryDim = data[["country"]]
CountryDim.insert(0, "country_id", range(1, 1+len(CountryDim)))

display(CountryDim)
print("\n-------------------------------------------------------")
CountryDim.info()

Unnamed: 0,country_id,country
0,1,Afghanistan
1,2,Albania
2,3,Algeria
3,4,Andorra
4,5,Angola
...,...,...
188,189,Venezuela
189,190,Vietnam
190,191,Yemen
191,192,Zambia



-------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   country_id  193 non-null    int64 
 1   country     193 non-null    object
dtypes: int64(1), object(1)
memory usage: 3.1+ KB


In [4]:
# Date Dimension
DateDim = data.drop("country", axis=1).T.reset_index()
DateDim = DateDim[["index"]].rename(columns={"index":"year"})

DateDim["year"] = DateDim["year"].astype(np.int64)

DateDim.insert(0, "date_id", range(1, 1+len(DateDim)))
DateDim.insert(2, "decade", 10 * (DateDim['year'] // 10))

display(DateDim)
print("\n-------------------------------------------------------")
DateDim.info()

Unnamed: 0,date_id,year,decade
0,1,1800,1800
1,2,1801,1800
2,3,1802,1800
3,4,1803,1800
4,5,1804,1800
...,...,...,...
236,237,2036,2030
237,238,2037,2030
238,239,2038,2030
239,240,2039,2030



-------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241 entries, 0 to 240
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   date_id  241 non-null    int64
 1   year     241 non-null    int64
 2   decade   241 non-null    int64
dtypes: int64(3)
memory usage: 5.8 KB


## Creating Fact Table

In [5]:
# Fact Table
data

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2031,2032,2033,2034,2035,2036,2037,2038,2039,2040
0,Afghanistan,603,603,603,603,603,603,603,603,603,...,2550,2600,2660,2710,2770,2820,2880,2940,3000,3060
1,Albania,667,667,667,667,667,668,668,668,668,...,19400,19800,20200,20600,21000,21500,21900,22300,22800,23300
2,Algeria,715,716,717,718,719,720,721,722,723,...,14300,14600,14900,15200,15500,15800,16100,16500,16800,17100
3,Andorra,1200,1200,1200,1200,1210,1210,1210,1210,1220,...,73600,75100,76700,78300,79900,81500,83100,84800,86500,88300
4,Angola,618,620,623,626,628,631,634,637,640,...,6110,6230,6350,6480,6610,6750,6880,7020,7170,7310
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,Venezuela,1210,1200,1200,1190,1190,1180,1170,1170,1160,...,8270,8420,8580,8760,8930,9110,9300,9490,9680,9880
189,Vietnam,778,778,778,778,778,778,778,778,778,...,11900,12200,12500,12700,13000,13300,13500,13800,14100,14400
190,Yemen,877,879,882,884,887,889,892,894,897,...,3230,3290,3360,3430,3500,3570,3640,3720,3790,3870
191,Zambia,663,665,667,668,670,671,673,675,676,...,3500,3560,3630,3700,3780,3860,3930,4010,4100,4180


In [6]:
# https://towardsdatascience.com/wide-to-long-data-how-and-when-to-use-pandas-melt-stack-and-wide-to-long-7c1e0f462a98
GDP_fact = data.melt(id_vars=["country"])

# Merge with CountryDim
GDP_fact = pd.merge(GDP_fact, CountryDim, how="left")
GDP_fact = GDP_fact.rename(columns={"variable":"year"})

# Merge with DateDim
GDP_fact["year"] = GDP_fact["year"].astype(np.int64)
GDP_fact = pd.merge(GDP_fact, DateDim, how="left")

# Removing unwanted columns
GDP_fact = GDP_fact.drop(["country","year","decade"], axis=1)
GDP_fact = GDP_fact.rename(columns={"value":"income_per_person"})
GDP_fact = GDP_fact[["date_id","country_id","income_per_person"]]

display(GDP_fact)
print("\n-------------------------------------------------------")
GDP_fact.info()

Unnamed: 0,date_id,country_id,income_per_person
0,1,1,603
1,1,2,667
2,1,3,715
3,1,4,1200
4,1,5,618
...,...,...,...
46508,241,189,9880
46509,241,190,14400
46510,241,191,3870
46511,241,192,4180



-------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 46513 entries, 0 to 46512
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   date_id            46513 non-null  int64
 1   country_id         46513 non-null  int64
 2   income_per_person  46513 non-null  int64
dtypes: int64(3)
memory usage: 1.4 MB


## Downloading CSV Files

In [7]:
CountryDim.to_csv("CountryDim.csv")
DateDim.to_csv("DateDim.csv")
GDP_fact.to_csv("GDP_fact.csv")

## Deliver Dimensions and Facts to Google BigQuery

### Google BigQuery Variables

In [8]:
# CHANGE THIS TO YOUR FILE PATH
key_path = r'C:\Users\xj438\Desktop\cis9440-361100-642072a7b126.json'

In [9]:
# Run this cell without changing anything to setup your credentials
credentials = service_account.Credentials.from_service_account_file(key_path,
                                                                    scopes=["https://www.googleapis.com/auth/cloud-platform"])
bigquery_client = bigquery.Client(credentials=credentials,
                                  project=credentials.project_id)

print(f"bigquery client name is: {bigquery_client}")
print(f"bigquery client data type is: {type(bigquery_client)}")

bigquery client name is: <google.cloud.bigquery.client.Client object at 0x000001C6EABCF1F0>
bigquery client data type is: <class 'google.cloud.bigquery.client.Client'>


### Uploading Dimensions and Facts to Google BigQuery

In [10]:
# Dataset Id from created dataset in BigQuery
dataset_id = 'cis9440-361100.hw2_ETL'

dataset_id = dataset_id.replace(':', '.')
print(f"your dataset_id is: {dataset_id}")

your dataset_id is: cis9440-361100.hw2_ETL


In [11]:
# Create a function to load dataframes to BigQuery

def load_table_to_bigquery(df,
                           table_name,
                           dataset_id):

    dataset_id = dataset_id 

    dataset_ref = bigquery_client.dataset(dataset_id)
    job_config = bigquery.LoadJobConfig()
    job_config.autodetect = True
    job_config.write_disposition = "WRITE_TRUNCATE"

    upload_table_name = f"{dataset_id}.{table_name}"
    
    load_job = bigquery_client.load_table_from_dataframe(df,
                                                         upload_table_name,
                                                         job_config = job_config)
        
    print(f"completed job {load_job}")

In [12]:
load_table_to_bigquery(df=GDP_fact, 
                       table_name="GDP_fact", 
                       dataset_id=dataset_id)

completed job LoadJob<project=cis9440-361100, location=US, id=8ab83cef-177d-49de-a081-c26e7d8269d5>


In [13]:
load_table_to_bigquery(df=DateDim, 
                       table_name="DateDim", 
                       dataset_id=dataset_id)

completed job LoadJob<project=cis9440-361100, location=US, id=39f15c35-0d98-41dd-9d2c-c09cae6e2266>


In [14]:
load_table_to_bigquery(df=CountryDim, 
                       table_name="CountryDim", 
                       dataset_id=dataset_id)

completed job LoadJob<project=cis9440-361100, location=US, id=e40d46cb-209d-453c-9906-61e4be0fb6b6>
