## Importing Libraries and Load Environment Variables:

I first start by importing essential libraries such as os, pandas, dotenv, psycopg2, and openpyxl. These are crucial for my data manipulation, database interactions, and handling Excel files. I use load_dotenv to ensure my environment variables from the .env file are up to date.

In [2]:

import os
import pandas as pd # data analysis and manipulation library
from dotenv import load_dotenv
import psycopg2 # database adapter library that enables Python applications interact with PostgreSQL databases
import openpyxl # Python library used for reading from and writing to Excel 2010+ files

load_dotenv(override=True) #this means that if we change the values of our .env file. it picks the new value and we dont have to reload it.

pg_url = os.getenv('POSTGRES_URL')

type(pg_url)

str

## Reading Data from Excel: 

After importing the necessary libraries , the next step is to read the data into pandas.  I then specify and load an Excel file 'global-superstore-data.xlsx' into a pandas DataFrame, which kicks off my Extract phase of the ETL process.

In [4]:
file_path = '/Users/tamarainwang/Downloads/Week_6/hands_on_proj/global-superstore-data.xlsx'

df = pd.read_excel(file_path)

## Exploring the dataframe 

In [5]:
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,32298,CA-2012-124891,2012-07-31,2012-07-31,Same Day,RH-19495,Rick Hansen,Consumer,New York City,New York,...,TEC-AC-10003033,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.65,7,0.0,762.1845,933.57,Critical
1,26341,IN-2013-77878,2013-02-05,2013-02-07,Second Class,JR-16210,Justin Ritter,Corporate,Wollongong,New South Wales,...,FUR-CH-10003950,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.765,923.63,Critical
2,25330,IN-2013-71249,2013-10-17,2013-10-18,First Class,CR-12730,Craig Reiter,Consumer,Brisbane,Queensland,...,TEC-PH-10004664,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.971,915.49,Medium
3,13524,ES-2013-1579342,2013-01-28,2013-01-30,First Class,KM-16375,Katherine Murray,Home Office,Berlin,Berlin,...,TEC-PH-10004583,Technology,Phones,"Motorola Smart Phone, Cordless",2892.51,5,0.1,-96.54,910.16,Medium
4,47221,SG-2013-4320,2013-11-05,2013-11-06,Same Day,RH-9495,Rick Hansen,Consumer,Dakar,Dakar,...,TEC-SHA-10000501,Technology,Copiers,"Sharp Wireless Fax, High-Speed",2832.96,8,0.0,311.52,903.04,Critical


In [6]:
df.columns

#the column names need to be properly formatted. There are spaces and hyphens in the columns names.

Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country',
       'Postal Code', 'Market', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit', 'Shipping Cost', 'Order Priority'],
      dtype='object')

## Data Transformation

After exploring the data frame , I defined the function "transform_column_names" to format my dataFrame column names,removing leading or trailing spaces and making them lowercase . Then replacing spaces and hyphens with underscores. This standardizes the column names for database compatibility.


In [7]:
# fxn to transform column names

def transform_column_names(df:pd.DataFrame)-> pd.DataFrame:
    """iterates over each column name in the DataFrame's columns and transforms it to lowercase and replaces spaces with underscores"""
    df.columns = [column.strip().lower().replace(' ','_').replace('-','_') for column in df.columns]
    return df

I used the transform_column_names function to iterate over and transform each column name in the dataframe (df). The result is a trnasformed and standardised dataframe.

In [26]:
raw_data_transformed = transform_column_names(df)

raw_data_transformed.columns


Index(['row_id', 'order_id', 'order_date', 'ship_date', 'ship_mode',
       'customer_id', 'customer_name', 'segment', 'city', 'state', 'country',
       'postal_code', 'market', 'region', 'product_id', 'category',
       'sub_category', 'product_name', 'sales', 'quantity', 'discount',
       'profit', 'shipping_cost', 'order_priority'],
      dtype='object')

## Data Normalization


The next step is to break the main dataframe into four distinct DataFrames i.e products, customers, locations, and orders. I defined  the normalize_data function to do this. This function breaks down the main DataFrame into four distinct dataFrame and made sure to remove duplicates from these dataFrames. I also included print statements to track the progress of this normalization.

In [33]:
# Normalization ----> Breaking it all apart. Breaking the table into the 4 subsets below and then removing duplicates based on the primary key

# Products
# customers
# location
# orders

print(f"Normalization in progress.....current length = {len(raw_data_transformed)}")

products_df = raw_data_transformed[['product_id', 'category', 'sub_category', 'product_name']]
products_df_clean = products_df.drop_duplicates()

print(f"Done transforming Products ---> current length = {len(products_df_clean)}")

customers_df = raw_data_transformed[['customer_id', 'customer_name']]
customers_df_clean = customers_df.drop_duplicates()

print(f"Done transforming customers ---> current length = {len(customers_df_clean)}")

locations_df = raw_data_transformed[['city', 'state', 'country']]
locations_df_clean = locations_df.drop_duplicates()

print(f"Done transforming locations ---> current length = {len(locations_df_clean)}")

# The next thing to do would be to create my orders_df and remove duplicates based on a primary key i.e order_id
# However I noticed that the order_id would not make a suitable primary key for the orders df because the order id was not unique.
# This was because each order had multiple products per order. So we have different 
# So I created the orders_df to exclude all other fields 

orders_df = raw_data_transformed.drop(columns = ['category', 'sub_category', 'product_name','customer_name','state', 'country'])

print(len(orders_df))

# Therefore I decided to drop duplicates in the orders df based on the row_id which is the primark key for 

orders_df_clean = orders_df.drop_duplicates(subset = ['row_id'])

print(len(orders_df_clean))

print(f"Done transforming orders ---> current length = {len(orders_df_clean)}")






Normalization in progress.....current length = 51290
Done transforming Products ---> current length = 10768
Done transforming customers ---> current length = 1590
Done transforming locations ---> current length = 3812


## Writing Data to CSV Files

I then wrote my transformed dataFrames into CSV files in the outputs/models directory and added a quick print statement to confirm that the  I've successfully saved the csv files to the outputs/models directory (an intermediate storage)

In [32]:
#write dfs to csv files

output_dir = "hands_on_proj"
products_df.to_csv(f"{output_dir}products.csv", index=False)
customers_df.to_csv(f"{output_dir}customers.csv", index=False)
locations_df.to_csv(f"{output_dir}locations.csv", index=False)
orders_df.to_csv(f"{output_dir}orders.csv", index=False)

print("Files written to intermediate storage")

Files written to intermediate storage


51290
51290


In [2]:
from datetime import datetime

In [7]:
print (datetime.now().strftime('%Y-%d-%m %H:%M:%S'))

2023-09-12 11:43:45
