# LOADING

In the loading phase, the cleaned and transformed datasets—transformed_full.csv and transformed_incremental.csv—were imported and loaded into a structured database format. A SQLite database was chosen for its simplicity and portability, with two separate tables (full_data and incremental_data) created within a single database file named full_data.db, stored in the loaded/ directory. Using pandas.to_sql(), the DataFrames were written into the database, allowing for efficient storage and future querying. To verify the loading process, sample queries such as SELECT * FROM full_data LIMIT 5 were executed and displayed to confirm the presence and correctness of the data. This phase ensured that the transformed datasets were safely persisted in a structured format, ready for use in downstream analytics or reporting tools.

In [2]:
import pandas as pd
import os
import sqlite3
from IPython.display import display  # for pretty DataFrame output

# Step 1: Define paths using os.path.join
transformed_dir = os.path.join('2. transformed')
loaded_dir = os.path.join('loaded')

full_csv_path = os.path.join(transformed_dir, 'transformed_full.csv')
incremental_csv_path = os.path.join(transformed_dir, 'transformed_incremental.csv')
db_path = os.path.join(loaded_dir, 'full_data.db')

# Step 2: Load both transformed CSV files
full_df = pd.read_csv(full_csv_path)
incremental_df = pd.read_csv(incremental_csv_path)

# Step 3: Ensure 'loaded/' directory exists
os.makedirs(loaded_dir, exist_ok=True)

# Step 4: Create a SQLite connection
conn = sqlite3.connect(db_path)

# Step 5: Write both DataFrames to separate tables in the same DB
full_df.to_sql('full_data', conn, if_exists='replace', index=False)
incremental_df.to_sql('incremental_data', conn, if_exists='replace', index=False)

# Step 6: Preview both tables
print("\nPreview of 'full_data' table (first 5 rows):")
preview_full = pd.read_sql_query("SELECT * FROM full_data LIMIT 5", conn)
display(preview_full)

print("\nPreview of 'incremental_data' table (first 5 rows):")
preview_incremental = pd.read_sql_query("SELECT * FROM incremental_data LIMIT 5", conn)
display(preview_incremental)

# Step 7: Close connection
conn.close()



Preview of 'full_data' table (first 5 rows):


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region,total_price,age,age_group
0,1,Diana,Tablet,2.0,500.0,2024-01-20,South,1000.0,,Unknown
1,2,Eve,Laptop,2.0,496.09375,2024-04-29,North,992.1875,,Unknown
2,3,Charlie,Laptop,2.0,250.0,2024-01-08,Unknown,500.0,,Unknown
3,4,Eve,Laptop,2.0,750.0,2024-01-07,West,1500.0,,Unknown
4,5,Eve,Tablet,3.0,496.09375,2024-03-07,South,1488.28125,,Unknown



Preview of 'incremental_data' table (first 5 rows):


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region,total_price,age,age_group
0,101,Alice,Laptop,1.5,900.0,2024-05-09,Central,1350.0,,Unknown
1,102,Unknown,Laptop,1.0,300.0,2024-05-07,Central,300.0,,Unknown
2,103,Unknown,Laptop,1.0,600.0,2024-05-04,Central,600.0,,Unknown
3,104,Unknown,Tablet,1.5,300.0,2024-05-26,Central,450.0,,Unknown
4,105,Heidi,Tablet,2.0,600.0,2024-05-21,North,1200.0,,Unknown
