# EXTRACTION

In the extraction phase, two datasets—raw_data.csv and incremental_data.csv—were loaded into memory using pandas. These files represent the historical and new data records, respectively. A preview of each dataset was displayed using .head() to gain an initial understanding of the data structure and contents. Additionally, .info() was used to inspect data types and identify missing values. From this exploration, several observations were made: some fields like age, customer_name, and region contained missing values; the order_date column was in string format, indicating the need for conversion during transformation; and a few rows appeared to be duplicated. Both datasets were then saved to the data/ directory to preserve their raw form before transformation. This extraction step laid the foundation for a structured and effective ETL process by highlighting the key issues to address in the next phase.

In [2]:
# etl_extract.ipynb

# Step 1: Import necessary libraries
import pandas as pd
import os
from IPython.display import display  # for nice DataFrame output

# Step 2: Print the current working directory
print("Current Working Directory:")
print(os.getcwd())

# Step 3: Define the data directory path
data_path = os.path.join(os.getcwd(), '1. data')  # Full path to 'data' directory

# Step 4: Define full paths to CSV files
raw_path = os.path.join(data_path, 'raw_data.csv')
inc_path = os.path.join(data_path, 'incremental_data.csv')

# Step 5: Load the CSV files
raw_df = pd.read_csv(raw_path)
incremental_df = pd.read_csv(inc_path)

# Step 6: Display DataFrame previews
print("\nPreview of raw_data.csv:")
display(raw_df.head(10))

print("\nStructure of raw_data.csv:")
raw_df.info()  # info prints to stdout

print("\nPreview of incremental_data.csv:")
display(incremental_df.head(10))

print("\nStructure of incremental_data.csv:")
incremental_df.info()

# Step 6: Observations (you can modify these based on actual data)
# - raw_data.csv may contain missing values (e.g., age, email).
# - user_id could be a unique key. Check for duplicates later.
# - signup_date might need to be converted to datetime format.
# - incremental_data.csv is likely new or updated records.


# Step 7: Save copies into /data directory (if any cleaning is done later)
os.makedirs(data_path, exist_ok=True)  # Ensure the 'data/' folder exists

raw_df.to_csv(raw_path, index=False)
incremental_df.to_csv(inc_path, index=False)


Current Working Directory:
c:\Users\Admin\Desktop\DSA2040A_ETL_Midterm_Vivian_386\DSA2040A_ETL_Midterm_Vivian_386

Preview of raw_data.csv:


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,1,Diana,Tablet,,500.0,2024-01-20,South
1,2,Eve,Laptop,,,2024-04-29,North
2,3,Charlie,Laptop,2.0,250.0,2024-01-08,
3,4,Eve,Laptop,2.0,750.0,2024-01-07,West
4,5,Eve,Tablet,3.0,,2024-03-07,South
5,4,Eve,Laptop,2.0,750.0,2024-01-07,West
6,7,Charlie,Monitor,2.0,750.0,2024-02-02,West
7,8,Charlie,Laptop,3.0,,2024-02-17,
8,9,Charlie,Monitor,,750.0,2024-03-16,West
9,10,Eve,Monitor,1.0,500.0,2024-02-28,North



Structure of raw_data.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       100 non-null    int64  
 1   customer_name  99 non-null     object 
 2   product        100 non-null    object 
 3   quantity       74 non-null     float64
 4   unit_price     65 non-null     float64
 5   order_date     99 non-null     object 
 6   region         75 non-null     object 
dtypes: float64(2), int64(1), object(4)
memory usage: 5.6+ KB

Preview of incremental_data.csv:


Unnamed: 0,order_id,customer_name,product,quantity,unit_price,order_date,region
0,101,Alice,Laptop,,900.0,2024-05-09,Central
1,102,,Laptop,1.0,300.0,2024-05-07,Central
2,103,,Laptop,1.0,600.0,2024-05-04,Central
3,104,,Tablet,,300.0,2024-05-26,Central
4,105,Heidi,Tablet,2.0,600.0,2024-05-21,North
5,106,,Laptop,2.0,600.0,2024-05-18,Central
6,107,,Tablet,1.0,600.0,2024-05-13,Central
7,108,,Laptop,,600.0,2024-05-11,
8,109,Grace,Laptop,2.0,600.0,2024-05-29,Central
9,110,Heidi,Phone,,900.0,2024-05-24,



Structure of incremental_data.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       10 non-null     int64  
 1   customer_name  4 non-null      object 
 2   product        10 non-null     object 
 3   quantity       6 non-null      float64
 4   unit_price     10 non-null     float64
 5   order_date     10 non-null     object 
 6   region         8 non-null      object 
dtypes: float64(2), int64(1), object(4)
memory usage: 692.0+ bytes
