# Part A: Data Cleaning (Extraction & Transformation)

This notebook implements the data cleaning pipeline for the Retail Store Sales dataset.
We handle:
1. Standardization of formats and naming
2. Missing value resolution
3. Business logic validation (Total Spent = Quantity × Price)

In [1]:
import pandas as pd
import numpy as np
import os

DATA_DIR = os.path.join('..', 'data')
RAW_FILE = os.path.join(DATA_DIR, 'retail_store_sales.csv')
CLEAN_FILE = os.path.join(DATA_DIR, 'cleaned_sales.csv')

## 1. Load & Explore Raw Data

In [2]:
df_raw = pd.read_csv(RAW_FILE)
print(f'Shape: {df_raw.shape}')
print(f'\nColumns: {list(df_raw.columns)}')
df_raw.head(10)

Shape: (12575, 11)

Columns: ['Transaction ID', 'Customer ID', 'Category', 'Item', 'Price Per Unit', 'Quantity', 'Total Spent', 'Payment Method', 'Location', 'Transaction Date', 'Discount Applied']


Unnamed: 0,Transaction ID,Customer ID,Category,Item,Price Per Unit,Quantity,Total Spent,Payment Method,Location,Transaction Date,Discount Applied
0,TXN_6867343,CUST_09,Patisserie,Item_10_PAT,18.5,10.0,185.0,Digital Wallet,Online,2024-04-08,True
1,TXN_3731986,CUST_22,Milk Products,Item_17_MILK,29.0,9.0,261.0,Digital Wallet,Online,2023-07-23,True
2,TXN_9303719,CUST_02,Butchers,Item_12_BUT,21.5,2.0,43.0,Credit Card,Online,2022-10-05,False
3,TXN_9458126,CUST_06,Beverages,Item_16_BEV,27.5,9.0,247.5,Credit Card,Online,2022-05-07,
4,TXN_4575373,CUST_05,Food,Item_6_FOOD,12.5,7.0,87.5,Digital Wallet,Online,2022-10-02,False
5,TXN_7482416,CUST_09,Patisserie,,,10.0,200.0,Credit Card,Online,2023-11-30,
6,TXN_3652209,CUST_07,Food,Item_1_FOOD,5.0,8.0,40.0,Credit Card,In-store,2023-06-10,True
7,TXN_1372952,CUST_21,Furniture,,33.5,,,Digital Wallet,In-store,2024-04-02,True
8,TXN_9728486,CUST_23,Furniture,Item_16_FUR,27.5,1.0,27.5,Credit Card,In-store,2023-04-26,False
9,TXN_2722661,CUST_25,Butchers,Item_22_BUT,36.5,3.0,109.5,Cash,Online,2024-03-14,False


In [3]:
# Inspect data types and null counts
print('Data types:')
print(df_raw.dtypes)
print(f'\nNull counts per column:')
print(df_raw.isnull().sum())
print(f'\nNull percentages:')
print((df_raw.isnull().sum() / len(df_raw) * 100).round(2))

Data types:
Transaction ID       object
Customer ID          object
Category             object
Item                 object
Price Per Unit      float64
Quantity            float64
Total Spent         float64
Payment Method       object
Location             object
Transaction Date     object
Discount Applied     object
dtype: object

Null counts per column:
Transaction ID         0
Customer ID            0
Category               0
Item                1213
Price Per Unit       609
Quantity             604
Total Spent          604
Payment Method         0
Location               0
Transaction Date       0
Discount Applied    4199
dtype: int64

Null percentages:
Transaction ID       0.00
Customer ID          0.00
Category             0.00
Item                 9.65
Price Per Unit       4.84
Quantity             4.80
Total Spent          4.80
Payment Method       0.00
Location             0.00
Transaction Date     0.00
Discount Applied    33.39
dtype: float64


In [4]:
# Basic statistics for numeric columns
df_raw.describe()

Unnamed: 0,Price Per Unit,Quantity,Total Spent
count,11966.0,11971.0,11971.0
mean,23.365912,5.53638,129.652577
std,10.743519,2.857883,94.750697
min,5.0,1.0,5.0
25%,14.0,3.0,51.0
50%,23.0,6.0,108.5
75%,33.5,8.0,192.0
max,41.0,10.0,410.0


## 2. Standardization

We standardize:
- **Date formats**: Parse all dates into a uniform `YYYY-MM-DD` format
- **Product/item names**: Strip whitespace, apply consistent title casing
- **Categorical columns**: Normalize Payment Method, Location, Category, Discount Applied

In [5]:
df = df_raw.copy()

# --- Date Standardization ---
# Parse dates flexibly to handle any mixed formats, then normalize to YYYY-MM-DD strings
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'], format='mixed', dayfirst=False)

print('Date range:', df['Transaction Date'].min(), 'to', df['Transaction Date'].max())
print('Failed date parses:', df['Transaction Date'].isna().sum())

Date range: 2022-01-01 00:00:00 to 2025-01-18 00:00:00
Failed date parses: 0


In [6]:
# --- Item Name Standardization ---
# Strip whitespace and apply title case to ensure consistency (e.g., "coffee" -> "Coffee")
df['Item'] = df['Item'].str.strip().str.title()

# --- Category Standardization ---
df['Category'] = df['Category'].str.strip().str.title()

# --- Payment Method Standardization ---
df['Payment Method'] = df['Payment Method'].str.strip().str.title()

# --- Location Standardization ---
df['Location'] = df['Location'].str.strip().str.title()

print('Unique Categories:', sorted(df['Category'].unique()))
print('Unique Payment Methods:', sorted(df['Payment Method'].unique()))
print('Unique Locations:', sorted(df['Location'].unique()))

Unique Categories: ['Beverages', 'Butchers', 'Computers And Electric Accessories', 'Electric Household Essentials', 'Food', 'Furniture', 'Milk Products', 'Patisserie']
Unique Payment Methods: ['Cash', 'Credit Card', 'Digital Wallet']
Unique Locations: ['In-Store', 'Online']


In [7]:
# --- Discount Applied Standardization ---
# Convert to boolean; NaN values will be handled in the missing values step
print('Discount Applied values before:', df['Discount Applied'].value_counts(dropna=False).to_dict())

# Map string 'True'/'False' and boolean to proper booleans
df['Discount Applied'] = df['Discount Applied'].map(
    {True: True, False: False, 'True': True, 'False': False}
)
print('Discount Applied values after:', df['Discount Applied'].value_counts(dropna=False).to_dict())

Discount Applied values before: {True: 4219, nan: 4199, False: 4157}
Discount Applied values after: {True: 4219, nan: 4199, False: 4157}


## 3. Missing Value Handling

The course teaches 5 standard methods for handling missing data:
1. **Ignore the tuple** (drop the row)
2. **Fill in the missing value manually**
3. **Use a measure of central tendency** (mean for numeric, mode for categorical)
4. **Use the attribute mean or median** (for numeric)
5. **Use the most probable value**

We apply different methods to different columns based on the nature of the data:

| Column | Null Count | Method Applied | Rationale |
|---|---|---|---|
| **Item** | 1,213 (9.6%) | Method 1: Ignore the tuple (drop row) | Item name is essential for market basket analysis and cannot be inferred from other columns |
| **Price Per Unit** | 609 (4.8%) | Method 2: Fill using business logic (`Total / Quantity`) | Computable deterministically from other columns |
| **Quantity / Total Spent** | 604 (4.8%) | Method 1: Ignore the tuple (drop row) | Both are null simultaneously; cannot infer either without the other |
| **Discount Applied** | 4,199 (33%) | Method 5: Most probable value (mode) | Boolean attribute; filling with mode preserves the overall distribution |

In [8]:
print(f'Rows before cleaning: {len(df)}')
print(f'\nNull counts before:')
print(df.isnull().sum())

# === Step 1: Item nulls ===
# Method: IGNORE THE TUPLE (Course Method 1)
# Rationale: Item name is essential for product-level analysis and market basket mining.
# It is a categorical identifier that cannot be inferred from other columns.
# All 1,213 Item-null rows also have missing numeric fields, reinforcing this decision.
item_null_count = df['Item'].isna().sum()
df = df.dropna(subset=['Item'])
print(f'\nStep 1 [Method 1: Ignore the tuple]: Dropped {item_null_count} rows with null Item name')
print(f'Rows after dropping null Items: {len(df)}')

Rows before cleaning: 12575

Null counts before:
Transaction ID         0
Customer ID            0
Category               0
Item                1213
Price Per Unit       609
Quantity             604
Total Spent          604
Payment Method         0
Location               0
Transaction Date       0
Discount Applied    4199
dtype: int64

Step 1 [Method 1: Ignore the tuple]: Dropped 1213 rows with null Item name
Rows after dropping null Items: 11362


In [9]:
# === Step 2: Price Per Unit nulls ===
# Method: FILL MANUALLY USING BUSINESS LOGIC (Course Method 2)
# Rationale: Price Per Unit can be deterministically computed as Total Spent / Quantity
# when both of those columns are present. This is the most accurate fill possible.
price_null_mask = df['Price Per Unit'].isna() & df['Total Spent'].notna() & df['Quantity'].notna()
inferred_count = price_null_mask.sum()
df.loc[price_null_mask, 'Price Per Unit'] = df.loc[price_null_mask, 'Total Spent'] / df.loc[price_null_mask, 'Quantity']
print(f'Step 2 [Method 2: Fill using business logic]: Inferred Price Per Unit for {inferred_count} rows')

# === Step 3: Quantity / Total Spent nulls ===
# Method: IGNORE THE TUPLE (Course Method 1)
# Rationale: In these rows, both Quantity AND Total Spent are null simultaneously.
# We cannot infer one from the other, and using mean/median imputation for transactional
# measures would introduce artificial data that distorts the warehouse facts.
remaining_nulls = df[['Price Per Unit', 'Quantity', 'Total Spent']].isna().any(axis=1).sum()
df = df.dropna(subset=['Price Per Unit', 'Quantity', 'Total Spent'])
print(f'Step 3 [Method 1: Ignore the tuple]: Dropped {remaining_nulls} rows with unresolvable null numerics')
print(f'Rows after numeric null resolution: {len(df)}')

Step 2 [Method 2: Fill using business logic]: Inferred Price Per Unit for 0 rows
Step 3 [Method 1: Ignore the tuple]: Dropped 0 rows with unresolvable null numerics
Rows after numeric null resolution: 11362


In [10]:
# === Step 4: Discount Applied nulls ===
# Method: MOST PROBABLE VALUE / MODE (Course Method 5)
# Rationale: Discount Applied is a boolean attribute (True/False). With 33% nulls,
# dropping the column would lose potentially useful information. Instead, we fill
# nulls with the mode (most frequent value), which preserves the overall distribution.
discount_mode = df['Discount Applied'].mode()[0]
discount_null_count = df['Discount Applied'].isna().sum()
df['Discount Applied'] = df['Discount Applied'].fillna(discount_mode)
print(f'Step 4 [Method 5: Most probable value (mode)]: Filled {discount_null_count} '
      f'Discount Applied nulls with mode = {discount_mode}')
print(f'Discount Applied distribution after fill:')
print(df['Discount Applied'].value_counts())

print(f'\nFinal null counts:')
print(df.isnull().sum())
print(f'\nFinal shape: {df.shape}')

Step 4 [Method 5: Most probable value (mode)]: Filled 3783 Discount Applied nulls with mode = True
Discount Applied distribution after fill:
Discount Applied
True     7584
False    3778
Name: count, dtype: int64

Final null counts:
Transaction ID      0
Customer ID         0
Category            0
Item                0
Price Per Unit      0
Quantity            0
Total Spent         0
Payment Method      0
Location            0
Transaction Date    0
Discount Applied    0
dtype: int64

Final shape: (11362, 11)


  df['Discount Applied'] = df['Discount Applied'].fillna(discount_mode)


## 4. Business Logic Validation

Verify that `Total Spent == Quantity × Price Per Unit` for all rows.
Flag and fix any mismatches by recalculating Total from Quantity × Price.

In [11]:
# Compute expected total and compare with actual
df['Expected Total'] = df['Quantity'] * df['Price Per Unit']
df['Total Mismatch'] = abs(df['Expected Total'] - df['Total Spent']) > 0.01

mismatch_count = df['Total Mismatch'].sum()
print(f'Rows failing business logic check (Total != Qty × Price): {mismatch_count}')

if mismatch_count > 0:
    print('\nSample mismatches:')
    print(df[df['Total Mismatch']][['Item', 'Price Per Unit', 'Quantity', 'Total Spent', 'Expected Total']].head(10))
    
    # Fix: Recalculate Total Spent from Quantity × Price (more reliable source)
    df.loc[df['Total Mismatch'], 'Total Spent'] = df.loc[df['Total Mismatch'], 'Expected Total']
    print(f'\nFixed {mismatch_count} rows by recalculating Total Spent = Quantity × Price')
else:
    print('All rows pass the business logic check.')

# Clean up temporary columns
df = df.drop(columns=['Expected Total', 'Total Mismatch'])

Rows failing business logic check (Total != Qty × Price): 0
All rows pass the business logic check.


## 5. Final Validation & Export

In [12]:
# Ensure correct data types
df['Quantity'] = df['Quantity'].astype(int)
df['Price Per Unit'] = df['Price Per Unit'].astype(float)
df['Total Spent'] = df['Total Spent'].astype(float)

# Final summary of the cleaned data
print('=== Cleaned Data Summary ===')
print(f'Shape: {df.shape}')
print(f'\nData Types:')
print(df.dtypes)
print(f'\nNull Counts (should be all zeros):')
print(df.isnull().sum())
print(f'\nSample Rows:')
df.head(10)

=== Cleaned Data Summary ===
Shape: (11362, 11)

Data Types:
Transaction ID              object
Customer ID                 object
Category                    object
Item                        object
Price Per Unit             float64
Quantity                     int64
Total Spent                float64
Payment Method              object
Location                    object
Transaction Date    datetime64[ns]
Discount Applied              bool
dtype: object

Null Counts (should be all zeros):
Transaction ID      0
Customer ID         0
Category            0
Item                0
Price Per Unit      0
Quantity            0
Total Spent         0
Payment Method      0
Location            0
Transaction Date    0
Discount Applied    0
dtype: int64

Sample Rows:


Unnamed: 0,Transaction ID,Customer ID,Category,Item,Price Per Unit,Quantity,Total Spent,Payment Method,Location,Transaction Date,Discount Applied
0,TXN_6867343,CUST_09,Patisserie,Item_10_Pat,18.5,10,185.0,Digital Wallet,Online,2024-04-08,True
1,TXN_3731986,CUST_22,Milk Products,Item_17_Milk,29.0,9,261.0,Digital Wallet,Online,2023-07-23,True
2,TXN_9303719,CUST_02,Butchers,Item_12_But,21.5,2,43.0,Credit Card,Online,2022-10-05,False
3,TXN_9458126,CUST_06,Beverages,Item_16_Bev,27.5,9,247.5,Credit Card,Online,2022-05-07,True
4,TXN_4575373,CUST_05,Food,Item_6_Food,12.5,7,87.5,Digital Wallet,Online,2022-10-02,False
6,TXN_3652209,CUST_07,Food,Item_1_Food,5.0,8,40.0,Credit Card,In-Store,2023-06-10,True
8,TXN_9728486,CUST_23,Furniture,Item_16_Fur,27.5,1,27.5,Credit Card,In-Store,2023-04-26,False
9,TXN_2722661,CUST_25,Butchers,Item_22_But,36.5,3,109.5,Cash,Online,2024-03-14,False
10,TXN_8776416,CUST_22,Butchers,Item_3_But,8.0,9,72.0,Cash,In-Store,2024-12-14,True
12,TXN_5874772,CUST_23,Food,Item_2_Food,6.5,7,45.5,Cash,Online,2023-09-09,True


In [13]:
# Export cleaned data to CSV
df.to_csv(CLEAN_FILE, index=False)
print(f'Cleaned data saved to {CLEAN_FILE}')
print(f'Total rows: {len(df)}')

Cleaned data saved to ../data/cleaned_sales.csv
Total rows: 11362
