# Data Validation – Cleaned Transactions

## Objective
This notebook validates the cleaned transactional dataset generated in Phase 3.
It ensures that:
- Critical columns contain no invalid values
- Known data quality issues have been resolved
- Retention rate is within acceptable limits
- The dataset is safe for feature engineering and modeling

In [2]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
print("Libraries loaded successfully")


Libraries loaded successfully


In [7]:
df = pd.read_csv("../data/processed/cleaned_transactions.csv", parse_dates=["invoicedate"])

print(f"Dataset Shape: {df.shape}")
df.head()


Dataset Shape: (400916, 8)


Unnamed: 0,invoice,stockcode,description,quantity,invoicedate,price,customerid,country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


## Schema Validation

We verify that all required columns exist in the cleaned dataset.


In [10]:
required_columns = [
    "invoice",
    "stockcode",
    "description",
    "quantity",
    "invoicedate",
    "price",
    "customerid",
    "country"
]

missing_cols = [col for col in required_columns if col not in df.columns]

if not missing_cols:
    print("All required columns are present ✅")
else:
    print("Missing columns ❌:", missing_cols)


All required columns are present ✅


## Missing Value Validation

After cleaning, there should be no missing values
in critical columns used for modeling.


In [12]:
# Missing values check
missing_summary = df.isnull().sum().sort_values(ascending=False)

missing_pct = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    "missing_count": missing_summary,
    "missing_percentage": missing_pct.round(2)
})

missing_df


Unnamed: 0,missing_count,missing_percentage
invoice,0,0.0
stockcode,0,0.0
description,0,0.0
quantity,0,0.0
invoicedate,0,0.0
price,0,0.0
customerid,0,0.0
country,0,0.0


In [13]:
# Business rule checks
invalid_quantity = (df["quantity"] <= 0).sum()
invalid_price = (df["price"] <= 0).sum()

print(f"Invalid quantity rows (<=0): {invalid_quantity}")
print(f"Invalid price rows (<=0): {invalid_price}")


Invalid quantity rows (<=0): 0
Invalid price rows (<=0): 0


In [14]:
# Date validation
print("Invoice Date Range")
print("-" * 40)
print("Min date:", df["invoicedate"].min())
print("Max date:", df["invoicedate"].max())

# Check future dates
future_dates = (df["invoicedate"] > pd.Timestamp.today()).sum()
print(f"Future-dated invoices: {future_dates}")


Invoice Date Range
----------------------------------------
Min date: 2009-12-01 07:45:00
Max date: 2010-12-09 20:01:00
Future-dated invoices: 0


In [15]:
# Duplicate check
duplicate_rows = df.duplicated().sum()
print(f"Duplicate rows after cleaning: {duplicate_rows}")


Duplicate rows after cleaning: 0


In [16]:
# Row count validation
total_rows = len(df)
unique_customers = df["customerid"].nunique()
unique_products = df["stockcode"].nunique()
unique_countries = df["country"].nunique()

print("Dataset Summary")
print("-" * 40)
print(f"Total Rows: {total_rows}")
print(f"Unique Customers: {unique_customers}")
print(f"Unique Products: {unique_products}")
print(f"Unique Countries: {unique_countries}")


Dataset Summary
----------------------------------------
Total Rows: 400916
Unique Customers: 4312
Unique Products: 4017
Unique Countries: 37


In [17]:
import json
from pathlib import Path

validation_report = {
    "total_rows": int(total_rows),
    "duplicate_rows": int(duplicate_rows),
    "invalid_quantity_rows": int(invalid_quantity),
    "invalid_price_rows": int(invalid_price),
    "future_dated_rows": int(future_dates),
    "unique_customers": int(unique_customers),
    "unique_products": int(unique_products),
    "unique_countries": int(unique_countries),
    "missing_value_percentage": missing_pct.round(2).to_dict()
}

Path("../data/processed").mkdir(parents=True, exist_ok=True)

with open("../data/processed/validation_report.json", "w") as f:
    json.dump(validation_report, f, indent=4)

print("validation_report.json saved successfully ✅")


validation_report.json saved successfully ✅


## Validation Summary

- Schema validation passed successfully
- No invalid quantities or prices detected
- No duplicate records after cleaning
- Invoice dates fall within expected historical range
- Dataset is validated and ready for feature engineering and modeling

➡️ Next: Feature Engineering Notebook
