# H&M Data Preprocessing Visualisations

This notebook provides comprehensive visualisations for the H&M data preprocessing and cleaning pipeline. It demonstrates the impact of data cleaning operations including duplicate removal, missing value imputation, and outlier handling through before/after comparisons and quality improvement metrics.


## Import Libraries and Setup

Import necessary libraries for data processing and visualisation.


In [13]:
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import os
import time
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configure Polars
try:
    pl.Config.set_streaming_chunk_size(10000)
    if hasattr(pl.Config, 'set_table_width'):
        pl.Config.set_table_width(120)
except AttributeError:
    pass

print(f"Libraries imported successfully")
print(f"Polars version: {pl.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Seaborn version: {sns.__version__}")

Libraries imported successfully
Polars version: 0.20.31
Matplotlib version: 3.8.3
Seaborn version: 0.13.2


## Load Original and Cleaned Datasets

Load both the original integrated dataset and the cleaned dataset to perform before/after comparisons.


In [14]:
# Load datasets for comparison
data_dir = '../data'
original_path = os.path.join(data_dir, 'processed', 'hm_integrated_dataset.parquet')
cleaned_path = os.path.join(data_dir, 'processed', 'hm_customer_data_cleaned.parquet')

print("Loading datasets for preprocessing visualisation...")

# Load original dataset
if os.path.exists(original_path):
    df_original = pl.read_parquet(original_path)
    print(f"✓ Original dataset loaded: {df_original.height:,} records")
else:
    print("Original integrated dataset not found. Creating from exploration...")
    # Fallback: create sample dataset to demonstrate preprocessing
    # This would normally come from the exploration notebook
    raise FileNotFoundError(f"Please run the exploration notebook first: {original_path}")

# Load cleaned dataset
if os.path.exists(cleaned_path):
    df_cleaned = pl.read_parquet(cleaned_path)
    print(f"✓ Cleaned dataset loaded: {df_cleaned.height:,} records")
else:
    print("Cleaned dataset not found. Please run the preprocessing notebook first.")
    # For demonstration, we'll create the cleaned version during this notebook
    print("Will demonstrate cleaning process step-by-step...")
    df_cleaned = None

# Calculate memory usage
original_memory = df_original.estimated_size('mb')
cleaned_memory = df_cleaned.estimated_size('mb') if df_cleaned is not None else 0

print(f"\nDataset Comparison:")
print(f"• Original memory usage: {original_memory:.1f} MB")
if df_cleaned is not None:
    print(f"• Cleaned memory usage: {cleaned_memory:.1f} MB")
    print(f"• Memory change: {cleaned_memory - original_memory:+.1f} MB")

Loading datasets for preprocessing visualisation...
✓ Original dataset loaded: 3,178,832 records
✓ Cleaned dataset loaded: 3,141,398 records

Dataset Comparison:
• Original memory usage: 1634.5 MB
• Cleaned memory usage: 1615.8 MB
• Memory change: -18.7 MB


## Final Quality Validation

Comprehensive validation of the cleaned dataset quality and readiness for analysis.


In [21]:
# Final quality validation
print("Performing final quality validation...")

if df_cleaned is not None:
    # Comprehensive quality checks
    validation_results = {}

    # Completeness check
    total_cells = df_cleaned.height * len(df_cleaned.columns)
    missing_cells = sum(df_cleaned.null_count().row(0))
    completeness = ((total_cells - missing_cells) / total_cells) * 100
    validation_results['completeness'] = {
        'score': completeness,
        'status': 'PASS' if completeness >= 95 else 'REVIEW' if completeness >= 90 else 'FAIL'
    }

    # Uniqueness check
    uniqueness = (df_cleaned.unique().height / df_cleaned.height) * 100
    validation_results['uniqueness'] = {
        'score': uniqueness,
        'status': 'PASS' if uniqueness >= 98 else 'REVIEW' if uniqueness >= 95 else 'FAIL'
    }

    # Overall quality score
    overall_score = np.mean([v['score'] for v in validation_results.values()])
    overall_status = 'PASS' if overall_score >= 95 else 'REVIEW' if overall_score >= 90 else 'FAIL'

    print(f"✓ Final quality validation completed")
    print(f"\n{'='*60}")
    print(f"FINAL QUALITY VALIDATION RESULTS")
    print(f"{'='*60}")
    print(f"\nOverall Quality Score: {overall_score:.1f}% - {overall_status}")
    print(f"\nDetailed Results:")
    for dimension, result in validation_results.items():
        print(f"  • {dimension.title()}: {result['score']:.1f}% - {result['status']}")

    print(f"\nDataset Readiness:")
    if overall_status == 'PASS':
        print(f"  ✅ Dataset is ready for further analysis and modelling")
    elif overall_status == 'REVIEW':
        print(f"  ⚠️  Dataset quality is acceptable but may need attention in some areas")
    else:
        print(f"  ❌ Dataset requires additional cleaning before use")

else:
    print("Cannot perform validation without cleaned dataset")

Performing final quality validation...
✓ Final quality validation completed

FINAL QUALITY VALIDATION RESULTS

Overall Quality Score: 100.0% - PASS

Detailed Results:
  • Completeness: 100.0% - PASS
  • Uniqueness: 100.0% - PASS

Dataset Readiness:
  ✅ Dataset is ready for further analysis and modelling


## Summary and Next Steps

Compile comprehensive summary of preprocessing improvements and recommendations for analysis.


In [22]:
# Comprehensive summary and recommendations
print("=" * 70)
print("H&M DATA PREPROCESSING - COMPREHENSIVE SUMMARY")
print("=" * 70)

if df_cleaned is not None:
    print(f"\n📊 PREPROCESSING SUMMARY:")
    print(f"  • Records processed: {original_total:,} → {cleaned_total:,} ({cleaned_total - original_total:+,})")
    print(f"  • Duplicates removed: {original_metrics['duplicates']:,}")
    print(f"  • Missing value columns resolved: {len(original_missing_df)} → {len(cleaned_missing_df)}")
    print(f"  • Memory usage: {original_memory:.1f}MB → {df_cleaned.estimated_size('mb'):.1f}MB ({df_cleaned.estimated_size('mb') - original_memory:+.1f}MB)")

    print(f"\n✅ PREPROCESSING PIPELINE COMPLETED SUCCESSFULLY")
    print(f"✅ DATASET READY FOR FURTHER ANALYTICS AND MODELLING")

else:
    print("\n⚠️  Please run the preprocessing notebook to generate cleaned dataset")

print(f"=" * 70)

H&M DATA PREPROCESSING - COMPREHENSIVE SUMMARY

📊 PREPROCESSING SUMMARY:
  • Records processed: 3,178,832 → 3,141,398 (-37,434)
  • Duplicates removed: 37,434
  • Missing value columns resolved: 6 → 0
  • Memory usage: 1634.5MB → 1615.8MB (-18.7MB)

✅ PREPROCESSING PIPELINE COMPLETED SUCCESSFULLY
✅ DATASET READY FOR FURTHER ANALYTICS AND MODELLING
