# H&M Exploratory Data Analysis (EDA) with Polars

This notebook provides an exploratory analysis of the H&M dataset structure and quality using Polars for high-performance data processing. The analysis includes data loading, integration, and quality assessment across transactions, customers, and articles datasets.

## Overview

- **Transaction Data**: Customer purchase history with temporal patterns
- **Customer Data**: Demographic and preference information
- **Articles Data**: Product metadata and categorisation


## 1. Import Libraries and Setup

Import Polars and other necessary libraries for data processing, analysis, and visualisation.


In [13]:
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Configure Polars for optimal performance
try:
    pl.Config.set_streaming_chunk_size(10000)
    # Some Polars versions may not have all config options
    if hasattr(pl.Config, 'set_table_width'):
        pl.Config.set_table_width(120)
except AttributeError:
    # Gracefully handle missing config options in older versions
    pass

print(f"Polars version: {pl.__version__}")
print("Libraries imported successfully")

Polars version: 0.20.31
Libraries imported successfully


## 2. Data Loading Configuration

Set up file paths and configuration for data loading. With 128GB RAM, we can process the full dataset efficiently.


In [14]:
# Configuration parameters
data_dir = '../data'
use_full_dataset = True  # With 128GB RAM, we can handle the full dataset
sample_fraction = 0.1 if not use_full_dataset else 1.0
random_seed = 42

# Set up file paths
transactions_path = os.path.join(data_dir, 'raw', 'transactions_train.csv')
customers_path = os.path.join(data_dir, 'raw', 'customers.csv')
articles_path = os.path.join(data_dir, 'raw', 'articles.csv')

print("Data Loading Configuration:")
print(f"• Data directory: {data_dir}")
print(f"• Using full dataset: {use_full_dataset}")
print(f"• Sample fraction: {sample_fraction*100:.0f}%")
print(f"• Random seed: {random_seed}")

Data Loading Configuration:
• Data directory: ../data
• Using full dataset: True
• Sample fraction: 100%
• Random seed: 42


## 3. Load Transactions Data

Load transaction data using Polars' optimised CSV reader. Polars automatically infers schema and optimises memory usage.


In [15]:
# Configuration parameters
data_dir = '../data'  # Corrected path - use relative path from notebooks directory
use_full_dataset = False  # Start with sample for testing, change to True for full dataset
sample_fraction = 0.1 if not use_full_dataset else 1.0
random_seed = 42

# Set up file paths
transactions_path = os.path.join(data_dir, 'raw', 'transactions_train.csv')
customers_path = os.path.join(data_dir, 'raw', 'customers.csv')
articles_path = os.path.join(data_dir, 'raw', 'articles.csv')

print("Data Loading Configuration:")
print(f"• Data directory: {data_dir}")
print(f"• Using full dataset: {use_full_dataset}")
print(f"• Sample fraction: {sample_fraction*100:.0f}%")
print(f"• Random seed: {random_seed}")

# Verify paths exist
print(f"\nPath verification:")
print(f"• Transactions file exists: {os.path.exists(transactions_path)}")
print(f"• Customers file exists: {os.path.exists(customers_path)}")
print(f"• Articles file exists: {os.path.exists(articles_path)}")

if not all(os.path.exists(p) for p in [transactions_path, customers_path, articles_path]):
    print(f"\nTroubleshooting:")
    print(f"• Current working directory: {os.getcwd()}")
    print(f"• Data directory contents: {os.listdir(data_dir) if os.path.exists(data_dir) else 'Not found'}")
    if os.path.exists(os.path.join(data_dir, 'raw')):
        print(f"• Raw directory contents: {os.listdir(os.path.join(data_dir, 'raw'))[:10]}")  # Show first 10 files

Data Loading Configuration:
• Data directory: ../data
• Using full dataset: False
• Sample fraction: 10%
• Random seed: 42

Path verification:
• Transactions file exists: True
• Customers file exists: True
• Articles file exists: True


## 4. Load Customer Data

Load customer demographic and preference data using Polars' efficient processing.


In [16]:
print("\n• Loading customer data...")
start_time = time.time()

if not os.path.exists(customers_path):
    raise FileNotFoundError(f"Customers file not found: {customers_path}")

df_customers = pl.read_csv(customers_path)
customer_count = df_customers.height
load_time = time.time() - start_time

print(f"  ✓ Customers: {customer_count:,} records loaded in {load_time:.2f} seconds")
print(f"  ✓ Memory usage: {df_customers.estimated_size('mb'):.1f} MB")
print("  - Schema:")
print(df_customers.schema)


• Loading customer data...
  ✓ Customers: 1,371,980 records loaded in 0.23 seconds
  ✓ Memory usage: 215.0 MB
  - Schema:
OrderedDict([('customer_id', String), ('FN', Float64), ('Active', Float64), ('club_member_status', String), ('fashion_news_frequency', String), ('age', Int64), ('postal_code', String)])


In [17]:
print("• Loading transactions data with Polars...")
start_time = time.time()

if not os.path.exists(transactions_path):
    raise FileNotFoundError(f"Transaction file not found: {transactions_path}")

try:
    # Load transactions data efficiently with Polars
    if use_full_dataset:
        print("  Loading full dataset (this may take a moment for large files)...")
        df_transactions = pl.read_csv(transactions_path)
    else:
        print(f"  Loading {sample_fraction*100:.0f}% sample...")
        # For compatibility with different Polars versions, read then sample
        df_transactions = pl.read_csv(transactions_path).sample(fraction=sample_fraction, seed=random_seed)

    transaction_count = df_transactions.height
    load_time = time.time() - start_time

    print(f"  ✓ Transactions: {transaction_count:,} records loaded in {load_time:.2f} seconds")
    print(f"  ✓ Memory usage: {df_transactions.estimated_size('mb'):.1f} MB")
    print("  - Schema:")
    print(df_transactions.schema)
    
    # Show a quick sample
    print(f"\nFirst 3 rows:")
    print(df_transactions.head(3))

except Exception as e:
    print(f"Error loading transactions data: {e}")
    print("This might be due to:")
    print("1. Large file size - try setting use_full_dataset = False")
    print("2. Memory constraints - reduce sample_fraction")
    print("3. File corruption - check the CSV file")
    
    # If memory is an issue, try with smaller sample
    print("\nTrying with smaller sample (1%)...")
    try:
        df_transactions = pl.read_csv(transactions_path).sample(fraction=0.01, seed=random_seed)
        transaction_count = df_transactions.height
        load_time = time.time() - start_time
        print(f"  ✓ Emergency sample: {transaction_count:,} records loaded")
        print("  Note: Using 1% sample due to memory constraints")
    except Exception as e2:
        print(f"  Failed with smaller sample too: {e2}")
        raise

• Loading transactions data with Polars...
  Loading 10% sample...
  ✓ Transactions: 3,178,832 records loaded in 22.81 seconds
  ✓ Memory usage: 491.1 MB
  - Schema:
OrderedDict([('t_dat', String), ('customer_id', String), ('article_id', Int64), ('price', Float64), ('sales_channel_id', Int64)])

First 3 rows:
shape: (3, 5)
┌────────────┬─────────────────────────────────┬────────────┬──────────┬──────────────────┐
│ t_dat      ┆ customer_id                     ┆ article_id ┆ price    ┆ sales_channel_id │
│ ---        ┆ ---                             ┆ ---        ┆ ---      ┆ ---              │
│ str        ┆ str                             ┆ i64        ┆ f64      ┆ i64              │
╞════════════╪═════════════════════════════════╪════════════╪══════════╪══════════════════╡
│ 2019-05-02 ┆ 1e6db9e9e42595b9bb7f2076d7dd11… ┆ 600886001  ┆ 0.023949 ┆ 1                │
│ 2019-04-26 ┆ d72800d8455ca7f23189c39daaee30… ┆ 594161006  ┆ 0.027102 ┆ 2                │
│ 2020-03-07 ┆ 900fe1bfbca0a18f

## 5. Load Articles Data

Load product/article information and metadata with detailed product categorisation including product types, groups, colours, and other attributes.


In [18]:
print("\n• Loading articles data...")
start_time = time.time()

if not os.path.exists(articles_path):
    raise FileNotFoundError(f"Articles file not found: {articles_path}")

df_articles = pl.read_csv(articles_path)
article_count = df_articles.height
load_time = time.time() - start_time

print(f"  ✓ Articles: {article_count:,} records loaded in {load_time:.2f} seconds")
print(f"  ✓ Memory usage: {df_articles.estimated_size('mb'):.1f} MB")
print("  - Schema:")
print(df_articles.schema)


• Loading articles data...
  ✓ Articles: 105,542 records loaded in 0.07 seconds
  ✓ Memory usage: 36.4 MB
  - Schema:
OrderedDict([('article_id', Int64), ('product_code', Int64), ('prod_name', String), ('product_type_no', Int64), ('product_type_name', String), ('product_group_name', String), ('graphical_appearance_no', Int64), ('graphical_appearance_name', String), ('colour_group_code', Int64), ('colour_group_name', String), ('perceived_colour_value_id', Int64), ('perceived_colour_value_name', String), ('perceived_colour_master_id', Int64), ('perceived_colour_master_name', String), ('department_no', Int64), ('department_name', String), ('index_code', String), ('index_name', String), ('index_group_no', Int64), ('index_group_name', String), ('section_no', Int64), ('section_name', String), ('garment_group_no', Int64), ('garment_group_name', String), ('detail_desc', String)])


## 6. Dataset Integration

Integrate transaction, customer, and article datasets using Polars' optimised join operations. This creates a unified view of customer interactions with products for comprehensive analysis.


In [19]:
print("Creating integrated customer interaction dataset...")
start_time = time.time()

# Perform left joins to preserve all transaction records - Polars optimises joins automatically
df_integrated = (
    df_transactions
    .join(df_customers, on="customer_id", how="left")
    .join(df_articles, on="article_id", how="left")
)

# Calculate dataset statistics efficiently
integration_stats = {
    'total_records': df_integrated.height,
    'unique_customers': df_integrated['customer_id'].n_unique(),
    'unique_articles': df_integrated['article_id'].n_unique()
}

integration_time = time.time() - start_time

print(f"✓ Integrated dataset created with {integration_stats['total_records']:,} transaction records")
print(f"✓ Integration completed in {integration_time:.2f} seconds")
print(f"✓ Total memory usage: {df_integrated.estimated_size('mb'):.1f} MB")

Creating integrated customer interaction dataset...
✓ Integrated dataset created with 3,178,832 transaction records
✓ Integration completed in 1.27 seconds
✓ Total memory usage: 1828.6 MB


## 7. Sample Data Preview

Display a sample of the integrated dataset to understand the structure and verify successful integration.


In [20]:
# Show sample of integrated data
print("Sample of integrated dataset:")
sample_cols = ["customer_id", "article_id", "price", "sales_channel_id", "age", "club_member_status", "product_type_name"]
available_cols = [col for col in sample_cols if col in df_integrated.columns]

display_df = df_integrated.select(available_cols).head(10)
print(display_df)

Sample of integrated dataset:
shape: (10, 7)
┌─────────────────┬────────────┬──────────┬────────────────┬─────┬────────────────┬────────────────┐
│ customer_id     ┆ article_id ┆ price    ┆ sales_channel_ ┆ age ┆ club_member_st ┆ product_type_n │
│ ---             ┆ ---        ┆ ---      ┆ id             ┆ --- ┆ atus           ┆ ame            │
│ str             ┆ i64        ┆ f64      ┆ ---            ┆ i64 ┆ ---            ┆ ---            │
│                 ┆            ┆          ┆ i64            ┆     ┆ str            ┆ str            │
╞═════════════════╪════════════╪══════════╪════════════════╪═════╪════════════════╪════════════════╡
│ 1e6db9e9e42595b ┆ 600886001  ┆ 0.023949 ┆ 1              ┆ 45  ┆ ACTIVE         ┆ Swimwear       │
│ 9bb7f2076d7dd11 ┆            ┆          ┆                ┆     ┆                ┆ bottom         │
│ …               ┆            ┆          ┆                ┆     ┆                ┆                │
│ d72800d8455ca7f ┆ 594161006  ┆ 0.027102 ┆ 2 

## 8. Dataset Structure Analysis

Analyse the integrated dataset structure including customer and product diversity, date ranges, and feature categorisation for the H&M dataset.


In [21]:
print(f"Dataset Structure Analysis:")
print(f"• Total unique customers: {integration_stats['unique_customers']:,}")
print(f"• Total unique articles: {integration_stats['unique_articles']:,}")

# Check if date column exists and show range
if 't_dat' in df_integrated.columns:
    date_stats = df_integrated.select([
        pl.col('t_dat').min().alias('min_date'),
        pl.col('t_dat').max().alias('max_date')
    ]).to_pandas().iloc[0]
    print(f"• Date range: {date_stats['min_date']} to {date_stats['max_date']}")

# Create feature categories for H&M dataset
all_columns = df_integrated.columns
feature_categories = {
    'Customer Demographics': [col for col in ['customer_id', 'age', 'club_member_status', 'fashion_news_frequency'] if col in all_columns],
    'Transaction Behaviour': [col for col in ['t_dat', 'price', 'sales_channel_id'] if col in all_columns],
    'Product Information': [col for col in ['article_id', 'product_type_name', 'product_group_name', 'colour_group_name'] if col in all_columns],
    'Customer Preferences': [col for col in ['garment_group_name', 'index_name', 'section_name'] if col in all_columns]
}

print(f"\nDataset Features by Category:")
for category, features in feature_categories.items():
    if features:  # Only show categories with available features
        print(f"  {category}: {', '.join(features)}")

Dataset Structure Analysis:
• Total unique customers: 821,920
• Total unique articles: 87,121
• Date range: 2018-09-20 to 2020-09-22

Dataset Features by Category:
  Customer Demographics: customer_id, age, club_member_status, fashion_news_frequency
  Transaction Behaviour: t_dat, price, sales_channel_id
  Product Information: article_id, product_type_name, product_group_name, colour_group_name
  Customer Preferences: garment_group_name, index_name, section_name


## 9. Data Quality Assessment

Perform comprehensive data quality assessment using Polars' efficient operations to identify missing values, duplicates, and statistical patterns.


In [22]:
print("Analysing existing data quality issues in H&M dataset...")

# Check for missing values efficiently with Polars
print(f"\n• Missing Values Analysis:")

# Get null counts for all columns in one operation
null_counts = df_integrated.null_count()
total_records = df_integrated.height

missing_summary = []
for col_name in df_integrated.columns:
    missing_count = null_counts[col_name][0]
    if missing_count > 0:
        missing_percentage = (missing_count / total_records) * 100
        missing_summary.append((col_name, missing_count, missing_percentage))
        print(f"  {col_name}: {missing_count:,} ({missing_percentage:.2f}%)")

if not missing_summary:
    print("  ✓ No missing values found in dataset")

Analysing existing data quality issues in H&M dataset...

• Missing Values Analysis:
  FN: 1,821,007 (57.29%)
  Active: 1,841,453 (57.93%)
  club_member_status: 6,179 (0.19%)
  fashion_news_frequency: 14,216 (0.45%)
  age: 14,079 (0.44%)
  detail_desc: 11,480 (0.36%)


## 10. Duplicate Analysis

Analyse the dataset for duplicate records to understand data integrity using Polars' efficient duplicate detection.


In [23]:
# Analyse duplicates efficiently
print(f"\n• Duplicate Analysis:")
total_records = df_integrated.height
unique_records = df_integrated.unique().height
duplicate_count = total_records - unique_records

print(f"  Total records: {total_records:,}")
print(f"  Unique records: {unique_records:,}")
print(f"  Duplicate records: {duplicate_count:,}")


• Duplicate Analysis:


  Total records: 3,178,832
  Unique records: 3,141,398
  Duplicate records: 37,434


## 11. Price Distribution Analysis

Analyse price distribution to identify potential outliers and understand the pricing structure of H&M products using Polars' statistical functions.


In [24]:
# Check for potential outliers in price data
if 'price' in df_integrated.columns:
    print(f"\n• Price Distribution Analysis:")
    
    # Get comprehensive price statistics in one operation
    price_stats = df_integrated.select([
        pl.col('price').count().alias('count'),
        pl.col('price').mean().alias('mean'),
        pl.col('price').std().alias('std'),
        pl.col('price').min().alias('min'),
        pl.col('price').quantile(0.25).alias('25%'),
        pl.col('price').quantile(0.5).alias('50%'),
        pl.col('price').quantile(0.75).alias('75%'),
        pl.col('price').max().alias('max')
    ]).to_pandas().iloc[0]
    
    for stat_name, value in price_stats.items():
        print(f"  {stat_name}: {value:.2f}")
else:
    print(f"\n• No price column found in dataset")


• Price Distribution Analysis:
  count: 3178832.00
  mean: 0.03
  std: 0.02
  min: 0.00
  25%: 0.02
  50%: 0.03
  75%: 0.03
  max: 0.51


## 12. Performance Summary

Display processing performance metrics of Polars' efficiency.


In [25]:
# Compile quality assessment results
quality_report = {
    'missing_summary': missing_summary,
    'duplicate_count': duplicate_count,
    'total_records': total_records
}

# Calculate total memory usage
total_memory_mb = (
    df_transactions.estimated_size('mb') + 
    df_customers.estimated_size('mb') + 
    df_articles.estimated_size('mb') + 
    df_integrated.estimated_size('mb')
)

print(f"\n✓ Data Exploration Complete")
print(f"  - Transaction records: {transaction_count:,}")
print(f"  - Integrated records: {integration_stats['total_records']:,}")
print(f"  - Unique customers: {integration_stats['unique_customers']:,}")
print(f"  - Unique articles: {integration_stats['unique_articles']:,}")
print(f"  - Total memory usage: {total_memory_mb:.1f} MB")
print(f"  - Processing framework: Polars {pl.__version__}")

print("\n" + "=" * 60)
print("DATA EXPLORATION COMPLETED SUCCESSFULLY")
print("=" * 60)
print(f"Integrated dataset available as 'df_integrated'")
print(f"Quality report available as 'quality_report'")


✓ Data Exploration Complete
  - Transaction records: 3,178,832
  - Integrated records: 3,178,832
  - Unique customers: 821,920
  - Unique articles: 87,121
  - Total memory usage: 2571.0 MB
  - Processing framework: Polars 0.20.31

DATA EXPLORATION COMPLETED SUCCESSFULLY
Integrated dataset available as 'df_integrated'
Quality report available as 'quality_report'


## 14. Save Results

Save the integrated dataset and results for use in preprocessing and further analysis.


In [26]:
# Save integrated dataset for preprocessing
output_dir = os.path.join(data_dir, 'processed')
os.makedirs(output_dir, exist_ok=True)

# Save as Parquet for efficient storage
df_integrated.write_parquet(os.path.join(output_dir, 'hm_integrated_dataset.parquet'))
print(f"✓ Integrated dataset saved as Parquet file")

✓ Integrated dataset saved as Parquet file
