# H&M Personalised Fashion Recommendations Data Understanding

This notebook provides an exploratory analysis of the H&M dataset structure and quality using Polars for high-performance data processing. The analysis includes data loading, integration, and quality assessment across transactions, customers, and articles datasets.

## Dataset Description

**Challenge notes from Kaggle**: For this challenge you are given the purchase history of customers across time, along with supporting metadata. Your challenge is to predict what articles each customer will purchase in the 7-day period immediately after the training data ends. Customer who did not make any purchase during that time are excluded from the scoring.

### Files

- **images/** - a folder of images corresponding to each article_id; images are placed in subfolders starting with the first three digits of the article_id; note, not all article_id values have a corresponding image.
- **articles.csv** - detailed metadata for each article_id available for purchase
- **customers.csv** - metadata for each customer_id in dataset
- **sample_submission.csv** - a sample submission file in the correct format
- **transactions_train.csv** - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.

[H&M Personalized Fashion Recommendations Kaggle Competition](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations)

## Overview

- **Transaction Data**: Customer purchase history with temporal patterns
- **Customer Data**: Demographic and preference information
- **Articles Data**: Product metadata and categorisation


## 1. Import Libraries and Setup

Import Polars and other necessary libraries for data processing, analysis, and visualisation.


In [1]:
import polars as pl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
from typing import Tuple, List, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

print(f"Polars version: {pl.__version__}")
print("Libraries imported successfully")

Polars version: 1.32.0
Libraries imported successfully


## 2. Data Loading Configuration

Set up file paths and configuration for data loading.


In [2]:
# Configuration parameters
data_dir = '../data'
use_full_dataset = False
sample_fraction = 0.1 if not use_full_dataset else 1.0
random_seed = 42

# Set up file paths
transactions_path = os.path.join(data_dir, 'raw', 'transactions_train.csv')
customers_path = os.path.join(data_dir, 'raw', 'customers.csv')
articles_path = os.path.join(data_dir, 'raw', 'articles.csv')

print("Data Loading Configuration:")
print(f"• Data directory: {data_dir}")
print(f"• Using full dataset: {use_full_dataset}")
print(f"• Sample fraction: {sample_fraction*100:.0f}%")
print(f"• Random seed: {random_seed}")

Data Loading Configuration:
• Data directory: ../data
• Using full dataset: False
• Sample fraction: 10%
• Random seed: 42


## 3. Check Dataset Existance

Check that the datasets exist


In [3]:
# Configuration parameters
data_dir = '../data'
use_full_dataset = False
sample_fraction = 0.1 if not use_full_dataset else 1.0
random_seed = 42

# Set up file paths
transactions_path = os.path.join(data_dir, 'raw', 'transactions_train.csv')
customers_path = os.path.join(data_dir, 'raw', 'customers.csv')
articles_path = os.path.join(data_dir, 'raw', 'articles.csv')

print("Data Loading Configuration:")
print(f"• Data directory: {data_dir}")
print(f"• Using full dataset: {use_full_dataset}")
print(f"• Sample fraction: {sample_fraction*100:.0f}%")
print(f"• Random seed: {random_seed}")

# Verify paths exist
print(f"\nPath verification:")
print(f"• Transactions file exists: {os.path.exists(transactions_path)}")
print(f"• Customers file exists: {os.path.exists(customers_path)}")
print(f"• Articles file exists: {os.path.exists(articles_path)}")

if not all(os.path.exists(p) for p in [transactions_path, customers_path, articles_path]):
    print(f"\nTroubleshooting:")
    print(f"• Current working directory: {os.getcwd()}")
    print(f"• Data directory contents: {os.listdir(data_dir) if os.path.exists(data_dir) else 'Not found'}")
    if os.path.exists(os.path.join(data_dir, 'raw')):
        print(f"• Raw directory contents: {os.listdir(os.path.join(data_dir, 'raw'))[:10]}")  # Show first 10 files

Data Loading Configuration:
• Data directory: ../data
• Using full dataset: False
• Sample fraction: 10%
• Random seed: 42

Path verification:
• Transactions file exists: True
• Customers file exists: True
• Articles file exists: True


## 4. Load Customer Data


In [4]:
print("\n• Loading customer data...")
start_time = time.time()

if not os.path.exists(customers_path):
    raise FileNotFoundError(f"Customers file not found: {customers_path}")

df_customers = pl.read_csv(customers_path)
customer_count = df_customers.height
load_time = time.time() - start_time

print(f"  ✓ Customers: {customer_count:,} records loaded in {load_time:.2f} seconds")
print(f"  ✓ Memory usage: {df_customers.estimated_size('mb'):.1f} MB")
print("  - Schema:")
print(df_customers.schema)


• Loading customer data...
  ✓ Customers: 1,371,980 records loaded in 0.46 seconds
  ✓ Memory usage: 215.0 MB
  - Schema:
Schema([('customer_id', String), ('FN', Float64), ('Active', Float64), ('club_member_status', String), ('fashion_news_frequency', String), ('age', Int64), ('postal_code', String)])


## 5. Load Transactions Data


In [5]:
print("• Loading transactions data with Polars...")
start_time = time.time()

if not os.path.exists(transactions_path):
    raise FileNotFoundError(f"Transaction file not found: {transactions_path}")

try:
    # Load transactions data with Polars
    if use_full_dataset:
        print("  Loading full dataset (this may take a moment for large files)...")
        df_transactions = pl.read_csv(transactions_path)
    else:
        print(f"  Loading {sample_fraction*100:.0f}% sample...")
        # For compatibility with different Polars versions, read then sample
        df_transactions = pl.read_csv(transactions_path).sample(fraction=sample_fraction, seed=random_seed)

    transaction_count = df_transactions.height
    load_time = time.time() - start_time

    print(f"  ✓ Transactions: {transaction_count:,} records loaded in {load_time:.2f} seconds")
    print(f"  ✓ Memory usage: {df_transactions.estimated_size('mb'):.1f} MB")
    print("  - Schema:")
    print(df_transactions.schema)
    
    # Show a sample
    print(f"\nFirst 5 rows:")
    print(df_transactions.head(5))

except Exception as e:
    print(f"Error loading transactions data: {e}")
    print("This might be due to:")
    print("1. Large file size - try setting use_full_dataset = False")
    print("2. Memory constraints - reduce sample_fraction")
    print("3. File corruption - check the CSV file")
    
    # If memory is an issue, try with smaller sample
    print("\nTrying with smaller sample (1%)...")
    try:
        df_transactions = pl.read_csv(transactions_path).sample(fraction=0.01, seed=random_seed)
        transaction_count = df_transactions.height
        load_time = time.time() - start_time
        print(f"  ✓ Emergency sample: {transaction_count:,} records loaded")
        print("  Note: Using 1% sample due to memory constraints")
    except Exception as e2:
        print(f"  Failed with smaller sample too: {e2}")
        raise

• Loading transactions data with Polars...
  Loading 10% sample...
  ✓ Transactions: 3,178,832 records loaded in 10.40 seconds
  ✓ Memory usage: 297.1 MB
  - Schema:
Schema([('t_dat', String), ('customer_id', String), ('article_id', Int64), ('price', Float64), ('sales_channel_id', Int64)])

First 5 rows:
shape: (5, 5)
┌────────────┬─────────────────────────────────┬────────────┬──────────┬──────────────────┐
│ t_dat      ┆ customer_id                     ┆ article_id ┆ price    ┆ sales_channel_id │
│ ---        ┆ ---                             ┆ ---        ┆ ---      ┆ ---              │
│ str        ┆ str                             ┆ i64        ┆ f64      ┆ i64              │
╞════════════╪═════════════════════════════════╪════════════╪══════════╪══════════════════╡
│ 2020-05-18 ┆ 242e34dcafc554396c79c976e08459… ┆ 874916005  ┆ 0.014322 ┆ 2                │
│ 2019-05-12 ┆ 14de0945a699eb27c5b24c69077a1f… ┆ 751260002  ┆ 0.016932 ┆ 2                │
│ 2020-09-08 ┆ 8d7b5cd5a125ed04d08ac

## 6. Load Articles Data


In [6]:
print("\n• Loading articles data...")
start_time = time.time()

if not os.path.exists(articles_path):
    raise FileNotFoundError(f"Articles file not found: {articles_path}")

df_articles = pl.read_csv(articles_path)
article_count = df_articles.height
load_time = time.time() - start_time

print(f"  ✓ Articles: {article_count:,} records loaded in {load_time:.2f} seconds")
print(f"  ✓ Memory usage: {df_articles.estimated_size('mb'):.1f} MB")
print("  - Schema:")
print(df_articles.schema)


• Loading articles data...
  ✓ Articles: 105,542 records loaded in 0.06 seconds
  ✓ Memory usage: 36.4 MB
  - Schema:
Schema([('article_id', Int64), ('product_code', Int64), ('prod_name', String), ('product_type_no', Int64), ('product_type_name', String), ('product_group_name', String), ('graphical_appearance_no', Int64), ('graphical_appearance_name', String), ('colour_group_code', Int64), ('colour_group_name', String), ('perceived_colour_value_id', Int64), ('perceived_colour_value_name', String), ('perceived_colour_master_id', Int64), ('perceived_colour_master_name', String), ('department_no', Int64), ('department_name', String), ('index_code', String), ('index_name', String), ('index_group_no', Int64), ('index_group_name', String), ('section_no', Int64), ('section_name', String), ('garment_group_no', Int64), ('garment_group_name', String), ('detail_desc', String)])


## 7. Dataset Integration

Integrate transaction, customer, and article datasets using Polars join operations. Please note that this denormalises the data.


In [7]:
print("Creating integrated customer interaction dataset...")
start_time = time.time()

# Perform left joins to preserve all transaction records
df_integrated = (
    df_transactions
    .join(df_customers, on="customer_id", how="left")
    .join(df_articles, on="article_id", how="left")
)

# Calculate dataset statistics efficiently
integration_stats = {
    'total_records': df_integrated.height,
    'unique_customers': df_integrated['customer_id'].n_unique(),
    'unique_articles': df_integrated['article_id'].n_unique()
}

integration_time = time.time() - start_time

print(f"✓ Integrated dataset created with {integration_stats['total_records']:,} transaction records")
print(f"✓ Integration completed in {integration_time:.2f} seconds")
print(f"✓ Total memory usage: {df_integrated.estimated_size('mb'):.1f} MB")

Creating integrated customer interaction dataset...
✓ Integrated dataset created with 3,178,832 transaction records
✓ Integration completed in 1.03 seconds
✓ Total memory usage: 1634.7 MB


## 7. Integrated Data Preview

Display a sample of the integrated dataset to understand the structure and verify successful integration.


In [8]:
# Show sample of integrated data
print("Sample of integrated dataset:")
sample_cols = ["customer_id", "article_id", "price", "sales_channel_id", "age", "club_member_status", "product_type_name"]
available_cols = [col for col in sample_cols if col in df_integrated.columns]

display_df = df_integrated.select(available_cols).head(10)
print(display_df)

# Print integrated data schema
print("  - Schema:")
print(df_integrated.schema)


Sample of integrated dataset:
shape: (10, 7)
┌─────────────────┬────────────┬──────────┬────────────────┬─────┬────────────────┬────────────────┐
│ customer_id     ┆ article_id ┆ price    ┆ sales_channel_ ┆ age ┆ club_member_st ┆ product_type_n │
│ ---             ┆ ---        ┆ ---      ┆ id             ┆ --- ┆ atus           ┆ ame            │
│ str             ┆ i64        ┆ f64      ┆ ---            ┆ i64 ┆ ---            ┆ ---            │
│                 ┆            ┆          ┆ i64            ┆     ┆ str            ┆ str            │
╞═════════════════╪════════════╪══════════╪════════════════╪═════╪════════════════╪════════════════╡
│ 242e34dcafc5543 ┆ 874916005  ┆ 0.014322 ┆ 2              ┆ 54  ┆ ACTIVE         ┆ Sunglasses     │
│ 96c79c976e08459 ┆            ┆          ┆                ┆     ┆                ┆                │
│ …               ┆            ┆          ┆                ┆     ┆                ┆                │
│ 14de0945a699eb2 ┆ 751260002  ┆ 0.016932 ┆ 2 

## 8. Integrated Dataset Structure Analysis

Analyse the integrated dataset structure including customer and product diversity and date ranges.


In [9]:
print(f"Dataset Structure Analysis:")
print(f"• Total unique customers: {integration_stats['unique_customers']:,}")
print(f"• Total unique articles: {integration_stats['unique_articles']:,}")

# Check if date column exists and show range
if 't_dat' in df_integrated.columns:
    date_stats = df_integrated.select([
        pl.col('t_dat').min().alias('min_date'),
        pl.col('t_dat').max().alias('max_date')
    ]).to_pandas().iloc[0]
    print(f"• Date range: {date_stats['min_date']} to {date_stats['max_date']}")

Dataset Structure Analysis:
• Total unique customers: 822,211
• Total unique articles: 86,988
• Date range: 2018-09-20 to 2020-09-22


## 8. Data Quality Assessment


In [10]:
print("Analysing existing data quality issues in H&M dataset...")

# Check for missing values efficiently with Polars
print(f"\n• Missing Values Analysis:")

# Get null counts for all columns in one operation
null_counts = df_integrated.null_count()
total_records = df_integrated.height

missing_summary = []
for col_name in df_integrated.columns:
    missing_count = null_counts[col_name][0]
    if missing_count > 0:
        missing_percentage = (missing_count / total_records) * 100
        missing_summary.append((col_name, missing_count, missing_percentage))
        print(f"  {col_name}: {missing_count:,} ({missing_percentage:.2f}%)")

if not missing_summary:
    print("  ✓ No missing values found in dataset")

Analysing existing data quality issues in H&M dataset...

• Missing Values Analysis:
  FN: 1,821,934 (57.31%)
  Active: 1,842,236 (57.95%)
  club_member_status: 6,226 (0.20%)
  fashion_news_frequency: 14,299 (0.45%)
  age: 14,156 (0.45%)
  detail_desc: 11,458 (0.36%)


## 9. Duplicate Analysis

Analyse the dataset for duplicate records to understand data integrity using Polars' efficient duplicate detection.


In [11]:
# Analyse duplicates efficiently
print(f"\n• Duplicate Analysis:")
total_records = df_integrated.height
unique_records = df_integrated.unique().height
duplicate_count = total_records - unique_records

print(f"  Total records: {total_records:,}")
print(f"  Unique records: {unique_records:,}")
print(f"  Duplicate records: {duplicate_count:,}")


• Duplicate Analysis:
  Total records: 3,178,832
  Unique records: 3,141,366
  Duplicate records: 37,466


## 10. Price Distribution Analysis

Analyse price distribution to identify potential outliers and understand the pricing structure of H&M products using Polars' statistical functions.


In [12]:
# Check for potential outliers in price data
if 'price' in df_integrated.columns:
    print(f"\n• Price Distribution Analysis:")
    
    # Get comprehensive price statistics in one operation
    price_stats = df_integrated.select([
        pl.col('price').count().alias('count'),
        pl.col('price').mean().alias('mean'),
        pl.col('price').std().alias('std'),
        pl.col('price').min().alias('min'),
        pl.col('price').quantile(0.25).alias('25%'),
        pl.col('price').quantile(0.5).alias('50%'),
        pl.col('price').quantile(0.75).alias('75%'),
        pl.col('price').max().alias('max')
    ]).to_pandas().iloc[0]
    
    for stat_name, value in price_stats.items():
        print(f"  {stat_name}: {value:.2f}")
else:
    print(f"\n• No price column found in dataset")


• Price Distribution Analysis:
  count: 3178832.00
  mean: 0.03
  std: 0.02
  min: 0.00
  25%: 0.02
  50%: 0.03
  75%: 0.03
  max: 0.59


## 11. Performance Summary

Display processing performance metrics of Polars' efficiency.


In [13]:
# Compile quality assessment results
quality_report = {
    'missing_summary': missing_summary,
    'duplicate_count': duplicate_count,
    'total_records': total_records
}

# Calculate total memory usage
total_memory_mb = (
    df_transactions.estimated_size('mb') + 
    df_customers.estimated_size('mb') + 
    df_articles.estimated_size('mb') + 
    df_integrated.estimated_size('mb')
)

print(f"\n✓ Data Exploration Complete")
print(f"  - Transaction records: {transaction_count:,}")
print(f"  - Integrated records: {integration_stats['total_records']:,}")
print(f"  - Unique customers: {integration_stats['unique_customers']:,}")
print(f"  - Unique articles: {integration_stats['unique_articles']:,}")
print(f"  - Total memory usage: {total_memory_mb:.1f} MB")

print("\n" + "=" * 60)
print("DATA UNDERSTANDING COMPLETED SUCCESSFULLY")
print("=" * 60)
print(f"Integrated dataset available as 'df_integrated'")
print(f"Quality report available as 'quality_report'")


✓ Data Exploration Complete
  - Transaction records: 3,178,832
  - Integrated records: 3,178,832
  - Unique customers: 822,211
  - Unique articles: 86,988
  - Total memory usage: 2183.1 MB

DATA UNDERSTANDING COMPLETED SUCCESSFULLY
Integrated dataset available as 'df_integrated'
Quality report available as 'quality_report'


## 12. Save Results

Save the integrated dataset and results for use in preprocessing and further analysis.


In [14]:
output_dir = os.path.join(data_dir, 'processed')
os.makedirs(output_dir, exist_ok=True)

# Save as Parquet for efficient storage
df_integrated.write_parquet(os.path.join(output_dir, 'hm_integrated_dataset.parquet'))
print(f"✓ Integrated dataset saved as Parquet file")

# Save integrated dataset as CSV for easier inspection
df_integrated.write_csv(os.path.join(output_dir, 'hm_integrated_dataset.csv'))
print(f"✓ Integrated dataset saved as CSV file")


✓ Integrated dataset saved as Parquet file
✓ Integrated dataset saved as CSV file
