# Rossmann Store Sales: Professional EDA & MLOps Foundation

This notebook follows a rigorous EDA process to extract features for a production-ready MLOps framework tailored for retail forecasting (ICA Sverige context).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from datetime import datetime

# Add project root to path
sys.path.append(os.path.abspath('..'))

from src.ingest.ingestor import DataIngestorFactory
from pipeline.ml_pipeline import RossmannPipeline

sns.set(style='whitegrid', palette='muted')
%matplotlib inline

# Fix for CJK characters in Matplotlib on macOS
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'PingFang HK', 'Heiti TC', 'sans-serif']
plt.rcParams['axes.unicode_minus'] = False

## 1. Data Integrity & Business Logic
Verifying that data aligns with retail common sense.

In [None]:
train_path = '../data/raw/train.csv'
factory = DataIngestorFactory()
ingestor = factory.get_data_ingestor('rossmann')
df = ingestor.ingest(train_path)
df['Date'] = pd.to_datetime(df['Date'])

# Business logic check: Closed stores (Open=0) should have 0 Sales
dirty_data = df[(df['Open'] == 0) & (df['Sales'] > 0)]
print(f'Closed stores with sales (Exceptions): {len(dirty_data)}')

# Missing values handling: Tracking CompetitionDistance
missing_dist = df['CompetitionDistance'].isnull().sum()
print(f'Missing CompetitionDistance records: {missing_dist}')

## 2. Target Variable (Sales) Distribution
Analyzing skewness to justify preprocessing choices.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.histplot(df[df['Sales'] > 0]['Sales'], kde=True, ax=axes[0], color='blue')
axes[0].set_title('Raw Sales Distribution (Right-Skewed)')

sns.histplot(np.log1p(df[df['Sales'] > 0]['Sales']), kde=True, ax=axes[1], color='green')
axes[1].set_title('Log-Transformed Sales (Normal Distribution)')
plt.show()
print('Insight: $log(1+x)$ transformation improves regression model stability.')

## 3. Time-Series & Seasonality
Capturing the pulse of retail cycles.

In [None]:
# 3.1 Weekly Pattern
plt.figure(figsize=(10, 5))
sns.barplot(data=df, x='DayOfWeek', y='Sales', hue='DayOfWeek', legend=False, palette='viridis')
plt.title('Weekly Sales Cycle (Sunday Closure Effect Visible)')
plt.show()

# 3.2 Long-term Trend
df.resample('W', on='Date')['Sales'].mean().plot(figsize=(15, 5), color='purple')
plt.title('Weekly Moving Average Sales Trend')
plt.show()

## 4. Promotion (Promo) Effectiveness
Validating if marketing efforts drive significant volume lift.

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df[df['Sales']>0], x='Promo', y='Sales', hue='Promo', palette='husl', legend=False)
plt.title('Sales Distribution: Promo (1) vs. No Promo (0)')
plt.show()
print('Decision: Promo is a critical feature for predicting volume peaks.')

## 5. Store Type & Competition Environment
Different commercial formats show distinct performance profiles.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
df.groupby('StoreType')['Sales'].mean().plot(kind='bar', ax=axes[0], color='teal')
axes[0].set_title('Mean Sales by Store Type')

sns.scatterplot(data=df.sample(2000), x='CompetitionDistance', y='Sales', alpha=0.3, ax=axes[1])
axes[1].set_title('Impact of Competition Proximity')
plt.show()

## 6. The Payday Effect Hypothesis
Sales often spike around salary payment dates (start/end of month).

In [None]:
df['DayOfMonth'] = df['Date'].dt.day
payday_sales = df.groupby('DayOfMonth')['Sales'].mean()
plt.figure(figsize=(12, 5))
payday_sales.plot(kind='line', marker='o', color='gold')
plt.axvline(x=1, color='red', linestyle='--', label='Start of Month')
plt.axvline(x=30, color='red', linestyle='--', label='End of Month')
plt.title('Payday Pulse Investigation (Spikes on 1st, 30th/31st)')
plt.legend()
plt.show()

## Summary: Translating EDA to MLOps

Based on these findings, we optimized our MLOps pipeline:

- **Cleaning Layer**: Implemented `Open=0` anomaly filtering.
- **Feature Layer**: Added `DateTransformer` for Fourier seasonality and `IsPayday` feature.
- **Monitoring Layer**: Set SMAPE drift thresholds in Evidently AI based on observed variance.