# SmartFlush Exploratory Data Analysis (EDA)

This notebook performs comprehensive exploratory data analysis on the SmartFlush dataset.

## Contents
1. Data Loading
2. Initial Exploration
3. Statistical Analysis
4. Correlation Analysis
5. Feature Distributions
6. Chi-Squared Tests
7. Visualizations

In [None]:
# Import required libraries
import sys
import os
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Import custom modules
from src.data_loading import load_and_combine_data, handle_missing_values
from src.eda import (
    perform_eda,
    calculate_pearson_correlation,
    plot_correlation_heatmap,
    plot_pairplot,
    plot_boxplots,
    plot_target_distribution,
    generate_summary_statistics,
    set_plot_style
)
from src.utils import calculate_vif, chi_squared_test

# Configure plotting
set_plot_style()
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)

## 1. Data Loading

In [None]:
# Load data files
data_files = [
    '../data/Combined_Data.xlsx',
    '../data/mon_fichier.xlsx'
]

# Check which files exist
existing_files = [f for f in data_files if Path(f).exists()]

if existing_files:
    df = load_and_combine_data(existing_files, how='concat')
else:
    print("No data files found. Creating synthetic data for demonstration.")
    np.random.seed(42)
    n_samples = 1000
    df = pd.DataFrame({
        'photodiode_1': np.random.randn(n_samples),
        'photodiode_2': np.random.randn(n_samples),
        'waste_level': np.random.randint(1, 6, n_samples),
        'sensor_1': np.random.randn(n_samples),
        'sensor_2': np.random.randn(n_samples),
        'flush_level': np.random.randint(1, 12, n_samples)
    })

print(f"Data shape: {df.shape}")
df.head()

## 2. Initial Exploration

In [None]:
# Display basic information
print("Dataset Info:")
df.info()

print("\nMissing Values:")
print(df.isnull().sum())

print("\nBasic Statistics:")
df.describe()

## 3. Target Variable Analysis

In [None]:
# Determine target column
target_col = 'flush_level' if 'flush_level' in df.columns else df.columns[-1]
print(f"Target column: {target_col}")

# Target distribution
plot_target_distribution(df, target_col)

## 4. Correlation Analysis

In [None]:
# Calculate correlations
feature_cols = df.select_dtypes(include=[np.number]).columns.tolist()
if target_col in feature_cols:
    feature_cols.remove(target_col)

corr_matrix = calculate_pearson_correlation(df, feature_cols)

# Plot heatmap
plot_correlation_heatmap(corr_matrix, title='Feature Correlation Matrix')

## 5. Multicollinearity Check (VIF)

In [None]:
# Calculate VIF
vif_df = calculate_vif(df, feature_cols)
print("Variance Inflation Factors:")
print(vif_df)

# Highlight high VIF features
high_vif = vif_df[vif_df['VIF'] > 10]
if not high_vif.empty:
    print(f"\nFeatures with VIF > 10 (multicollinearity):")
    print(high_vif)

## 6. Feature Distributions (Boxplots)

In [None]:
# Boxplots for features vs target
boxplot_features = feature_cols[:6] if len(feature_cols) >= 6 else feature_cols
plot_boxplots(df, boxplot_features, target_col)

## 7. Pairplot Visualization

In [None]:
# Pairplot for selected features
pairplot_features = feature_cols[:4] if len(feature_cols) >= 4 else feature_cols
plot_pairplot(df, pairplot_features, target_col, sample_size=500)

## 8. Summary Statistics

In [None]:
# Generate comprehensive summary
summary_stats = generate_summary_statistics(df)
print("Summary Statistics:")
summary_stats

## Conclusions

Based on the exploratory data analysis:

1. **Data Quality**: [Add observations about missing values, outliers]
2. **Feature Relationships**: [Add observations about correlations]
3. **Multicollinearity**: [Add observations about VIF results]
4. **Target Distribution**: [Add observations about class balance]
5. **Next Steps**: [Add recommendations for modeling]