# Analysis of Energy vs Charge with Geometry Cuts

This notebook performs a comprehensive global analysis of energy reconstruction, focusing on:
1.  **Energy vs Charge Relationship**: specifically analyzing the impact of `towall` cuts.
2.  **Correlation Evolution**: how correlations between energy and other variables change with geometry.
3.  **Model Performance**: Comparing simple regression, multiple regression, and the cyclic model.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
from sklearn.preprocessing import StandardScaler

# Add project root to path for imports
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

from visualization.plots import print_plot_reg, evolution_correlation
from models.model_utils import stepwise_aic, print_res_modeles

%matplotlib inline
sns.set_style('whitegrid')

## 1. Load Data

In [None]:
data_path = '../data/processed/df.pkl'
if os.path.exists(data_path):
    df = pd.read_pickle(data_path)
    print(f"Loaded data from {data_path} with shape: {df.shape}")
else:
    print("Data file not found. Please run 01_Energy_Prefit_Pipeline.ipynb first.")


## 2. Energy vs Charge Analysis with Towall Cuts

We split the data into bins based on `towall` distance to analyze the relationship between Energy and Charge in different geometric regions.

In [None]:
if 'towall' in df.columns and 'energy' in df.columns and 'charge_totale' in df.columns:
    # Sort by towall to create consistent slices
    df_sorted = df.sort_values(by='towall', ascending=True)
    
    # Split into 4 chunks manually to avoid MemoryError with np.array_split on complex columns
    n_splits = 4
    chunk_size = int(np.ceil(len(df_sorted) / n_splits))
    liste_decoupe_towall = [df_sorted.iloc[i*chunk_size : (i+1)*chunk_size] for i in range(n_splits)]
    
    print(f"Created {len(liste_decoupe_towall)} slices based on towall variable.")
    
    # Use the legacy visualization function extracted from utilities.py
    print_plot_reg(
        liste_decoupe_towall, 
        critere_decoupe="towall", 
        cible="energy", 
        x_variable="charge_totale", 
        df=df, 
        reg=True
    )
else:
    print("Missing required columns (towall, energy, charge_totale).")

## 3. Evolution of Correlation

We examine how the correlation between energy and other key variables (charge, n_hits) evolves as we move closer/further from the wall.

In [None]:
if 'towall' in df.columns and 'energy' in df.columns and 'charge_totale' in df.columns:
    evolution_correlation(liste_decoupe_towall, "towall")
else:
    print("Missing required columns for correlation evolution.")

## 4. Feature Selection (Stepwise AIC)

We use Stepwise AIC to identify the most relevant features for energy reconstruction, excluding the target and geometry variables from the input set.

In [None]:
# Filter for numerical columns first
df_num = df.select_dtypes(include=[np.number]).dropna()

# Exclude target and geometry variables from features
drop_cols = ['energy', 'towall', 'dwall', 'vertex_x', 'vertex_y', 'vertex_z', 'time', 'particleDir_x', 'particleDir_y', 'particleDir_z']
features_temp = [c for c in df_num.columns if c not in drop_cols]

X_all = df_num[features_temp]
y_all = df_num["energy"]

# Standardize features for consistent selection
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_all), columns=X_all.columns)

print("Running Stepwise AIC selection... (this may take a moment)")
# Note: stepwise_aic might return a list of features. 
selected_variables_aic = stepwise_aic(X_scaled, y_all)
print(f"Selected Variables (AIC): {selected_variables_aic}")

## 5. Model Comparison (Cyclic Model vs Baselines)

We compare the performance of:
1.  **Simple Regression**: Energy vs Charge Totale
2.  **Multiple Regression**: Energy vs Charge, N_hits, Max/Min Charge
3.  **Cyclic Model**: Combining Energy predictions and Geometry predictions to refine results.

The cyclic model uses an iterative approach:
- Estimate Energy from Charge/N_hits
- Estimate 'towall' from Energy_pred + Geometry features
- Refine Energy from 'towall'_pred + Charge features

In [None]:
# Define variable groups for the Cyclic Model
# x_variables_1: Used for initial Energy estimation (usually charge, n_hits)
# x_variables_2: Used for Energy refinement given towall (extracted from AIC or standard list)
# x_variables_towall: Used for Towall estimation given Energy (usually n_hits or geometry proxy)

# Use selected AIC variables if available, otherwise default fallback
if 'selected_variables_aic' not in locals():
    selected_variables_aic = ["charge_totale", "n_hits", "max_charge", "min_charge", "mean_charge"]
    # Filter to exist
    selected_variables_aic = [c for c in selected_variables_aic if c in df.columns]

x_variables_1 = ["charge_totale", "n_hits"]
x_variables_2 = selected_variables_aic # Use all selected variables for refinement
x_variables_towall = ["n_hits"] # Simple proxy for geometry

print("Variables for Cyclic Model:")
print(f"  Step 1 (Energy Init): {x_variables_1}")
print(f"  Step 2 (Towall Est): {x_variables_towall}")
print(f"  Step 3 (Energy Refine): {x_variables_2}")

print_res_modeles(
    df_num, 
    x_variables_1, 
    x_variables_2, 
    x_variables_towall, 
    target="energy", 
    n_splits=5, 
    test_size=0.3
)