# Preprocessing Per-Participant Mean with Gender (Fixed with Missing/NaN Handling)

This notebook loads multiple CSV files (PSY, EEG, GSR, EYE, TIVA), computes mean values for numeric columns grouped by `Participant_ID`, retains `Gender` from the PSY dataset, merges all datasets, and saves a single `participant_summary_dataset.csv`. It includes robust handling for missing and NaN values using median imputation to ensure data completeness for modeling in Problem ID - 8: Analyzing Gender Differences Using Explainable AI.

## Key Features
- **Missing Value Handling**: Imputes NaNs with column medians before aggregation and in the final merged dataset.
- **NaN Validation**: Checks for and reports high NaN percentages per column/participant.
- **Gender Retention**: Keeps `Gender` only from PSY.csv to avoid conflicts.
- **Output**: `participant_summary_dataset.csv` with ~38 rows (participants) and aggregated features.

## Input
- **Directory**: `C:\Users\Anish\anaconda3\envs\IITB\IITB`
- **Files**: `PSY.csv`, `EEG.csv`, `GSR.csv`, `EYE.csv`, `TIVA.csv` (trial-level data with `Participant_ID`).

## Output
- **Directory**: `processed` subfolder.
- **Files**: `participant_summary_dataset.csv` (merged dataset), `nan_report.txt` (NaN summary).

In [None]:
import pandas as pd
from pathlib import Path
from functools import reduce
import numpy as np

## Paths and Configuration

Set input and output directories. Adjust `data_dir` if your files are in a different location.

In [None]:
# Paths (Updated for IITB environment)
# Assuming your IITB folder is inside the current working directory or conda env directory
data_dir = Path(r"C:\Users\Anish\anaconda3\envs\IITB\IITB")  # Change this if different
output_dir = data_dir / "processed"
output_dir.mkdir(parents=True, exist_ok=True)

# Input files
files = {
    "psy": data_dir / "PSY.csv",
    "eeg": data_dir / "EEG.csv",
    "gsr": data_dir / "GSR.csv",
    "eye": data_dir / "EYE.csv",
    "tiva": data_dir / "TIVA.csv"
}

# Validate input files exist
missing_files = [name for name, file in files.items() if not file.exists()]
if missing_files:
    raise FileNotFoundError(f"Missing files: {missing_files}")
print("All input files found.")

## Helper Function: Summarize Mean with Gender and NaN Handling

Computes mean for numeric columns grouped by `Participant_ID`, imputes NaNs with medians, and optionally retains `Gender`.

In [None]:
def summarize_mean_with_gender(df, keep_gender=True):
    """
    Compute mean for all numeric columns grouped by Participant_ID,
    optionally keep Gender column (take first occurrence).
    Handles NaN values with median imputation.
    """
    participant_col = "Participant_ID"
    
    if participant_col not in df.columns:
        raise ValueError(f"Column '{participant_col}' not found in CSV. Columns: {df.columns.tolist()}")
    
    # Identify numeric columns
    numeric_cols = df.select_dtypes(include="number").columns.tolist()
    if participant_col in numeric_cols:
        numeric_cols.remove(participant_col)
    
    if len(numeric_cols) == 0:
        print("Warning: No numeric columns found for aggregation.")
        return df[[participant_col]].copy()
    
    # Impute NaNs with median for each numeric column
    for col in numeric_cols:
        if df[col].isna().sum() > 0:
            median_val = df[col].median()
            df[col] = df[col].fillna(median_val)
            print(f"Imputed NaNs in {col} with median: {median_val}")
    
    # Group by participant and compute mean
    grouped = df.groupby(participant_col)[numeric_cols].mean()
    
    # Add Gender if requested and present
    if keep_gender and "Gender" in df.columns:
        grouped["Gender"] = df.groupby(participant_col)["Gender"].first()
    
    return grouped.reset_index()

## Process Each Dataset

Loads each CSV, applies summarization (keeping `Gender` only from PSY), and handles NaNs.

In [None]:
participant_summaries = {}
first_dataset = True

for name, file in files.items():
    print(f"Processing {name}...")
    df = pd.read_csv(file)
    print(f"  Loaded {len(df)} rows, {len(df.columns)} columns")
    
    # Keep Gender only in the first dataset (PSY)
    keep_gender = first_dataset
    summary = summarize_mean_with_gender(df, keep_gender=keep_gender)
    first_dataset = False
    
    # Drop Gender from other datasets to avoid merge conflicts
    if not keep_gender and "Gender" in summary.columns:
        summary = summary.drop(columns=["Gender"])
    
    participant_summaries[name] = summary
    print(f"  Aggregated to {len(summary)} participants")

print("All datasets processed.")

## Merge All Summaries

Merges datasets on `Participant_ID` using outer join and imputes any remaining NaNs in the final dataset.

In [None]:
# Merge all summaries
dfs = list(participant_summaries.values())
final_summary = reduce(lambda left, right: pd.merge(left, right, on="Participant_ID", how="outer"), dfs)

# Impute NaNs in final dataset with medians
numeric_cols = final_summary.select_dtypes(include="number").columns.tolist()
if "Participant_ID" in numeric_cols:
    numeric_cols.remove("Participant_ID")

nan_report = []
for col in numeric_cols:
    nan_count = final_summary[col].isna().sum()
    if nan_count > 0:
        median_val = final_summary[col].median()
        final_summary[col] = final_summary[col].fillna(median_val)
        nan_report.append(f"{col}: {nan_count} NaNs imputed with median {median_val}")

# Report high NaN percentages (>50%)
high_nan_cols = [col for col in numeric_cols if final_summary[col].isna().sum() / len(final_summary) > 0.5]
if high_nan_cols:
    print(f"Warning: Columns with >50% NaNs: {high_nan_cols}")

# Save NaN report
with open(output_dir / "nan_report.txt", 'w') as f:
    f.write("NaN Imputation Report:\n")
    f.write("\n".join(nan_report))

print(f"Merged dataset shape: {final_summary.shape}")
print("NaN handling complete.")

## Save Output

Saves the final merged dataset and prints summary statistics.

In [None]:
# Save output
output_file = output_dir / "participant_summary_dataset.csv"
final_summary.to_csv(output_file, index=False)

# Print summary
print(f"Participant summary saved to {output_file}")
print(f"Total participants: {len(final_summary)}")
print(f"Total columns: {len(final_summary.columns)}")
if 'Gender' in final_summary.columns:
    print(f"Gender distribution:\n{final_summary['Gender'].value_counts()}")

# Display first few rows
print("\nFirst 5 rows preview:")
print(final_summary.head())