# 01 – Data Pre-processing

This notebook loads the raw CSV files, merges them, cleans missing / infinite values, removes constant / quasi-constant columns and saves a clean `train_clean.csv` for the next stage.

**Inputs**  
- `/content/drive/MyDrive/QuantumBoost2025/Dataset/train_features.csv`  
- `/content/drive/MyDrive/QuantumBoost2025/Dataset/train_labels.csv`

**Outputs**  
- `train_clean.csv` (features + `Toxicity_Class`)

Run this notebook **first**.

In [None]:
!pip install -q pandas numpy scikit-learn

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print(f"Start: {datetime.now():%Y-%m-%d %H:%M:%S}")

## 1. Load raw data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

features = pd.read_csv('/content/drive/MyDrive/QuantumBoost2025/Dataset/train_features.csv')
labels   = pd.read_csv('/content/drive/MyDrive/QuantumBoost2025/Dataset/train_labels.csv')

# Clean RTECS_ID (remove leading zeros)
features['RTECS_ID'] = features['RTECS_ID'].astype(str).str.lstrip('0')
labels['RTECS_ID']   = labels['RTECS_ID'].astype(str).str.lstrip('0')

train_df = features.merge(labels, on='RTECS_ID', how='left')
print(f"Merged shape: {train_df.shape}")
print(train_df['Toxicity_Class'].value_counts())

## 2. Basic cleaning

In [None]:
# Separate X / y
X = train_df.drop(['RTECS_ID', 'Toxicity_Class'], axis=1)
y = train_df['Toxicity_Class']

# Missing values → median
missing_cols = X.columns[X.isna().any()].tolist()
if missing_cols:
    X[missing_cols] = X[missing_cols].fillna(X[missing_cols].median())

# Infinite values → median
X = X.replace([np.inf, -np.inf], np.nan).fillna(X.median())

# Remove constant / quasi-constant features
vt = VarianceThreshold(threshold=0.01)
X_clean = pd.DataFrame(vt.fit_transform(X), columns=X.columns[vt.get_support()])
print(f"After variance filtering: {X_clean.shape[1]} features")

## 3. Save clean dataset

In [None]:
clean_df = pd.concat([X_clean, y.reset_index(drop=True)], axis=1)
clean_path = '/content/drive/MyDrive/QuantumBoost2025/train_clean.csv'
clean_df.to_csv(clean_path, index=False)
print(f"Clean dataset saved to {clean_path}")