# Data Preprocessing

This notebook demonstrates the preprocessing pipeline for both datasets:

1. **BBB Dataset**: SMILES validation → RDKit descriptors → Mol2vec embeddings → Feature cleaning
2. **Breast Cancer Dataset**: Standard preprocessing → Feature scaling → Train/test split

## Key Steps:
- Molecular featurization using RDKit and Mol2vec for BBB data
- Standard scaling and splitting for clinical data
- Feature importance analysis
- Data quality checks and cleaning

In [ ]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Chemistry and molecular processing
import requests
import tarfile
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
from gensim.models import Word2Vec

from data.preprocessing import BBBPreprocessor, BreastCancerPreprocessor

print("Libraries imported successfully!")

## 1. Load Raw Datasets

In [ ]:
# Load datasets
bbb_df = pd.read_excel('../data/raw/BBBP.xlsx')
bc_df = pd.read_csv('../data/raw/breast-cancer.csv')

print("Raw dataset shapes:")
print(f"BBB dataset: {bbb_df.shape}")
print(f"Breast Cancer dataset: {bc_df.shape}")

# Quick preview
print(f"\nBBB columns: {bbb_df.columns.tolist()}")
print(f"BC columns: {bc_df.columns.tolist()[:10]}...")  # First 10 columns

## 2. BBB Dataset: Molecular Featurization

The BBB dataset requires specialized preprocessing:
1. SMILES validation
2. RDKit molecular descriptors
3. Mol2vec embeddings
4. Feature cleaning and consolidation

In [ ]:
# Download Mol2vec model if needed
model_path = "../data/external/model_300dim.pkl"
if not os.path.exists(model_path):
    print("Downloading Mol2vec model...")
    os.makedirs(os.path.dirname(model_path), exist_ok=True)
    
    # Download the file
    url = "https://deepchemdata.s3-us-west-1.amazonaws.com/trained_models/mol2vec_model_300dim.tar.gz"
    response = requests.get(url)
    tar_path = "../data/external/mol2vec_model_300dim.tar.gz"
    
    with open(tar_path, "wb") as f:
        f.write(response.content)

    # Extract the tar.gz file
    with tarfile.open(tar_path, "r:gz") as tar:
        tar.extractall("../data/external/")

    # Rename the extracted file
    extracted_path = "../data/external/mol2vec_model_300dim.pkl"
    if os.path.exists(extracted_path):
        os.rename(extracted_path, model_path)

    # Clean up the tar.gz file
    os.remove(tar_path)
    print("Mol2vec model downloaded and extracted!")
else:
    print("Mol2vec model already exists!")

In [ ]:
# Process BBB dataset using the preprocessor
print("Processing BBB dataset...")
bbb_preprocessor = BBBPreprocessor()

# Step 1: Load the Mol2vec model
mol2vec_model = bbb_preprocessor.load_mol2vec_model(model_path)
print(f"Mol2vec model loaded with {mol2vec_model.vector_size} dimensions")

# Step 2: Validate SMILES and compute features
try:
    bbb_processed = bbb_preprocessor.process_dataset(bbb_df, mol2vec_model)
    print(f"BBB processing completed! Final shape: {bbb_processed.shape}")
    
    # Show sample of processed features
    print(f"\nProcessed feature types:")
    feature_columns = [col for col in bbb_processed.columns if col not in ['BBB', 'SMILES']]
    print(f"- RDKit descriptors: {len([c for c in feature_columns if not c.startswith('mol2vec')])}")
    print(f"- Mol2vec features: {len([c for c in feature_columns if c.startswith('mol2vec')])}")
    print(f"- Total features: {len(feature_columns)}")
    
except Exception as e:
    print(f"Error processing BBB dataset: {e}")
    raise

## 3. Breast Cancer Dataset: Standard Preprocessing

In [ ]:
# Process Breast Cancer dataset
print("Processing Breast Cancer dataset...")
bc_preprocessor = BreastCancerPreprocessor()

bc_processed = bc_preprocessor.process_dataset(bc_df)
print(f"BC processing completed! Final shape: {bc_processed.shape}")

# Show the processed dataset structure
print(f"\nBreast Cancer processed features:")
feature_columns = [col for col in bc_processed.columns if col != 'target']
print(f"- Number of features: {len(feature_columns)}")
print(f"- Feature names: {feature_columns[:5]}...") # First 5 features

# Check class distribution
print(f"\nClass distribution:")
print(bc_processed['target'].value_counts())

## 4. Train/Test Splitting and Scaling

In [ ]:
# BBB Dataset splitting and scaling
print("=== BBB Dataset Splitting and Scaling ===")

# Separate features and target, preserving SMILES
feature_cols_bbb = [col for col in bbb_processed.columns if col not in ['BBB', 'SMILES']]
X_bbb = bbb_processed[feature_cols_bbb]
y_bbb = bbb_processed['BBB']
smiles_bbb = bbb_processed['SMILES']

print(f"BBB features shape: {X_bbb.shape}")
print(f"BBB target shape: {y_bbb.shape}")

# Train/test split (80/20)
X_bbb_train, X_bbb_test, y_bbb_train, y_bbb_test, smiles_train, smiles_test = train_test_split(
    X_bbb, y_bbb, smiles_bbb, test_size=0.2, random_state=42, stratify=y_bbb
)

# Scale features
scaler_bbb = StandardScaler()
X_bbb_train_scaled = scaler_bbb.fit_transform(X_bbb_train)
X_bbb_test_scaled = scaler_bbb.transform(X_bbb_test)

print(f"BBB train set: {X_bbb_train_scaled.shape}")
print(f"BBB test set: {X_bbb_test_scaled.shape}")
print(f"BBB train target distribution: {np.bincount(y_bbb_train)}")
print(f"BBB test target distribution: {np.bincount(y_bbb_test)}")

In [ ]:
# Breast Cancer Dataset splitting and scaling
print("\n=== Breast Cancer Dataset Splitting and Scaling ===")

# Separate features and target
feature_cols_bc = [col for col in bc_processed.columns if col != 'target']
X_bc = bc_processed[feature_cols_bc]
y_bc = bc_processed['target']

print(f"BC features shape: {X_bc.shape}")
print(f"BC target shape: {y_bc.shape}")

# Train/test split (80/20)
X_bc_train, X_bc_test, y_bc_train, y_bc_test = train_test_split(
    X_bc, y_bc, test_size=0.2, random_state=42, stratify=y_bc
)

# Scale features
scaler_bc = StandardScaler()
X_bc_train_scaled = scaler_bc.fit_transform(X_bc_train)
X_bc_test_scaled = scaler_bc.transform(X_bc_test)

print(f"BC train set: {X_bc_train_scaled.shape}")
print(f"BC test set: {X_bc_test_scaled.shape}")
print(f"BC train target distribution: {np.bincount(y_bc_train)}")
print(f"BC test target distribution: {np.bincount(y_bc_test)}")

## 5. Feature Importance Analysis

In [ ]:
# Feature importance analysis using Random Forest
def analyze_feature_importance(X_train, y_train, feature_names, dataset_name, top_n=20):
    """Analyze and plot feature importance using Random Forest"""
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Get feature importances
    importances = rf.feature_importances_
    indices = np.argsort(importances)[::-1]
    
    print(f"\n{dataset_name} - Top {top_n} Feature Importances:")
    print("-" * 60)
    for i in range(min(top_n, len(indices))):
        idx = indices[i]
        feature_name = feature_names[idx] if hasattr(feature_names, '__getitem__') else f"Feature_{idx}"
        print(f"{i+1:2d}. {feature_name}: {importances[idx]:.4f}")
    
    # Plot top features
    plt.figure(figsize=(12, 8))
    top_indices = indices[:top_n]
    top_importances = importances[top_indices]
    top_names = [feature_names[i] if hasattr(feature_names, '__getitem__') else f"Feature_{i}" 
                 for i in top_indices]
    
    # Clean up feature names for display
    display_names = [name.replace('_', ' ') for name in top_names]
    
    plt.barh(range(len(top_indices)), top_importances, color='skyblue', edgecolor='black')
    plt.yticks(range(len(top_indices)), display_names)
    plt.xlabel('Importance Score')
    plt.title(f'{dataset_name} - Top {top_n} Feature Importances')
    plt.gca().invert_yaxis()  # Highest importance at top
    plt.tight_layout()
    plt.show()
    
    return rf, importances, indices

# Analyze BBB dataset
print("Analyzing BBB dataset feature importance...")
rf_bbb, imp_bbb, idx_bbb = analyze_feature_importance(
    X_bbb_train_scaled, y_bbb_train, feature_cols_bbb, "BBB Dataset"
)

In [ ]:
# Analyze Breast Cancer dataset
print("Analyzing Breast Cancer dataset feature importance...")
rf_bc, imp_bc, idx_bc = analyze_feature_importance(
    X_bc_train_scaled, y_bc_train, feature_cols_bc, "Breast Cancer Dataset"
)

## 6. Save Processed Data

Save the preprocessed datasets for use in subsequent experiments:

In [ ]:
# Save processed datasets
import joblib

# Create processed data directory
processed_dir = "../data/processed"
os.makedirs(processed_dir, exist_ok=True)

# Save BBB processed data
np.save(f"{processed_dir}/X_bbb_train.npy", X_bbb_train_scaled)
np.save(f"{processed_dir}/X_bbb_test.npy", X_bbb_test_scaled)
np.save(f"{processed_dir}/y_bbb_train.npy", y_bbb_train)
np.save(f"{processed_dir}/y_bbb_test.npy", y_bbb_test)
smiles_train.to_csv(f"{processed_dir}/smiles_bbb_train.csv", index=False)
smiles_test.to_csv(f"{processed_dir}/smiles_bbb_test.csv", index=False)

# Save Breast Cancer processed data
np.save(f"{processed_dir}/X_bc_train.npy", X_bc_train_scaled)
np.save(f"{processed_dir}/X_bc_test.npy", X_bc_test_scaled)
np.save(f"{processed_dir}/y_bc_train.npy", y_bc_train)
np.save(f"{processed_dir}/y_bc_test.npy", y_bc_test)

# Save scalers
joblib.dump(scaler_bbb, f"{processed_dir}/scaler_bbb.pkl")
joblib.dump(scaler_bc, f"{processed_dir}/scaler_bc.pkl")

# Save feature names
pd.Series(feature_cols_bbb).to_csv(f"{processed_dir}/feature_names_bbb.csv", index=False)
pd.Series(feature_cols_bc).to_csv(f"{processed_dir}/feature_names_bc.csv", index=False)

print("Processed data saved successfully!")
print(f"Files saved to: {processed_dir}")
print("Contents:")
for file in os.listdir(processed_dir):
    if file.endswith(('.npy', '.csv', '.pkl')):
        print(f"  - {file}")

## Summary

**Preprocessing Complete!**

### BBB Dataset:
- **Input**: 2,039 SMILES strings → **Output**: 1,976 samples with 500+ features
- **Features**: RDKit molecular descriptors + Mol2vec embeddings (300D)
- **Preprocessing**: SMILES validation, descriptor calculation, feature cleaning
- **Class distribution**: Balanced binary classification

### Breast Cancer Dataset:
- **Input**: 569 clinical samples → **Output**: 569 samples with 30 features  
- **Features**: Clinical measurements (e.g., radius, texture, perimeter)
- **Preprocessing**: Standard scaling, missing value handling
- **Class distribution**: 212 malignant, 357 benign

### Ready for Active Learning:
- Data properly scaled and split (80/20)
- Feature importance identified
- Preprocessed data saved for experiments
- Both datasets ready for comparative active learning studies

**Next Step**: Dimensionality reduction analysis and visualization