# 01 - Data Loading and Preprocessing

**TB Drug Discovery ML Pipeline - Phase 1**

This notebook covers:
1. Loading ChEMBL TB inhibitor data
2. SMILES validation and standardization
3. Activity value processing (IC50 → pIC50)
4. Data quality checks

**Target:** Clean dataset with >= 500 compounds

In [None]:
# Imports
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / "src"))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from rdkit import Chem
from rdkit.Chem import Draw

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', 50)

print("Imports successful!")

## 1. Download ChEMBL Data

We need to download TB InhA inhibitor data from ChEMBL.

**Option 1:** Download from ChEMBL website
1. Go to https://www.ebi.ac.uk/chembl/
2. Search for target: CHEMBL1849 (InhA)
3. Download bioactivity data as CSV

**Option 2:** Use ChEMBL API (shown below)

In [None]:
# Create data directories
data_dir = Path.cwd().parent / "data"
raw_dir = data_dir / "raw"
processed_dir = data_dir / "processed"

raw_dir.mkdir(parents=True, exist_ok=True)
processed_dir.mkdir(parents=True, exist_ok=True)

print(f"Data directories created:")
print(f"  Raw: {raw_dir}")
print(f"  Processed: {processed_dir}")

In [None]:
# Download from ChEMBL API
import requests

def download_chembl_data(target_id="CHEMBL1849", activity_type="IC50"):
    """Download bioactivity data from ChEMBL API."""
    
    base_url = "https://www.ebi.ac.uk/chembl/api/data/activity.json"
    
    params = {
        "target_chembl_id": target_id,
        "standard_type": activity_type,
        "limit": 1000,
    }
    
    all_activities = []
    offset = 0
    
    print(f"Downloading {activity_type} data for {target_id}...")
    
    while True:
        params["offset"] = offset
        response = requests.get(base_url, params=params)
        
        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            break
        
        data = response.json()
        activities = data.get("activities", [])
        
        if not activities:
            break
        
        all_activities.extend(activities)
        print(f"  Downloaded {len(all_activities)} records...")
        
        offset += len(activities)
        
        if len(activities) < 1000:
            break
    
    print(f"Total: {len(all_activities)} records")
    return pd.DataFrame(all_activities)

# Download data (uncomment to run)
# df_raw = download_chembl_data()
# df_raw.to_csv(raw_dir / "chembl_inhA.csv", index=False)

print("\nIf you already have the data, load it in the next cell.")

In [None]:
# Load existing data (if already downloaded)
data_path = raw_dir / "chembl_inhA.csv"

if data_path.exists():
    df_raw = pd.read_csv(data_path)
    print(f"Loaded {len(df_raw)} records from {data_path.name}")
    print(f"\nColumns: {list(df_raw.columns)}")
else:
    print(f"Data file not found: {data_path}")
    print("Please download from ChEMBL or run the download cell above.")

## 2. Data Exploration

In [None]:
# Basic info
print("Dataset shape:", df_raw.shape)
print("\nColumn types:")
print(df_raw.dtypes)

In [None]:
# Check for key columns
key_columns = ['canonical_smiles', 'standard_value', 'standard_type', 'pchembl_value']

for col in key_columns:
    if col in df_raw.columns:
        non_null = df_raw[col].notna().sum()
        print(f"{col}: {non_null}/{len(df_raw)} non-null values")
    else:
        print(f"{col}: NOT FOUND")

In [None]:
# Activity type distribution
if 'standard_type' in df_raw.columns:
    print("Activity types:")
    print(df_raw['standard_type'].value_counts())

## 3. Data Preprocessing with ChEMBLLoader

In [None]:
from data.chembl_loader import ChEMBLLoader

# Initialize loader
loader = ChEMBLLoader(min_compounds=500, random_seed=42)

# Preprocess data
df_clean = loader.preprocess(df_raw, activity_types=["IC50"])

print(f"\nCleaned dataset: {len(df_clean)} compounds")
df_clean.head()

In [None]:
# Add activity labels
df_clean = loader.create_activity_labels(df_clean, threshold=6.0)

print("Activity distribution:")
print(df_clean['active'].value_counts())

In [None]:
# Dataset statistics
stats = loader.get_statistics(df_clean)
print("\nDataset Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")

## 4. Visualizations

In [None]:
# pIC50 distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram
axes[0].hist(df_clean['pIC50'], bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(x=6.0, color='red', linestyle='--', label='Activity threshold (pIC50=6)')
axes[0].set_xlabel('pIC50')
axes[0].set_ylabel('Count')
axes[0].set_title('pIC50 Distribution')
axes[0].legend()

# Box plot by activity
df_clean.boxplot(column='pIC50', by='active', ax=axes[1])
axes[1].set_xlabel('Active (0=Inactive, 1=Active)')
axes[1].set_ylabel('pIC50')
axes[1].set_title('pIC50 by Activity Class')
plt.suptitle('')  # Remove auto-title

plt.tight_layout()
plt.savefig(processed_dir.parent.parent / 'results' / 'figures' / 'pic50_distribution.png', dpi=150)
plt.show()

In [None]:
# Sample molecules visualization
sample_smiles = df_clean['smiles'].head(9).tolist()
mols = [Chem.MolFromSmiles(s) for s in sample_smiles]

img = Draw.MolsToGridImage(mols, molsPerRow=3, subImgSize=(300, 300))
img

## 5. Save Processed Data

In [None]:
# Save cleaned data
output_path = processed_dir / "cleaned_chembl_inhA.csv"
loader.save_processed(df_clean, str(output_path))

print(f"\nSaved {len(df_clean)} compounds to:")
print(f"  {output_path}")

## Summary

### Results:
- **Input:** Raw ChEMBL data
- **Output:** Cleaned dataset with validated SMILES and pIC50 values
- **Compounds:** Check the count above

### Quality Checks:
- ✅ SMILES validated with RDKit
- ✅ Activity values converted to pIC50
- ✅ Duplicates removed
- ✅ Activity labels created (threshold: pIC50 >= 6.0)

### Next Steps:
→ Proceed to **02_descriptor_calculation.ipynb** for molecular descriptor calculation