In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preparation for Molecular Property Prediction\n",
    "\n",
    "This notebook demonstrates the data preparation steps for the molecular solubility prediction project.\n",
    "\n",
    "We'll perform the following tasks:\n",
    "1. Download the Delaney solubility dataset\n",
    "2. Explore the dataset and understand its structure\n",
    "3. Generate molecular descriptors using RDKit\n",
    "4. Analyze relationships between descriptors and solubility\n",
    "5. Prepare data for model training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Import libraries\n",
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from rdkit import Chem\n",
    "from rdkit.Chem import Draw, PandasTools, Descriptors, Lipinski\n",
    "\n",
    "# Import from our own modules\n",
    "import sys\n",
    "sys.path.append('..')\n",
    "from src.features.molecular_descriptors import generate_features, generate_features_df\n",
    "\n",
    "# Set up plotting\n",
    "%matplotlib inline\n",
    "plt.style.use('seaborn-whitegrid')\n",
    "sns.set_palette('viridis')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Download the Delaney Solubility Dataset\n",
    "\n",
    "The Delaney dataset contains measured solubility data for 1144 compounds, along with their SMILES representations. This is a commonly used benchmark dataset for QSPR (Quantitative Structure-Property Relationship) models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# URL for the Delaney dataset\n",
    "url = \"https://raw.githubusercontent.com/ESAIBMCIC/MolecularSolubilityPrediction/master/delaney.csv\"\n",
    "\n",
    "# Download the dataset\n",
    "df = pd.read_csv(url)\n",
    "\n",
    "# Create directory for raw data if it doesn't exist\n",
    "os.makedirs('../data/raw', exist_ok=True)\n",
    "\n",
    "# Save the dataset\n",
    "df.to_csv('../data/raw/delaney.csv', index=False)\n",
    "\n",
    "print(f\"Dataset downloaded and saved. Shape: {df.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Explore the Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Display the first few rows\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Examine dataset information\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Compute summary statistics\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Visualize the distribution of solubility values\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(df['logS'], kde=True, bins=30)\n",
    "plt.title('Distribution of Experimental LogS Values')\n",
    "plt.xlabel('LogS')\n",
    "plt.ylabel('Frequency')\n",
    "plt.grid(True, alpha=0.3)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Visualize Some Example Molecules\n",
    "\n",
    "Let's visualize a few example molecules from the dataset to get a better understanding of the chemical structures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Convert SMILES to RDKit molecules\n",
    "def smiles_to_mol(smiles):\n",
    "    return Chem.MolFromSmiles(smiles)\n",
    "\n",
    "# Select a few example molecules\n",
    "sample_mols = [smiles_to_mol(smiles) for smiles in sample_smiles]\n",
    "sample_labels = [f\"LogS: {df.iloc[i]['logS']:.2f}\" for i in sample_indices]\n",
    "\n",
    "# Display molecules with their solubility values\n",
    "img = Draw.MolsToGridImage(sample_mols, molsPerRow=3, subImgSize=(300, 300), legends=sample_labels)\n",
    "display(img)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Generate Molecular Descriptors\n",
    "\n",
    "Now we'll use our custom module to generate molecular descriptors for all compounds in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Generate molecular descriptors for all compounds\n",
    "smiles_list = df['smiles'].tolist()\n",
    "features_df = generate_features_df(smiles_list)\n",
    "\n",
    "# Add experimental solubility values to the features dataframe\n",
    "features_df['logS_exp'] = df['logS'].values\n",
    "\n",
    "# Display the first few rows of the features dataframe\n",
    "features_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Check the shape of the features dataframe\n",
    "print(f\"Feature dataframe shape: {features_df.shape}\")\n",
    "\n",
    "# Check for missing values\n",
    "missing_values = features_df.isnull().sum()\n",
    "print(\"\\nFeatures with missing values:\")\n",
    "print(missing_values[missing_values > 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Analyze Relationships Between Descriptors and Solubility\n",
    "\n",
    "Let's explore the relationships between the calculated molecular descriptors and the experimental solubility values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Calculate correlations with experimental logS\n",
    "correlations = features_df.drop('SMILES', axis=1).corr()['logS_exp'].sort_values(ascending=False)\n",
    "\n",
    "# Display top positive and negative correlations\n",
    "print(\"Top positively correlated features:\")\n",
    "print(correlations.head(10))\n",
    "print(\"\\nTop negatively correlated features:\")\n",
    "print(correlations.tail(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Plot correlation heatmap for the top features\n",
    "top_features = correlations.index[:8].tolist() + correlations.index[-8:].tolist()\n",
    "top_features = [f for f in top_features if f != 'logS_exp']\n",
    "top_features.append('logS_exp')\n",
    "\n",
    "plt.figure(figsize=(12, 10))\n",
    "sns.heatmap(features_df[top_features].corr(), annot=True, cmap='coolwarm', linewidths=0.5)\n",
    "plt.title('Correlation Heatmap of Top Features')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Create pairplot for the most important features\n",
    "top_5_features = correlations.index[:4].tolist()\n",
    "top_5_features.append('logS_exp')\n",
    "\n",
    "plt.figure(figsize=(12, 10))\n",
    "sns.pairplot(features_df[top_5_features], diag_kind='kde', height=2.5)\n",
    "plt.suptitle('Pairplot of Top Features vs. Experimental LogS', y=1.02)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Save Processed Data\n",
    "\n",
    "Finally, let's save the processed data with all molecular descriptors for model training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "source": [
    "# Create directory for processed data if it doesn't exist\n",
    "os.makedirs('../data/processed', exist_ok=True)\n",
    "\n",
    "# Save the processed data\n",
    "features_df.to_csv('../data/processed/molecular_features.csv', index=False)\n",
    "print(f\"Processed data saved to ../data/processed/molecular_features.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this notebook, we have:\n",
    "\n",
    "1. Downloaded the Delaney solubility dataset\n",
    "2. Explored the dataset structure and distribution of solubility values\n",
    "3. Visualized example molecules from the dataset\n",
    "4. Generated molecular descriptors using RDKit\n",
    "5. Analyzed correlations between descriptors and solubility\n",
    "6. Saved the processed data for model training\n",
    "\n",
    "Next, we'll proceed to model development in the next notebook."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}_indices = [0, 10, 20, 30, 40]\n",
    "sample_smiles = df.iloc[sample_indices]['smiles'].tolist()\n",
    "sample