# Assignment 1, Task A: Classification problem.

## The data:
In this QSAR exercise, the mutagenicity of various molecules is to be investigated. The dataset in use is the Ames Mutagenicity Dataset for Multi-Task learning accessed via the PyTDC library, essentially as also provided here: https://huggingface.co/datasets/scikit-fingerprints/TDC_ames. Columns have been renamed for enhanced clarity.

The dataset gives the overal mutagenicity (1 = mutagen) of various drugs (simply represented as their SMILES string). From the SMILES strings, molecular fingerprints can be generated as molecular descriptors.

## The tasks:
1) Inspect the data and clean if needed. Adhere to good practices!
2) Calculate the fingerprints (partial snippet provided) and create a feature matrix X and a target vector y
3) Then four different models should be trained on the fingerprints and evaluated according to accuracy and their roc-auc score to compare their performance. For each model, additionally, the overfitting needs to be addressed.

These four models have to be compared:
- `KNeighborsClassifier`: choose a suitable number of neighbors
- `DecisionTreeClassifier`: use a random_state
- `RandomForestClassifier`: use a random_state and a slightly bigger forest (e.g. 200 trees)
- `GradientBoostingClassifier`: use a random_state

Other than the stated parameters, the models can be mostly used as provided by `scikit`. No hyperparameter tuning needs to be performed, no CV necessary.

4) Conclusion and discussion: Provide answers to the questions.

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, ConfusionMatrixDisplay

In [None]:
df = pd.read_csv("ames_data.csv")
df.head()

Unnamed: 0,drug_id,smiles,mutagenicity
0,Drug 0,O=[N+]([O-])c1ccc2ccc3ccc([N+](=O)[O-])c4c5ccc...,1
1,Drug 1,O=[N+]([O-])c1c2c(c3ccc4cccc5ccc1c3c45)CCCC2,1
2,Drug 2,O=c1c2ccccc2c(=O)c2c1ccc1c2[nH]c2c3c(=O)c4cccc...,0
3,Drug 3,[N-]=[N+]=CC(=O)NCC(=O)NN,1
4,Drug 4,[N-]=[N+]=C1C=NC(=O)NC1=O,1


## 1. Inspect and clean the data
- Gain some overview of the data and assess NaNs and duplicates and clean if needed.
- Inspect the class balance!

## 2. Create fingerprints from the Smiles
The partial snippet for MorganFingerprints can be used. Note that instead of a dataframe, the function will produce a np.array, which will be written into a list. From this you can create the feature matrix and the target vector. Inspect the shape of the arrays!

In [None]:
def smiles_to_fp(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
    fp = mfpgen.GetFingerprint(mol)
    return np.array(fp)

# Convert to fingerprints
fps = []
valid_labels = []

for smiles, label in zip(df["smiles"], df["mutagenicity"]):
    fp = smiles_to_fp(smiles)
    if fp is not None:
        fps.append(fp)
        valid_labels.append(label)


Feature matrix shape: (7278, 2048)


## 3. Train the models
Use a classic train-test split of 0.2 including a random seed and `stratify`. For training and predicting labels, take note of the time the process takes for each model (does not necessarily have to be coded, can also be estimated). Make sure to predict labels for both training and test splits in order to identify overfitting. Use the accuracy and roc-auc as metrics for evaluation.

KNN
  Train Accuracy: 0.861
  Test  Accuracy: 0.790
  Train ROC-AUC:  0.941
  Test  ROC-AUC:  0.860
----------------------------------------
Decision Tree
  Train Accuracy: 0.999
  Test  Accuracy: 0.777
  Train ROC-AUC:  1.000
  Test  ROC-AUC:  0.772
----------------------------------------
Random Forest
  Train Accuracy: 0.999
  Test  Accuracy: 0.825
  Train ROC-AUC:  1.000
  Test  ROC-AUC:  0.901
----------------------------------------
Gradient Boosting
  Train Accuracy: 0.810
  Test  Accuracy: 0.773
  Train ROC-AUC:  0.895
  Test  ROC-AUC:  0.851
----------------------------------------


## 4. Conclusion and discussion
- Which model performed the best?
- Which was the most time efficient?
- Which model showed the wors overfitting?
- Why does ensemble learning outperform a single tree?
- Why does KNN perform well in high-dimensional fingerprint space?
- What does ROC-AUC tell us that accuracy does not?