<a href="https://colab.research.google.com/github/sofia-sunny/Introductory_Tutorials/blob/main/09_Drug_Type_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Predicting the Drug Class of a Molecule**

Predicting the type of a potential drug early in the discovery process helps researchers focus on compounds most likely to succeed. By knowing whether a molecule is likely to act as, for example, an antibiotic or antidepressant, experiments can be tailored to the appropriate biological targets. This reduces unnecessary testing, speeds up development, and lowers costs by narrowing down the candidate pool to the most relevant and promising molecules.


Molecular descriptors will be calculated for a set of drugs and used to train two classification models: L**ogistic Regression (LR) and K-Nearest Neighbors (KNN)**. These models will learn to associate descriptor patterns with known drug types, such as antibiotics, analgesics, antidepressants, or antihistamines. Once trained, they will be applied to predict the drug type of new molecules based on their descriptors.



In [1]:
import warnings
warnings.filterwarnings('ignore')  # Ignore warnings

In [2]:
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.0 kB)
Downloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl (34.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.9/34.9 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2025.3.3


### **Import neccesarry libraries**

In [3]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### **Create a df from given file**
The **drug_type.csv** file contains information about various drugs, including:

**Drug:** The name of the drug.

**SMILES:** The SMILES of each drug

**Label:** A numerical label representing the category of the drug:

1: Antibiotics

2: Antihistamines

3: Antidepressants

4: Analgesic

In [8]:
url = 'https://raw.githubusercontent.com/sofia-sunny/Introductory_Tutorials/main/data/drug_type.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,Drug,SMILES,Label
0,Vancomycin,CC1C(C(CC(O1)OC2C(C(C(OC2OC3=C4C=C5C=C3OC6=C(C...,1
1,Venlafaxine,CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O,3
2,Doxycycline,CC1C2C(C3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4...,1
3,Oxycodone,CN1CCC23C4C(=O)CCC2(C1CC5=C3C(=C(C=C5)OC)O4)O,4
4,Bupropion,CC(C(=O)C1=CC(=CC=C1)Cl)NC(C)(C)C,3


### **Function to calculate molecular descriptors**

In [7]:
# Function to calculate molecular descriptors
def calculate_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        descriptors = {
            'MolWt': Descriptors.MolWt(mol),
            'LogP': Descriptors.MolLogP(mol),
            'NumHAcceptors': Descriptors.NumHAcceptors(mol),
            'NumHDonors': Descriptors.NumHDonors(mol),
            'TPSA': Descriptors.TPSA(mol)
        }
        return pd.Series(descriptors)
    else:
        return pd.Series({'MolWt': None, 'LogP': None, 'NumHAcceptors': None, 'NumHDonors': None, 'TPSA': None})


### **Calculate descriptors for each smiles**

In [21]:
# Calculate descriptors
descriptors_df = df['SMILES'].apply(calculate_descriptors)
descriptors_df.head()

Unnamed: 0,MolWt,LogP,NumHAcceptors,NumHDonors,TPSA
0,1449.271,0.1062,25.0,19.0,530.49
1,277.408,3.0356,3.0,1.0,32.7
2,444.44,-0.5042,9.0,6.0,181.62
3,315.369,1.0482,5.0,1.0,59.0
4,239.746,3.2993,2.0,1.0,29.1


In [22]:
data_with_descriptors = pd.concat([df, descriptors_df], axis=1).dropna()  # Concatenate the descriptors with the original DataFrame
data_with_descriptors.head()

Unnamed: 0,Drug,SMILES,Label,MolWt,LogP,NumHAcceptors,NumHDonors,TPSA
0,Vancomycin,CC1C(C(CC(O1)OC2C(C(C(OC2OC3=C4C=C5C=C3OC6=C(C...,1,1449.271,0.1062,25.0,19.0,530.49
1,Venlafaxine,CN(C)CC(C1=CC=C(C=C1)OC)C2(CCCCC2)O,3,277.408,3.0356,3.0,1.0,32.7
2,Doxycycline,CC1C2C(C3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4...,1,444.44,-0.5042,9.0,6.0,181.62
3,Oxycodone,CN1CCC23C4C(=O)CCC2(C1CC5=C3C(=C(C=C5)OC)O4)O,4,315.369,1.0482,5.0,1.0,59.0
4,Bupropion,CC(C(=O)C1=CC(=CC=C1)Cl)NC(C)(C)C,3,239.746,3.2993,2.0,1.0,29.1


### **Data Preparation**

In [23]:
# Define X and y
X = data_with_descriptors[['MolWt', 'LogP', 'NumHAcceptors', 'NumHDonors', 'TPSA']]  # Features
y = data_with_descriptors['Label']  # Target variable


In [24]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### **Train the model**

In [25]:
# Train the logistic regression model
model = LogisticRegression(max_iter=1000)  # Create a logistic regression model
model.fit(X_train, y_train)

### **Predict**

In [26]:
y_pred = model.predict(X_test)  # Make predictions on the test set


### **Evaluate**

In [20]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.62


### **Prediction for new molecule**

In [27]:
# Function to predict the class of a new SMILES
def predict_new_smiles(smiles):
    descriptors = calculate_descriptors(smiles).values.reshape(1, -1)
    prediction = model.predict(descriptors)
    return prediction[0]  # Return the predicted class

### **Using the above function (predict_new_smiles) for a new molecule**

In [28]:
# Example of a new SMILES string with at least 20 atoms with C, N, and O
new_smiles = "CC(C)CC1=CCN=C(C=C1)C(C2=CC=COC=C2)NC3=CC=CC=C3"

# Predict the class of the new SMILES
predicted_label = predict_new_smiles(new_smiles)
print(f"The predicted label for the new SMILES is: {predicted_label}")

The predicted label for the new SMILES is: 3


### **KNN model to predict the**

In [29]:
from sklearn.neighbors import KNeighborsClassifier  # Import the KNeighborsClassifier from scikit-learn

In [30]:
knn = KNeighborsClassifier(n_neighbors=2)  # Create a KNN classifier with 2 neighbors
knn.fit(X_train, y_train)  # Fit the model to the training data

## **Predict**

In [31]:
y_pred = knn.predict(X_test)

## **Evaluate**

In [33]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.50


### The knn model's accuracy  using **k=2** (0.50 ) is smaller than that of logistic regression model(0.62)


### **How about other K values?

In [35]:
for k in range(1, 10):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"k={k}, Accuracy: {accuracy:.2f}")

k=1, Accuracy: 0.50
k=2, Accuracy: 0.50
k=3, Accuracy: 0.75
k=4, Accuracy: 0.75
k=5, Accuracy: 0.75
k=6, Accuracy: 0.25
k=7, Accuracy: 0.25
k=8, Accuracy: 0.38
k=9, Accuracy: 0.38


### **k=3, k=4, and k=5:**
These values of k yield the highest accuracy at **0.75**. This suggests that for your dataset, considering 3 to 5 neighbors strikes the right balance between capturing the structure of the data and avoiding overfitting. These values of k seem to provide the best generalization to the test data