<a href="https://colab.research.google.com/github/sofia-sunny/Introductory_Tutorials/blob/main/08_Simple_QSAR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## **Quantitative Structure-Activity Relationship (QSAR)**


Quantitative Structure-Activity Relationship **(QSAR)** is a method used in chemoinformatics and drug discovery to predict the biological activity or properties of chemical compounds based on their chemical structure. The fundamental idea behind QSAR is that **similar molecules have similar activities**, and by analyzing the relationships between molecular structures and their biological effects, one can develop models to predict the activity of new compounds.

QSAR models are widely used in drug discovery to screen large libraries of compounds, prioritizing those most likely to have desired biological effects.

![](https://drive.google.com/uc?id=1XQ0l_Bi9w_-6XY1NaOK9JIYKwf-GgLRq)




### **Example:**
### **Predicting Drug Activity Using Linear Regression**
How to use linear regression to predict the activity of a set of drug compounds based on their molecular descriptors.

Suppose you have a set of molecules with known inhibitory activity against a particular enzyme. You calculate descriptors like molecular weight, number of hydrogen bond donors, and hydrophobicity. You then **use linear regression to correlate these descriptors with the enzyme inhibition data(Activity)**. The resulting QSAR model can predict the inhibitory activity of new molecules based on their descriptors.


![](https://drive.google.com/uc?id=1o00W4-yx2lXKQCx5G3ti0ybCprOxs4Ec)


In [None]:
# Install RDKit
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.0 kB)
Downloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl (34.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.9/34.9 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2025.3.3


### **Import neccesarry libraries**

In [None]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors

In [None]:
# sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

### **Example Data**

In [None]:
# Example dataset (SMILES and biological activity)
data = {
    'SMILES': ['CCO', 'CCC', 'CCN', 'C1CCCC1', 'CC(C)(C)C(=O)O', 'C#CC(C)(C)C', 'CCCN'],
    'Activity': [5.2, 7.8, 3.5, 10.2, 15.1, 8.5, 12.3]}


### **Define a function to calculate the molecular discriptors from a given smiles**

In [None]:
# Calculate molecular descriptors
def calculate_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    num_h_donors = Descriptors.NumHDonors(mol)
    num_h_acceptors = Descriptors.NumHAcceptors(mol)
    return mw, logp, num_h_donors, num_h_acceptors

### **Create a DataFrame with descriptors and Activity**

In [None]:
# Create DataFrame with descriptors and Activity
descriptors = []
for smiles in data['SMILES']:
    mw, logp, num_h_donors, num_h_acceptors = calculate_descriptors(smiles)
    descriptors.append({'MolWt': mw, 'MolLogP': logp, 'NumHDonors': num_h_donors, 'NumHAcceptors': num_h_acceptors})

df = pd.DataFrame(descriptors)
df['Activity'] = data['Activity']

In [None]:
df.head()

Unnamed: 0,MolWt,MolLogP,NumHDonors,NumHAcceptors,Activity
0,46.069,-0.0014,1,1,5.2
1,44.097,1.4163,0,0,7.8
2,45.085,-0.035,1,1,3.5
3,70.135,1.9505,0,0,10.2
4,102.133,1.1171,1,1,15.1


### **Define X and y**

In [None]:
# Define features(X) and target(y)
X = df[['MolWt', 'MolLogP', 'NumHDonors', 'NumHAcceptors']]
y = df['Activity']

### **Split the data to Training and Testing sets**

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### **Train the Model**

In [None]:
# Train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

### **Predict!**

In [None]:
# Predict Activity for  test data
y_pred_test = model.predict(X_test)

### **Evaluate!**

In [None]:
# Evaluate model performance on test set
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
print(f"Test Set Metrics:")
print(f"Mean Squared Error (MSE): {mse_test:.2f}")
print(f"R-squared (R2) score: {r2_test:.2f}")


Test Set Metrics:
Mean Squared Error (MSE): 8.68
R-squared (R2) score: 0.55


An **R² score of 0.55** indicates that approximately 55% of the variability in the dependent variable can be explained by the model. In other words, the model explains a moderate amount of the variance in the data, but there's still a significant portion (45%) that it doesn't account for.


### **Predicting Activity for a New Data Point**


### To predict the activity of a new data point using the trained linear regression model in our QSAR example, we'll follow these steps:

* #### **Calculate Descriptors for the New Data Point:**
Compute the molecular descriptors (such as molecular weight, LogP, number of hydrogen bond donors, and acceptors) from its SMILES representation.


* #### **Format the Data:**
Organize the computed descriptors into the same format as used during training.

* #### **Use the Trained Model:**
 Apply the trained linear regression model to predict the activity of the new data point based on its descriptors.

In [None]:
# New data point (example SMILES)
new_smiles = 'CCOC'

### **Calculate descriptors**

In [None]:
# Calculate descriptors for the new data point
new_mw, new_logp, new_num_h_donors, new_num_h_acceptors = calculate_descriptors(new_smiles)

### **Create the DataFrame**

In [None]:
# Format into DataFrame format used during training
new_data = pd.DataFrame({
    'MolWt': [new_mw],
    'MolLogP': [new_logp],
    'NumHDonors': [new_num_h_donors],
    'NumHAcceptors': [new_num_h_acceptors]
})

### **Predict Activity!**

In [None]:
# Predict activity using the trained model
predicted_activity = model.predict(new_data)
print(f"Predicted Activity for New Data Point '{new_smiles}': {predicted_activity[0]:.2f}")

Predicted Activity for New Data Point 'CCOC': 5.59


Using a linear regression QSAR model trained on a dataset with activity values ranging from 3.5 to 15.1, we predicted the activity of a new compound represented by the SMILES 'CCOC'. **The model estimated its activity to be 5.59**, placing it on the lower end of the observed activity range. This suggests that 'CCOC' may have relatively modest biological activity compared to the more potent compounds in the training set. The predicted activity allows researchers to decide whether 'CCOC' is worth testing experimentally. Such predictive insights can guide compound selection for further investigation or synthesis in drug development pipelines.