# Lab 1: Virtual Screening 

**Goal**: Use _in silico_ methods to screen novel leads from dozens to millions of molecules, aiming to identify lead compounds that can be optimized into viable drug candidates.

<img src="figures/In-Silico-Virtual-Screening-1.jpg" width="500"/>
<p style="color:lightgrey">Ref: https://drug-discovery.creative-biostructure.com/in-silico-virtual-screening-p32</p>



## Content
- [Concept](#concept)
- [QSAR modeling](01_qsar_modeling.ipynb)
- [Virtual screening & prioritization](02_virtual_screening.ipynb)
- [Lab recap](03_recap.ipynb)

<a id="concept"> </a>
## Concept
### Ligand-based virtual screening

This strategy focuses on analyzing the structures, physicochemical properties, and structure-activity relationships (SAR) of compounds with both known and unknown activities. By establishing SAR, it predicts and screens the activity of novel compounds.

In this approach, a target protein is not directly used to identify potential ligands. Instead, the emphasis is on finding compounds that are structurally and functionally similar to known active compounds. It is an important complementary means for molecular docking, and the strategy has the advantages of fast speed and good versatility (not limited by the target structure).
 
<img src="figures/fphar-09-01275-g001.jpg" width="500"/>


### Screening libraries
Choosing these appropriate libraries which provide robust starting points for virtual screening in drug discovery, helping researchers identify and optimize lead compounds efficiently. The choice of which chemical libraries to screen should be guided by the goals of the project and the biological system of interest. It depends on the diversity of the chemical space, the quality and availability of compounds, and any associated bioactivity data.  

<img src="figures/libraries.jpg" width="500"/>
<p style="color:grey">Illustration of chemical spaces of commercially available or tangible compounds.</p>
<p style="color:lightgrey">Ref: https://www.sciencedirect.com/science/article/pii/S1359644618304471#fig0015</p>


### Machine learning (ML)
With the increasing availability of extensive data sources, ML has gained significant momentum in drug discovery, particularly in ligand-based virtual screening.

**Supervised**: A learning algorithm creates rules by finding patterns in the training data.
- **Classification** : Identify which category an object belongs to.
- **Regression**: Prediction of a continuous-values attribute associated with an object.

**Unsupervised**
- Clustering : Automated grouping of similar objects into sets.


#### Supervised ML algorithms
Learning algorithm creates rules by finding patterns in the training data.
- **Random Forest (RF)**: Ensemble of decision trees. A single decision tree splits the features of the input vector in a way that maximizes an objective function. In the random forest algorithm, the trees that are grown are de-correlated because the choice of features for the splits are chosen randomly.

<img src="figures/randomforest.png" width="600"/>

- **Graph neural network(GNN)**: A category of deep neural networks whose inputs are graphs and provide a way around the choice of descriptors. A GNN can take a molecule directly as input. GNNs are specific layers that input a graph and output a graph.
    
<img src="figures/gnn_ml.png" width="800"/>

<!-- <p style="color:sliver">Ref: https://github.com/chaitjo/geometric-gnn-dojo/blob/main/geometric_gnn_101.ipynb</p> -->


### Model evaluation and selection
Model evaluation is the process of using different evaluation metrics to understand a machine learning model’s performance, as well as its strengths and weaknesses.

#### Data splitting 

The original data in a machine learning model is typically taken and split into two, three or more sets. The three sets commonly used are the training set, validation set and the testing set. 

In drug discovery, proper data splitting ensures that models can generalize well to new, unseen chemical spaces, which is essential in predicting the activity of novel compounds outside the distribution of the training data (the chemical space of the screening library in virtusl screening).

Common data splitting techniques in drug discovery:
- **Random splitting**: Randomly divides the dataset into training, validation, and test sets.
- **Scaffold-based splitting**: Splits data based on the molecular scaffolds (core structures) of the compounds.
- **Temporal splitting**: Splits data based on the time of data generation, with older data used for training and newer data for testing.
- **Activity-based splitting**: Ensures that compounds with a range of activities (e.g., high, medium, low) are present in both training and test sets.
- **Cluster-base splitting**:  Uses clustering algorithms to group similar compounds and then splits clusters into training and test sets.

The **choice** of splitting approach depends on factors such as: 
- the stage and purpose of the drug discovery program
- the structural diversity of the dataset
- the characteristics of the deployment set to which the predictive model will be applied


#### Validation strategies

<img src="figures/splitting.png" width="500"/>

- Train-Validation-Test Split: Divide the dataset into three subsets: training, validation, and test sets. The training set is used for model training, the validation set for hyperparameter tuning, and the test set for final evaluation.
- Cross-Validation: Partition the dataset into k-folds, train the model on k-1 folds, and validate on the remaining fold. Repeat k times and average the results.
- Leave-One-Out Cross-Validation (LOOCV): A special case of k-fold where k equals the number of samples. Train on all samples except one, and test on the excluded sample. Repeat for all samples.

#### External Validation

- **Independent Test Set**: Evaluate the model on an independent dataset not used in any part of the model training or validation process.
- **External Benchmarks**: Compare the model’s performance against established benchmarks or datasets from different sources.



### Compound prioritization
Compound prioritization in virtual screening is the process of ranking and selecting the most promising compounds from a large pool of candidates based on predicted activities and other relevant criteria. This step is crucial in drug discovery as it helps to identify potential lead compounds efficiently and effectively, saving time and resources in the experimental validation phase. 

- Desired potency 
- Desired molecule properties
- ADMET properties
- Structural diversity
- Selectivity (off-target effects)
- Druglikeness and rule based filters (Day 3)
- Specific requirement of the drug discovery programs

#### In multi-objective optimization 
**Pareto front** is the set of all Pareto efficient solutions.

<img src="figures/Pareto_Efficient_Frontier_1024x1024.png" width="300"/>


## Next one-hour session 

In this first lab, we will delve into the realm of virtual screening. Using datasets of 2D molecules, we will develop predictive models to assess inhibitory activity against a human kinase EGFR(Epidermal Growth Factor Receptor) protein. Building on concepts from lectures on molecular representation, scoring, and Graph Neural Networks (GNNs) for Chemistry, we will utilize `PyTorch`,`PyG`, `Scikit-learn`, and other libraries to create both GNN models and classical Random Forest models with molecular fingerprints. The ultimate objective is to screen a small commercial library and select **100** promising and **diverse** molecules with molecular weight between **280 and 400 Da** for further experimental investigation.

### Lab 1 target of interest: EGFR (Epidermal Growth Factor Receptor)

The protein encoded by this gene is a transmembrane glycoprotein that is a member of the protein **kinase** superfamily. This protein is a receptor for members of the epidermal growth factor family. EGFR is a cell surface protein that binds to epidermal growth factor, thus inducing receptor dimerization and tyrosine autophosphorylation leading to cell proliferation. 

EGFR is a frequently over-expressed and aberrantly activated trans-membrane protein in non-small cell lung cancer (NSCLC) patients, described for the first time in 2004. Mutations in this gene are associated with lung cancer in particular.

#### Types of EGFR inhibitions:
- competitive 
- covalent
- allosteric

**Example of first generation EGFR inhibitor**\
<img src="figures/EGFR_ATP.png" width="500"/>


### Hit discovery of novel EGFR inhibitors
We have collected EGFR binding affinity dataset from public domain. Our project aims to leverage those publicly available data to identify more hits in the chemical space.

### Targeted compound library

In this tutorial, we will focus on identifying potential inhibitors that target the ATP-binding pocket of kinases. We will perform virtual screening against 24 000 compounds compounds from the [Hinge Binders Library](https://enamine.net/compound-libraries/targeted-libraries/kinase-library/hinge-binders-library), a commercial library specifically designed for discovering novel kinase ATP pocket binders. By utilizing this targeted library, we aim to efficiently identify promising candidates and prioritize our experimental resources on the most promising leads.

<img src="figures/KINASE_HINGE_RDL_1.png" width="500"/>


## What we will focus for the next one hour?

In this first lab, we will delve into the realm of virtual screening. Using datasets of 2D molecules, we will develop predictive models to assess inhibitory activity against a human kinase EGFR(Epidermal Growth Factor Receptor) protein. Building on concepts from lectures on molecular representation, scoring, and Graph Neural Networks (GNNs) for Chemistry, we will utilize `PyTorch`,`PyG`, `Scikit-learn`, and other libraries to create both GNN models and classical Random Forest models with molecular fingerprints. The ultimate objective is to screen a small commercial library and select **100** promising and **diverse** molecules with molecular weight between **280 and 400 Da** for further experimental investigation.


To run this tutorial, please ensure all dependencies below are installed. 

- `datamol`
- `molfeat`
- `splito`
- `scikit-learn`
- `pytorch`
- `pyG`
- `umap-learn`

You can install those dependencies by 
```shell
conda env create -f env.yml
```

# Questions?