AI-based Quantitative structure Activity relationship study for Alzheimer's disease project is implemented as part of the Data Mining Course in my Masters degree in AI. The project analyzes the Quantitative structure Activity relationship of the Amyloid beta A4 protein and the Alzheimer's disease where the activity of the protein is predicted as pIC50 standard value from the molecular structure.
- Implementation Remarks
- Table of Contents
- Libraries Used
- Preprocessing Steps
- Exploratory Data Analysis
- Machine Learning Regression Problem
This project is implmented on Google Colab, hence all additional packages to be installed are documented and installed using shell commands in the colab project. The dataset is fetched from ChEMBL in 2021 using the ChEMBL webresource client API
which is regularly updated
hence an image of the dataset is saved for a reference.
Data Retrieval API
Data Manipulation
Data Visualization
Protein Descriptors Computations
Step 1: Access the ChEMBL database and filter data, to exctract the data for Alzheimers disease where protein studies is Amyloid beta A4 protein.
Step 2: Handling missing , duplicated, and null data
Step 3: simply the simplified molecular input line-entry system (SMILE) notation e.g. handle disconnections in SMILEs notation
Step 4: Transforming attribute types according to the attribute nature
Step 5: Discretization of bio-activity to 3 levels: active, intermediate, inactive according to the standard value then eliminate intermediate level rows to focus on active/inactive instances
Step 6: Normalize IC50 value by computing pIC50 (negative logarithmic of IC50)
Step 7: Generate Padel discriptor using githuh project, from SMILES notation
Step 8: Drop identifier attribute
Step 9: Dimension reduction using VarianceThreshold method
Step 10: split the data to training and testing with spliting ratio 67% and 33%
The problem at hand is a regression problem as the input to the regression model is the PaDEL descriptor that represents the footprint/descriptor of a molecule and try to predict the bio-activity value in pIC50 continuous-domain value hence the problem name Quantitative structure Activity relationship(QSAR). First the LazyRegressor library is used to norrow down to four good perfroming regression model then these models are compared to elect the best performing regression model which is further optimized by tuning its hyperparameter values.
The LazyRegressor library runs 40 regression models including Support Vector Machine(SVM), Random Forest (RF), Adaboost regressor, decision tree regressor,and many more. The performance of the regressor models are evaluated according to the R-squared value , Root Mean Square Error (RMSE), and computation time.
The following models are selected four models for regression
- Random Forest with 80 estimators
- Gradient Boost with 80 estimators
- Support Vector Machine with Radial Basis Function (rbf) kernal
- K Nearest Neighbor with k=10
where the performance is evaluated according to:
- Mean Absolute Error (MAE)
- R squared
- Computation time
According to table below the most promising regression is random forest thus the followig parameters are optimized using search grid method in order to optimize the performance of the random forest model:
- n_estimators = [100, 300, 500, 800, 1200]
- max_depth = [5, 8, 15, 25, 30]
According to grid search operation with cross validation (cv = 3 the best parameter values is n_estimators = 800
and max_depth = 8
Model | R2 Score | MAE | Execution Time |
---|---|---|---|
Random Forest | 0.7045 |
0.549 |
0.0091 |
Gradient Boosted Regressor | 0.692 |
0.61 |
0.0015 |
K Nearest Neighbor | 0.68 |
0.61 |
0.0016 |
Support Vector Machine | 0.708 |
0.61 |
0.0018 |
Optimized Random Forest | 0.928 |
0.29 |
0.0702 |