***INTRODUCTION***

Our project focuses on developing an intuitive tool utilizing a Bayesian network to swiftly predict the bioactivity of small molecules against target proteins. This endeavor aims to expedite the drug discovery process by leveraging a dataset comprising information on 100 known small molecules, their molecular descriptors, target protein activity, and bioactivity. Through preprocessing, we will partition the dataset into training, validation, and testing sets.

**Purpose and Benefits:**

Primarily situated within the realm of drug discovery and development, our tool holds significant promise. By facilitating the rapid identification of potential candidate molecules for further experimental validation and drug design, it stands to streamline research efforts. Moreover, its utility extends to virtual screening of compound libraries, enabling the prioritization of compounds for experimental testing, and the optimization of lead compounds for enhanced efficacy and safety profiles.

**Background:**

In the landscape of bioactivity prediction tasks, our project addresses a crucial need for efficient and accurate methodologies. We intend to evaluate the performance of our Bayesian network model using key metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC). Additionally, we plan to conduct a comparative analysis, pitting our Bayesian network model against other commonly employed machine learning algorithms like support vector machines (SVM), random forests, and potentially deep learning approaches.

In the event that the Bayesian network model encounters limitations, we remain flexible in our approach. Alternative strategies, such as employing deep learning techniques like convolutional neural networks (CNNs) or recurrent neural networks (RNNs), are under consideration. Furthermore, ensemble learning methodologies such as stacking or boosting could be explored to harness the collective strengths of multiple models, thereby enhancing accuracy and robustness.

***Setup Instructions:***

GitHub Link for the code:

Steps to Clone the project:

Installation of the packages:

pgmpy, Pandas, Numpy, sklearn, Matplotlib, seaborn.

Upload of dataset to colob:



*   Click on folder in colob
*   Click on upload session button

*   Select the file downloaded from github
*   After uploading check the name of the dataset file








***Step BY Step Instructions to run code***

**Install Required Packages:**

First, install the required packages by running !pip install pgmpy in your Python environment.

**Load Libraries and Dataset:**


*   Import necessary libraries like pandas, numpy, pgmpy, sklearn, matplotlib, and seaborn.
*   Load your dataset. Replace 'path_to_your_dataset.xlsx' with the actual path to your dataset.


**Explore the Dataset:**

Check the first few rows of your dataset to ensure it's loaded correctly.

**Define Bayesian Network Structure:**

Define the structure of the Bayesian Network by specifying the directed edges between nodes.

**Fit the Bayesian Network:**

Fit the Bayesian Network using Maximum Likelihood Estimator (MLE).

**Print Conditional Probability Tables (CPTs):**

Print the learned Conditional Probability Tables (CPTs) for each node in the network.

**Discretize Continuous Variables:**


*  Discretize continuous variables like 'LogP' and 'Molecular Weight' into discrete categories.
*   Adjust discretization strategy according to your dataset.

**Handle Missing States:**


*   Provide evidence for certain states and perform inference to predict the target variable.
*   Adjust evidence according to your dataset.

**Cross-Validation:**

Split the data into training and testing sets using k-fold cross-validation.
Fit the model for each fold and evaluate its performance.

**Generate ROC Curve:**

Plot Receiver Operating Characteristic (ROC) curve to evaluate model performance.

**Visualize Data Distribution:**

Visualize the distribution of 'LogP' and 'Molecular Weight' using histograms.

**Generate Confusion Matrix:**

Generate a confusion matrix to evaluate classification performance.

**Calculate Metrics:**

Calculate metrics like accuracy, precision, recall, and F1-score.

**Adjust Threshold for F1 Score:**

Adjust the decision threshold to optimize F1 score.

**Final Cross-Validation and Evaluation:**

Perform k-fold cross-validation again and evaluate the model's performance.

**Plot F1 Scores and Accuracies:**

Plot F1 scores and accuracies for each fold.

***Results***

**Cross-Validation and Model Evaluation:**



*   The data is split into five folds using K-Fold cross-validation (KFold)
*   For each fold, the model is trained on the training data and then used to make predictions on the test data.

*   The predictions are evaluated using Receiver Operating Characteristic (ROC) curves, area under the curve (AUC), confusion matrices, and F1 scores.

**ROC Curve and AUC:**


*   The code generates ROC curves for each fold of the cross-validation and calculates the area under the curve (AUC) to quantify the model's performance in terms of true positive rate vs. false positive rate.
*   The mean ROC curve and mean AUC across all folds are also calculated and plotted.

**Confusion Matrix:**


*   Confusion matrices are generated for each fold to visualize the model's performance in terms of true positives, true negatives, false positives, and false negatives.
*   These matrices help in understanding the distribution of actual vs. predicted classes and identifying any misclassifications made by the model.

**F1 Score:**

*   F1 scores are calculated for each fold to evaluate the model's accuracy, precision, and recall.
*   The F1 score is the harmonic mean of precision and recall, providing a single metric for assessing the model's performance across different thresholds.













***Conclusion***

The project successfully developed an intuitive tool employing a Bayesian network to predict the bioactivity of small molecules against target proteins.

The implementation of computational predictive models, such as the Bayesian network developed in this project, has the potential to revolutionize the drug discovery process by accelerating the identification of novel therapeutics with higher efficacy and lower side effects.

Future work could involve expanding the dataset, exploring advanced machine learning techniques, addressing ethical and regulatory considerations, and collaborating with domain experts to enhance model interpretability and relevance.