In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import umap
import seaborn as sns
import os 


from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.rdFingerprintGenerator import GetMorganGenerator
from rdkit.Chem import Draw
from rdkit.Chem import rdMolDescriptors


from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.svm import SVR


### import the smiles_tg.csv file into a dataframe and identify how many unique smiles strings are in the dataset

### drop the duplicate smiles strings in the dataset, keep the first entry and reset the index

### we are going to use the RDKit library to convert the smiles strings into molecular objects
### that we will use as features for our models
### lets first generate basic descriptors for the molecules

### here you will write a basic function called get_basic_descriptors, that takes a smiles string as input, and returns a dictionary of descriptors
### the descriptors are the following:
### 'MW': molecular weight
### 'HBD': number of hydrogen bond donors
### 'HBA': number of hydrogen bond acceptors
### 'TPSA': topological polar surface area
### 'Rotatable_Bonds': number of rotatable bonds

### next create a function called get_morgan_fingerprint that takes in smiles strings and generates morgan fingerprints
### with a radius of 2 and a length nBits= 1024 
### Return the fingerprint as a list

### hint: the Input parameters to the function should be
### smiles, radius, nBits
### the function should return the fingerprint as a list
### this can be done in a very few lines of code (don't complicate it)


### Next create a function called get_topological_fingerprint that generates topological fingerprints from SMILES strings. These fingerprints capture the 2D structural features of molecules. 
### The function should take in a smiles string and nBits=25
### return the fingerprint as a list


### we are now going to use these functions to generate features for our models.
### lets start with the get_basic_descriptors function, use it to convert the smiles strings in the dataset to features
### so what you will do here is create a list that contains the descriptors for each smiles string in the dataset
### remember, in your function, this should return a dictionary of descriptors, so you will have a single list, where each smile string is represented by a dictionary of descriptors

### if you did this correctly and print the output, the first entry should look similar to this:

### [{'MW': 167.188, 'HBD': 0, 'HBA': 5, 'TPSA': 75.99, 'Rotatable_Bonds': 0},{.....}]


### convert this list of dictionaries into a dataframe called df_descriptors
### next lets scale the features using the StandardScaler from sklearn.preprocessing
### fit and transform the dataframe and we'll call this scaled dataframe X
### now that your data is transformed, use KMeans clustering to cluster and fit the scaled dataframe (X).
### use n_clusters = 5, and a random_state = 0
### extract the labels using kmeans.labels_ and add this as a column a new column in the df_descriptors called 'cluster'
### next use the PCA algorithmn to reduce the dimensionality of the dataframe (X) to 2 dimensions
### Add the the pca1 and pca2 columns to the df_descriptors dataframe
### finally, use the seaborn library to create a scatter plot 
### set data = descriptors, x = pca1, y = pca2, hue = cluster

### next lets explore UMAP. UMAP is a dimensionality reduction technique that is used for visualization of high-dimensional data
### take the list of dictionaries we created earlier, and create another dataframe from it.
### use this new dataframe to create a UMAP plot with 2 dimensions
### set the umap parameters to n_neighbors = 15, min_dist = 0.1, n_components = 2
### when making the umap plot, set a color bar as the Tg values from the original dataframe


### next, create a train test split of the data, using the featurized dataset and the Tg values
### set the test_size =0.3 and a random_state = 42
### traina linear model (ridge regression) and a non-linear model (random forest) on the training data
### test the mdoels on the test set and print the r2 and rmse score for each model
### create 2 parity plots, one for each model
### In either of these parity plots, can you spot a clear outlier in the model?
### if we were to remove this outlier, would our model improve?

### next, lets see if using gridsearchcv can improve the performance of the models.
### repeat the above excerise, but this time use gridsearchcv to find the best hyperparameters for the models
### for the ridge model set the param_grid for alpha to alpha = [0.01, 0.1, 1.0, 10, 100], set cv =5, scoring= r^2 and n_jobs = -1.
### For the random forest model use this param_grid to search through param_grid_rf = {
###    'n_estimators': [50, 100, 200],
###    'max_depth': [None, 5, 10],}
### print the best score and the best parameters for each model
### create a model from the best parameters and test the model on the test set
### do you see an improvement in the performance of the models?

### next, lets generate features from the get_morgan_fingerprint function
### create one list called morgan_fingerprints of the morgan fingerprints for each smiles string in the dataset
### next, create a new dataframe from the morgan_fingerprints list
### next, split the data set into a test and train set using the Tg values as your target variable,
### set the test_size = 0.3 and random_state = 42
### standardize the data using the StandardScaler, and train a linear ridge(alpha=0.1) model and a non-linear random forest(n_estimators=100,random_state=42) model
### test the models on the test set and print the r2 and rmse score for each model
### plot a parity plot for each model 

### next lets use the get_topological_fingerprint function to generate 2D structural features of our smiles strings
### create a list called topological_fps of the topological fingerprints for each smiles string in the dataset
### and convert this to a dataframe. Create a train test split of the data using the Tg values as the target variable
### set the test_size = 0.3 and random_state = 42. 
### for this excercise we are going to train a Support vector regression and a random forest model.
### We will also want to create a 3rd model called ensemble made up of both of the models.
### using the VotingRegressor from sklearn.ensemble, create a model that combines the SVR and random forest models.
### test each model, print the r2 and rmse score for each all 3 models as well as a parity plot for each model.
### Do we see any improvement in the performance of the models when we combine into an ensemble of models?

### for the final excercise, we are going to identify the most similar molecules in our dataset relative to the first molecule in the dataset.
### so we want to compare how similar the first molecule is to the rest of the molecules in the dataset
### generate morgan fingerprints for the first molecule in the dataset and compare it to the rest of the molecules in the dataset
### you'll first generate morgan fingerprints(radius=2, nBits=1024) for the first molecule in the dataset
### and then for the rest of the molecules in the dataset. Calulate the Tanimoto similarity between the first molecule and the rest of the molecules in the dataset, print the top 10 most similar molecules to the first molecule in the dataset
### it should look something like this:

### Most similar molecules:
###                             SMILES                                                     Similarity
### 0    *C1COC2C1OCC2Oc1ccc(cc1)CNC(=O)CCCCCCC(=O)NCc1...                                  1.000000
### 222                 *N=Occ1                                                             0.444444
### 333  *Oc1cCC(=O)NCCCc1cccOC...                                                          0.428571
### 444  *C1C2OC(=O)CCC(=O)O                                                                0.428571
### 20   *NC....                                                                            0.425926
### 12   *OC(=O)CNC...                                                                      0.353846

### use MolsToGridImage to display the molecule and the 4 most similar molecules to the first molecule in the dataset.