# ***R Project*** 


Database: ChEbi -> [link](https://www.ebi.ac.uk/chebi/)

Contributors: 
Hézio S. -> [GitHub](https://github.com/HezioS1lv4)
Riam Martinelli -> [GitHub](https://github.com/richboyyy)

In [3]:
!pip install rdkit

# Imports
#------------------------------------------------------------------------------------------------------#
import pandas as pd 
#------------------------------------------------------------------------------------------------------#
from rdkit import Chem, RDLogger
from rdkit.Chem import Descriptors, MolSurf, rdMolDescriptors, AllChem, Crippen, PandasTools, Lipinski
#------------------------------------------------------------------------------------------------------#
# import files from drive.
from google.colab import files  # import files from drive.
 # Connect colab with drive.
from google.colab import drive
drive.mount('/content/drive')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


__________________________________________________________


In [4]:
#Reading the database uploaded from google drive.
chebi = '/content/drive/MyDrive/ChEbi/ChEBI_complete.sdf' 

In [5]:
all_mols = Chem.SDMolSupplier(chebi)  # Vai puxar os mols da database ChEbi.

In [6]:
# A list created for each column that needed to be calculated.
 # Extraction of mols of chebi.
all_mols = Chem.SDMolSupplier(chebi) 
SMILES = []
NumAtoms=[]
ExactMolWt=[]
Mol_LogP = []
NumHAcceptors = []
NumHDonors = []
Ring_Count = []
TPSA =[]
NumRotableBonds = []
Molecular_formula = []
#O x will go through the condition and run len(all_mols).
#If the x inside all_mols is None it will continue and perform the calculations.
for x in range (0, len(all_mols)): 
  if all_mols[x] is None: continue 
  NumAtoms.append(all_mols[x].GetNumAtoms())
  SMILES.append(Chem.MolToSmiles(all_mols[x]))
  ExactMolWt.append(Descriptors.ExactMolWt(all_mols[x]))
  TPSA.append(Descriptors.TPSA(all_mols[x]))
  Mol_LogP.append(Crippen.MolLogP(all_mols[x]))
  NumHAcceptors.append(Lipinski.NumHAcceptors(all_mols[x]))
  NumHDonors.append(Lipinski.NumHDonors(all_mols[x]))
  Ring_Count.append(Lipinski.RingCount(all_mols[x]))
  NumRotableBonds.append(Lipinski.NumRotatableBonds(all_mols[x]))
  Molecular_formula.append(rdMolDescriptors.CalcMolFormula(all_mols[x]))


__________________________________________________________



It was necessary to import the database in two different ways and use more than 1 methodology for the calculations.
Because, using the drive, I didn't have access to the columns and we couldn't find a formula for the ChEbi ID, through reading with PandasTools, it became possible.

__________________________________________________________



In [None]:
#Reading the database by uploading it directly to the machine.
dataSDF = '/content/ChEBI_complete.sdf'  

In [None]:
#Reads the sdf and transforms the variable into a data frame.
ChEbi = PandasTools.LoadSDF(dataSDF) 

In [None]:
ChEbi

Unnamed: 0,ChEBI ID,ChEBI Name,Star,Definition,Secondary ChEBI ID,InChI,InChIKey,SMILES,Formulae,Charge,...,GlyGen Database Links,GlyTouCan Database Links,LIPID MAPS class Database Links,RESID Database Links,WebElements Database Links,FAO/WHO standards Database Links,PPR Links,CiteXplore citation Links,SMID Database Links,ChemIDplus Database Links
0,CHEBI:90,(-)-epicatechin,3,"A catechin with (2R,3R)-configuration.",CHEBI:18484,InChI=1S/C15H14O6/c16-8-4-11(18)9-6-13(20)15(2...,PFTAWBLQPZVEMU-UKRRQHHQSA-N,[H][C@@]1(Oc2cc(O)cc(O)c2C[C@H]1O)c1ccc(O)c(O)c1,C15H14O6,0,...,,,,,,,,,,
1,CHEBI:165,"(1S,4R)-fenchone",3,"A fenchone that has 1S,4R stereochemistry. A c...",CHEBI:63901,"InChI=1S/C10H16O/c1-9(2)7-4-5-10(3,6-7)8(9)11/...",LHXDLQBQYFFVNW-XCBNKYQSSA-N,CC1(C)[C@@H]2CC[C@@](C)(C2)C1=O,C10H16O,0,...,,,,,,,,,,
2,CHEBI:598,1-alkyl-2-acylglycerol,3,A glycerol ether having an alkyl substituent a...,CHEBI:19009,,,OCC(CO[*])OC([*])=O,C4H6O4R2,0,...,,,,,,,,,,
3,CHEBI:776,16alpha-hydroxyestrone,3,The 16alpha-hydroxy derivative of estrone; a m...,CHEBI:60497,InChI=1S/C18H22O3/c1-18-7-6-13-12-5-3-11(19)8-...,WPOCIZJTELRQMF-QFXBJFAPSA-N,[H][C@@]12C[C@@H](O)C(=O)[C@@]1(C)CC[C@]1([H])...,C18H22O3,0,...,,,,,,,,,,
4,CHEBI:943,"2,6-dichlorobenzonitrile",3,A nitrile that is benzonitrile which is substi...,CHEBI:73174,InChI=1S/C7H3Cl2N/c8-6-2-1-3-7(9)5(6)4-10/h1-3H,YOYAIZYFCNQIRF-UHFFFAOYSA-N,Clc1cccc(Cl)c1C#N,C7H3Cl2N,0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150340,CHEBI:691037,desoximetasone,3,Dexamethasone in which the hydroxy group at th...,,InChI=1S/C22H29FO4/c1-12-8-16-15-5-4-13-9-14(2...,VWVSBHGCDBMOOT-IIEHVVJPSA-N,[H][C@@]12C[C@@H](C)[C@H](C(=O)CO)[C@@]1(C)C[C...,C22H29FO4,0,...,,,,,,,,,,
150341,CHEBI:691622,"1,3,7-trimethyluric acid",3,An oxopurine in which the purine ring is subst...,,InChI=1S/C8H10N4O3/c1-10-4-5(9-7(10)14)11(2)8(...,BYXCFUMGEBZDDI-UHFFFAOYSA-N,CN1C(=O)NC2=C1C(=O)N(C)C(=O)N2C,C8H10N4O3,0,...,,,,,,,,,,
150342,CHEBI:724125,methyl 5-aminolevulinate,3,The methyl ester of 5-aminolevulinic acid. A p...,,"InChI=1S/C6H11NO3/c1-10-6(9)3-2-5(8)4-7/h2-4,7...",YUUAYBAIHCDHHD-UHFFFAOYSA-N,COC(=O)CCC(=O)CN,C6H11NO3,0,...,,,,,,,,,,
150343,CHEBI:741548,ethylmalonic acid,3,A dicarboxylic acid obtained by substitution o...,,"InChI=1S/C5H8O4/c1-2-3(4(6)7)5(8)9/h3H,2H2,1H3...",UKFXDFUAPNAMPJ-UHFFFAOYSA-N,CCC(C(O)=O)C(O)=O,C5H8O4,0,...,,,,,,,,,,


In [None]:
Chebi_ID = ChEbi['ChEBI ID']

In [None]:
Chebi_ID

0             CHEBI:90
1            CHEBI:165
2            CHEBI:598
3            CHEBI:776
4            CHEBI:943
              ...     
150340    CHEBI:691037
150341    CHEBI:691622
150342    CHEBI:724125
150343    CHEBI:741548
150344    CHEBI:746859
Name: ChEBI ID, Length: 150122, dtype: object

__________________________________________________________


In [None]:
#Adding Lists to a Dictionary.
i = {'ChEbi_ID': Chebi_ID,
     'SMILES': SMILES,
    'MolecularFormula': Molecular_formula,
    'NumAtoms': NumAtoms,
    'ExactMolWt': ExactMolWt,
    'NumRotableBonds': NumRotableBonds,
    'MolLogp': Mol_LogP,
    'RingCount': Ring_Count,
    'NumHAcceptors': NumHAcceptors,
    'TPSA': TPSA,
    'NumHDonors': NumHDonors,

}
#Creating a data frame from dict(i) and turning it into csv.
pd.DataFrame(i).to_csv('base de dados.csv', sep = ';')

In [None]:
# Turning the dict into a dataframe that can be queried as a table.
MoleculeDescription = pd.DataFrame(data=i) 

In [None]:
MoleculeDescription

NameError: ignored

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


__________________________________________________________

In [None]:
#Checking if the sizes of different calculation methods are equal.
len(NumAtoms)

150122

In [None]:
len(Chebi_ID)  

150122

In [None]:
# OBS:
#  [CHEBI:192499] E [CHEBI:192500] did not get results because there was an error in smiles and mol.