# About the Data

![](https://cdn.pixabay.com/photo/2020/05/15/18/46/corona-5174671_960_720.jpg)

**This dataset consists of 2 different files for a potential drug against the COVID-19 virus. The original file consists of only a SMILES notation and pIC50 constant against the COVID-19 virus for a chemical compound. The second one consists of engineered features using the pubchempy library of Python. This library helps to access the PubChem data. PubChem is a database of millions of chemical compounds. We used this library to fetch the properties of the compounds using their SMILES representation.**

**The dataset is made publically available by the Government of India as a part of their Drug Discovery Hackathon. There are some potential drugs against the COVID-19 virus in this dataset as suggested by the hackathon organisers.**

# 1. Install the library `pubchempy`

In [None]:
!pip install pubchempy

# 2. Necessary Imports

In [None]:
import pubchempy as pcp
import pandas as pd

# 3. Checking for a single SMILES string a few properties

In [None]:
prop = pcp.get_properties(['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'IsomericSMILES',
'InChI', 'InChIKey', 'IUPACName'],'CN1N=C(C=C1C(F)(F)F)C1=CC=C(S1)C1=CC=NC(SCC(=O)NC2=CC=C(Cl)C=C2)=N1', 'smiles')

In [None]:
print(type(prop))
print(prop)

# 4. Reading the data file

In [None]:
df = pd.read_csv('../input/drug-discovery-data/DDH Data.csv')

In [None]:
df.head()

# 5. Getting all the available properties
**Fetch all the properties of all the 104 available compounds from the PubChem database. We need to pass every property available in the database as a list to `get_properties` function of the library. The list of functions could be accessed from [here](https://pubchempy.readthedocs.io/en/latest/guide/properties.html).** 

In [None]:
data = []

for i in df['SMILES']:
    props = pcp.get_properties(['MolecularFormula', 'MolecularWeight','InChI', 'InChIKey', 'IUPACName', 
                                'XLogP', 'ExactMass', 'MonoisotopicMass', 'TPSA', 'Complexity', 'Charge', 
                                'HBondDonorCount', 'HBondAcceptorCount', 'RotatableBondCount', 
                                'HeavyAtomCount', 'IsotopeAtomCount', 'AtomStereoCount', 
                                'DefinedAtomStereoCount', 'UndefinedAtomStereoCount', 'BondStereoCount', 
                                'DefinedBondStereoCount', 'UndefinedBondStereoCount', 'CovalentUnitCount', 
                                'Volume3D', 'XStericQuadrupole3D', 'YStericQuadrupole3D', 
                                'ZStericQuadrupole3D', 'FeatureCount3D', 'FeatureAcceptorCount3D', 
                                'FeatureDonorCount3D', 'FeatureAnionCount3D', 'FeatureCationCount3D', 
                                'FeatureRingCount3D', 'FeatureHydrophobeCount3D', 'ConformerModelRMSD3D', 
                                'EffectiveRotorCount3D', 'ConformerCount3D'], i, 'smiles')
    data.append(props)
#data

# 6. Appending the newly found properties to a dataframe

In [None]:
len(data[0][0].keys())

In [None]:
rows = []
columns = data[0][0].keys()
for i in range(104):
    rows.append(data[i][0].values())
props_df = pd.DataFrame(data=rows, columns=columns) 
props_df.head()

# 7. Fetch the original properties from the original dataframe

In [None]:
props_df.insert(1, 'SMILES', df['SMILES'], True)
props_df['pIC50'] = df['pIC50 (IC50 in microM)']
props_df.head()

# 8. Save the new dataframe with all properties as a .csv file

In [None]:
props_df.to_csv('DDH Data with Properties.csv', index=False)