# General information

The competition provide us several files with additional molecular features, such as dipole moments, magnetic shielding, mulliken charges and potential energy, which have information only for train and not for test datasets. 

So, in order to use such attributes we would have to first create models from the training molecules and infer values for each test molecule.

### Dealing With Molecules

The problem is that estimating properties relative to the position of the atoms in the molecule, it is useless to deal only with 1, 2 or 3 atoms, because they are often influenced by several (or all) neighboring atoms of a molecule.   
Thus, the ideal model must consider the entire molecular structure.   
The competition files provide the xyz positions of each atom, but using all the positions of all the atoms in a machine learning model can be a very difficult task.   
To overcome this problem, the concept of fingerprint of the molecule arises, where the molecular structure can be represented by an array, calculated from the spatial relationship (in three dimensions) between atoms in the molecule. Based on this, in theory you can create models for any molecular property based on the spatial structure.    
The most famous technique for defining molecular fingerprint is **Morgan**'s fingerprinting, for more information see the paper:    
*Morgan, H. L. The Generation of a Unique Machine Description for Chemical Structures - A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5: 107-112.*   
The process of creating the molecule's fingerprint is complex because it takes into account the entire 3D structure of the molecule, including the identity of the atoms, the type of chemical bonds, the degree of hybridization, conformation, relative angles, and so on.
For this task we will use the incredible DEEPCHEM library, which abstracts most of the complexity and requires only a few lines of code.

### Main goal
The main purpose of this kernel is to teach how to use the Stanford DeepChem library to use the additional molecular features in this competition. No knowledge of chemistry is needed, just pure python !!
In this kernel we will use the potential energy data as an example, however the same concept can be applied to the other files provided.

> **If you want to see only the DEEPCHEM library application, you can skip the introduction and preprocessing and go straight to the title: Deepchem to provide the potential energy, at the end of this kernel.**

### Instaling packages, libraries, and so on...

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('max_rows', 5)
import os
import matplotlib.pylab as plt
import seaborn as sns
from sklearn import metrics
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
plt.style.use('ggplot')
color_pal = [x['color'] for x in plt.rcParams['axes.prop_cycle']]
import os
from IPython.display import Image, HTML, display

In [None]:
!pip install deepchem

In [None]:
%%bash -e
if ! [[ -f ./xyz2mol.py ]]; then
  wget https://raw.githubusercontent.com/jensengroup/xyz2mol/master/xyz2mol.py
fi

In [None]:
pip install py3Dmol # This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015);

In [None]:
file_folder = '../input/champs-scalar-coupling' if 'champs-scalar-coupling' in os.listdir('../input/') else '../input'


In [None]:
train = pd.read_csv(f'{file_folder}/train.csv')[0-60000:]

In [None]:
train.head()

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} -c rdkit rdkit

# Dealing with different file formats
To be able to use the deepchem library satisfactorily we need to have our molecules encoded in SMILES format. The files of the molecules supplied by Kaggle are in xyz formats.   
So we will modify them. This transformation will be in 3 steps:
1. Convert XYZ to MOL format.
2. Transform MOL to SMILES format.
3. Create a csv file.

## Read molecules in RDKit using XYZ2MOL package
### XYZ file format
The XYZ file format is a chemical file format. Normally XYZ format specifies the molecule geometry by giving the number of atoms with Cartesian coordinates that will be read on the first line, a comment on the second, and the lines of atomic coordinates in the following lines.[1] The file format is used in computational chemistry programs for importing and exporting geometries. The units are generally in ångströms.(https://en.wikipedia.org/wiki/XYZ_file_format)    
Normally, we find in a XYZ file:  
-First line: total number of atoms (optional)  
-Second line: molecule name or comment (optional)  
-All other lines: element symbol or atomic number, x, y, and z coordinates, separated by spaces, tabs, or commas  
Example:    
glucose from 2gbp     
C  35.884  30.895  49.120   
C  36.177  29.853  50.124    
C  37.296  30.296  51.074    
C  38.553  30.400  50.259     
C  38.357  31.290  49.044     
C  39.559  31.209  48.082     
O  34.968  30.340  48.234  
O  34.923  29.775  50.910  
O  37.441  29.265  52.113  
O  39.572  30.954  51.086  
O  37.155  30.858  48.364  
O  39.261  32.018  46.920

### Mol file format
An MDL Molfile is a file format for holding information about the atoms, bonds, connectivity and coordinates of a molecule.(https://en.wikipedia.org/wiki/Chemical_table_file)  

### xyz2mol
This is a library that converts an xyz file to an RDKit mol object. The code is based on this paper Yeonjoon Kim and Woo Youn Kim "Universal Structure Conversion Method for Organic Molecules: From Atomic Connectivity to Three-Dimensional Geometry" Bull. Korean Chem. Soc. 2015, Vol. 36, 1769-1777 DOI: 10.1002/bkcs.10334. (https://github.com/jensengroup/xyz2mol)  

### Smiles Format
The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.(https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system)    
In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree.(https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system, http://opensmiles.org)    
A wide variety of SMILES strings are acceptable as input. For example, all of the following represent ethanol:   

CCO	
OCC	
C(O)C	
[CH3][CH2][OH]	
[H][C]([H])([H])C([H])([H])[O][H]	

### RDKit
RDKit is a collection of cheminformatics and machine-learning software written in C++ and Python, created by Greg Landrum.(https://github.com/rdkit/rdkit)     
It is an amazing library for chemistry, computational biology and bioinformatics studies involving python programming.
We use it to convert mol format to Smiles format.

In [None]:
# Few Snippets from https://www.kaggle.com/sunhwan/using-rdkit-for-atomic-feature-and-visualization
# rdkit & xyz2mol
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole #Needed to show molecules
from rdkit.Chem import Draw
from rdkit.Chem.Draw.MolDrawing import MolDrawing, DrawingOptions #Only needed if modifying defaults
DrawingOptions.bondLineWidth=1.8
from rdkit.Chem.rdmolops import SanitizeFlags

# https://github.com/jensengroup/xyz2mol
from xyz2mol import xyz2mol, xyz2AC, AC2mol, read_xyz_file
from pathlib import Path
import pickle

CACHEDIR = Path('./')

def chiral_stereo_check(mol):
    # avoid sanitization error e.g., dsgdb9nsd_037900.xyz
    Chem.SanitizeMol(mol, SanitizeFlags.SANITIZE_ALL - SanitizeFlags.SANITIZE_PROPERTIES)
    Chem.DetectBondStereochemistry(mol,-1)
    # ignore stereochemistry for now
    #Chem.AssignStereochemistry(mol, flagPossibleStereoCenters=True, force=True)
    #Chem.AssignAtomChiralTagsFromStructure(mol,-1)
    return mol

def xyz2mol(atomicNumList,charge,xyz_coordinates,charged_fragments,quick):
    AC,mol = xyz2AC(atomicNumList,xyz_coordinates)
    new_mol = AC2mol(mol,AC,atomicNumList,charge,charged_fragments,quick)
    new_mol = chiral_stereo_check(new_mol)
    return new_mol

def MolFromXYZ(filename):
    mol=''
    charged_fragments = True
    quick = True
    cache_filename = CACHEDIR/f'{filename.stem}.pkl'
    if cache_filename.exists():
        return pickle.load(open(cache_filename, 'rb'))
    else:
        try:
            atomicNumList, charge, xyz_coordinates = read_xyz_file(filename)
            mol = xyz2mol(atomicNumList, charge, xyz_coordinates, charged_fragments, quick)
            # commenting this out for kernel to work.
            # for some reason kernel runs okay interactively, but fails when it is committed.
            pickle.dump(mol, open(cache_filename, 'wb'))
        except:
            print(filename)
    return mol

#mol = MolFromXYZ(xyzfiles[1])
#m = Chem.MolFromSmiles(Chem.MolToSmiles(mol, allHsExplicit=True)); m

from multiprocessing import Pool
from tqdm import *
from glob import glob

def MolFromXYZ_(filename):
    return filename.stem, MolFromXYZ(filename)

mols = {}
n_cpu = 4
with Pool(n_cpu) as p:
    molecule_names = np.concatenate([train.molecule_name.unique()])
    molecule_names = molecule_names[:400]
    xyzfiles = [Path(f'../input/structures/{f}.xyz') for f in molecule_names]
    n = len(xyzfiles)
    with tqdm(total=n) as pbar:
        for res in p.imap_unordered(MolFromXYZ_, xyzfiles):
            mols[res[0]] = res[1]
            pbar.update()

## Testing formatting xyz files into smiles format

Let´s test the conversion to smiles format.

In [None]:
molecule_names = np.concatenate([train.molecule_name.unique()])
xyzfiles = [(f'{f}.xyz') for f in molecule_names]
few_molecule_names = molecule_names[:5]
few_molecule_names

In [None]:
# Create mols format from xyz files
# Create smiles format

try:
    for molecule_name in few_molecule_names:
        #print('Molecule: {}'.format(molecule_name))
        m = MolFromXYZ(Path(f'../input/champs-scalar-coupling/structures/{molecule_name}.xyz'))
        smile_fromMolecule = Chem.MolToSmiles(m)
        print('Smile format from molecule: {}'.format(smile_fromMolecule))

except:
    print ('...something is wrong!!')
    pass 

### Visualize 3d molecules

How can we guarantee that the format conversion is correct?
Let's look at some sample in 3D for verification.

In [None]:
# Inspired by https://github.com/greglandrum/rdkit_blog/tree/master/notebooks

import py3Dmol # Amazing library for 3D visualization
from rdkit import Chem
from rdkit.Chem import AllChem
from ipywidgets import interact, interactive, fixed
def drawit(m,p,confId=-1):
    mb = Chem.MolToMolBlock(m,confId=confId)
    p.removeAllModels()
    p.addModel(mb,'sdf')
    p.setStyle({'stick':{}})
    p.setBackgroundColor('0xeeeeee')
    p.zoomTo()
    return p.show()
p = py3Dmol.view(width=400,height=400)

# now construct the 3d view:
m = MolFromXYZ(Path(f'../input/champs-scalar-coupling/structures/dsgdb9nsd_129764.xyz'))
smile_fromMolecule = Chem.MolToSmiles(m)
m = Chem.MolFromSmiles(smile_fromMolecule)
#print(smile_fromMolecule)
m = Chem.AddHs(m)
AllChem.EmbedMultipleConfs(m,randomSeed=0xf00d,useExpTorsionAnglePrefs=True, useBasicKnowledge=True)
interact(drawit, m=fixed(m),p=fixed(p));

''' Trick: If you want click and drag on the image to spin the molecule, and scroll for zoom!!'''

# Using DeepChem Library

DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.
The library was originally created by Bharath Ramsundar with encouragement and guidance from Vijay Pande.

It started as a Pande group project at Stanford, and is now developed by many academic and industrial collaborators. 

I really recommend you take a look at these sites to get deeper into the tool.    
https://github.com/deepchem/deepchem      
https://deepchem.io    
https://deepchem.io/docs/notebooks/index.html    

## Prepare dataframe for using with DeepChem library

In this example we will use the potential_energy file.
The idea is create a model using the potential energy from the train dataset in order to predict it into the test dataframe. 

### Potential Energy

Potential energy is a concept that can be used to understand any type of change in a system. Any system (a collection of molecules, atoms, or electrons) has a potential energy associated with it. The origin of this energy and what it represents is not important, all that you need to know we can use the potential energy of a molecule to understand how and why it will undergo a change. These changes could include the formation or breaking of a bond, the gain or loss of an electron, or the change in physical orientation of the atoms in the molecule. Each physical change has an associated change in potential energy.(https://sop4cv.com/chapters/PotentialEnergyThermodynamicsAndKinetics.html)    


potential_energy.csv - contains the potential energy of the molecules. The first column (molecule_name) contains the name of the molecule, the second column (potential_energy) contains the potential energy of the molecule.(https://www.kaggle.com/c/champs-scalar-coupling/data)   

In [None]:
potential_energyDF = pd.read_csv(f'{file_folder}/potential_energy.csv')

In [None]:
potential_energyDF.head()

In [None]:
# Plot the distribution of potential_energy
potential_energyDF['potential_energy'].plot(kind='hist', figsize=(25, 5), bins=500, title='Distribution of Molecular Potential Energy', color='g')
plt.show()


#### Create smiles format from molecules

In [None]:
import tensorflow as tf
import deepchem as dc
import numpy as np
import pandas as pd

# Getting only few samples
molecule_names = molecule_names[0:400]

In [None]:
# Create smiles format
xyzfiles = [(f'{f}.xyz') for f in molecule_names]
n = len(xyzfiles)
df_temp = pd.DataFrame({'molecule_name':[],'smile_fromMolecule': []})
try:
    for molecule_name in molecule_names:
        #print('Molecule: {}'.format(molecule_name))
        m = MolFromXYZ(Path(f'../input/champs-scalar-coupling/structures/{molecule_name}.xyz'))
        smile_fromMolecule = Chem.MolToSmiles(m)
        #print('Smile format from molecule: {}'.format(smile_fromMolecule))
        df_temp = df_temp.append({'molecule_name': molecule_name,'smile_fromMolecule': smile_fromMolecule}, ignore_index=True)
except:
    print ('...something is wrong!!')
    pass         

In [None]:
df_temp.head()

In [None]:
# Merge two dataframes
potential_energyDF_Smiles = pd.merge(df_temp, potential_energyDF, how = 'left', on='molecule_name')
potential_energyDF_Smiles.dropna(inplace=True)

In [None]:
potential_energyDF_Smiles.shape

In [None]:
# Our dataset file must contain a column with the SMILES sentence and another with our target (potential energy) in order to use with DeepChem
potential_energyDF_Smiles[['potential_energy', 'smile_fromMolecule']].to_csv('../potential_energyDF_Smiles.csv', index=False)

## Deepchem to provide the potential energy

In [None]:
from rdkit import Chem
import random
from deepchem.feat import CircularFingerprint
import deepchem as dc
import numpy as np

In [None]:
# Dataset file
dataset_file = '../potential_energyDF_Smiles.csv'

In [None]:
# Featurizer will create our fingerprint, and turn it into an array with 1024 bits.
featurizer = dc.feat.CircularFingerprint(size=1024)

# Prepare our dataset file
loader = dc.data.CSVLoader(
      tasks=["potential_energy"], smiles_field="smile_fromMolecule",
      featurizer=featurizer)

dataset = loader.featurize(dataset_file)

# Split train, validation and test
splitter = dc.splits.ScaffoldSplitter(dataset_file)
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(dataset)
train_mols = [Chem.MolFromSmiles(compound)
              for compound in train_dataset.ids]
valid_mols = [Chem.MolFromSmiles(compound)
              for compound in valid_dataset.ids]
# Normalize them
transformers = [dc.trans.NormalizationTransformer(transform_y=True, dataset=train_dataset)]

for dataset in [train_dataset, valid_dataset, test_dataset]:
    for transformer in transformers:
        dataset = transformer.transform(dataset)


After the aforementioned featurization, we will create our potential energy model. We will use random forest.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Our regressor
sklearn_model = RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=20, 
                                      min_samples_split=2, min_samples_leaf=1, 
                                      min_weight_fraction_leaf=0.0, max_features=None,
                                      max_leaf_nodes=None, min_impurity_decrease=0.0, 
                                      min_impurity_split=None, bootstrap=True, oob_score=False, 
                                      n_jobs=None, random_state=1, warm_start=False)
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

# Evaluation
from deepchem.utils.evaluate import Evaluator
metric = dc.metrics.Metric(dc.metrics.r2_score)
evaluator = Evaluator(model, valid_dataset, transformers)
r2score = evaluator.compute_model_performance([metric])
print(r2score)

R2 Score obtained was not bad for a first model and the small dataset used, of course you can test with other models and parameters to improve it.

In [None]:
# let’s plot the predicted R2 scores versus the true R2 scores for the constructed model.
llim = -600
ulim = -300
task = "measured potential_energy"
predicted_test = model.predict(test_dataset)
true_test = test_dataset.y
plt.scatter(predicted_test, true_test)
plt.xlim((llim, ulim))
plt.ylim((llim, ulim))
plt.plot([llim, ulim], [llim, ulim])
plt.xlabel('Predicted potential_energy')
plt.ylabel('True potential_energy')
plt.title(r'RF- predicted vs. true potential_energy')
plt.show()

Now you can use your model to predict the potential energy of your test dataframe.    

As we said earlier, the same concept can be applied to the other files provided, such as **dipole moments, mulliken charges**, and so on.

**So, if you find it useful and liked this code what do you think of giving me an upvote? Thanks a lot.    
If you have any suggestions or questions please contact me.   
Keep in touch because I want to update it with more tips.   
Thank you very much.**