# Tutorial: Bag of Bonds 
To learn more about the Bag of Bonds representation, please read the literature reference below:
- DOI: 10.1021/acs.jpclett.5b00831

To make a Bag of Bonds representation for your molecule you need to pass `chemreps.bag_of_bonds` the empty bags and bag sizes of your dataset. This can be done by loading premade bags using `LoadBags` from `chemreps.dataset` or by creating your own bags using `BagMaker` from `chemreps.bagger`.

## Load dataset bags
To load the bags for a dataset you need to pass `LoadBags` the appropriate representation and dataset.

In [8]:
from chemreps.dataset import LoadBags

In [9]:
bagger = LoadBags('BoB', 'QM9')

Now that we have loaded our bags and stored them in the object called `bagger`, we can get our empty bags by calling `bagger.bags` as well as the size of our bags with `bagger.bag_sizes`.

In [10]:
bagger.bags

{'C': [],
 'CC': [],
 'CH': [],
 'F': [],
 'FC': [],
 'FF': [],
 'FH': [],
 'FN': [],
 'FO': [],
 'H': [],
 'HH': [],
 'N': [],
 'NC': [],
 'NH': [],
 'NN': [],
 'O': [],
 'OC': [],
 'OH': [],
 'ON': [],
 'OO': []}

In [11]:
bagger.bag_sizes

OrderedDict([('C', 9),
             ('CC', 36),
             ('CH', 180),
             ('F', 6),
             ('FC', 18),
             ('FF', 15),
             ('FH', 33),
             ('FN', 9),
             ('FO', 6),
             ('H', 20),
             ('HH', 190),
             ('N', 7),
             ('NC', 20),
             ('NH', 48),
             ('NN', 21),
             ('O', 5),
             ('OC', 20),
             ('OH', 48),
             ('ON', 12),
             ('OO', 10)])

## Create your own bags
The dataset that we will be using can be found in the data directory of this repository. If you cloned this repository locally you should be able to set the path as '../data/sdf/'. Once we have the path to our dataset we need to pass it to the `BagMaker` along with the type of representation we want. In this case we want to make the Bag of Bonds representation so we will pass `BagMaker` the string 'BoB'. 

Note: For larger datasets this may take a little time to run as it needs to iterate through the entire dataset and find the proper bag sizes for the entirety of the dataset.

In [12]:
from chemreps.bagger import BagMaker

In [13]:
dataset = '../data/sdf/'
bagger = BagMaker('BoB', dataset)

In [14]:
bagger.bags

{'C': [],
 'CC': [],
 'CH': [],
 'H': [],
 'HH': [],
 'N': [],
 'NC': [],
 'NH': [],
 'NN': [],
 'O': [],
 'OC': [],
 'OH': [],
 'ON': [],
 'OO': [],
 'P': [],
 'PC': [],
 'PH': [],
 'PN': [],
 'PO': [],
 'S': [],
 'SC': [],
 'SH': [],
 'SN': [],
 'SO': []}

In [15]:
bagger.bag_sizes

OrderedDict([('C', 16),
             ('CC', 120),
             ('CH', 288),
             ('H', 18),
             ('HH', 153),
             ('N', 2),
             ('NC', 32),
             ('NH', 36),
             ('NN', 1),
             ('O', 5),
             ('OC', 80),
             ('OH', 90),
             ('ON', 10),
             ('OO', 10),
             ('P', 1),
             ('PC', 3),
             ('PH', 8),
             ('PN', 1),
             ('PO', 5),
             ('S', 1),
             ('SC', 16),
             ('SH', 18),
             ('SN', 2),
             ('SO', 5)])

## Make a representation
Once we have the bags and bag sizes for the dataset, we can start making our representations. To make a Bag of Bonds representation using `chemreps` all we need to do is pass `bag_of_bonds` the molecule file, the `bagger.bags`, and the `bagger.bag_sizes`. 

In [16]:
from chemreps.bag_of_bonds import bag_of_bonds

In [19]:
mfiles = '../data/sdf/molecule_1.sdf'
print(mfiles)
rep = bag_of_bonds(mfiles, bagger.bags, bagger.bag_sizes)
rep

../data/sdf/molecule_1.sdf


array([36.84   , 36.84   , 36.84   , 36.84   , 36.84   , 36.84   ,
       36.84   , 36.84   , 36.84   ,  0.     ,  0.     ,  0.     ,
        0.     ,  0.     ,  0.     ,  0.     ,  0.     , 23.84   ,
       23.64   , 23.31   , 22.88   , 14.96   , 14.74   , 14.72   ,
       14.62   , 14.52   , 14.445  , 14.19   , 14.03   , 13.98   ,
       12.29   , 11.73   , 11.695  ,  9.87   ,  9.86   ,  9.69   ,
        9.64   ,  9.23   ,  9.125  ,  9.09   ,  9.01   ,  8.914  ,
        8.26   ,  8.11   ,  7.63   ,  7.293  ,  7.285  ,  6.883  ,
        6.85   ,  6.812  ,  6.508  ,  5.844  ,  5.36   ,  0.     ,
        0.     ,  0.     ,  0.     ,  0.     ,  0.     ,  0.     ,
        0.     ,  0.     ,  0.     ,  0.     ,  0.     ,  0.     ,
        0.     ,  0.     ,  0.     ,  0.     ,  0.     ,  0.     ,
        0.     ,  0.     ,  0.     ,  0.     ,  0.     ,  0.     ,
        0.     ,  0.     ,  0.     ,  0.     ,  0.     ,  0.     ,
        0.     ,  0.     ,  0.     ,  0.     ,  0.     ,  0.  

## Make representations for multiple molecules
Disclaimer: There may be better ways to accomplish the same objective. You are welcome to use your method as well as submit a issue/PR if you think we should use that method

To make representations for all the molecules in our directory we are going to need to use `glob` and `pandas`. To find out more about these libraries you can go to the [glob documentation](https://docs.python.org/3/library/glob.html) or [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html). We are going to first create an empty list called `rep_list` in which we will store information such as the filename and the representation. Next we loop over all of the files in the directory using glob to match our pattern (eg. we want all sdf files from our data/sdf/ directory). In this loop we use the same method as above in order to make our representations. We store the name of the file and the representation in a dictionary that is then appended to our rep_list. Once the loop is complete, we store the information in a pandas dataframe.


In [20]:
dataset

'../data/sdf/'

In [21]:
import glob
import pandas as pd

rep_list = []
for i in sorted(glob.iglob(dataset + '/*.sdf')):
    fname = i
    print(fname)
    rep = bag_of_bonds(fname, bagger.bags, bagger.bag_sizes)
    dict1 = {}
    dict1.update({'Name': fname})
    dict1.update({'Rep': rep})
    rep_list.append(dict1)

df = pd.DataFrame(rep_list, columns=['Name', 'Rep'])
df

../data/sdf/benzoic_acid.sdf
../data/sdf/butane.sdf
../data/sdf/molecule_1.sdf
../data/sdf/molecule_2.sdf
../data/sdf/molecule_3.sdf
../data/sdf/molecule_4.sdf
../data/sdf/molecule_5.sdf
../data/sdf/penicillin.sdf
../data/sdf/water.sdf


Unnamed: 0,Name,Rep
0,../data/sdf/benzoic_acid.sdf,"[36.84, 36.84, 36.84, 36.84, 36.84, 36.84, 36...."
1,../data/sdf/butane.sdf,"[36.84, 36.84, 36.84, 36.84, 0.0, 0.0, 0.0, 0...."
2,../data/sdf/molecule_1.sdf,"[36.84, 36.84, 36.84, 36.84, 36.84, 36.84, 36...."
3,../data/sdf/molecule_2.sdf,"[36.84, 36.84, 36.84, 36.84, 36.84, 36.84, 36...."
4,../data/sdf/molecule_3.sdf,"[36.84, 36.84, 36.84, 36.84, 36.84, 36.84, 36...."
5,../data/sdf/molecule_4.sdf,"[36.84, 36.84, 36.84, 0.0, 0.0, 0.0, 0.0, 0.0,..."
6,../data/sdf/molecule_5.sdf,"[36.84, 36.84, 36.84, 0.0, 0.0, 0.0, 0.0, 0.0,..."
7,../data/sdf/penicillin.sdf,"[36.84, 36.84, 36.84, 36.84, 36.84, 36.84, 36...."
8,../data/sdf/water.sdf,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Once our representation information is stored in the pandas dataframe, we can use numpy in order to make an array of our representations that we can finally pass to our machine learning method.

In [22]:
import numpy as np
reps = np.asarray(df['Rep'])
reps

array([array([36.84  , 36.84  , 36.84  , 36.84  , 36.84  , 36.84  , 36.84  ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    , 25.81  , 25.81  , 25.81  , 25.81  ,
              25.81  , 25.81  , 24.75  , 14.9   , 14.9   , 14.9   , 14.9   ,
              14.9   , 14.9   , 14.59  , 14.586 , 12.91  , 12.91  , 12.91  ,
               9.61  ,  9.6   ,  8.48  ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,
               0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  0.    ,