# Outline: Representation and Selection of Descriptors
1. Description of typical descriptors and libraries for handling biomolecules
	a. Molecular fingerprints (e.g., ECFP, MACCS keys) 
	b. Learned representations (e.g., learned embeddings from molecular strings {SMILES, SELFIES, SMARTS patterns for structural fragments}, 2D graphs).
	c. Atomic-centered descriptors
		i. SOAP
	d. Coulomb matrices (where to put, exactly?)
	e. Physicochemical properties (e.g., molecular weight, logP, perhaps just go over the ones in QM7)

2. Typical data preprocessing steps
	a. Handling data on different scales (normalization and scaling techniques, standardization {e.g., z-score normalization}, min-max scaling, robust scaling 

3. Feature selection
	a.Illustrate with 2D representation, 3D representation, and MD trajectories (3 examples of increasing complexity).

# Description of typical descriptors and libraries for handling biomolecules 
Molecules are complex and can be represented by strings (e.g., SMILES), 2D graphs, and conformers. There are a few libraries for working with molecule "objects" in Python, one of the most popular being [RDKit](https://www.rdkit.org/), which we will also use in this notebook.

To import the necessary libraries for this tutorial, run the following cell:

In [61]:
from rdkit import Chem       # description
from rdkit.Chem import Draw  # used for drawing molecules
import numpy as np
import itertools
import scipy.io
import xyz2mol as xyz2mol

We can begin analyzing molecules by looking at the molecules in QM7. The QM7 dataset is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) composed of all molecules of up to 23 atoms (including 7 heavy atoms C, N, O, and S), totalling 7165 molecules. It contains the Coulomb matrix representation of these molecules and their atomization energies computed similarly to the FHI-AIMS implementation of the Perdew-Burke-Ernzerhof hybrid functional (PBE0). This dataset features a large variety of molecular structures such as double and triple bonds, cycles, carboxy, cyanide, amide, alcohol and epoxy.

To download and load the QM7 dataset as a Python dictionary, run:

In [11]:
!wget http://quantum-machine.org/data/qm7.mat
!mv qm7.mat ../data/.

raw_qm7 = scipy.io.loadmat("../data/qm7.mat") # the qm7 dataset will be loaded as a Python dictionary

--2023-07-06 11:40:22--  http://quantum-machine.org/data/qm7.mat
Resolving quantum-machine.org (quantum-machine.org)... 130.149.80.145
Connecting to quantum-machine.org (quantum-machine.org)|130.149.80.145|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17884280 (17M)
Saving to: ‘qm7.mat.1’


2023-07-06 11:40:24 (9,26 MB/s) - ‘qm7.mat.1’ saved [17884280/17884280]



The dataset is composed of five multidimensional arrays:
- X (7165 x 23 x 23) - Coulomb matrices
- T (7165) - atomization energies (the labels)
- P (5 x 1433) - cross-validation splits
- Z (7165) - atomic charge of each atom in the molecules
- R (7165 x 3) - Cartesian coordinates of each atom in the molecules

In [90]:
X = raw_qm7["X"]
T = raw_qm7["T"]
P = raw_qm7["P"]
Z = raw_qm7["Z"]
R = raw_qm7["R"]

To make it a bit easier to work with the molecules, let's convert them into RDKit ´Mol´ objects:

In [107]:
# Get the molobjs
atoms       = Z
coordinates = R

mols = xyz2mol.xyz2mol(atoms[0], coordinates[0])

mol

[6 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] <class 'numpy.int64'>


ArgumentError: Python argument types in
    Atom.__init__(Atom, numpy.int64)
did not match C++ signature:
    __init__(_object*, unsigned int)
    __init__(_object*, RDKit::Atom)
    __init__(_object*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>)

In [47]:
print(mol_0)
Chem.rdmolfiles.MolToSmiles(mol_0)


<rdkit.Chem.rdchem.Mol object at 0x7ff058673b30>


'*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.C.[HH].[HH].[HH].[HH]'

## Molecular fingerprints