# Tutorial 1: From Ligand Preparation to MD Simulations

## Outline
1. Useful Information Before Starting <br>
    1.1. Input files <br>
    1.2. Required Softwares and Python Libraries<br>
    1.3. Scripts Usage<br>
    1.4.  How to Run the Pipeline for a Different Set of Molecules
2. Make a Copy of the Original Folder<br>
3. Generate Compound 3D-Structures<br>
    3.1. Generate 3D-Structures from SDF Files<br>
    3.2. Generate 3D-Structures from SMILES<br>
4. Parametrization of the Compounds<br>
5. Build a Water Box Around the Compound<br>
6. MD Simulations<br>
    6.1. Simulation Details<br>
    6.2. Run Equilibration, Production, and Post-Processing

## 1. Useful Information Before Starting 

### 1.1.   Input Files
Only one file is required to run this tutorial for a different set of molecules, which is either a sdf file containing the 2D-structures or a smi file containing a list of SMILES.

### 1.2. Required Softwares and Python Libraries

##### Softwares:
python3 <br>
openbabel <br>
ChemAxon <br>
GROMACS<br>
Ambertools <br>

##### Python Liraries:
RDKit <br>
sys<br>
subprocess<br>
argparse <br>
pandas <br>
numpy <br>
glob <br>
parmed <br>
mdfptools (Download from https://github.com/rinikerlab/mdfptools)<br>


### 1.3. Scripts Usage
Each script contains as header a DESCRIPTION section describing the purpose and the functions used. <br>
To check the DESCRIPTION, type 

        less script.xx 

##### Print out usage, brief description and information regarding the python input files: <br>

        python3 script.py -h

For example, 

        python3 clean_2d_structures.py -h  
    
gives as output:

        usage: clean_2d_structures.py [-h] -isdf input_file.sdf

        Remove Salts from SDF structure files. Output = filename_clean.sdf

        optional arguments:
          -h, --help                show this help message and exit
          -isdf input_file.sdf      input SDF file


##### Print out usage of bash scripts

        ./script.sh -h

For example,

        ./run_md_water_4tutorial1.sh -h
        
gives as output:

        Usage: ./run_md_water_4tutorial1.sh [-w working_folder] [-i inputs_folder]
        Options:
                -w   working_folder           directory containing topology .top and structure .gro files 
                -i   inputs_folder            directory containing the input .mdp files for gromacs
                -h                            Show this message

## 2. Make a Copy of the Original Folder

Enter the example folder

## 3. Generate Compound 3D-Structures

The first step is the generation of 3D-structures, either from a SDF file or from a SMI file containing a list of SMILES.

- If a SDF file is used and if the 2D-structures contain also salt, use the clean_2d_structures.py script to remove the salt. Afterwards generate 3D-structures using the script gen_3d_structure_from_SDF.py.
<br>
- If SMILES are used as input, use gen_3d_structure_from_SMILES.py.



### 3.1. Generate 3D-Structures from SDF Files

Clean 2D-structures and generate 3D-structures

Output PDB files are saved in the current directory.

### 3.2. Generate 3D-Structures from SMILES

Print out Usage:


    usage: gen_3d_structure_from_SMILES.py [-h] -ismi input_file.smi [-sep SEP]

    Generate 3D-structures at pH = 7 from a SMILES file. One output file _pH7.pdb
    for each SMILES in the input .smi file

    optional arguments:
      -h, --help            show this help message and exit
      -ismi input_file.smi  Input file containing a list of SMILES. It must
                            contain a "SMILES" column. Columns "CMPD_ID" and
                            "is_sub" are optional. The "CMPD_ID" column contains
                            compound names while the "is_sub" column contains the
                            classification labels. If the "CMPD_ID" column is
                            missing, then compounds will be named as mol_N. If the
                            "is_sub" column is present then the pdb output files
                            will be saved into two directories, "substrates" and
                            "nonsubstrates", instead of the current directory
      -sep SEP              column separator. Default = " "


Output PDB files are written in the directories "substrates" and nonsubstrates, according to the classification label specified in the input file. See the usage above.

 ## 4. Parametrization of the Compounds
 
 #### Generate GAFF Parameters using AmberTools

Copy the required input files in the current directory:

The parametrization consists of three steps:

**i. For every compound (every pdb file), AM1BCC partial charges are generated using antechamber from AmberTools18.** <br> The formal charge is calculated using ChemAxon tools while GAFF parameters are extracted using parmchk2.<br>
Topology and coordinates files are generated using the AmberTools18 program tleap.<br>

**Inputs:** PDB files of the compounds generated in the previpus step<br>
**Outputs from antechamber:** FRCMOD and MOL2 files<br>
**Outputs from tleap:** PRMTOP and INPCRD files<br>

**IMPORTANT:** deactivate anaconda enviroments before running the following script.


**ii. Round the formal charge to an integer number**

Because of numerical errors, the partial charges obtained with antechamber do not sum up to an integer. Therefore, the excess charge is redistributed on each of the atoms. 

**Outputs:** netcharge.mol2 files (with updated partial charges)

***iii. Generate the compound topology and coordinates files***

Rewrite topology and coordinates files using the AmberTools18 program tleap.<br>
**IMPORTANT:** deactivate anaconda enviroments before running the following script.

Execute from terminal:

## 5. Build a Water Box Around the Compound

To solvate the compounds, a cubic TIP3P water box is generated using tleap such that the minimum distance between the compound and the wall is 1 nm.<br> 

**Outputs**: PRMTOP and INPCRD files are saved in the folder WAT_box

**IMPORTANT:** deactivate anaconda enviroments before running the following script.


Convert topology and coordinate files from PRMTOP and INPCRD formats to TOP and GRO formats (required for Gromacs): 

Output structure (.gro) and topology (.top) files are written in the WAT_box folder. 

#### Remove all the files that are not needed in the next steps

## 6. MD Simulations

### 6.1. Simulation Details

All simulations are performed using GROMACS. Download the latest version of GROMACS from http://manual.gromacs.org/documentation/

 
#### Simulation Details
The systems are minimized using the steepest descent algorithm for a maximum of 50000 steps
with a convergence criterion of 1000 $kJ mol^{−1} nm^{−1}$. <br>A two-steps NPT equilibration is carried out: firstly, a 50 ps equilibration is run using the velocity-rescaling thermostat (coupling constant of 0.1 ps) and the Berendsen barostat (coupling constant of 2.0 ps); then, an additional 50 ps equilibration is run using the more accurate Nose-Hover thermostat (coupling constant of 0.5 ps) and Parrinello-Rahman barostat (coupling constant of 5 ps). The equilibration is followed by a 5 ns production in the NVT ensemble. The temperature is controlled using the Nose-Hoover chain thermostat with a coupling constant of 0.5 ps. In all smulations, the temperature is kept at 298.15K. Bonds are constrained using LINCS. Periodic boundary conditions (PBC) are applied. Both Lennard-Jones and electrostatic interactions are truncated at a distance of 10 Å, and long range electrostatic interactions are treated using the particle-mesh Ewald (PME) summation method. The Newton's equation of motion is integrated using the Verlet scheme with a time step of 2 fs. Snapshots are saved every 500 steps (1 ps). This results in a total of 5000 snapshots per compound.


#### Input files that are required for equilibration and production
The input files that are used to carry out the simulations are stored in the folder "inputs_md_gromacs". These are:

    1. minim_wat.mdp     To minimize the system
    2. npt_wat_1x.mdp    To run the first 50 ps NPT equilibration with velocity-rescaling thermostat 
                         and Berendsen barostat
    3. npt_wat_2x.mdp    To run the second 50 ps NPT equilibration with Nose-Hover thermostat 
                         and Parrinello-Rahman barostat 
    4. nvt_wat_3x.mdp    To run the 5 ns NVT production. The temperature is kept at 298.15K using 
                         the Nose-Hoover chain thermostat
                         
    x=a if no counter-ions are present in the system
    x=b if there are counter-ions to neutralize the system
    
    
#### Post-Processing
The ligand is centered in the box using gmx trjcov. Afterwards, the energy terms (LJ and electrostatic) are re-calculated using Reaction field (RF) instead of PME, which was employed for production. While PME is compatible with the force-field parametrization scheme of AMBER, it does not allow for a direct energy decomposition into solute-solute and solute-solvent terms, which is required for the construction of the MDFPs. Therefore, the energy terms are recomputed with the RF expression based on the trajectory coordinates using the “-rerun” option of the GROMACS mdrun tool (see also the [documentation of Gromacs](http://manual.gromacs.org/documentation/5.1/user-guide/mdrun-features.html)). The RF dielectric permittivity is set to 78.5. For each trajectory snapshots, the energy values are stored in a file named "nvt3_rf_basename.edr". The gmx energy tool of Gromacs allows to extract the energy components from the edr energy file. The output "nvt3_rf_basename.xvg" file is required to construct the MDFPs. 

#### Input files that are required for the post-processing step
The input file required to recompute the energy terms from the trajectory file is:
    
    nvt_rf_wat_3x.mdp    To recompute energy terms with the reaction field expression 
    
This mdp files is the same as the one used for production (nvt_wat_3x.mdp) but:

    i)    coulombtype = Reaction-Field instead of coulombtype     = PME. 
    ii)   Related keywords were also modified
    iii)  energy groups for which to write the energy components were specified. Added keywords:
            energygrps = LIG Water
            energygrp-excl   = Water Water #exclude protein-protein interactions. 

### 6.2. Run Equilibration, Production, and Post-Processing

Each of the steps described above can be executed using the script run_md_water_4tutorial1.sh. To check how to use this script, type:

This gives as output

        Usage: ./run_md_water_4tutorial1.sh [-w working_folder] [-i inputs_folder]
        Options:
                -w   working_folder           directory containing topology .top and structure .gro files 
                -i   inputs_folder            directory containing the input .mdp files for gromacs
                -h                            Show this message

For example, to run the simulations of compound CHEMBL1077779, whose structure and topology files are contained in the folder ./nonsubstrates/WAT_box/, type: 

#### IMPORTANT: 
This step might take a long time. If you want to test the run_md_water_4tutorial1.sh script without waiting too long for the NVT production to finish, modify the nvt_\*.mdp files contained in the inputs_md_gromacs folder by changing the number of steps (nsteps) from 2500000 (5 ns) to 25000 (50 ps)

#### NOTE: 
On a cluster and with 4 CPU, each simulation takes about 1h. If you need to carry out simulations for 200 compounds 
and you want to ensure that the simulations finish in 24h, you can parallelize the task by splitting the input .gro and .top files in 10 subdirectories (dir_001 to dir_010) containing 20 compounds each using the following bash lines: