### Welcome to the ProtoSyn.jl examples

# 14 - Non-canonical aminoacids

ProtoSyn.jl allows users to define non-canonical aminoacids (NCAAs) and include them in simulations, such as design efforts and mutations. In this brief tutorial, the process to add a new NCAA is explored, as well as its implementation in a simple mutation application.

The first step to introduce new NCAA is to find or retrieve a 3D model for the aminoacid. This can be done in various ways:
* From any online repository
* Building from SMILES or 2D structure representation (for example, using a tool like https://molview.org/)
* Selecting a specific NCAA residue from a larger conformation
* Manually building the residue atom by atom in ProtoSyn.jl (laborious, not recommended)

For this tutorial, the NCAA of interest is the p-azido-phenylalanine in the 4J88 entry in the Protein Data Bank (PDB), with the residue name "CQ1". As a first step, the whole PDB structure is downloaded from the database.

In [1]:
using ProtoSyn


.      ____            _       ____              
      |  _ \ _ __ ___ | |_ ___/ ___| _   _ _ __  
      | |_) | '__/ _ \| __/ _ \___ \| | | | '_ \ 
      |  __/| | | (_) | || (_) |__) | |_| | | | |
      |_|   |_|  \___/ \__\___/____/ \__, |_| |_|
                                       |_/       
    
      ---------------------------------------------

 Version      : 1.01
 License      : GNU-GPL-3
 Developed by : José Pereira (jose.manuel.pereira@ua.pt)
                Sérgio Santos


┌ Info: Precompiling ProtoSyn [c9758760-7c0d-11e9-0ffc-fb9355b7d293]
└ @ Base loading.jl:1423
┌ Info: Skipping precompilation since __precompile__(false). Importing ProtoSyn [c9758760-7c0d-11e9-0ffc-fb9355b7d293].
└ @ Base loading.jl:1124
┌ Info: Loading required packages
└ @ ProtoSyn /home/jpereira/ProtoSyn.jl/src/ProtoSyn.jl:17
┌ Info:  | Loading SIMD
└ @ ProtoSyn /home/jpereira/ProtoSyn.jl/src/ProtoSyn.jl:21
┌ Info:  | Loading CUDA
└ @ ProtoSyn /home/jpereira/ProtoSyn.jl/src/ProtoSyn.jl:23
┌ Info: Setting up variables
└ @ ProtoSyn /home/jpereira/ProtoSyn.jl/src/ProtoSyn.jl:26
┌ Info: Current acceleration set to ProtoSyn.CUDA_2
└ @ ProtoSyn /home/jpereira/ProtoSyn.jl/src/ProtoSyn.jl:67
┌ Info: Loading Core
└ @ ProtoSyn /home/jpereira/ProtoSyn.jl/src/ProtoSyn.jl:71
┌ Info: Loading Calculators
└ @ ProtoSyn /home/jpereira/ProtoSyn.jl/src/ProtoSyn.jl:100
┌ Info:  | Loading TorchANI
└ @ ProtoSyn.Calculators /home/jpereira/ProtoSyn.jl/src/Core/Calculators/Calculators.jl:18
┌ Info:  | Loadi

In [2]:
pose = ProtoSyn.Peptides.download("4J88", bonds_by_distance = true, ignore_residues = ["HOH", "SO4", "EDO", "TRS"], ignore_chains = ["B"], include_residues = ["CQ1"])
rm("4J88.pdb")

└ @ ProtoSyn.Peptides /home/jpereira/ProtoSyn.jl/src/Peptides/Methods/io.jl:90


The residue name will be used to query for the right residue. Note that during the loading of the 4J88 pose, ProtoSyn was unable to pinpoint the exact N atom to act as a starting point for the inte-residue graph. ProtoSyn attempts to identify this atom by the number and types of bonds, but in complex NCAAs this criteria can fail. Whenever attempting to load a new NCAA, it's important to manually check the attributed graph and make any necessary changes.

In [3]:
residue_names = [r.name.content for r in eachresidue(pose.graph[1])]
rid  = findfirst((x) -> x === "CQ1", residue_names)
ncaa = pose.graph[1, rid]

Residue{/4J88:1476/A:1/CQ1:64}

The objective is now to extract this residue as a single entry. In ProtoSyn, this is called a _Fragment_ (a vehicle for temporary information regarding both the structure graph and state). 

In order to prepare the Fragment into a Pose with the information regarding the single residue of interest, a couple additional steps may be necessary:
1. Re-order the fragment atoms and fix the intra-residue graph. Note that this is only necessary for some NCAAs. For this example, ProtoSyn identified "N2" as the connection point to any previous aminoacids in the chain. By manually checking the structure, this identification is wrong and needs to be fixed: the connected atom is "N1". As such, a new intra-residue parenthood graph is infered (starting from the "N1" atom) and the inter-residue parenthood (both at the Atom and Residue level) are manually set.

In [4]:
ProtoSyn.infer_parenthood!(ncaa, start = ncaa["N1"], overwrite = true)

# Residue level
popparent!(ncaa)
setparent!(ncaa, pose.graph[1, rid - 1])

# Atom level
popparent!(ncaa["N1"])
setparent!(ncaa["N1"], pose.graph[1, rid - 1, "C"]);

2. Define the inter-residue bonding atom names. The Peptides grammar expects the existance of certain atoms with given names, namely, the "C" and "N" atoms to perform the peptidic bond. In this example, the residue extracted from the PDB file uses alternative naming for the atoms, outside the IUPAC convention. We therefore need to satisfy this requirement manually.

In [5]:
ProtoSyn.rename!(ncaa["N1"], "N");
ProtoSyn.rename!(ncaa["C3"], "C");

3. Most ProtoSyn methods that loop over atoms (using the `eachatom` function) do so based on the current order of atoms within their encompassing `AbstractContainer`. As such, it may be necessary to re-order the atoms to match the new `Graph`.

In [6]:
ProtoSyn.sort_atoms_by_graph!(pose.state, ncaa, ncaa["N"])
ProtoSyn.reindex(pose)

State{Float64}:
 Size: 4141
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)


4. Since the `Graph` parenthood relationships changed, if a `request_i2c!` was issued, the new cartesian cooridnates would be different than the original: the same internal coordinates would be applied to a different `Graph`. Therefore, it's necessary to update internal coordinates from the current cartesian cooridnates, to `sync!` `Graph` and `State`.

In [7]:
ProtoSyn.request_c2i!(pose.state)
sync!(pose)

Pose{Topology}(Topology{/4J88:1476}, State{Float64}:
 Size: 4141
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)

We can now create the fragment. Since the CQ1 residue has a complex backbone graph, ProtoSyn can make sure the atoms are sorted based on the newly defined intra-residue graph, by setting the `sort_atoms_by_graph` flag to `true`. This is necessary since the `fragment` method simply starts the Graph travel on the first atom, thus not sorting the structure may leave behing some atoms.

In [8]:
frag = fragment(pose, SerialSelection{Residue}(rid, :id), sort_atoms_by_graph = true)

Fragment(Segment{/A:36231}, State{Float64}:
 Size: 24
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)

Altough not necessary, it's useful to mark new NCAAs with distinct names.

In [9]:
frag.graph.name = "CQ1"
frag

Fragment(Segment{/CQ1:36231}, State{Float64}:
 Size: 24
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)

This Fragment will then be printed to a .YML file containing all the important internal coordinates info. Note that this does not contain charge information. Charges, if necessary, can be calculated later or manually assigned. Since we intend this file to be available for mutation purposes, we can place the generated file in the ProtoSyn.Peptides resources directory: "resources/Peptides/NCAA/yml".

In [10]:
dest = joinpath(ProtoSyn.Peptides.resource_dir, "NCAA/yml")
ProtoSyn.write(Pose(frag), joinpath(dest, "cq1.yml"))

In order to include the newly defined CQ1 NCAA in the Peptides grammar, we need to add it to the grammar file, located in "resources/Peptides/grammars.yml". Make sure that in the _ncaa_ entry, under the _variables_ field, there's an entry pointing towards the recently defined .YML file ("Peptides/NCAA/yml/cq1.yml").

We can now either re-load ProtoSyn to include the new residue, or re-load the grammar into a new variable. After loading the grammar, we can now use it to perform any manipulation desired, such as a mutation. We will load the 2A3D structure and mutate residue 38 to this new NCAA.

In [11]:
ncaa_grammar = ProtoSyn.load_grammar_from_file(joinpath(ProtoSyn.Peptides.resource_dir, "grammars.yml"), "ncaa")

LGrammar{Float64, String, Vector{String}}:
 Rules:
 Variables:
 c => Fragment(Segment{/CQ1:32867}, State{Float64}:
 Size: 24
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)
 b => Fragment(Segment{/MSE:4424}, State{Float64}:
 Size: 16
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)
 a => Fragment(Segment{/CMA:29688}, State{Float64}:
 Size: 17
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)
 Operators:
 α => #114 (Between atoms C & N)
 β => #114 (Between atoms C & N)

 None.

In [12]:
pose = ProtoSyn.Peptides.load("data/2a3d.pdb")

Pose{Topology}(Topology{/2a3d:61481}, State{Float64}:
 Size: 1140
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)

ProtoSyn makes available two methods for introducing mutations. The "regular" method is using the `mutate!` function. This function maintains the backbone and substitutes the sidechain only. However, in the case of the CQ1 NCAA, the backbone is completly different. A complementary method exists in ProtoSyn to replace the whole residue, the `force_mutate!` function.

In [13]:
ProtoSyn.Peptides.force_mutate!(pose, pose.graph[1][38], ncaa_grammar, seq"c")

Pose{Topology}(Topology{/2a3d:61481}, State{Float64}:
 Size: 1144
 i2c: true | c2i: false
 Energy: Dict(:Total => Inf)
)

In [14]:
using Bio3DView
ProtoSyn.write(pose, "output/example14.pdb")
style = Style("stick")
viewfile("output/example14.pdb", style = style)

In [15]:
pose_raw = ProtoSyn.Peptides.load("data/2a3d.pdb")

Pose{Topology}(Topology{/2a3d:14506}, State{Float64}:
 Size: 1140
 i2c: false | c2i: false
 Energy: Dict(:Total => Inf)
)

## Conclusion

In this example, the incorporation of complex NCAAs was explored in detail. It's important to make sure the newly defined NCAA's graph and state are correct, with atoms in correct order. Another important aspect in the atom naming. For most cases, renaming the backbone atoms may be necessary, in order to follow IUPAC reccomendations. ProtoSyn's `assign_default_atom_names!` function may help in the more simple cases. 