In [1]:
# This cell is removed with the tag: "remove-input"
# As such, it will not be shown in documentation

import warnings
warnings.filterwarnings('ignore')


(UserGuide_Tools_Basic_Select)=
# Select

Elements selections is probably one of the most frequently tasks when we work with molecular systems. There are many circumstances under which we need to know list of elements acomplishing a certain condition. We probably need, for instance, to calculate de contact map between CA atoms from two chains, or to remove the solvent atoms or to know how many 'HIS' residues there are in a protein. All these conditions can be expresed as a sentence that a query over elements needs to match. Each library to work with molecular systems, molecular dynamics engine or molecular visualization software, have each own syntax you have to follow to write this sentence. See for instance different examples in the tools [MDTraj](https://www.mdtraj.org/1.9.7/atom_selection.html), [PyTraj](https://amber-md.github.io/pytraj/latest/atom_mask_selection.html?highlight=select), [MDAnalysis](https://docs.mdanalysis.org/stable/documentation_pages/selections.html), [NGLView](http://nglviewer.org/ngl/api/manual/usage/selection-language.html).

## MolSysMT selection syntax

Although you can use the function {func}`molsysmt.basic.select` with your preferred syntax (see XXX), MolSysMT has its own selection syntax based on the attributes of the elements as atoms, groups, molecules, etc. Let's load a molecular system to explain the logic behind this syntax:

In [5]:
import molsysmt as msm

In [6]:
molecular_system = msm.convert('1TCD', to_form='molsysmt.MolSys')

A molecular system encoded as the native form 'molsysmt.MolSys' has a pandas DataFrame with the atoms breakdown:

In [7]:
molecular_system.topology.atoms_dataframe

Unnamed: 0,atom_index,atom_name,atom_id,atom_type,group_index,group_name,group_id,group_type,component_index,component_name,...,molecule_type,entity_index,entity_name,entity_id,entity_type,occupancy,alternate_location,b_factor,formal_charge,partial_charge
0,0,N,1,N,0,LYS,4,aminoacid,0,0,...,protein,0,Triosephosphate isomerase,0,protein,1.0,,0.3226,0.0,
1,1,CA,2,C,0,LYS,4,aminoacid,0,0,...,protein,0,Triosephosphate isomerase,0,protein,1.0,,0.3328,0.0,
2,2,C,3,C,0,LYS,4,aminoacid,0,0,...,protein,0,Triosephosphate isomerase,0,protein,1.0,,0.3253,0.0,
3,3,O,4,O,0,LYS,4,aminoacid,0,0,...,protein,0,Triosephosphate isomerase,0,protein,1.0,,0.3216,0.0,
4,4,CB,5,C,0,LYS,4,aminoacid,0,0,...,protein,0,Triosephosphate isomerase,0,protein,1.0,,0.2535,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3978,3978,O,3979,O,657,HOH,339,water,162,162,...,water,1,water,1,water,1.0,,0.4365,0.0,
3979,3979,O,3980,O,658,HOH,340,water,163,163,...,water,1,water,1,water,1.0,,0.3323,0.0,
3980,3980,O,3981,O,659,HOH,341,water,164,164,...,water,1,water,1,water,1.0,,0.3706,0.0,
3981,3981,O,3982,O,660,HOH,342,water,165,165,...,water,1,water,1,water,1.0,,0.3207,0.0,


As you can see, the column names are the fundamental attributes of the molecular system elements:

In [8]:
print(molecular_system.topology.atoms_dataframe.columns)

Index(['atom_index', 'atom_name', 'atom_id', 'atom_type', 'group_index',
       'group_name', 'group_id', 'group_type', 'component_index',
       'component_name', 'component_id', 'component_type', 'chain_index',
       'chain_name', 'chain_id', 'chain_type', 'molecule_index',
       'molecule_name', 'molecule_id', 'molecule_type', 'entity_index',
       'entity_name', 'entity_id', 'entity_type', 'occupancy',
       'alternate_location', 'b_factor', 'formal_charge', 'partial_charge'],
      dtype='object')


The syntax proposed by Pandas to perform queries in a pandas.DataFrame is the base of the MolSysMT selection procedure. The boolean syntax of Pandas includes the following words and symbols:

<center>

| Word | Symbol | Meaning |
|---|---|---|
| and | & | and |
| or | \| | or |
| not | ~ | not |
| in | | in |
|  | == | equal |
|  | != | not equal |
|  | < | less than |
|  | <= | less or equal than |
|  | > | greater than |
|  | >= | greater or equal than |

</center>



In [None]:
As such, the selection sentence can also include the reference to external lists. Lets see some simple examples.

## element selections with atom indices

In [None]:
msm.select(molecular_system, selection=[0,1,2])

In [None]:
msm.select(molecular_system, element='group', selection=[0,1,2,3,4,5,6,7,8,9,10,11])

In [None]:
msm.select(molecular_system, element='molecule', selection=[3900, 3910, 3920])

## Simple atoms selection by their attributes or properties
The following are some examples where a list of atoms is obtained matching some selection criteria:

In [None]:
# Atoms with name C
msm.select(molecular_system, 'atom_name == "C"')

In [None]:
# Atoms with name CA or CB
msm.select(molecular_system, 'atom_name in ["CA","CB"]')

In [None]:
# Atoms of type C or N
msm.select(molecular_system, 'atom_type==["C","N"]')

In [None]:
# Heavy atoms
msm.select(molecular_system, 'not atom_type=="H"')

In [None]:
# Atoms of type C not named CA
msm.select(molecular_system, 'atom_type=="C" and not atom_name=="CA"')

In [None]:
# Atoms not named CA, CB or C
msm.select(molecular_system, 'atom_name!=["CA","CB","C"]')

In [None]:
# Atoms with id number lower than 10
msm.select(molecular_system, 'atom_id<10')

In [None]:
# Atoms with id number lower than 10 and higher or equal than 3
msm.select(molecular_system, 'atom_id<10 and atom_id>=3')

## Including other elements attributes

Atoms can be selected using attributes of other the other elements in the hierarchical organization of the molecular system: 'group', 'component', 'molecule', 'chain', 'entity' or 'bioassembly'. You can find further information of these elements in XXX. These are some examples of selection sentences including other criteria than atoms attributes:

In [None]:
# Atoms belonging to molecules of type water.
msm.select(molecular_system, 'molecule_type=="water"')

In [None]:
# Heavy atoms belonging to molecules of type protein.
msm.select(molecular_system, 'molecule_type=="protein" and atom_type!="H" and group_index==3')

In [None]:
# Atoms belonging to residues named GLY, ALA or VAL in chain id A.
msm.select(molecular_system, 'group_name==["GLY","ALA","VAL"] and chain_id=="A"') 

## Including external variables

Pandas query method allows the use of external variables in the logical sentence. To include them, variables names have to be preceded by the character '@'. Lets illustrate its use with some examples:

In [None]:
# Atoms in groups with indices 10, 11 or 12.
indices=[10,11,12]
msm.select(molecular_system, 'group_index==@indices')

In [None]:
# Atoms named CA, C, O or N in groups with indices 10 to 29.
indices=list(range(10,30))
atoms=["CA", "C", "O", "N"]
msm.select(molecular_system, 'atom_name==@atoms & atom_index==@indices') 

## Including mask filters

Although including masks is not really necessary, `molsysmt.select()` has an optional input argument to do so:

In [None]:
# Atoms named C with atom index in range 10 to 29
indices=list(range(10,30))
msm.select(molecular_system, 'atom_name=="C"', mask=indices)

The use of masks can always be avoid using the logical sentence:

In [None]:
# Atoms named C with atom index in range 10 to 29
indics=list(range(10,30))
msm.select(molecular_system, 'atom_name=="C" and atom_index in @indics')

## Selection of other elements

The selection method of MolSysMT can also return other elements indices than atoms. As many methods in this library, `molsysmt.select()` has an input argument named `element` to select the elements nature of the output list of indices. Lets see some examples:

In [None]:
# Groups with indices equal to 0, 100 or 200
indices=[0,100,200]
msm.select(molecular_system, 'group_index==@indices', element='group')

In [None]:
# Groups with name "ALA"
msm.select(molecular_system, 'group_name=="ALA"', element='group')

In [None]:
# Groups of atoms index 34, 44 or 64
msm.select(molecular_system, 'atom_index==[34,44,64]', element='group')

In [None]:
# Groups belonging to chain id A or C and molecule of type anything but water
msm.select(molecular_system, 'chain_id==["A","C"] and molecule_type!="water"', element='group')

In [None]:
# Groups of molecules of type water
msm.select(molecular_system, 'molecule_type=="water"', element='group')

In [None]:
# Molecules of type water
msm.select(molecular_system, 'molecule_type=="water"', element='molecule')

In [None]:
# Chains with molecules of type water
msm.select(molecular_system, 'molecule_type=="water"', element='chain')

In [None]:
# Bonds in group index 5
msm.select(molecular_system, 'group_index==5', element='bond')

Finnally, notice that `mask` is always acting over the elemented elements:

In [None]:
# Atoms with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'atom_index in [0,1,2,3,4]', mask=[0,1,2], element='atom')

In [None]:
# Groups with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'group_index in [0,1,2,3,4]', mask=[0,1,2], element='group')

## Special selection tools

A selection of elements within a certain distance of a set of elements can be obtained using the string `within ... of`:

In [None]:
msm.select(molecular_system, 'chain_id=="A" within 0.3 nm of chain_id=="B"')

In [None]:
msm.select(molecular_system, 'chain_id=="A" not within 7.8 nanometers of chain_id=="B"')

In [None]:
msm.select(molecular_system, 'chain_id=="A" within 0.3 nm without pbc of chain_id=="B"')

In [None]:
msm.select(molecular_system, 'chain_id=="A" within 0.3 nm with pbc of chain_id=="B"')

In [None]:
msm.select(molecular_system, '(atom_name=="N" and chain_id=="A") within 3 angstroms of (atom_type=="O" and molecule_type=="water")')

In [None]:
msm.select(molecular_system, '(atom_name=="CA" and chain_id=="A") within 0.5 nm of (atom_name=="CA" and chain_id=="B")',
          element='group')

Atoms bonded to specific atoms can also be selected with `bonded to`:

In [None]:
msm.select(molecular_system, 'atom_name=="N" bonded to atom_type=="C"')

In [None]:
msm.select(molecular_system, '(all not bonded to atom_type==["H","N","C","O"]) and molecule_type=="protein"')

In [None]:
msm.select(molecular_system, '(atom_type=="O" and chain_id=="A") bonded to (atom_type=="C" and chain_id=="A")')

And both, `within .. of` and `bonded to`, can be mixed in the same selection sentence:

In [None]:
msm.select(molecular_system, '((atom_name=="N" and chain_id=="A") bonded to atom_type=="C") within 3 angstroms of (atom_type=="O" and molecule_type=="water")')

## Syntaxis translation

MolSysMT is prepared to easily interact with other tools. The main goal of this library is providing with a set of pipes and joins to set up your workflows, keeping simple the integration of other tools. But different tools have different selection syntax. Learning how to use the selection syntax of MDTraj, ParmEd or NGLview is something very useful. Those are tools that we all use frequently in our labs. But it happens that we forget soon the rules of each tool. To keep a unique selection syntax in your projects, MolSysMT includes the input argument `to_syntax` in the method `molsysmt.select()`. Lets illustrate some examples:

In [None]:
msm.select(molecular_system, selection='group_index==[3,4,5]', to_syntax='NGLView')

In [None]:
msm.select(molecular_system, selection='group_index==[3,4,5]', to_syntax='MDTraj')

The output string can be obtained, if the selection is done over other elementted elements, as a sequence of groups or chains:

In [None]:
msm.select(molecular_system, element='group', selection='group_index==[3,4,5]', to_syntax='NGLView')

In [None]:
msm.select(molecular_system, element='group', selection='group_index==[3,4,5]', to_syntax='MDTraj')

### Output syntax supported

MolSysMT translates selection sentences from its own native syntax to NGLview, MDTraj, Pytraj, ParmEd and AMBER.

## Using your favourite selection syntax

Already implemented: testing and documenting need it.