In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import molsysmt as msm



# Select

Elements selections is probably the most frequently task when we work with molecular systems. There are many circumstances under which we need to know list of elements acomplishing a certain condition. We probably need, for instance, to calculate de contact map between CA atoms from two chains, or to remove the solvent atoms or to know how many 'HIS' residues there are in a peptide. All these conditions can be expresed as a sentence that the query over elements needs to match. Each library or MD engine or molecular visualization software have each own syntaxis to write this sentence. You can see different examples in MDTraj, PyTraj, Amber, Pymol or VMD.

## MolSysMT selection syntaxis

MolSysMT has its own selection syntaxis based on the attributes of the elements as atoms, groups, molecules, etc. Lets load a molecular system to explain the logic behind this syntaxis:

In [3]:
file_path = msm.demo_systems.files['1tcd.mmtf']

In [4]:
molecular_system = msm.convert(file_path, to_form='molsysmt.MolSys')

In [5]:
molecular_system = msm.add_missing_hydrogens(molecular_system)

A molecular system encoded as the native form 'MolSys' has a pandas DataFrame with the atoms breakdown:

In [6]:
molecular_system.topology.atoms_dataframe

Unnamed: 0,atom_index,atom_name,atom_id,atom_type,group_index,group_name,group_id,group_type,component_index,component_name,...,chain_id,chain_type,molecule_index,molecule_name,molecule_id,molecule_type,entity_index,entity_name,entity_id,entity_type
0,0,N,1,N,0,LYS,4,aminoacid,0,,...,A,,0,,,protein,0,Protein_0,,protein
1,1,H,2,H,0,LYS,4,aminoacid,0,,...,A,,0,,,protein,0,Protein_0,,protein
2,2,H2,3,H,0,LYS,4,aminoacid,0,,...,A,,0,,,protein,0,Protein_0,,protein
3,3,H3,4,H,0,LYS,4,aminoacid,0,,...,A,,0,,,protein,0,Protein_0,,protein
4,4,CA,5,C,0,LYS,4,aminoacid,0,,...,A,,0,,,protein,0,Protein_0,,protein
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8197,8197,H1,8198,H,660,HOH,342,water,165,,...,D,,165,,,water,2,water,,water
8198,8198,H2,8199,H,660,HOH,342,water,165,,...,D,,165,,,water,2,water,,water
8199,8199,O,8200,O,661,HOH,343,water,166,,...,D,,166,,,water,2,water,,water
8200,8200,H1,8201,H,661,HOH,343,water,166,,...,D,,166,,,water,2,water,,water


As you can see, the column names are the fundamental attributes of the molecular system elements:

In [7]:
print(molecular_system.topology.atoms_dataframe.columns)

Index(['atom_index', 'atom_name', 'atom_id', 'atom_type', 'group_index',
       'group_name', 'group_id', 'group_type', 'component_index',
       'component_name', 'component_id', 'component_type', 'chain_index',
       'chain_name', 'chain_id', 'chain_type', 'molecule_index',
       'molecule_name', 'molecule_id', 'molecule_type', 'entity_index',
       'entity_name', 'entity_id', 'entity_type'],
      dtype='object')


The syntaxis proposed by Pandas to perform queries in a pandas.DataFrame is the base of the MolSysMT selection procedure. The boolean syntaxis of Pandas includes the following words and symbols:

<center>

| Word | Symbol | Meaning |
|---|---|---|
| and | & | and |
| or | \| | or |
| not | ~ | not |
| in | | in |
|  | == | equal |
|  | != | not equal |
|  | < | less than |
|  | <= | less or equal than |
|  | > | greater than |
|  | >= | greater or equal than |

</center>

As such, the selection sentence can also include the reference to external lists. Lets see some simple examples.

## Target selections with atom indices

In [8]:
msm.select(molecular_system, selection=[0,1,2])

array([0, 1, 2])

In [9]:
msm.select(molecular_system, target='group', selection=[0,1,2,3,4,5,6,7,8,9,10,11])

array([0])

In [10]:
msm.select(molecular_system, target='molecule', selection=[3900, 3910, 3920])

array([1])

## Simple atoms selection by their attributes or properties
The following are some examples where a list of atoms is obtained matching some selection criteria:

In [11]:
# Atoms with name C
msm.select(molecular_system, 'atom_name == "C"')

array([   6,   27,   42, ..., 7664, 7674, 7688])

In [12]:
# Atoms with name CA or CB
msm.select(molecular_system, 'atom_name in ["CA","CB"]')

array([   4,    8,   25, ..., 7676, 7686, 7690])

In [13]:
# Atoms of type C or N
msm.select(molecular_system, 'atom_type==["C","N"]')

array([   0,    4,    6, ..., 7696, 7699, 7702])

In [14]:
# Heavy atoms
msm.select(molecular_system, 'not atom_type=="H"')

array([   0,    4,    6, ..., 8193, 8196, 8199])

In [15]:
# Atoms of type C not named CA
msm.select(molecular_system, 'atom_type=="C" and not atom_name=="CA"')

array([   6,    8,   11, ..., 7693, 7696, 7699])

In [16]:
# Atoms not named CA, CB or C
msm.select(molecular_system, 'atom_name!=["CA","CB","C"]')

array([   0,    1,    2, ..., 8199, 8200, 8201])

In [17]:
# Atoms with id number lower than 10
msm.select(molecular_system, 'atom_id<10')

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [18]:
# Atoms with id number lower than 10 and higher or equal than 3
msm.select(molecular_system, 'atom_id<10 and atom_id>=3')

array([2, 3, 4, 5, 6, 7, 8])

## Including other elements attributes

Atoms can be selected using attributes of other the other elements in the hierarchical organization of the molecular system: 'group', 'component', 'molecule', 'chain', 'entity' or 'bioassembly'. You can find further information of these elements in XXX. These are some examples of selection sentences including other criteria than atoms attributes:

In [19]:
# Atoms belonging to molecules of type water.
msm.select(molecular_system, 'molecule_type=="water"')

array([7707, 7708, 7709, ..., 8199, 8200, 8201])

In [20]:
# Heavy atoms belonging to molecules of type protein.
msm.select(molecular_system, 'molecule_type=="protein" and atom_type!="H"')

array([   0,    4,    6, ..., 7699, 7702, 7706])

In [21]:
# Atoms belonging to residues named GLY, ALA or VAL in chain id A.
msm.select(molecular_system, 'group_name==["GLY","ALA","VAL"] and chain_id=="A"') 

array([  88,   89,   90, ..., 3808, 3809, 3810])

## Including external variables

Pandas query method allows the use of external variables in the logical sentence. To include them, variables names have to be preceded by the character '@'. Lets illustrate its use with some examples:

In [22]:
# Atoms in groups with indices 10, 11 or 12.
indices=[10,11,12]
msm.select(molecular_system, 'group_index==@indices')

array([156, 157, 158, ..., 200, 201, 202])

In [23]:
# Atoms named CA, C, O or N in groups with indices 10 to 29.
indices=list(range(10,30))
atoms=["CA", "C", "O", "N"]
msm.select(molecular_system, 'atom_name==@atoms & atom_index==@indices') 

array([24, 25, 27, 28])

## Including mask filters

Although including masks is not really necessary, `molsysmt.select()` has an optional input argument to do so:

In [24]:
# Atoms named C with atom index in range 10 to 29
indices=list(range(10,30))
msm.select(molecular_system, 'atom_name=="C"', mask=indices)

array([27])

The use of masks can always be avoid using the logical sentence:

In [25]:
# Atoms named C with atom index in range 10 to 29
indices=list(range(10,30))
msm.select(molecular_system, 'atom_name=="C" and atom_index in @indices')

array([27])

## Selection of other elements

The selection method of MolSysMT can also return other elements indices than atoms. As many methods in this library, `molsysmt.select()` has an input argument named `target` to select the elements nature of the output list of indices. Lets see some examples:

In [26]:
# Groups with indices equal to 0, 100 or 200
indices=[0,100,200]
msm.select(molecular_system, 'group_index==@indices', target='group')

array([  0, 100, 200])

In [27]:
# Groups with name "ALA"
msm.select(molecular_system, 'group_name=="ALA"', target='group')

array([  5,   6,   7, ..., 465, 482, 494])

In [28]:
# Groups of atoms index 34, 44 or 64
msm.select(molecular_system, 'atom_index==[34,44,64]', target='group')

array([1, 2, 3])

In [29]:
# Groups belonging to chain id A or C and molecule of type anything but water
msm.select(molecular_system, 'chain_id==["A","C"] and molecule_type!="water"', target='group')

array([  0,   1,   2, ..., 245, 246, 247])

In [30]:
# Groups of molecules of type water
msm.select(molecular_system, 'molecule_type=="water"', target='group')

array([497, 498, 499, ..., 659, 660, 661])

In [31]:
# Molecules of type water
msm.select(molecular_system, 'molecule_type=="water"', target='molecule')

array([  2,   3,   4, ..., 164, 165, 166])

In [32]:
# Chains with molecules of type water
msm.select(molecular_system, 'molecule_type=="water"', target='chain')

array([2, 3])

In [33]:
# Bonds in group index 5
msm.select(molecular_system, 'group_index==5', target='bond')

array([  48,   49,   50,   51,   52, 4256, 4257, 4258, 4259])

Finnally, notice that `mask` is always acting over the targeted elements:

In [34]:
# Atoms with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'atom_index in [0,1,2,3,4]', mask=[0,1,2], target='atom')

array([0, 1, 2])

In [35]:
# Groups with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'group_index in [0,1,2,3,4]', mask=[0,1,2], target='group')

array([0, 1, 2])

In [36]:
# Molecules with index from 0 to 4 and from 0 to 2
msm.select(molecular_system, 'molecule_index in [0,1,2,3,4]', mask=[0,1,2], target='molecule')

array([0, 1, 2])

## Special selection tools

A selection of elements within a certain distance of a set of elements can be obtained using the string `within ... of`:

In [37]:
msm.select(molecular_system, 'chain_id=="A" within 0.2 nm of chain_id=="B"')

array([ 179,  183, 1120, 1478])

In [38]:
msm.select(molecular_system, 'chain_id=="A" not within 0.2 nanometers of chain_id=="B"')

array([   0,    1,    2, ..., 3845, 3846, 3847])

In [39]:
msm.select(molecular_system, 'chain_id=="A" within 0.2 nm without pbc of chain_id=="B"')

array([ 179,  183, 1120, 1478])

In [40]:
msm.select(molecular_system, 'chain_id=="A" within 0.2 nm with pbc of chain_id=="B"')

array([ 179,  183, 1120, 1478])

In [41]:
msm.select(molecular_system, '(atom_name=="N" and chain_id=="A") within 3 angstroms of (atom_type=="O" and molecule_type=="water")')

array([ 236,  435,  959, 1075, 1323, 1409, 1604, 2095])

In [42]:
msm.select(molecular_system, '(atom_name=="CA" and chain_id=="A") within 0.5 nm of (atom_name=="CA" and chain_id=="B")',
          target='group')

array([10, 42, 62, 72, 73])

Atoms bonded to specific atoms can also be selected with `bonded to`:

In [43]:
msm.select(molecular_system, 'atom_name=="N" bonded to atom_type=="H"')

array([0, 38, 69, ..., 7660, 7670, 7684], dtype=object)

In [44]:
msm.select(molecular_system, 'all not bonded to atom_type==["H","N","C","O"]')

array([ 188,  553, 1778, 1904, 4047, 4412, 5637, 5763])

In [45]:
msm.select(molecular_system, '(atom_type=="O" and chain_id=="A") bonded to (atom_type=="H" and chain_id=="A")')

array([219, 245, 376, ..., 3425, 3625, 3819], dtype=object)

And both, `within .. of` and `bonded to`, can be mixed in the same selection sentence:

In [46]:
msm.select(molecular_system, '((atom_name=="N" and chain_id=="A") bonded to atom_type=="H") within 5 angstroms of (atom_type=="O" and molecule_type=="water")')

array([  38,  132,  156, ..., 3627, 3646, 3682])

## Syntaxis translation

MolSysMT is prepared to easily interact with other tools. The main goal of this library is providing with a set of pipes and joins to set up your workflows, keeping simple the integration of other tools. But different tools have different selection syntaxis. Learning how to use the selection syntaxis of MDTraj, ParmEd or NGLview is something very useful. Those are tools that we all use frequently in our labs. But it happens that we forget soon the rules of each tool. To keep a unique selection syntaxis in your projects, MolSysMT includes the input argument `to_syntaxis` in the method `molsysmt.select()`. Lets illustrate some examples:

In [47]:
msm.select(molecular_system, selection='group_index==[3,4,5]', to_syntaxis='NGLView')

'@55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97'

In [48]:
msm.select(molecular_system, selection='group_index==[3,4,5]', to_syntaxis='MDTraj')

'index 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97'

The output string can be obtained, if the selection is done over other targetted elements, as a sequence of groups or chains:

In [49]:
msm.select(molecular_system, target='group', selection='group_index==[3,4,5]', to_syntaxis='NGLView')

'7:A 8:A 9:A'

In [50]:
msm.select(molecular_system, target='group', selection='group_index==[3,4,5]', to_syntaxis='MDTraj')

'resid 3 4 5'

### Output syntaxis supported

MolSysMT translates selection sentences from its own native syntaxis to NGLview, MDTraj, Pytraj, ParmEd and AMBER.

## Using your favourite selection syntaxis

Already implemented: testing and documenting need it.