Python molecular modeller (with .pdb/.cif and .xyz parsing and production)
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github
atomium
docs
tests
.gitignore
.travis.yml
LICENSE
README.rst
requirements.txt
setup.py

README.rst

travis coveralls pypi

atomium

atomium is a molecular modeller and file parser.

Example

>>> import atomium
>>> pdb = atomium.fetch("5HVD")
>>> pdb.model
<Model (2156 atoms)>
>>> pdb.model.chain("A")
<Chain (204 residues)>

Installing

pip

atomium can be installed using pip:

$ pip3 install atomium

atomium is written for Python 3, and does not support Python 2. It currently requires Python 3.5 or above.

If you get permission errors, try using sudo:

$ sudo pip3 install atomium

Development

The repository for atomium, containing the most recent iteration, can be found here. To clone the atomium repository directly from there, use:

$ git clone git://github.com/samirelanduk/atomium.git

Requirements

atomium requires the Python library NumPy (for now), paramiko and requests - pip will install these automatically when it installs atomium.

Overview

atomium is a Python library for opening and saving .pdb, .cif and .xyz files, and presenting and manipulating the information contained within.

Loading Data

Whist you can use atomium to create models from scratch to build an entirely de novo structure, in practice you would generally use it to load molecular data from an existing file...

>>> import atomium
>>> pdb1 = atomium.open('../1LOL.pdb')
>>> xyz1 = atomium.open('/structures/glucose.xyz')
>>> mmcif1 = atomium.open('/structures/1XDA.cif')
>>> pdb2 = atomium.fetch('5HVD')
>>> mmcif2 = atomium.fetch('5HVD.cif')

In that latter case, you don't need the file to be saved locally - it will just go and grab the PDB with that code from the RCSB.

atomium will use the file extension you provide to decide how to parse it. If there isn't one, or it doesn't recognise the extension, it will peek at the file contents and try and guess whether it should be interpreted as .pdb, .cif or .xyz.

The rest of this guide will focus on .pdb and .cif files. .xyz files are very simple structures, and the only annotation they really contain is a .title.

Using Data

Once you've got your File object, what can you do with it?

Annotation

There is various meta information contained within the File object.

>>> pdb1.title
'CRYSTAL STRUCTURE OF OROTIDINE MONOPHOSPHATE DECARBOXYLASE COMPLEX WITH XMP'
>>> pdb1.deposition_date
datetime.date(2002, 5, 6)
>>> pdb1.keywords
['TIM BARREL', 'LYASE']
>>> pdb1.classification
'LYASE'
>>> pdb1.source_organism
'METHANOTHERMOBACTER THERMAUTOTROPHICUS STR. DELTA H'
>>> pdb1.resolution
1.9
>>> pdb1.rfactor
0.193
>>> pdb1.rfree
0.229

atomium doesn't currently parse every bit of information from .pdb and .cif files, but there is more than those shown above. See the full API docs for more details.

Models and Assembly

All .pdb files contain one or more models - little universes containing a molecular scene.

>>> pdb1.model
<Model (3431 atoms)>
>>> pdb1.models
(<Model (3431 atoms)>,)

Most just contain one - it's generally those that come from NMR experiments which contain multiple models.

This model contains the 'asymmetric unit' - this is one or more protein (usually) chains arranged in space, which may not be how the molecule arranges itself in real life. It might just be how they arranged themselves in the experiment. To create the 'real thing' from the asymmetric unit, you use biological assemblies.

Most .pdb files contain one or more biological assemblies - instructions for how to create a more realistic structure from the chains present, which in atomium are accessed using File.biomolecules.

In practice, what you need to know is that you can create a new model - not the one already there containing the asymmetric unit - as follows...

>>> pdb3 = atomium.fetch('1XDA')
>>> pdb3.model
<Model (1842 atoms)>
>>> pdb3.generate_assembly(1)
<Model (924 atoms)>
>>> pdb3.generate_assembly(10)
<Model (2730 atoms)>
>>> pdb3.generate_best_assembly()
<Model (5550 atoms)>

Here you load a .pdb with multiple possible assemblies, have a quick look at the asymmetric unit with 1,842 atoms, generate two of its possible biological assemblies by passing in their IDs, and then generate the 'best' of the assemblies, which is the one with the lowest (that is, most negative) delta free energy change as described in the .pdb file. In this case it is a hexameric formation.

Model Contents

The basic structures within a model are chains, residues, ligands, and atoms.

>>> pdb1.model.chains()
{<Chain (B, 1748 atoms)>, <Chain (A, 1683 atoms)>}
>>> pdb1.model.chain('B')
<Chain (B, 1748 atoms)>
>>> pdb1.model.residues(name='TYR')
{<Residue TYR (A:206, 12 atoms)>, <Residue TYR (A:45, 12 atoms)>, <Residue T
    YR (A:37, 12 atoms)>, <Residue TYR (B:1154, 12 atoms)>, <Residue TYR (B:1206
    , 12 atoms)>, <Residue TYR (A:154, 12 atoms)>, <Residue TYR (B:1045, 12 atom
    s)>, <Residue TYR (B:1037, 12 atoms)>}
>>> pdb1.model.residues(name_regex='TYR|PRO')
{<Residue PRO (B:1046, 7 atoms)>, <Residue TYR (A:37, 12 atoms)>, <Residue P
    RO (A:157, 7 atoms)>, <Residue TYR (B:1206, 12 atoms)>, <Residue PRO (B:1228
    , 7 atoms)>, <Residue PRO (A:211, 7 atoms)>, <Residue PRO (B:1077, 7 atoms)>
    , <Residue PRO (B:1129, 7 atoms)>, <Residue TYR (A:45, 12 atoms)>, <Residue
    TYR (A:154, 12 atoms)>, <Residue PRO (A:180, 7 atoms)>, <Residue PRO (B:1157
    , 7 atoms)>, <Residue TYR (B:1037, 12 atoms)>, <Residue TYR (A:206, 12 atoms
    )>, <Residue PRO (B:1189, 7 atoms)>, <Residue PRO (A:161, 7 atoms)>, <Residu
    e PRO (A:101, 7 atoms)>, <Residue PRO (A46, 7 atoms)>, <Residue TYR (B1045,
    12 atoms)>, <Residue PRO (A:77, 7 atoms)>, <Residue PRO (A:129, 7 atoms)>, <
    Residue PRO (B:1211, 7 atoms)>, <Residue TYR (B1154, 12 atoms)>, <Residue PR
    O (B1180, 7 atoms)>, <Residue PRO (B:1101, 7 atoms)>, <Residue PRO (B:1161,
    7 atoms)>}
>>> pdb1.model.chain('B').residue('B:1206')
<Residue TYR (B:1206, 12 atoms)>
>>> pdb1.model.ligands(water=False)
{<Ligand XMP (B:2002, 24 atoms)>, <Ligand BU2 (A:5001, 6 atoms)>, <Ligand XM
    P (A:2001, 24 atoms)>, <Ligand BU2 (B:5002, 6 atoms)>}
>>> pdb1.model.ligand(name='BU2').atoms()
{<C (C1) Atom 3194 at (2.646, 45.112, 48.995)>, <C (C4) Atom 3199 at (-0.456
, 44.629, 51.162)>, <C (C3) Atom 3197 at (0.706, 44.197, 50.309)>, <O (O3) A
tom 3198 at (1.101, 42.889, 50.701)>, <O (O1) Atom 3195 at (1.781, 45.484, 4
7.929)>, <C (C2) Atom 3196 at (1.922, 45.088, 50.288)>}
>>> pdb1.model.ligand(name='BU2').atoms(mass__gt=12)
{<C (C4) Atom 3199 at (-0.456, 44.629, 51.162)>, <O (O3) Atom 3198 at (1.101
, 42.889, 50.701)>, <C (C2) Atom 3196 at (1.922, 45.088, 50.288)>, <C (C1) A
tom 3194 at (2.646, 45.112, 48.995)>, <C (C3) Atom 3197 at (0.706, 44.197, 5
0.309)>, <O (O1) Atom 3195 at (1.781, 45.484, 47.929)>}
>>> pdb1.model.ligand(name='BU2').atoms(mass__gt=14)
{<O (O3) Atom 3198 at (1.101, 42.889, 50.701)>, <O (O1) Atom 3195 at (1.781,
 45.484, 47.929)>}

The examples above demonstrate atomium's selection language. In the case of the molecules - Model, Chain, Residue and Ligand - you can pass in an id or name, or search by regex pattern with id_regex or name_regex.

Atoms have an even more powerful syntax. You can pass in any property of atoms such as charge=1, any comparitor of a property such as mass__lt=100, or any regex of a property such as name_regex='[^C]'.

For pairwise comparisons, structures also have the AtomStructure.pairwise_atoms generator which will yield all unique atom pairs in the structure. These can obviously get very big indeed - a 5000 atom PDB file would have about 12 million unique pairs.

Structures can be moved around and otherwise compared with each other...

>>> pdb1.model.ligand(id='B:2002').mass
351.1022
>>> pdb1.model.ligand(id='B:2002').formula
Counter({'C': 10, 'O': 9, 'N': 4, 'P': 1})
>>> pdb1.model.ligand(id='B:2002').nearby_atoms(2.8)
{<O (O) Atom 3377 at (-24.077, 59.423, 53.919)>, <O (O) Atom 3418 at (-14.53
5, 62.938, 57.757)>, <O (OD1) Atom 1636 at (-22.92, 57.72, 52.315)>}
>>> pdb1.model.ligand(id='B:2002').nearby_atoms(2.8, name='OD1')
{<O (OD1) Atom 1636 at (-22.92, 57.72, 52.315)>}
>>> pdb1.model.ligand(id='B:2002').nearby_residues(2.8)
{<Residue ASP (B1020, 8 atoms)>}
>>> pdb1.model.ligand(id='B:2002').nearby_residues(2.8, ligands=True)
{<Ligand HOH (B3155, 1 atom)>, <Ligand HOH (B3059, 1 atom)>, <Residue ASP (B
1020, 8 atoms)>}
>>> import math
>>> pdb1.model.ligand(id='B:2002').rotate(math.pi / 2, 'x')
>>> pdb1.model.ligand(id='B:2002').translate(10, 10, 15)
>>> pdb1.model.ligand(id='B:2002').center_of_mass
(-9.886734282781484, -42.558415679537184, 77.33400578435568)
>>> pdb1.model.ligand(id='B:2002').radius_of_gyration
3.6633506511540825
>>> pdb1.model.ligand(id='B:2002').rmsd_with(pdb1.model.ligand(id='A2001'))
90.55588214099254
>>> other_ligand = pdb1.model.ligand(id='A2001')
>>> pdb1.model.ligand(id='B:2002').rmsd_with(other_ligand)
90.55588214099254
>>> pdb1.model.ligand(id='B:2002').rmsd_with(other_ligand, superimpose=True)
0.13325557235580035

Here we look at one of the ligands, identify its mass and molecular formula, look at what atoms are within 2.8 Angstroms of it, and what residues are within that same distance, rotate it and translate it through space, see where its new center of mass is, and then finally get its RMSD with the other similar ligand in the model - first using their locations 'as is', and then by seeing what the RMSD would be if they were superimposed in such a way as to minimise RMSD.

The Atom objects themselves have their own useful properties.

>>> pdb1.model.atom(97)
<C (CA) Atom 97 at (-12.739, 31.201, 43.016)>
>>> pdb1.model.atom(97).mass
12.0107
>>> pdb1.model.atom(97).anisotropy
[0, 0, 0, 0, 0, 0]
>>> pdb1.model.atom(97).bfactor
24.87
>>> pdb1.model.atom(97).location
(-12.739, 31.201, 43.016)
>>> pdb1.model.atom(97).distance_to(pdb1.model.atom(1))
26.18289982030257
>>> pdb1.model.atom(97).bonded_atoms
{<N (N) Atom 96 at (-11.649, 32.148, 42.889)>, <C (C) Atom 98 at (-12.515, 3
0.319, 44.247)>, <C (CB) Atom 100 at (-12.897, 30.387, 41.732)>}
>>> pdb1.model.atom(97).nearby_atoms(2)
{<N (N) Atom 96 at (-11.649, 32.148, 42.889)>, <C (C) Atom 98 at (-12.515, 3
0.319, 44.247)>, <C (CB) Atom 100 at (-12.897, 30.387, 41.732)>}
>>> pdb1.model.atom(97).is_metal
False
>>> pdb1.model.atom(97).residue
<Residue ASN (A:23, 8 atoms)>
>>> pdb1.model.atom(97).chain
<Chain (A, 1683 atoms)>

Chains are a bit different from other structures in that they are iterable, indexable, and return their residues as a tuple, not a set...

>>> pdb1.model.atom(97).chain
<Chain (A, 1683 atoms)>
>>> pdb1.model.chain('A')
<Chain (A, 1683 atoms)>
>>> len(pdb1.model.chain('A'))
204
>>> pdb1.model.chain('A')[10]
<Residue LEU (A21, 8 atoms)>
>>> pdb1.model.chain('A').residues()[:5]
(<Residue VAL (A:11, 7 atoms)>, <Residue MET (A:12, 8 atoms)>, <Residue ASN
    (A:13, 8 atoms)>, <Residue ARG (A:14, 11 atoms)>, <Residue LEU (A:15, 8 atom
    s)>)
>>> pdb1.model.chain('A').sequence
'VMNRLILAMDLMNRDDALRVTGEVREYIDTVKIGYPLVLSEGMDIIAEFRKRFGCRIIADFKVADIPETNEKICR
ATFKAGADAIIVHGFPGADSVRACLNVAEEMGREVFLLTEMSHPGAEMFIQGAADEIARMGVDLGVKNYVGPSTRP
ERLSRLREIIGQDSFLISPGGETLRFADAIIVGRSIYLADNPAAAAAGIIESI'
>>> pdb1.model.chain('A').rep_sequence
'LRSRRVDVMDVMNRLILAMDLMNRDDALRVTGEVREYIDTVKIGYPLVLSEGMDIIAEFRKRFGCRIIADFKVAD
IPETNEKICRATFKAGADAIIVHGFPGADSVRACLNVAEEMGREVFLLTEMSHPGAEMFIQGAADEIARMGVDLGV
KNYVGPSTRPERLSRLREIIGQDSFLISPGVGAQGGDPGETLRFADAIIVGRSIYLADNPAAAAAGIIESIKDLLI
PE'

In those latter two cases, two different sequences are returned. The first just returns the sequence of residues actually present in the model, whereas the second is the 'real' sequence that exists in nature. Some of them will be missing from the model for practical reasons.

Residues can generate name information based on their three letter code, and are aware of their immediate neighbors.

>>> pdb1.model.residue('A:100')
<Residue PHE (A100, 11 atoms)>
>>> pdb1.model.residue('A:100').name
'PHE'
>>> pdb1.model.residue('A:100').code
'F'
>>> pdb1.model.residue('A:100').full_name
'phenylalanine'
>>> pdb1.model.residue('A:100').next
<Residue PRO (A:101, 7 atoms)>
>>> pdb1.model.residue('A100').previous
<Residue GLY (A:99, 4 atoms)>

Saving Data

A model can be saved to file using:

>>> model.save("new.xyz")
>>> model.save("new.pdb")

Any structure can be saved in this way, so you can save chains or molecules to their own seperate files if you so wish.

(Currently atomium doesn't support .cif saving, but that is planned for a future release.)

>>> model.chain("A").save("chainA.pdb")
>>> model.chain("B").save("chainB.pdb")
>>> model.ligand(name="XMP").save("ligand.xyz")

The File object itself can also be saved:

>>> pdb.title = "Modified PDB"
>>> pdb.save("new.pdb")

Note that if the model you are saving is one from a biological assembly, it will likely have many duplicated IDs, so saving to file may create unexpected results.

Changelog

Release 0.11.1

13 September 2018

  • Fixed bug pertaining to residues with ID 0.
  • Fixed bug pertaining to SEQRES parsing when chain ID is numeric.
  • Changed format of residue IDs to include colon.
  • Considerable speed improvements in .mmcif parsing.

Release 0.11.0

22 August 2018

  • Added .mmcif parsing.
  • Changed how parsing in general is done under the hood.
  • Added atom angle calculation.
  • Fixed bug where modified residues were treated as ligands if authors used HETATM records.

Release 0.10.2

29 July 2018

  • Added function for getting PDBs over SSH.
  • Fixed biological assembly parsing bug.
  • Fixed chain copying of sequence bug.

Release 0.10.1

25 June 2018

  • Added function for returning best biological assembly.
  • Fixed bug with sorting None energy assemblies.
  • Fixed bug pertaining to excessive atom duplication when creating assembly.

Release 0.10.0

22 June 2018

  • Parsing of .pdb keywords.
  • Parsing of atom anisotropy.
  • Parsing of .pdb sequence information.
  • More R-factor information.
  • Biological assembly parsing and generation.
  • More powerful transformations rather than just simple rotation.
  • Backend simplifications.
  • Powerful new atom querying syntax.

Release 0.9.1

17 May 2018

  • Added Residue one-letter codes.
  • Fixed stray print statement.

Release 0.9.0

10 April 2018

  • Turned many methods into properties.
  • Added full residue name generation.
  • Made bind site detection more picky.
  • Added coordinate rounding to deal with floating point rounding errors.
  • Atomic structures now 'copy'able.
  • Refactored atom querying.
  • Added grid generation.
  • Implemented Kabsch superposition/rotation.
  • Implemented RMSD comparison.
  • Created Complex class (for later).

Release 0.8.0

2 December 2017

  • Added option to get water residues in binding sites.

  • Added extra PDB meta information parsing, such as:

    • Classification
    • Experimental Technique
    • Source Organism
    • Expression Organism
    • R-factor

Release 0.7.0

2 November 2017

  • PDBs with multiple occupancy can now be parsed correctly.
  • Added pairwise atom generator.
  • PDB parser now extracts resolution.
  • Further speed increased to PDB parser.
  • Miscellaneous bug fixes.
  • Implemented Continuous Integration.

Release 0.6.0

3 October 2017

  • Now allows for fetching and opening of PDB data dictionaries.
  • Added parsing/saving of HEADER and TITLE records in PDB files.
  • Added ability to exclude elements from atom search.
  • Added ability to get nearby atoms in a model.
  • Added bind site identification.
  • Fixed chain length bottleneck in PDB model saving.
  • Overhauled PDB parsing by replacing classes with built in Python types.
  • Fixed bug where numerical residue names were interpreted as integers.
  • Changed atoms so that they can allow negative B factors.
  • Added loading of .xyz data dictionaries.
  • Miscellaneous speed increases.

Release 0.5.0

16 September 2017

  • Added atom temperature factors.
  • Added bond vector production.
  • Added parse time tests and reduced parse time by over a half.
  • Changed way atoms are stored in structures to make ID lookup orders of magnitude faster.
  • Made IDs immutable.
  • Added multiple model parsing and saving.
  • Added option to fetch PDBs from PDBe rather than RCSB.

Release 0.4.0

26 August 2017

  • Added PDB parsing.
  • Added PDB saving.
  • Gave atoms ability to get specific bond with other atom.
  • Added bond angle calculation.
  • Added ability to filter out water molecules.

Release 0.3.0

11 August 2017

  • Added classes for Molecules, Chains, Residues, and their interfaces.
  • Added charges to atoms and structures.
  • Add ability to create AtomicStructures from AtomicStructures.

Release 0.2.0

14 June 2017

  • Made all Atomic Structures savable.
  • Added Atom IDs and uniqueness constraints.
  • Added Atom Bonds.

Release 0.1.1

1 June 2017

  • Fixed setup.py
  • Minor typos

Release 0.1.0

1 June 2017

  • Added basic Model and Atom classes.
  • Added .xyz parsing.