# 1-MMTF-Datastructure
This tutorial shows how to access data from the MMTF datastructure.

In [1]:
from pyspark.sql import SparkSession
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.utils import traverseStructureHierarchy
from mmtfPyspark import structureViewer

#### Configure Spark

In [2]:
spark = SparkSession.builder.master("local[4]").appName("1-MMTF-Datastructure").getOrCreate()
sc = spark.sparkContext

### Download an example structure
Here we download an HIV protease structure with a bound ligand (Nelfinavir).

In [3]:
pdb = mmtfReader.download_full_mmtf_files(["1OHR"], sc)

Structures are represented as keyword-value pairs (tuples):
* key: structure identifier (e.g., PDB ID)
* value: MmtfStructure (structure data)

In this case, we only have one structure, so we can use the first() method to extract the data.

In [4]:
pdb_id = pdb.keys().first()
structure = pdb.values().first()

In [5]:
structureViewer.view_structure(pdb_id);

interactive(children=(IntSlider(value=0, continuous_update=False, description='Structure', max=0), Output()), â€¦

### Access metadata
traverseStructureHierachy provides methods to explore MMTF structures.
[See code how to access these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L50-L62)

In [6]:
traverseStructureHierarchy.print_metadata(structure)

*** METADATA ***
StructureId           : 1OHR
Title                 : VIRACEPT (R) (NELFINAVIR MESYLATE, AG1343): A POTENT ORALLY BIOAVAILABLE INHIBITOR OF HIV-1 PROTEASE
Deposition date       : 1997-09-27
Release date          : 1998-12-09
Experimental method(s): [X-RAY DIFFRACTION]
Resolution            : 2.0999999046325684
Rfree                 : None
Rwork                 : 0.20000000298023224



### Structural data
[See code how to accesss these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L87-L96)

In [7]:
traverseStructureHierarchy.print_structure_data(structure)

*** STRUCTURE DATA ***
Number of models : 1
Number of chains : 5
Number of groups : 250
Number of atoms : 1952
Number of bonds : 1926



### Entity data
Entities are the unique molecular components in a structure.

This structure has one unique polymer (ASPARTYLPROTEASE), one non-polymer ligand, and water.

In [8]:
traverseStructureHierarchy.print_entity_info(structure)

*** ENTITY DATA ***
entity type            : 0 polymer
entity description     : 0 ASPARTYLPROTEASE
entity sequence        : 0 PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
entity type            : 1 non-polymer
entity description     : 1 2-[2-HYDROXY-3-(3-HYDROXY-2-METHYL-BENZOYLAMINO)-4-PHENYL SULFANYL-BUTYL]-DECAHYDRO-ISOQUINOLINE-3-CARBOXYLIC ACID TERT-BUTYLAMIDE
entity sequence        : 1 
entity type            : 2 water
entity description     : 2 water
entity sequence        : 2 




### Chain information
[See code how to accesss these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L98-L112)

Note, the in [PDB file for this structure](https://files.rcsb.org/view/4hhb.pdb) you find chains A and B. These "PDB" chains are referred to by chainName in MMTF. Almost all operations in MMTF use the chainNames. 

However, in the MMTF data structures, chains are split into polymer, non-polymer, and water chains. For this structure, there are 5 chains: 2 protein chains (99 groups each), 1 non-polymer chain (1 ligand group), and two water chains (29, 22 water groups). These 5 chains are refered to by their chainIds (A,B,C,D,E).

In [9]:
traverseStructureHierarchy.print_chain_info(structure)

*** CHAIN DATA ***
Number of chains: 5
model: 1
chainName: A, chainId: A, groups: 99
chainName: B, chainId: B, groups: 99
chainName: A, chainId: C, groups: 1
chainName: A, chainId: D, groups: 29
chainName: B, chainId: E, groups: 22



### Chain, entity, group, and atom information
[See code how to accesss these data](https://github.com/sbl-sdsc/mmtf-pyspark/blob/master/mmtfPyspark/utils/traverseStructureHierarchy.py#L157-L225)

In the data listed below, seq. index is the zero-based index of a specific group (residue) into the one-letter polymer sequence.

DSSP secStruct. is the DSSP secondary structure annotation recalculated by BioJava's implementation of the DSSP method.

* 5: PI_HELIX
* S: BEND
* H: ALPHA_HELIX
* E: EXTENDED
* G: THREE_TEN_HELIX
* B: BRIDGE
* T: TURN
* C: COIL

In [10]:
traverseStructureHierarchy.print_chain_group_info(structure)

*** CHAIN AND GROUP DATA ***
model: 1
chainName: A, chainId: A, groups: 99
   groupName      : PRO
   oneLetterCode  : P
   seq. index     : 0
   numAtoms       : 9
   numBonds       : 7
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 1
   insertionCode  :  
   DSSP secStruct.: C

   groupName      : GLN
   oneLetterCode  : Q
   seq. index     : 1
   numAtoms       : 12
   numBonds       : 11
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 2
   insertionCode  :  
   DSSP secStruct.: E

   groupName      : ILE
   oneLetterCode  : I
   seq. index     : 2
   numAtoms       : 9
   numBonds       : 8
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 3
   insertionCode  :  
   DSSP secStruct.: E

   groupName      : THR
   oneLetterCode  : T
   seq. index     : 3
   numAtoms       : 9
   numBonds       : 8
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 4
   insertionCode  :  
   DSSP secStruct.: C

   groupName      : LEU
   oneLetterCode  : L
   seq. i

   DSSP secStruct.: E

   groupName      : ILE
   oneLetterCode  : I
   seq. index     : 46
   numAtoms       : 9
   numBonds       : 8
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 47
   insertionCode  :  
   DSSP secStruct.: E

   groupName      : GLY
   oneLetterCode  : G
   seq. index     : 47
   numAtoms       : 5
   numBonds       : 4
   chemCompType   : PEPTIDE LINKING
   groupId        : 48
   insertionCode  :  
   DSSP secStruct.: E

   groupName      : GLY
   oneLetterCode  : G
   seq. index     : 48
   numAtoms       : 5
   numBonds       : 4
   chemCompType   : PEPTIDE LINKING
   groupId        : 49
   insertionCode  :  
   DSSP secStruct.: E

   groupName      : ILE
   oneLetterCode  : I
   seq. index     : 49
   numAtoms       : 9
   numBonds       : 8
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 50
   insertionCode  :  
   DSSP secStruct.: T

   groupName      : GLY
   oneLetterCode  : G
   seq. index     : 50
   numAtoms       : 5
   numBonds   

In [11]:
traverseStructureHierarchy.print_chain_entity_group_atom_info(structure)

*** CHAIN ENTITY GROUP ATOM DATA ***
model: 1
chainName: A, chainId: A, groups: 99
entity type          : polymer
entity description   : ASPARTYLPROTEASE
entity sequence      : PQITLWQRPLVTIKIGGQLKEALLDTGADDTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
   groupName      : PRO
   oneLetterCode  : P
   seq. index     : 0
   numAtoms       : 9
   numBonds       : 7
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 1
   insertionCode  :  
   DSSP secStruct.: C
   Atoms          : 
      1	N	 	-3.477	7.714	33.891	1.0	26.32	N
      2	CA	 	-2.582	6.722	34.505	1.0	24.3	C
      3	C	 	-1.168	6.908	34.016	1.0	22.52	C
      4	O	 	-0.984	7.654	33.063	1.0	22.27	O
      5	CB	 	-3.083	5.331	34.122	1.0	26.46	C
      6	CG	 	-3.631	5.623	32.74	1.0	26.17	C
      7	CD	 	-4.339	6.972	32.959	1.0	26.04	C
      8	H2	 	-4.023	8.297	34.55	1.0	0.0	H
      9	H3	 	-2.859	8.366	33.35	1.0	0.0	H
   groupName      : GLN
   oneLetterCode  : Q
   seq. index     : 1
   numAtoms       : 12
 

   numAtoms       : 9
   numBonds       : 8
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 13
   insertionCode  :  
   DSSP secStruct.: E
   Atoms          : 
      122	N	 	-10.4	-1.1	25.633	1.0	14.77	N
      123	CA	 	-11.004	-0.006	24.902	1.0	15.59	C
      124	C	 	-12.427	-0.437	24.682	1.0	17.87	C
      125	O	 	-12.755	-1.621	24.827	1.0	19.07	O
      126	CB	 	-10.332	0.219	23.54	1.0	15.3	C
      127	CG1	 	-10.445	-1.036	22.667	1.0	17.28	C
      128	CG2	 	-8.857	0.622	23.788	1.0	9.09	C
      129	CD1	 	-10.159	-0.782	21.167	1.0	23.02	C
      130	H	 	-10.556	-2.001	25.298	1.0	0.0	H
   groupName      : LYS
   oneLetterCode  : K
   seq. index     : 13
   numAtoms       : 8
   numBonds       : 7
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 14
   insertionCode  :  
   DSSP secStruct.: E
   Atoms          : 
      131	N	 	-13.301	0.484	24.351	1.0	19.65	N
      132	CA	 	-14.697	0.127	24.11	1.0	19.56	C
      133	C	 	-14.95	0.732	22.781	1.0	20.44	C
      134	O	 	-14.674	1

      686	HG1	 	-2.634	-4.545	14.318	1.0	0.0	H
   groupName      : PRO
   oneLetterCode  : P
   seq. index     : 80
   numAtoms       : 7
   numBonds       : 7
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 81
   insertionCode  :  
   DSSP secStruct.: S
   Atoms          : 
      687	N	 	-2.014	-8.433	13.564	1.0	21.81	N
      688	CA	 	-1.75	-9.395	14.638	1.0	19.23	C
      689	C	 	-2.33	-8.966	15.956	1.0	22.26	C
      690	O	 	-2.732	-9.84	16.699	1.0	23.95	O
      691	CB	 	-0.236	-9.521	14.718	1.0	18.74	C
      692	CG	 	0.18	-9.127	13.293	1.0	22.1	C
      693	CD	 	-0.749	-7.954	12.945	1.0	21.78	C
   groupName      : VAL
   oneLetterCode  : V
   seq. index     : 81
   numAtoms       : 8
   numBonds       : 7
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 82
   insertionCode  :  
   DSSP secStruct.: S
   Atoms          : 
      694	N	 	-2.39	-7.656	16.297	1.0	22.6	N
      695	CA	 	-2.898	-7.174	17.623	1.0	18.8	C
      696	C	 	-3.571	-5.831	17.429	1.0	14.63	C
      697

   groupId        : 94
   insertionCode  :  
   DSSP secStruct.: T
   Atoms          : 
      812	N	 	-4.588	12.945	25.296	1.0	11.65	N
      813	CA	 	-3.534	13.629	26.053	1.0	11.82	C
      814	C	 	-2.408	12.75	26.517	1.0	12.8	C
      815	O	 	-1.725	13.088	27.484	1.0	17.53	O
      816	H	 	-4.68	13.111	24.338	1.0	0.0	H
   groupName      : CYS
   oneLetterCode  : C
   seq. index     : 94
   numAtoms       : 7
   numBonds       : 6
   chemCompType   : L-PEPTIDE LINKING
   groupId        : 95
   insertionCode  :  
   DSSP secStruct.: C
   Atoms          : 
      817	N	 	-2.05	11.714	25.77	1.0	12.94	N
      818	CA	 	-0.975	10.813	26.194	1.0	10.76	C
      819	C	 	0.41	11.341	25.843	1.0	13.43	C
      820	O	 	0.59	11.856	24.738	1.0	15.9	O
      821	CB	 	-1.286	9.462	25.519	1.0	12.85	C
      822	SG	 	-0.106	8.207	25.981	1.0	16.82	S
      823	H	 	-2.509	11.562	24.913	1.0	0.0	H
   groupName      : THR
   oneLetterCode  : T
   seq. index     : 95
   numAtoms       : 9
   numBonds       : 8
   chemC

   chemCompType   : L-PEPTIDE LINKING
   groupId        : 87
   insertionCode  :  
   DSSP secStruct.: H
   Atoms          : 
      1626	N	 	10.741	-1.391	23.954	1.0	6.25	N
      1627	CA	 	10.571	-1.471	25.415	1.0	7.34	C
      1628	C	 	11.748	-2.014	26.177	1.0	9.97	C
      1629	O	 	11.953	-1.59	27.307	1.0	9.53	O
      1630	CB	 	9.32	-2.313	25.737	1.0	6.24	C
      1631	CG	 	7.956	-1.707	25.297	1.0	5.11	C
      1632	CD	 	6.783	-2.453	25.957	1.0	6.34	C
      1633	NE	 	6.712	-3.854	25.508	1.0	11.31	N
      1634	CZ	 	7.129	-4.915	26.227	1.0	12.77	C
      1635	NH1	 	7.625	-4.874	27.461	1.0	11.87	N
      1636	NH2	 	7.012	-6.099	25.683	1.0	15.32	N
      1637	H	 	10.112	-1.875	23.38	1.0	0.0	H
      1638	HE	 	6.388	-4.05	24.615	1.0	0.0	H
      1639	HH11	 	7.715	-3.996	27.938	1.0	0.0	H
      1640	HH12	 	7.902	-5.725	27.919	1.0	0.0	H
      1641	HH21	 	6.621	-6.189	24.777	1.0	0.0	H
      1642	HH22	 	7.323	-6.912	26.183	1.0	0.0	H
   groupName      : ASN
   oneLetterCode  : N
   seq. index     : 87
 

In [12]:
### Crystallographic data

In [13]:
traverseStructureHierarchy.print_crystallographic_data(structure)

*** CRYSTALLOGRAPHIC DATA ***
Space group           : P 21 21 21
Unit cell dimensions  : [52.04, 59.38, 61.67, 90.00, 90.00, 90.00]



### Biologial assembly data
In this case, the asymmetric unit (content of MMTF structure) corresponds to the biological assembly. The transformation matrix in this csae is the Unit matrix.

In [14]:
traverseStructureHierarchy.print_bioassembly_data(structure)

*** BIOASSEMBLY DATA ***
Number bioassemblies: 1
bioassembly: 1
  Number transformations: 1
    transformation: 0
    chains:         [0, 1, 2, 3, 4]
    rotTransMatrix: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]


In [15]:
spark.stop()