---

# Biotite documentation

---
https://www.biotite-python.org/

目标
- 梳理tutorial的基本内容，熟悉一下biotite处理生物大分子对象的底层逻辑
- 结合ESM的模块，生成自己用于数据处理的code

biotite的基本内容:
1. Searching and fetching data from biological databases
2. Reading and writing popular sequence/structure file formats
3. Analyzing and editing sequence/structure data
4. Visualizing sequence/structure data
5. Interfacing external applications for further analysis

biotite和biopython最大的区别: biotite在内部将大部分对象存储在numpy ndarray的形式，更方便外部索引和加速


## sequence subpackage

## structure subpackage

This subpackage enables handling of 3D structures of biomolecules. Simplified, a structure is represented by a list of atoms and their properties, based on ndarray objects. Optionally, this representation can be enriched with chemical bond information. Biotite supports different structure formats, including the ones provided by the RCSB and Gromacs trajectory formats. The subpackage offers a wide range of functions for atom filtering, coordinate transformations, angle and bond measurements, accessible surface area calculation, structure superimposition and more.

`biotite.structure`是处理结构的模块，主要包含三个部分，`Atom,AtomArray,AtomArrayStack`，分别对应了单个原子，一个model的所有原子，所有model的所有原子。这些模块的底层都是用numpy来实现存储的

### Creating structures

In [2]:
import biotite.structure as struc
atom1 = struc.Atom([0,0,0],chain_id="A",res_id=1,res_name="GLY",atom_name="N",element="N")
atom2 = struc.Atom([0,1,1],chain_id="A",res_id=1,res_name="GLY",atom_name="CA",element="C")
atom3 = struc.Atom([0,0,2],chain_id="A",res_id=1,res_name="GLY",atom_name="C",element="C")

第一个参数是坐标，在内部被转化为`numpy ndarray`，其余参数是注释，包括了chain ID,residue Id, residue name, insertion code, atom name, element, hetero(whethere the atom is not in protein/nucleotide chain)，如果在创建atom的时候忽视了这些注释那么就会得到默认值，The mandatory annotation categories 来自PDB格式的ATO吗和HETATM格式记录。也可以任意指定annotation比如b_factor和带电量等，具体可以参考 https://www.biotite-python.org/apidoc/biotite.structure.html 。

如果想要控制整个大分子结构，一般更常见的是使用AtomArray或者AtomArrayStack。AtomArray内部并不是用数组来存储相关的内容，而是对每一个annotation和坐标存储一个ndarray (ndarray 是numpy array的数据类型，dtpye为str时用来存储annotation)

In [9]:
import numpy as np
array = struc.array([atom1,atom2,atom3])
print(f"Chain ID:{array.chain_id}")
print(f"Residue ID:{array.res_id}")
print(f"Atom name:{array.atom_name}")
print(f"Coordinates:{array.coord}")
print(array)

Chain ID:['A' 'A' 'A']
Residue ID:[1 1 1]
Atom name:['N' 'CA' 'C']
Coordinates:[[0. 0. 0.]
 [0. 1. 1.]
 [0. 0. 2.]]
    A       1  GLY N      N         0.000    0.000    0.000
    A       1  GLY CA     C         0.000    1.000    1.000
    A       1  GLY C      C         0.000    0.000    2.000


`biotite.structure.array()` 内置函数输入一个包含`Atom instance`的可迭代对象，来生成`AtomArray()`，由于`AtomArray()`本身就是一个可迭代对象，你甚至可以用`AtomArray()`来生成另一个`AtomArray()`。注意到生成的annotation ndarray size=(n,)，coordinate ndarray size=(n,3)

In [17]:
# 根据annotation来numpy式的过滤coordinate
array.chain_id[:] = "B"
array.coord[array.chain_id == "B"] = np.array([0.0,0.0,0.0])
# It is also possible to replace an entire annotation with another array
array.res_id = np.array([1,2,3])
print(array)

    B       1  GLY N      N         0.000    0.000    0.000
    B       2  GLY CA     C         0.000    0.000    0.000
    B       3  GLY C      C         0.000    0.000    0.000


可以利用`array.add_annotation()` or `array.set_annotation()`来为array增添一个annotation

In [18]:
array.add_annotation("foo", dtype=bool)
array.set_annotation("bar", [1, 2, 3])
print(array.foo)
print(array.bar)

[False False False]
[1 2 3]


为了处理每个原子出现在不同位置的情况，比如在NMR结构中可能会有多个model，或者是MD traj会有成千上万个轨迹，此时就需要`AtomArrayStack` object来处理这种情况。annotation仍然是一个长度为n的 object array，但是coordinates就是一个(m,n,3)-shaped ndarray

In [19]:
stack = struc.stack([array,array.copy()])
print(stack)

Model 1
    B       1  GLY N      N         0.000    0.000    0.000
    B       2  GLY CA     C         0.000    0.000    0.000
    B       3  GLY C      C         0.000    0.000    0.000

Model 2
    B       1  GLY N      N         0.000    0.000    0.000
    B       2  GLY CA     C         0.000    0.000    0.000
    B       3  GLY C      C         0.000    0.000    0.000




### Loading structures from file

根据`biotite.structure.io.pdb`里的PDBFile包来进行io操作

In [21]:
import biotite.structure.io.pdb as pdb
import biotite.database.rcsb as rcsb
from tempfile import gettempdir, NamedTemporaryFile

pdb_file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
pdb_file = pdb.PDBFile.read(pdb_file_path)
tc5b = pdb_file.get_structure()
print(type(tc5b).__name__)
print(tc5b.stack_depth())
print(tc5b.array_length())
print(tc5b.shape)

AtomArrayStack
38
304
(38, 304)
