# A Walkthrough of the Code
- Information on how the program runs in the background is given here
- This is mostly for documentation. (The user doesn't have to be concerned with these details)

# Setup
- Code is stored in separate protds_v3.py file for cleaner Notebook use
- Start by assigning name of input file (stored in the data folder)
- The program refers to entries through column names, so any csv will work as long as the header names contain:  
    "ProteinID" and "ModifiedLocationNum"

In [None]:
import sys 
sys.path.append('..')
from protds.protds_v3 import *

filename = 'Peptide_IndexByID_Version_test.csv'
data = pd.read_csv(os.path.join(os.pardir,"data",filename))

In [None]:
moddata=modData(data)
moddata 

- data is now accessable
- we'll mostly work with entries that have modifications:

# Fetching Proteins

- The setup code also initializes an empty dictionary called proteins
- We can add Proteins to this dictionary to access their info

In [None]:
proteins #no entries yet

- The getProteins() function takes in the whole dataset and goes through each row to add new Proteins to the dictionary
- [This process takes a while]

In [None]:
getProteins(moddata[:10]) #the first 10 rows as an example

In [None]:
proteins #5 Proteins have been stored

In [None]:
moddata[:10] #they belong to these 10 rows

## Fetching Individual Proteins

It is possible to add individual proteins through their name with searchPDB():  
- getProteins() from before basically loops through searchPDB() for us
- [Proteins with a lot of PDB results will take longer to complete]

In [None]:
#example for ProteinIDs Q7DFV3 and P0A7K2
searchPDB('Q7DFV3')
searchPDB('P0A7K2') #takes a while

In [None]:
# no results for Q7DFV3; P0A7K2 now added to proteins
proteins

## Saving Proteins
- The current Proteins dictionary can be saved with saveProteins() and can be loaded with loadProteins() for future use
- default file is named 'proteins.pkl', but set a different name to avoid accidental overrides

In [None]:
saveProteins('overview.pkl')

In [None]:
#try to delete the current dictionary and reload it:
proteins.clear()
proteins #empty again

In [None]:
loadProteins('overview.pkl')
proteins #re-loaded

# Accessing Protein Information

After storing a Protein, its information can be accessed with proteins['ProtID']  
Take Protein P00350 for example:

In [None]:
P00350 = proteins['P00350']; P00350

## Some Information available:

In [None]:
# Full list of features from UniProt
P00350.record.features

In [None]:
#Active/Binding sites:
printSites(P00350)

In [None]:
#List of PDB results:
P00350.getPDBs()

## Structure Information

Proteins can have multiple structures associated with them. Our example (P00350) has 3:

In [None]:
P00350.structures #each is a PDB object containing the structure's PDBid (name) and the structure data (coordinates)

In [None]:
#take the first of these as an example:
ex = P00350.structures[0]
ex

In [None]:
#name
ex.PDBid

In [None]:
#coordinates
ex.structure

The structures are stored as a biotite.structure, so they can be parsed & analyzed using tools from the biotite package  
reference: https://www.biotite-python.org/apidoc/biotite.structure.html#module-biotite.structure  
Examples:

In [None]:
#getting sequence of chain A 
chainSeq('A', ex.structure)[0]

In [None]:
#comparing sequence alignments for chain A and B
import biotite.sequence.align as align
alignment, order, guide_tree, distance_matrix = align.align_multiple(
    [chainSeq('A', ex.structure)[0], chainSeq('B', ex.structure)[0]],
    matrix=align.SubstitutionMatrix.std_protein_matrix(),
    gap_penalty=-5,
    terminal_penalty=False
)
print(alignment)

In [None]:
#center of mass
struc.mass_center(ex.structure)

In [None]:
#distance between chain A's and chain B's center of mass
A = ex.structure[ex.structure.chain_id=='A']
B = ex.structure[ex.structure.chain_id=='B']
struc.distance(struc.mass_center(A), struc.mass_center(B))