# A Walkthrough of the Code
- Information on how the program runs in the background is given here
- This is mostly for documentation. (The user doesn't have to be concerned with these details)

# Setup
- Code is stored in separate protds_v2.py file for cleaner Notebook use
- Start by assigning name of input file (stored in the data folder)
- The program refers to entries through column names, so any csv will work as long as the header names contain:  
    "ProteinID", "ModifiedLocationNum", and "ModifiedSequence"

<font color='red'>Some warnings may come up due to packages using older function names (or bad practice), but everything still works.   
    (Minor issue. Will fix later, if possible)</font> 

In [None]:
import sys 
sys.path.append('..')
from protds_v2 import *

filename = 'Peptide_IndexByID_Version_test.csv'
data = pd.read_csv(os.path.join(os.pardir,"data",filename))

In [None]:
moddata = data[data['ModifiedLocationNum'].notna()] #select rows with Modifications
moddata['ModifiedLocationNum'] = moddata['ModifiedLocationNum'].astype(int) #remove decimals in ModifiedLocationNum 
moddata 

- data is now accessable
- we'll mostly work with entries that have modifications:

# Fetching Proteins

- The setup code also initializes an empty dictionary called proteins
- We can add Proteins to this dictionary to access their info

In [None]:
proteins #no entries yet

- The getProteins() function takes in the whole dataset and goes through each row to add new Proteins to the dictionary
- In theory, it can process every modified protein in the dataset, but for now, it'll go through the first 4 as an example:
- <font color='red'>[This process takes a while. Have not tested if there's a time-out limit]</font> 

In [None]:
modlist = getProteins(moddata)

## <font color='red'>Future Plans:</font> 
This step takes the longest; will look into different methods to speed it up (later, if possible):
- parallel processing
- caching Notebook session 

In [None]:
proteins #4 Proteins have been stored

In [None]:
moddata[:8] #the first 4 proteins that we've looked at correspond to these rows:

For convenience reasons, getProteins() also returns these rows as a list:  
(refer to Modified Entries section for details) 

In [None]:
modlist #simplifies them as a list

## Fetching Individual Proteins

It is possible to add individual proteins through their name with searchPDB():  
- getProteins() from before basically loops through searchPDB() for us
- [Proteins with a lot of PDB results will take longer to complete]

In [None]:
#example for ProteinIDs Q7DFV3 and P0A7K2
searchPDB('Q7DFV3')
searchPDB('P0A7K2') #has 20 results; takes a while

In [None]:
# no results for Q7DFV3; P0A7K2 now added to proteins
proteins

# Accessing Protein Information

After storing a Protein, its information can be accessed with proteins['ProtID']  
Take Protein P00350 for example:

In [None]:
P00350 = proteins['P00350']; P00350

## Some Information available:

In [None]:
# Full list of features from UniProt
P00350.record.features

In [None]:
#Active/Binding sites:
printSites(P00350)

In [None]:
#List of PDB results:
P00350.getPDBs()

## Structure Information

Proteins can have multiple structures associated with them. Our example (P00350) has 3:

In [None]:
P00350.structures #each is a PDB object containing the structure's PDBid (name) and the structure data (coordinates)

In [None]:
#take the first of these as an example:
ex = P00350.structures[0]
ex

In [None]:
#name
ex.PDBid

In [None]:
#coordinates
ex.structure

The structures are stored as a biotite.structure, so they can be parsed & analyzed using tools from the biotite package  
reference: https://www.biotite-python.org/apidoc/biotite.structure.html#module-biotite.structure  
Examples:

In [None]:
#getting sequence of chain A 
chainSeq('A', ex.structure)

In [None]:
#checking to see which chains match the entry's ModifiedSequence:
print('ModifiedSequence:',modlist[0][2])
checkChains(ex.structure, modlist[0])
#(Both chains A and B contain the modSequence)

In [None]:
#comparing sequence alignments for chain A and B
import biotite.sequence.align as align
alignment, order, guide_tree, distance_matrix = align.align_multiple(
    [chainSeq('A', ex.structure), chainSeq('B', ex.structure)],
    matrix=align.SubstitutionMatrix.std_protein_matrix(),
    gap_penalty=-5,
    terminal_penalty=False
)
print(alignment)

In [None]:
#center of mass
struc.mass_center(ex.structure)

In [None]:
#distance between chain A's and chain B's center of mass
A = ex.structure[ex.structure.chain_id=='A']
B = ex.structure[ex.structure.chain_id=='B']
struc.distance(struc.mass_center(A), struc.mass_center(B))

# Modified Entries 

Most of our analysis involves looking at each row, so it's nice to have quick access to relevant row info
- in theory, the final program will look at every row
- this example only looked at the first 8 rows (containing the first 4 unique proteins found)

In [None]:
moddata[:8] #these were rows used for this example

In [None]:
#the list version of these rows come in handy for easy access:
modlist 

These "entries" are lists formated:  
[ProteinID, ModifiedLocationNum, ModifiedSequence, index] 

When asking for some information about an entry, the default input is usually in this form. For example:
- entryView([ProteinID, ModifiedLocationNum, ModifiedSequence, index]) for viewing a row's structure + highlighted sites
- entryDist([ProteinID, ModifiedLocationNum, ModifiedSequence, index]) for viewing the distances between the modLocation and sites for a row

This is how the program handles row entries in the background. It refers to this list structure rather than the DataFrame rows themselves. Functions like getView() and getDistances() do this conversion internally using checkRow()

In [None]:
#example row 7234:
display(moddata.loc[7234].to_frame().T)
checkRow(moddata.loc[7234])