# Set Up
- set file name and run this cell once to start
- datasets should be stored in the data folder
- The program refers to entries through column names, so any csv will work as long as the header names contain:  
    "ProteinID", "ModifiedLocationNum", and "ModifiedSequence"

In [None]:
import sys 
sys.path.append('..')
from protds_v2 import *

filename = 'Peptide_IndexByID_Version_test.csv'
data = pd.read_csv(os.path.join(os.pardir,"data",filename))
df = modData(data)

In [None]:
#show rows with Modifications
#pd.set_option("display.max_rows", None) #show all of them
df 

Select any of the modified row numbers and set it as the rowIndex:

In [None]:
rowIndex = 6540

# Distances

About Distances:
- unit measure is 1 Angstrom
- for each site, distance is defined by the euclidean distance between centerOfMass(ModifiedLocation) and centerOfMass(siteLocation)

The distances betweeen a modified location and UniProt sites can be found given these assumptions:  
- the ProteinID has active sites (on UniProt)
- ModifiedLocationNum is within range of what's provided by the structure data
- UniProt's site location is within range of what's provided by the structure data

### <font color='red'> Work in Progress </font>   
If the assumptions are violated, various results may occur:
   - distances displayed as NaN for invalid entries
   - some structures load while others don't
   - nothing displayed
   - error message (of many types)
  
[Have not figured out what to do about these yet. (May make note of them in the future, but for now, there are too many unaccounted cases)]   
  
[Still working on grouping structures together; it currently displays all of them, ordered by how ProteinDataBank lists them] 
  
[Also plan on adding more details (eg.site location) on the table later on]  
  

In [None]:
getDistances(processRow(rowIndex, df), data) 

# Visualization

The getView() function returns a iCn3D structure view:
- if the ProteinID has >1 PDB structure, the user is asked to choose
- ModifiedLocationNum is colored white
- Binding sites from UniProt are colored yellow
- Only relevant chains are considered

### <font color='red'>Minor Issue</font>  
- some chains my have default colors of white or yellow, so highlights are missed; (can fix by choosing different colors)

### <font color='red'>Other Issue</font>  
- the highlighted sites are based on UniProt's database
- sometimes, iCn3D's viewer will show different "Functional Sites" than UniProt's list if the option is selected

### <font color='red'> Changes Since Last Version </font>   
- Previously, the input was focused on ProteinID
- It now takes in row number instead (assuming that the user know which row they want to look at)
- (This reduces the number of prompts and input requests)
    - no more option to select ModifiedLocation for "all" rows containing a specific ProteinID since each row may correspond to different chains

In [None]:
getView(processRow(rowIndex, df), data)

# Old Viewer
This feature will be removed in future versions (unless requested otherwise)

### View by ProteinID:
- It is still possible to view by calling the name of the protein with getView_old()
- However, the binding sites highlighted do not take different chains into consideration and will give the location for
  all available chains
- getView_old() is the previous version of the viewer (slightly modified); it's not as useful anymore

In [None]:
protID = 'P00350'
searchPDB(protID); getView_old(protID)