# Visualizing SSNs using an internactive network

Networks have nodes and edges. A node is a connection point and edges are the connections between them. In this exercise each protein is a node and each edge indicates an evalue (expectation value) smaller than 10e-40.

## Examining the BLAST output

<font color=blue><b>STEP 1:</b></font> Let's first look at the output of our BLASTP search by double clicking the file in the file browser. <b>N.B. if you had to relaunch the binder, you will need to upload this file from your personal computer!</b> If you can't find it, you should be able to download the file here: https://drive.google.com/drive/folders/1Qjx3u6T-LIoUV4d2xtyagPN2pXPqHpx_?usp=sharing. 

***
You should see three columns, with each column separated by a tab, they aren't labeled, but we dicated them using the outfmt command in our blast search. To refresh your memory here is the command we ran:

~~~python
!blastp -db files/finalpro_40 -query files/final_40.fasta -outfmt "6 qseqid sseqid evalue" -out files/BLASTe40_out -num_threads 4 -evalue 10e-40
~~~

    - qseqid is the query sequence id (we will call this the source)
    - sseqid is the subject sequence id (we will call this the target)
    - evalue is the expectation value

<font color=blue><b>STEP 2:</b></font> Answer the following questions:

    1. What is the range of expectations value when the source and target are the same (find a few in the BLASTe40_out)?
    2. Given that the expectation value refers to the chance that a BLAST hit is found by chance, does your answer to question 1 make sense?
    3. In the first ten results in the file, what is the expectation value of the closest non-identical match (give the source, target, and evalue)?
    
***

## Creating a Dataframe from the BLAST output

A little lingo here. An API is an Application Programming Interface. APIs are pieces of software that allow applications to talk to each other. A dataframe is a popular API that resembles a spreadsheet. The BLAST output is a tab separated file, and we will use pandas - a "powerful Python data analysis toolkit" to read our file and convert it to a dataframe.

<font color=blue><b>STEP 3:</b></font> Edit the code to replace the <b>\<<<your file here\>>></b> with the BLAST output file path (remember it is in the files directory). Then run the code below to convert the BLASTe40_out file into a dataframe. Since our BLAST output did not contain any headers (column labels), we can add them in.


In [None]:
import pandas as pd # imports the pandas functions

print("creating header list")
headerList = ['source','target','evalue']

print("reading datafile")
blast_data_con = pd.read_csv('files/BLASTe40_out', sep='\t', header=None)  # reads the BLAST output and looks for 'tab' to separate the values

blast_data_con.columns = ['source', 'target', 'evalue']  # assigns names to the columns

blast_data_con.to_pickle("files/blast_data_con.pkl")

#blast_data_con    # show what is in the dataframe.


Note that the complete dataframe is not shown, but that it contains over 4000 connections, which we call edges in a network.
***
## Creating dataframes of edges and nodes

## Removing duplicates and self-references from edges

The code below removes duplicates (e.g. if <font color="blue">a</font> finds <font color="blue">b</font> and <font color="blue">b</font> finds <font color="blue">a</font>, we only need to keep one of them) and self-references (e.g. remove all instances of <font color="blue">a</font> finds <font color="blue">a</font>).

The code is a bit complicated and uses another function called numpy. Briefly, the code uses pandas and the numpy.sort function to create another dataframe with only the duplicates. We then "subtract" the dataframe containing duplicates from the original dataframe.

We call the new dataframe "edges".

<font color=blue><b>STEP 4:</b></font> Run the code below to create a dataframe of unique edges. We have named this dataframe "edges".

In [11]:
import numpy as np

df = blast_data_con       # we assign the variable df to our blast_data_con so we can retain the original dataframe.

#remove duplicates
m=pd.DataFrame(np.sort(df[['source','target']])).duplicated()
df = df[~m]

#removes self-reference
df = df[df.source != df.target]

edges = df    # this is a unique set of edges
#edges         # show us the edges dataframe

edges.to_pickle("files/edges.pkl")

## Creating a unique list of nodes

We will use the numpy.unique function to read through the sources and targets in the dataframe and make a list (called uniq_list) of nodes.

Then we will use pandas to convert this list into a simple dataframe of nodes.

<font color=blue><b>STEP 5:</b></font> Run the code below to create dataframe of unique nodes.

In [12]:
uniq_list = np.unique(df[['source', 'target']].values)   # find the unique values and put them in a list

nodes = pd.DataFrame(uniq_list, columns = ['id'])  # make a node dataframe with the column header id

#nodes     # show us the nodes dataframe

nodes.to_pickle("files/nodes.pkl")

***

## Visualizing nodes and edges

You now have a set of nodes and edges that you can visualize. We will import ipycytoscape, A Cytoscape widget for displaying interactive networks. You can find more about Cytoscape (a really cool stand alone software package here: https://cytoscape.org/ and ipycytoscape here: https://github.com/cytoscape/ipycytoscape.
 
This code is borrowed and edited from https://github.com/joseberlines, who has done some neat work with ipycytoscape.

<font color=blue><b>STEP 6:</b></font> Run the code box below to visualize the graph (this might take a few minutes and might take a few seconds to appear even after the asterisk disappears).



In [None]:
# There isn't a real need to edit any of this since it is just making the network graph.

import json                     # json stands for JavaScript Object Notation and ipycytoscape reads the data in this format
import ipycytoscape             # the widget to visualize interactive networks.
from ipywidgets import Output
import pandas as pd

def transform_into_ipycytoscape():
    
    nodes_df = pd.read_pickle("files/nodes.pkl")
    edges_df = pd.read_pickle("files/edges.pkl")
    
    nodes_dict = nodes_df.to_dict('records')    # converts the nodes to a dictionary
    edges_dict = edges_df.to_dict('records')    # convertst the edges to a dictionary 

    # building nodes

    data_keys = ['id']  #this is a list of keys in stations (nodes)
    position_keys = ['position_x','position_y']
    rest_keys = ['score','idInt','name','score','group','removed','selected','selectable','locked','grabbed'
                 'grabbable']
    
    
    nodes_graph_list=[] #an empty list for making the json-like? file
    for node in nodes_dict: #iterating over each node
        dict_node = {}
        data_sub_dict = {'data':{el:node[el] for el in data_keys}}
        rest_sub_dict = {el:node[el] for el in node.keys() if el in rest_keys}
        posi_sub_dict = {}
        if 'position_x' in node.keys() and 'position_y' in node.keys():
            posi_sub_dict = {'position':{el:node[el] for el in node.keys() if el in position_keys}}
        
        dict_node = {**data_sub_dict,**rest_sub_dict,**posi_sub_dict}
        nodes_graph_list.append(dict_node)
    
    # building edges
    
    data_keys  = ['source','target','evalue'] #this is a list of keys in edges
    data_keys2 = ['label','classes'] 
    rest_keys  = ['score','weight','group','networkId','networkGroupId','intn','rIntnId','group','removed','selected','selectable','locked','grabbed','grabbable','classes']
    position_keys = ['position_x','position_y']
    
    edges_graph_list = []
    for edge in edges_dict:
        dict_edge = {}
        data_sub_dict = {el:edge[el] for el in data_keys}
        data_sub_dict2 = {el:edge[el] for el in edge.keys() if el in data_keys2}
        rest_sub_dict = {el:edge[el] for el in edge.keys() if el in rest_keys}
        
        dict_edge = {'data':{**data_sub_dict,**data_sub_dict2},**rest_sub_dict}
        edges_graph_list.append(dict_edge)
    
    #print(edges_graph_list)
    
    total_graph_dict = {'nodes': nodes_graph_list, 'edges':edges_graph_list}
    
    #print(total_graph_dict)
    
    # building the style
    all_node_style = ['background-color','background-opacity',
                     'font-family','font-size','label','width',
                     'shape','height','width','text-valign','text-halign']
    all_edge_style = ['background-color','background-opacity',
                     'font-family','font-size','label','width','line-color', 
                     ]
    
    total_style_dict = {}
    style_elements=[]
    for node in nodes_dict:
        node_dict = {'selector': f'node[id = \"{node["id"]}\"]'}
        style_dict ={"style": { el:node[el] for el in node.keys() if el in all_node_style}}
        node_dict.update(style_dict)
        style_elements.append(node_dict)
    
    for edge in edges_dict:
        edge_dict = {'selector': f'edge[id = \"{edge["source"]}\"]'}
        style_dict ={"style": { el:edge[el] for el in edge.keys() if el in all_edge_style}}
        edge_dict.update(style_dict)
        style_elements.append(edge_dict)
    
    # the graph
    data_graph = json.dumps(total_graph_dict)
    json_to_python = json.loads(data_graph)
    result_cyto = ipycytoscape.CytoscapeWidget()
    result_cyto.graph.add_graph_from_json(json_to_python)    
    result_cyto.set_style(style_elements)
    result_cyto.set_layout(name='grid')   #concentric, cola, or grid

    out = Output()

    """
    def log_clicks(node):
        with out:
            print(f'clicked: {(node)}')

    #def log_mouseovers(node):
        #with out:
            #print(f'mouseover: {pformat(node)}')

    result_cyto.on('node', 'click', log_clicks)
    #result_cyto.on('node', 'mouseover', log_mouseovers)
    
    """  
    return result_cyto, out

network, out = transform_into_ipycytoscape()
display(network)
display(out)

<font color=blue><b>STEP 7:</b></font> Try zooming in and out of the network. You can also grab and move nodes.

If you are underimpressed - you should be. We did make an interactive network of nodes and edges. <b>But</b>...the problem is that this network doesn't contain enough information to give us insight into the connections among DtxR-like sequences.

<font color=blue><b>STEP 8:</b></font> The code below adds some information to the dataframe for our nodes. It includes color and labels and sets a default size for the nodes. Run the code below to update the dataframe and then run the code to visualize the network again. In case you would like to see the full range of named colors, check out this site: https://matplotlib.org/stable/gallery/color/named_colors.html.


In [4]:
# Let's add some new columns to our dataframe
nodes['label'] = nodes['id']        # creates a label using the id
nodes['background-color']='cyan'    # our default color is cyan, but could be anything.
nodes['width']='24'                
nodes['height']='24'
nodes['font-size']='20'
nodes['text-valign']='center'
nodes['text-halign']='center'
nodes['count'] = '1'

#nodes
nodes.to_pickle("files/nodes.pkl")

Unnamed: 0,id,label,background-color,width,height,font-size,text-valign,text-halign,count
0,1C0W_DTXR,1C0W_DTXR,cyan,24,24,20,center,center,1
1,1U8R_IDER,1U8R_IDER,cyan,24,24,20,center,center,1
2,3HRT_SCAR,3HRT_SCAR,cyan,24,24,20,center,center,1
3,3R60_MNTR,3R60_MNTR,cyan,24,24,20,center,center,1
4,5CVI_SLOR,5CVI_SLOR,cyan,24,24,20,center,center,1
...,...,...,...,...,...,...,...,...,...
300,Q3ZAA1,Q3ZAA1,cyan,24,24,20,center,center,1
301,R9SK60,R9SK60,cyan,24,24,20,center,center,1
302,R9SKE6,R9SKE6,cyan,24,24,20,center,center,1
303,X1EE71,X1EE71,cyan,24,24,20,center,center,1


***
The network is still very difficult to use. Knowledge of the number of connections might help us to find nodes that make many connections and nodes that make only a few connections easily.

<font color=blue><b>STEP 9:</b></font> The code below creates a list using our node IDs. It goes through the list and counts the number of edges to targets. Then it changes the variable count in the dataframe for that item. Lastly, the height and width are determined using the size * 10. I just made that up, you could try any multiple or even an exponent (e.g. \**2).


In [5]:
#Let's change the size of the nodes based on the number of connections. 
col_one_list = nodes['id'].tolist()     # make a list from the dataframe

for item in col_one_list: 
    size = len(edges[edges['source']==item]) + len(edges[edges['target']==item])
    nodes.loc[nodes['id'] == item, 'count']=size
    size = size*10
    nodes.loc[nodes['id'] == item, 'width']=size
    nodes.loc[nodes['id'] == item, 'height']=size

nodes['font-size']='50'   # let's also increase the size of the font here.
    
#nodes
nodes.to_pickle("files/nodes.pkl")

Unnamed: 0,id,label,background-color,width,height,font-size,text-valign,text-halign,count
0,1C0W_DTXR,1C0W_DTXR,cyan,80,80,50,center,center,8
1,1U8R_IDER,1U8R_IDER,cyan,140,140,50,center,center,14
2,3HRT_SCAR,3HRT_SCAR,cyan,110,110,50,center,center,11
3,3R60_MNTR,3R60_MNTR,cyan,30,30,50,center,center,3
4,5CVI_SLOR,5CVI_SLOR,cyan,110,110,50,center,center,11
...,...,...,...,...,...,...,...,...,...
300,Q3ZAA1,Q3ZAA1,cyan,50,50,50,center,center,5
301,R9SK60,R9SK60,cyan,20,20,50,center,center,2
302,R9SKE6,R9SKE6,cyan,10,10,50,center,center,1
303,X1EE71,X1EE71,cyan,80,80,50,center,center,8


This is finally starting to give us some information. 

<font color=blue><b>STEP 10:</b></font> Let's try another last amendment to our dataframe by giving a color to our knowns. Note that these are the IDs in the "dtxr_pdbs.fasta" file, and I just picked a different color for each. Run the code below and then rerun the network visualization.

In [6]:

#Here we can assign colors to nodes that connect to one of our knowns!

nodes.loc[nodes['id'] == '1U8R_IDER','background-color']  = 'red'
nodes.loc[nodes['id'] == '1C0W_DTXR','background-color']  = 'orange'
nodes.loc[nodes['id'] == '6O5C_MTSR','background-color']  = 'yellow'
nodes.loc[nodes['id'] == '3HRT_SCAR','background-color']  = 'green'
nodes.loc[nodes['id'] == '5CVI_SLOR','background-color']  = 'blue'
nodes.loc[nodes['id'] == '3R60_MNTR','background-color']  = 'magenta'

#nodes
nodes.to_pickle("files/nodes.pkl")

Unnamed: 0,id,label,background-color,width,height,font-size,text-valign,text-halign,count
0,1C0W_DTXR,1C0W_DTXR,orange,80,80,50,center,center,8
1,1U8R_IDER,1U8R_IDER,red,140,140,50,center,center,14
2,3HRT_SCAR,3HRT_SCAR,green,110,110,50,center,center,11
3,3R60_MNTR,3R60_MNTR,magenta,30,30,50,center,center,3
4,5CVI_SLOR,5CVI_SLOR,blue,110,110,50,center,center,11
...,...,...,...,...,...,...,...,...,...
300,Q3ZAA1,Q3ZAA1,cyan,50,50,50,center,center,5
301,R9SK60,R9SK60,cyan,20,20,50,center,center,2
302,R9SKE6,R9SKE6,cyan,10,10,50,center,center,1
303,X1EE71,X1EE71,cyan,80,80,50,center,center,8


<font color=blue><b>STEP 11:</b></font> Answer the question:

    1. Was adding color to your knowns as helpful as you thought it might be? Why or why not?
    
***

Let's see if adding more color can help us to generate clusters.

<font color=blue><b>STEP 12:</b></font> In this step, if a node is identified by a known, we will color it the same as the known. Note that the last section recolors the known nodes just in case they were identified by another known directly. Run the code below and then rerun the network visualization.

In [7]:
records = edges.to_records(index=False)
result = list(records)

for item in result:
    #print(item)
    if item[1] == '1U8R_IDER':
        nodes.loc[nodes['id'] == item[0],'background-color']  = 'red'
    if item[1] == '5CVI_SLOR':
        nodes.loc[nodes['id'] == item[0],'background-color']  = 'blue'
    if item[1] == '3HRT_SCAR':
        nodes.loc[nodes['id'] == item[0],'background-color']  = 'green'
    if item[1] == '1C0W_DTXR':
        nodes.loc[nodes['id'] == item[0],'background-color']  = 'orange'
    if item[1] == '3R60_MNTR':
        nodes.loc[nodes['id'] == item[0],'background-color']  = 'magenta'
    if item[1] == '6O5C_MTSR':
        nodes.loc[nodes['id'] == item[0],'background-color']  = 'yellow'

nodes.loc[nodes['id'] == '1U8R_IDER','background-color']  = 'red'
nodes.loc[nodes['id'] == '1C0W_DTXR','background-color']  = 'orange'
nodes.loc[nodes['id'] == '6O5C_MTSR','background-color']  = 'yellow'
nodes.loc[nodes['id'] == '3HRT_SCAR','background-color']  = 'green'
nodes.loc[nodes['id'] == '5CVI_SLOR','background-color']  = 'blue'
nodes.loc[nodes['id'] == '3R60_MNTR','background-color']  = 'magenta'
        
#nodes

nodes.to_pickle("files/nodes.pkl")

***

<font color=blue><b>STEP 13:</b></font> Interact with the graph to create clusters. Briefly, cluster by first moving the larger known clusters to the outside of the grid graph. Then bring the similarly colored groups together. Finally, arrange the remaining cyan nodes nearer to their connections as dictated by the edges.

<font color=blue><b>STEP 14:</b></font> Use your clustered graph to answer the following questions:

    1. Which DtxR-like proteins are most closely related to each other?
    2. What are the MNTRs connected to? What does this mean? What step or steps might you need to change to identify sequence connections to MNTRs?
    3. What sequence (or sequences) connect the 5CVI_SLOR protein to the 1C0W_DTXR protein?

<font color=blue><b>STEP 15:</b></font> <b>Challenge Question:</b> Having identified the sequence links between 5CVI_SLOR protein and 1C0W_DTXR protein (Step 14 question 3), create a fasta file that contains these linking sequences and the knowns (in dtxr_pdbs.fasta). Then create a multiple alignment from this fasta. Using the mutliple alignment output, assign functions to the linking sequences - choices are: DtxR/IdeR-like, SloR/ScaR-like, MtsR-like or unknown function.   