# Package tutorial:
### Creating a network analysis on a metabolic pathways of a microbiome

----------

### 1. Imporing the raw data saved as a .csv:  
There is a sample .csv file in the Examples directory, which will be used in this tutorial. 

In [1]:
# Import appropriate module:
import pandas as pd

In [2]:
# Import the .csv file:
raw_imported_csv = pd.read_csv("./Examples/sample_raw_input.csv",
                              nrows=1) 

In [3]:
# taking a quick look at what was imported:
raw_imported_csv.head()

Unnamed: 0,L.lysine.fermentation.to.acetate.and.butanoate,pyruvate.fermentation.to.acetate.II,acetate.conversion.to.acetyl.CoA,L.lysine.biosynthesis.I,L.lysine.biosynthesis.III,L.lysine.biosynthesis.VI,acetate.formation.from.acetyl.CoA.I,acetoacetate.degradation..to.acetyl.CoA.,pyruvate.decarboxylation.to.acetyl.CoA
0,24.129287,2.13025,34.141528,26.134562,53.470057,68.188032,94.161874,9.463964,21.280398


As seen above, the sample dataset is in a 'wide' format. This needs to be converted into a 'long' format with two columns: a pathway names column and the corresponding relative abundance column.

In [5]:
# converting to a long format using Pandas' melt() function:
long_data = pd.melt(raw_imported_csv,
                   var_name="pathways", #name of the variable column
                   value_name="relative_abundance" #name of the values column
                   )

In [6]:
# check how it looks now:
long_data


Unnamed: 0,pathways,relative_abundance
0,L.lysine.fermentation.to.acetate.and.butanoate,24.129287
1,pyruvate.fermentation.to.acetate.II,2.13025
2,acetate.conversion.to.acetyl.CoA,34.141528
3,L.lysine.biosynthesis.I,26.134562
4,L.lysine.biosynthesis.III,53.470057
5,L.lysine.biosynthesis.VI,68.188032
6,acetate.formation.from.acetyl.CoA.I,94.161874
7,acetoacetate.degradation..to.acetyl.CoA.,9.463964
8,pyruvate.decarboxylation.to.acetyl.CoA,21.280398


As seen above, the converted data table now is in a 'long' format with the left colun labeled 'pathways' containing the names of the pathway as string with formatting directly from another pipeline (paprica). The right column contains float values representing relative abundance of the corresponding pathways. 

This must now be formatted to fit the input criteria for creating a multi-directed network graph using `networkx`. 

### 2. Cleaning up and formatting the input data for network analysis

Network graphs are composed of nodes connected by edges. In this example, it will be a multi-directed graph, which is a type of network graph where there can be multiple edges between the same nodes in a specified direction, which also holds a 'weight' attribute. 

In this example, the nodes will be the metabolites and the edges will connect in the direction of the pathway that connects the nodes. For example, if metabolite A gets converted to metabolite B, then the edge would be drawn from metabolite A to metabolite B. This edge will also be assigned a 'weight,' which will be the values in the relative abundance column in this example. 

##### 2a. Parsing out the pathway names into inputs and outputs

Since the raw example data came from another pipeline, its pathway names follow a specific naming conventions. In this example, we know what this convention looks like, and which part to parse out. 

In [17]:
print("Example of a formatted pathway item:\n",long_data.iloc[0,0])

Example of a formatted pathway item:
 L.lysine.fermentation.to.acetate.and.butanoate


As you can see above, in a single pathway name, it contains information of the originating node and the metabolite products of this pathway, which will be the receiving nodes. 

In order to create a network graph, we must parse out the starting nodes(source) and the target nodes(target).

This will be done using a RegEx based function in a script named: `pathway_input_cleanup.py`

In [26]:
import .scripts.pathway_input_cleanup as pic

SyntaxError: invalid syntax (<ipython-input-26-49ca3d3627e9>, line 1)