# Analyzing the Phosphoinositide 3-Kinase (PI3K) Research Network
  
Phosphoinositide 3-Kinase (PI3K) is an enzyme involved with cell growth and motility, and is involved in cancer. In this notebook we will be applying Bu et. al.'s metrics both relative to the research network surrounding PI-3K as well as the detected sub-communities within the research network. We hope to identify publications that play a central role in the growth of the field.

# Data

## Starting Data

- PI3K Network: `/shared/pubmed/pi3k_pubmed_restricted_el.csv`
- PI3K Node-ID to DOI Mappings: `/shared/pubmed/pmid_doi.csv`
- PI3K Degree Distribution: `/shared/pubmed/pi3k_pubmed_restricted_nl_degree_counts.csv`

## Processed Data

- PI3K tsv Edgelist: `/shared/pubmed/pi3k_pubmed_restricted_el.tsv`
- PI3K IKC-30 Clustering: `/shared/pubmed/data/pi3k_pubmed_ikc30.csv`
- PI3K IKC-30 Reformatted Clustering: `/shared/pubmed/data/pi3k_pubmed_ikc30_reformatted.tsv`
- PI3K IKC-30 Cluster Statistics: `/shared/pubmed/data/pi3k_pubmed_ikc30_stats.csv`

**Collected During Notebook Runtime**

- PI3K IKC-30 Densest Cluster: `/shared/pubmed/data/pi3k_k49_cluster1.tsv`

# Pre-Processing Steps

I did several things before this analysis.

## Before Experiment I

1. First I changed the edgelist from csv to tsv format. I used the script found in this repository at `formatting_scripts/format_pi3k.py`
2. I removed the header by running `tail -n +2 pi3k_pubmed_restricted_el.tsv > pi3k_pubmed_restricted_el2.tsv`
3. I deleted `pi3k_pubmed_restricted_el.tsv` and renamed `pi3k_pubmed_restricted_el2.tsv` to `pi3k_pubmed_restricted_el.tsv`
4. Then, I clustered the tsv edgelist using IKC-30. I used the script `clustering_scripts/run_ikc.py`
5. I reformatted the clustering to be a `node_id` `cluster_id` tsv by running `formatting_scripts/format_pi3k.py`
6. Then I collected cluster statistics on the IKC-30 Clustering using [this reporting tool](https://github.com/illinois-or-research-analytics/cm_pipeline/blob/main/scripts/stats.py).

# Experiment I

## Finding the Densest K-Core Cluster

First we collect an IKC-30 clustering. Then, we need to find the most dense clustering. Let's search for the cluster with the highest k value.

In [1]:
import pandas as pd


pi3k_k_vals = pd.read_csv('/shared/pubmed/data/pi3k_pubmed_ikc30.csv')
pi3k_k_vals.head()

Unnamed: 0,18337270,1,49,1.1
0,31255140,1,49,1.0
1,22968725,1,49,1.0
2,18335787,1,49,1.0
3,18336071,1,49,1.0
4,22687254,1,49,1.0


In [2]:
# Get the maximum value in column 3
max_k = pi3k_k_vals.iloc[:, 2].max()
print(max_k)

49


It looks like we are looking at cluster 1 with a k value of 49 as our densest k-core cluster. Let's get stats on it.

In [3]:
pi3k_cluster_stats = pd.read_csv('/shared/pubmed/data/pi3k_pubmed_ikc30_stats.csv')
pi3k_cluster_stats.head()

Unnamed: 0,cluster,n,m,modularity,connectivity,connectivity_normalized_log10(n),connectivity_normalized_log2(n),connectivity_normalized_sqrt(n)/5,conductance
0,1,506,21478,0.000161,49.0,18.120293,5.454752,10.891578,0.681165
1,2,1397,61110,0.000456,48.0,15.261368,4.59413,6.421153,0.697872
2,3,2535,107177,0.000798,46.0,13.513601,4.067999,4.568134,0.688377
3,4,46132,1951190,0.011991,45.0,9.648366,2.904448,1.047566,0.715829
4,5,896,32370,0.000243,41.0,13.88744,4.180536,6.848569,0.502731


For Experiment I, we are analyzing:
  
**Cluster 1**
- 506 nodes
- 21478 edges

## Extracting an Edge List from the Cluster

Now that we have the cluster nodes, let's get an induced subgraph.

In [4]:
pi3k_ikc30 = pd.read_csv('/shared/pubmed/data/pi3k_pubmed_ikc30_reformatted.tsv', sep='\t', header=None)

pi3k_ikc30.head()

Unnamed: 0,0,1
0,18337270,1
1,31255140,1
2,22968725,1
3,18335787,1
4,18336071,1


In [7]:
first_cluster = pi3k_ikc30[pi3k_ikc30.iloc[:, 1] == 1]
first_cluster_nodes = first_cluster.iloc[:, 0].tolist()

print(first_cluster_nodes[:5])

[18337270, 31255140, 22968725, 18335787, 18336071]


In [8]:
pi3k = pd.read_csv('/shared/pubmed/pi3k_pubmed_restricted_el.tsv', sep='\t', header=None)
pi3k.head()

Unnamed: 0,0,1
0,22956746,53890765
1,22956746,55108116
2,22956746,57892820
3,22956750,31510512
4,22956750,40993204


In [10]:
# Filter rows where both columns contain values from the list
induced_subgraph = pi3k[pi3k.iloc[:, 0].isin(first_cluster_nodes) & pi3k.iloc[:, 1].isin(first_cluster_nodes)]
induced_subgraph.head()

Unnamed: 0,0,1
74230,22968725,18335787
74231,22968725,18336071
74232,22968725,18337270
74234,22968725,22687254
74235,22968725,22688518


In [11]:
print(induced_subgraph.shape[0])

21478


In [12]:
induced_subgraph.to_csv('/shared/pubmed/data/pi3k_k49_cluster1.tsv', sep='\t', header=False, index=False)

We now have an edgelist tsv for the densest cluster. Now we need BDID values.

## Getting BDID Values for PI3K and its Densest Cluster