# Analyzing the Phosphoinositide 3-Kinase (PI3K) Research Network
  
Phosphoinositide 3-Kinase (PI3K) is an enzyme involved with cell growth and motility, and is involved in cancer. In this notebook we will be applying Bu et. al.'s metrics both relative to the research network surrounding PI-3K as well as the detected sub-communities within the research network. We hope to identify publications that play a central role in the growth of the field.

# Data

## Starting Data

- PI3K Network: `/shared/pubmed/pi3k_pubmed_restricted_el.csv`
- PI3K Node-ID to DOI Mappings: `/shared/pubmed/pmid_doi.csv`
- PI3K Degree Distribution: `/shared/pubmed/pi3k_pubmed_restricted_nl_degree_counts.csv`

## Processed Data

- PI3K tsv Edgelist: `/shared/pubmed/pi3k_pubmed_restricted_el.tsv`
- PI3K IKC-30 Clustering: `/shared/pubmed/data/pi3k_pubmed_ikc30.csv`
- PI3K IKC-30 Reformatted Clustering: `/shared/pubmed/data/pi3k_pubmed_ikc30_reformatted.tsv`
- PI3K IKC-30 Cluster Statistics: `/shared/pubmed/data/pi3k_pubmed_ikc30_stats.csv`

# Pre-Processing Steps

I did several things before this analysis.

## Before Experiment I

1. First I changed the edgelist from csv to tsv format. I used the script found in this repository at `formatting_scripts/format_pi3k.py`
2. I removed the header by running `tail -n +2 pi3k_pubmed_restricted_el.tsv > pi3k_pubmed_restricted_el2.tsv`
3. I deleted `pi3k_pubmed_restricted_el.tsv` and renamed `pi3k_pubmed_restricted_el2.tsv` to `pi3k_pubmed_restricted_el.tsv`
4. Then, I clustered the tsv edgelist using IKC-30. I used the script `clustering_scripts/run_ikc.py`
5. I reformatted the clustering to be a `node_id` `cluster_id` tsv by running `formatting_scripts/format_pi3k.py`
6. Then I collected cluster statistics on the IKC-30 Clustering using [this reporting tool](https://github.com/illinois-or-research-analytics/cm_pipeline/blob/main/scripts/stats.py).

# Experiment I

First we collect an IKC-30 clustering. Then, we need to find the most dense clustering. Let's search for the cluster with the highest edge to node ratio.

In [1]:
import pandas as pd


