# Section 4 - Avoiding data overinterpretation
## Example 4.3
**Application 4.3**: Different entanglements have different structural and topological features.  We need to compute corrected *p*-values for these features to determine which ones have a significant influence on whether a protein is linked to disease. 

* Before running the code cells below, take a minute to think about:
    * What steps will you need to take to correct the *p*-values?
    * How can you check the number of false positives that you have eliminated?

### Step 0 - Load libraries

In [None]:
import pandas as pd
from statsmodels.stats.multitest import multipletests
import matplotlib.pyplot as plt
import numpy as np

### Step 1 - Load the data

In [None]:
data_path = "/home/jovyan/data-store/data/iplant/home/shared/NCEMS/BPS-training-2025/"
use_cols = ["metric", "percentile", "p_value"]
data11 = pd.read_csv(data_path + "disease-assoc_p-values.csv", usecols = use_cols)

### Step 2 - Explore the data

In [None]:
# print a quick summary of "data9"
data11.info()

# print the first 10 rows of "data9"
data11.head(20)

* From this exploration of the data, we can see that there are thirteen different entanglement parameters in the `metric` column
* Each of these different `metric` values was tested for its ability to predict linkage to disease using three different thresholds for when a gene is linked with disease; this leads to the `percentile` column containing the values `50%`, `75%`, `95%`
    * We will focus on the `50%` data in this analysis, so we select only these rows in the cell below

In [None]:
# remove rows that do not correspond to 50th percentile disease linkage calculations
data11 = data11[data11["percentile"] == "50%"]

data11.info()

data11.head(20)

* We are left with 13 rows, each with an associated *p*-value

### Step 3 - Run the analysis

In [None]:
# define the significance level for our tests
alpha = 0.05

# apply the Benjamini-Hochberg procedure for FDR correction
_, pvals_corrected, _, _ = multipletests(data11['p_value'], alpha = alpha, method = 'fdr_bh')

# add corrected p-values as a new column
data11['p_value_adjust'] = pvals_corrected

# compute the proportion of uncorrected p-values < alpha
N_uncorr_acc = (data11['p_value'] < alpha).sum()
N_corr_acc   = (data11['p_value_adjust'] < alpha).sum()
print ("Using the uncorrected p-values, we would conclude", N_uncorr_acc, "features are significant")
print ("Using the corrected p-values, we would conclude", N_corr_acc, "features are significant")

# make a plot of the distribution of p-values before & after the FDR correction
plt.clf()
plt.title("Histogram")
plt.hist(data11["p_value"], color = "#004488", alpha = 0.7, label = "Uncorrected", histtype = "step", bins = "fd", linewidth=2.5) # here, alpha != significance level
plt.hist(data11["p_value_adjust"], color = "#BB5566", alpha = 0.7, label = "Corrected", histtype = "step", bins = "fd", linewidth=2.5)
plt.xlabel("p-value")
plt.ylabel("Counts")
plt.legend(loc = "best")
plt.tight_layout()
plt.show()

# make a plot of the cumulative distribution function of p-values before & after the FDR correction
plt.clf()
plt.title("Cumulative distribution function")
plt.hist(data11["p_value"], color = "#004488", alpha = 0.7, label = "Uncorrected", histtype = "step", bins = "fd", cumulative = True, density = True, linewidth=2.5) # here, alpha != significance level
plt.hist(data11["p_value_adjust"], color = "#BB5566", alpha = 0.7, label = "Corrected", histtype = "step", bins = "fd", cumulative = True, density = True, linewidth=2.5)
plt.xlabel("p-value")
plt.ylabel("Cumulative probability")
plt.legend(loc = "best")
plt.tight_layout()
plt.show()

# make an additional plot showing the p-values

np.random.seed(1)
jitter1 = np.random.uniform(-0.1, 0.1, size=13)
jitter2 = np.random.uniform(-0.1, 0.1, size=13)

plt.clf()
plt.title("Scatter plot")
plt.scatter(np.ones(13) + jitter1, data11["p_value"],  color = "#004488", alpha = 0.7, label = "Uncorrected")#, marker = "_")
plt.scatter(2.0*np.ones(13) + jitter2, data11["p_value_adjust"],  color = "#BB5566", alpha = 0.7, label = "Corrected")#, marker = "_")
plt.plot([0, 3], [0.05, 0.05], "r--")
plt.xlim(0.5, 2.5)
plt.xticks([1, 2], ["Uncorrected", "Corrected"])
plt.yscale('log')
plt.ylim(1E-5, 10)
plt.ylabel("p-value")
plt.tight_layout()
plt.show()

### Step 4 - Interpret the results

* Think about what we can conclude based on this analysis. Consider the following:
    * How many false positives have you eliminated?
* Once you are confident in your answers, discuss them with someone sitting near you. 