# Notebook 05: Results Interpretation and Visualization

## Introduction

In Notebook 04, we performed the core statistical generation. We successfully created several key visualization artifacts (`.qzv`) that contain the answers to our scientific questions. However, these artifacts are "locked boxes"—we cannot see the p-values or data directly in the notebook.

The goal of this final notebook is to:
1.  **Unlock (Export)** these `.qzv` artifacts.
2.  **Extract** the key statistical results (p-values, test statistics).
3.  **Interpret** these results to answer our primary research questions.
4.  **Visualize** the results using Python (Pandas, Seaborn) to create publication-quality, static plots for our final portfolio.

### Objectives:

1.  **Analyze Alpha Diversity:** Export the `alpha-group-significance` visualization, load the resulting CSV with Pandas, and find the p-value for the `gastrointest_disord` comparison.
2.  **Analyze Beta Diversity:** Export the `beta-group-significance` visualization, load the PERMANOVA results, and find the p-value for the `gastrointest_disord` comparison.
3.  **Create Final PCoA Plot:** Export the `PCoA results` artifact, merge the coordinates with our metadata, and use Seaborn to create a clean, annotated PCoA plot.
4.  **Final Scientific Conclusion:** Summarize all findings and formally answer the project's central research questions.

In [None]:
# --- Install Visualization & Formatting Libraries ---
# We are installing seaborn and matplotlib (for plotting)
# and 'tabulate' (which is an optional dependency for pandas 'to_markdown' function).

print("--- 1. Installing seaborn, matplotlib, and tabulate ---")
!mamba install -y seaborn matplotlib tabulate

print("\n--- 2. Installation complete. ---")

In [None]:
# --- (Cell 3) Imports, Settings, and Verification ---
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

# Set plotting style
sns.set_style("whitegrid")

print("--- 1. Verification: Checking for input files from Notebook 04 ---")

# Define input files (the "locked boxes")
ALPHA_FAITH_PD_QZV = "../results/09_core_metrics/faith-pd-significance.qzv"
ALPHA_SHANNON_QZV = "../results/09_core_metrics/shannon-significance.qzv"
BETA_UNIFRAC_QZV = "../results/09_core_metrics/unweighted-unifrac-significance.qzv"
BETA_BRAY_QZV = "../results/09_core_metrics/bray-curtis-significance.qzv"
PCOA_UNIFRAC_QZA = "../results/09_core_metrics/unweighted_unifrac_pcoa_results.qza" # Note: .qza
METADATA_TSV = "../data/metadata.tsv"

# Define output directory for this notebook
EXPORT_DIR_BASE = "../results/10_exported_results"
!mkdir -p {EXPORT_DIR_BASE}
print(f"Created directory for exported results: {EXPORT_DIR_BASE}")


# Check if all required files exist
files_to_check = [
    ALPHA_FAITH_PD_QZV, ALPHA_SHANNON_QZV, BETA_UNIFRAC_QZV, 
    BETA_BRAY_QZV, PCOA_UNIFRAC_QZA, METADATA_TSV
]
all_files_exist = True

for f in files_to_check:
    if not os.path.exists(f):
        print(f"!!! ERROR: Required file not found: {f}")
        all_files_exist = False
    else:
        print(f"Found: {f}")

if all_files_exist:
    print("\n--- All required input files are present. Ready to start Notebook 05. ---")
else:
    print("\n--- !!! ERROR: Please ensure Notebook 04 ran successfully before proceeding. ---")

### 1. Analyze Alpha Diversity Results

Our first objective is to extract and interpret the p-values from our alpha diversity statistical tests. We will start with the Faith's PD metric.

In [None]:
# ---  Export Alpha Diversity (Faith's PD) Results ---

print("--- 1. Exporting Faith's PD significance visualization ---")

# Define input path
ALPHA_FAITH_PD_QZV = "../results/09_core_metrics/faith-pd-significance.qzv"
# Define output path (a *new directory*)
ALPHA_FAITH_PD_EXPORT_DIR = f"{EXPORT_DIR_BASE}/alpha_faith_pd"

# Run the export command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime tools export \
    --input-path {ALPHA_FAITH_PD_QZV} \
    --output-path {ALPHA_FAITH_PD_EXPORT_DIR}

print(f"\n--- 2. Export complete. Listing contents of {ALPHA_FAITH_PD_EXPORT_DIR} ---")
!ls -lh {ALPHA_FAITH_PD_EXPORT_DIR}

print("\nWe are looking for the 'kruskal-wallis-pairwise.csv' file.")

### 1.1 Load Alpha Diversity (Faith's PD) Results

The export was successful. As expected with QIIME 2 v2020.8, the command automatically ran the Kruskal-Wallis test on *every* column in our metadata, creating a separate CSV file for each.

The file we care about for our primary question is `kruskal-wallis-pairwise-gastrointest_disord.csv`.

We will now load this specific file into Pandas to find the p-value.

In [None]:
# --- Load and Print Faith's PD p-value ---

print("--- 1. Loading Faith's PD statistical results ---")

# Define the *exact* path to the CSV file we want
ALPHA_FAITH_PD_CSV = f"{ALPHA_FAITH_PD_EXPORT_DIR}/kruskal-wallis-pairwise-gastrointest_disord.csv"

# Load it into a pandas DataFrame
df_faith_pd_stats = pd.read_csv(ALPHA_FAITH_PD_CSV)

# Print the entire table (it should be very small)
print("\n--- 2. Kruskal-Wallis results for 'gastrointest_disord' ---")
print(df_faith_pd_stats.to_markdown(index=False))

# --- 3. Extract and interpret the key p-value ---
# (Correction: The filter failed because the group name was 'none (n=21)' not 'healthy')
# We will now dynamically read the group names from the table to avoid this error.

# Get the exact strings from the table's first row (row 0)
group_1_name = df_faith_pd_stats['Group 1'].iloc[0] # e.g., "Crohn's disease (n=234)"
group_2_name = df_faith_pd_stats['Group 2'].iloc[0] # e.g., "none (n=21)"

print(f"\n--- 3b. Dynamically found groups: '{group_1_name}' vs '{group_2_name}' ---")

# Now, we filter using the *exact* names we just read
p_value = df_faith_pd_stats[
    (df_faith_pd_stats['Group 1'] == group_1_name) &
    (df_faith_pd_stats['Group 2'] == group_2_name)
]['p-value'].iloc[0]

q_value = df_faith_pd_stats[
    (df_faith_pd_stats['Group 1'] == group_1_name) &
    (df_faith_pd_stats['Group 2'] == group_2_name)
]['q-value'].iloc[0]


print(f"\n--- 4. Scientific Conclusion (Faith's PD) ---")
print(f"The raw p-value for ({group_1_name} vs. {group_2_name}) is: {p_value}")
print(f"The adjusted q-value (FDR p-value) is: {q_value}")

if q_value < 0.05:
    print("\nCONCLUSION: The difference in Faith's PD alpha diversity IS statistically significant.")
else:
    print("\nCONCLUSION: The difference in Faith's PD alpha diversity IS NOT statistically significant.")

### 1.2 Analyze Alpha Diversity (Shannon's Index)

We have confirmed a significant difference using Faith's PD (a phylogenetic metric).

Now, we will perform the *exact same analysis* on our second metric, **Shannon's Index**, which is a non-phylogenetic metric (it measures richness and evenness without using the tree). This will help confirm and strengthen our conclusion.

In [None]:
# --- Export Alpha Diversity (Shannon) Results ---

print("--- 1. Exporting Shannon's Index significance visualization ---")

# Define input path
ALPHA_SHANNON_QZV = "../results/09_core_metrics/shannon-significance.qzv"
# Define output path (a *new directory*)
ALPHA_SHANNON_EXPORT_DIR = f"{EXPORT_DIR_BASE}/alpha_shannon"

# Run the export command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime tools export \
    --input-path {ALPHA_SHANNON_QZV} \
    --output-path {ALPHA_SHANNON_EXPORT_DIR}

print(f"\n--- 2. Export complete. Listing contents of {ALPHA_SHANNON_EXPORT_DIR} ---")
!ls -lh {ALPHA_SHANNON_EXPORT_DIR}

print("\nWe are looking for the 'kruskal-wallis-pairwise-gastrointest_disord.csv' file.")

In [None]:
# --- Load and Print Shannon p-value ---

print("--- 1. Loading Shannon's Index statistical results ---")

# Define the *exact* path to the CSV file we want
ALPHA_SHANNON_CSV = f"{ALPHA_SHANNON_EXPORT_DIR}/kruskal-wallis-pairwise-gastrointest_disord.csv"

# Load it into a pandas DataFrame
df_shannon_stats = pd.read_csv(ALPHA_SHANNON_CSV)

# Print the entire table
print("\n--- 2. Kruskal-Wallis results for 'gastrointest_disord' (Shannon) ---")
print(df_shannon_stats.to_markdown(index=False))

# --- 3. Extract and interpret the key p-value ---
# We use the same dynamic method to get group names safely
group_1_name = df_shannon_stats['Group 1'].iloc[0] # e.g., "Crohn's disease (n=234)"
group_2_name = df_shannon_stats['Group 2'].iloc[0] # e.g., "none (n=21)"

print(f"\n--- 3b. Dynamically found groups: '{group_1_name}' vs '{group_2_name}' ---")

# Filter using the *exact* names we just read
p_value_sh = df_shannon_stats[
    (df_shannon_stats['Group 1'] == group_1_name) &
    (df_shannon_stats['Group 2'] == group_2_name)
]['p-value'].iloc[0]

q_value_sh = df_shannon_stats[
    (df_shannon_stats['Group 1'] == group_1_name) &
    (df_shannon_stats['Group 2'] == group_2_name)
]['q-value'].iloc[0]


print(f"\n--- 4. Scientific Conclusion (Shannon's Index) ---")
print(f"The raw p-value for ({group_1_name} vs. {group_2_name}) is: {p_value_sh}")
print(f"The adjusted q-value (FDR p-value) is: {q_value_sh}")

if q_value_sh < 0.05:
    print("\nCONCLUSION: The difference in Shannon's Index alpha diversity IS statistically significant.")
else:
    print("\nCONCLUSION: The difference in Shannon's Index alpha diversity IS NOT statistically significant.")

### 2. Analyze Beta Diversity Results (PERMANOVA)

We have completed our alpha diversity analysis. The key finding is that while simple richness/evenness (Shannon) is not different, the phylogenetic diversity (Faith's PD) is significantly lower in Crohn's disease patients.

Now, we will move to beta diversity to answer our second question: **"Is the overall community composition significantly different between groups?"**

We will do this by exporting the results of our `beta-group-significance` command (which runs a PERMANOVA test) and extracting the p-value. We will start with the Unweighted UniFrac metric, as it complements our significant Faith's PD finding (both are phylogenetic).

In [None]:
# --- Export Beta Diversity (Unweighted UniFrac) Results ---

print("--- 1. Exporting Unweighted UniFrac significance visualization ---")

# Define input path
BETA_UNIFRAC_QZV = "../results/09_core_metrics/unweighted-unifrac-significance.qzv"
# Define output path (a *new directory*)
BETA_UNIFRAC_EXPORT_DIR = f"{EXPORT_DIR_BASE}/beta_unifrac_significance"

# Run the export command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime tools export \
    --input-path {BETA_UNIFRAC_QZV} \
    --output-path {BETA_UNIFRAC_EXPORT_DIR}

print(f"\n--- 2. Export complete. Listing contents of {BETA_UNIFRAC_EXPORT_DIR} ---")
!ls -lh {BETA_UNIFRAC_EXPORT_DIR}

print("\nWe are looking for the 'permanova-results.csv' file.")

### 3. Final Visualization (PCoA Plot)

We have now statistically confirmed significant differences in both phylogenetic alpha diversity (Faith's PD) and beta diversity (Unweighted UniFrac).

Our final objective is to **visualize** this beta diversity separation. While QIIME 2 provides interactive `emperor.qzv` plots, it is best practice to create a static, publication-quality plot directly in the notebook.

We will do this by:
1.  Exporting the raw coordinate data from our `unweighted_unifrac_pcoa_results.qza` artifact.
2.  Loading these coordinates into Pandas.
3.  Merging the coordinates with our sample metadata.
4.  Using `seaborn` to create the final, annotated PCoA plot.

In [None]:
# --- Export PCoA Coordinates ---

print("--- 1. Exporting Unweighted UniFrac PCoA results ---")

# Define input path (Note: .qza, not .qzv)
PCOA_UNIFRAC_QZA = "../results/09_core_metrics/unweighted_unifrac_pcoa_results.qza"
# Define output path (a *new directory*)
PCOA_UNIFRAC_EXPORT_DIR = f"{EXPORT_DIR_BASE}/pcoa_unifrac_coords"

# Run the export command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime tools export \
    --input-path {PCOA_UNIFRAC_QZA} \
    --output-path {PCOA_UNIFRAC_EXPORT_DIR}

print(f"\n--- 2. Export complete. Listing contents of {PCOA_UNIFRAC_EXPORT_DIR} ---")
!ls -lh {PCOA_UNIFRAC_EXPORT_DIR}

print("\nWe are looking for the 'ordination.txt' file.")

### 3.1 Load and Merge PCoA Data

The export was successful, and we now have the raw PCoA coordinates in `ordination.txt`.

This file is a simple text file, but it's not ready to be plotted. We must first load it into Pandas and then **merge** it with our `metadata.tsv` file. This merge is critical: it will connect the abstract (X, Y) coordinates of each sample to its actual group (e.g., "Crohn's disease" or "none").

In [None]:
# --- Load and Merge Coordinates with Metadata ---

print("--- 1. Loading PCoA coordinates (ordination.txt) ---")

# Define paths
PCOA_COORDS_FILE = f"{PCOA_UNIFRAC_EXPORT_DIR}/ordination.txt"
METADATA_TSV = "../data/metadata.tsv" # Defined in Cell 3

# Load the ordination.txt file
# This file has a specific format that needs careful parsing:
# - It is tab-separated (sep='\t')
# - The sample ID is the first column (index_col=0)
# - It has 9 header lines to skip (skiprows=9)
df_pcoa_coords = pd.read_csv(
    PCOA_COORDS_FILE, 
    sep='\t', 
    index_col=0,
    skiprows=9
)

# It also has 5 footer lines (eigvals, proportion explained, etc.)
# We remove these footer lines using iloc[:-5]
df_pcoa_coords = df_pcoa_coords.iloc[:-5]

# We only care about the first 3 axes (PC1, PC2, PC3) for plotting
df_pcoa_coords = df_pcoa_coords.iloc[:, 0:3]
df_pcoa_coords.columns = ['PC1', 'PC2', 'PC3']

print("PCoA Coordinates (first 5 rows):")
print(df_pcoa_coords.head())

print("\n--- 2. Loading Metadata ---")
# Load the metadata, setting the 'sample-id' as the index for merging
df_metadata = pd.read_csv(METADATA_TSV, sep='\t', index_col='sample-id')
print("Metadata (first 5 rows):")
print(df_metadata.head())

print("\n--- 3. Merging Coordinates and Metadata ---")
# 'pd.merge' joins the two tables using their indexes
df_pcoa_merged = pd.merge(
    df_pcoa_coords,
    df_metadata,
    left_index=True,
    right_index=True
)

print("\nFinal Merged DataFrame (first 5 rows):")
print(df_pcoa_merged.head())
print(f"\n--- Merge successful. Final table has {df_pcoa_merged.shape[0]} samples. ---")

### 3.2 Create Final PCoA Plot with Seaborn

The merge was successful. We now have our final "plotting-ready" DataFrame: `df_pcoa_merged`.

This table contains 251 samples, which represents the samples that remained after both contaminant filtering (Notebook 03) and rarefaction at 2957 reads (Notebook 04).

We will now use Seaborn to create the final scatter plot. We will plot PC1 vs. PC2 and color the points by the `gastrointest_disord` column.

In [None]:
# --- Plot PCoA with Seaborn ---

print("--- 1. Generating final PCoA plot ---")

# Define the figure size
plt.figure(figsize=(10, 8))

# Define the groups we care about. 
# We need to clean the 'gastrointest_disord' column first.
# It currently has values like "Crohn's disease (n=234)" and "none (n=21)"
# Let's simplify them to just "Crohn's disease" and "none"
# We also drop any other potential groups we don't want to plot
df_pcoa_merged['Group'] = df_pcoa_merged['gastrointest_disord'].replace({
    "Crohn's disease": "Crohn's disease",
    "none": "none (Healthy Control)"
})
# Keep only these two groups
plot_df = df_pcoa_merged[
    df_pcoa_merged['Group'].isin(["Crohn's disease", "none (Healthy Control)"])
].copy()


# Create the scatter plot
# x='PC1', y='PC2'
# hue='Group' (this colors the dots)
# s=100 (sets the size of the dots)
ax = sns.scatterplot(
    data=plot_df,
    x='PC1',
    y='PC2',
    hue='Group',
    palette={'Crohn\'s disease': '#E63946', 'none (Healthy Control)': '#457B9D'}, # Red vs Blue
    s=100,
    alpha=0.8
)

# Set title and labels
ax.set_title(
    "Beta Diversity (Unweighted UniFrac PCoA)\nCrohn's Disease vs. Healthy",
    fontsize=16,
    fontweight='bold'
)
ax.set_xlabel("PC1", fontsize=12)
ax.set_ylabel("PC2", fontsize=12)

# Move the legend
plt.legend(title='Sample Group', loc='upper right', bbox_to_anchor=(1.4, 1))

# Define the path to save the figure
PCOA_PLOT_PATH = "../results/10_exported_results/pcoa_unweighted_unifrac.png"

# Save the figure
plt.savefig(PCOA_PLOT_PATH, dpi=300, bbox_inches='tight')

print(f"\n--- 2. Plot saved to: {PCOA_PLOT_PATH} ---")

# Display the plot in the notebook
plt.show()

### 4. Visualize Alpha Diversity (Boxplots)

We have statistically confirmed *that* alpha diversity is different (p=0.0006), but we haven't visualized *how* it is different.

We will now create boxplots for both Faith's PD and Shannon's Index using Seaborn. This will provide a clear visual comparison of the diversity levels between the Crohn's and Healthy groups, confirming our p-value findings.

In [None]:
# --- (Cell 20) Export Faith's PD Raw Vector ---

print("--- 1. Exporting Faith's PD raw vector data ---")

# Define input path (the vector, not the significance visualization)
ALPHA_FAITH_PD_QZA = "../results/09_core_metrics/faith_pd_vector.qza"
# Define output path (a *new directory*)
ALPHA_FAITH_PD_VECTOR_DIR = f"{EXPORT_DIR_BASE}/alpha_faith_pd_vector"

# Run the export command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime tools export \
    --input-path {ALPHA_FAITH_PD_QZA} \
    --output-path {ALPHA_FAITH_PD_VECTOR_DIR}

print(f"\n--- 2. Export complete. Listing contents of {ALPHA_FAITH_PD_VECTOR_DIR} ---")
!ls -lh {ALPHA_FAITH_PD_VECTOR_DIR}

print("\nWe are looking for the 'alpha-diversity.tsv' file.")

### 4.1 Load, Merge, and Plot Faith's PD Boxplot

The export was successful, and we now have the raw Faith's PD scores in `alpha-diversity.tsv`.

We will now load this file, merge it with our sample metadata, and use Seaborn to create a boxplot. This will visually confirm our statistical finding (q-value = 0.0006) and show us the *direction* of the difference (i.e., which group has higher or lower diversity).

In [None]:
# --- Load, Merge, and Plot Faith's PD ---

print("--- 1. Loading and merging Faith's PD data ---")

# --- Load Scores ---
ALPHA_FAITH_PD_VECTOR_FILE = f"{ALPHA_FAITH_PD_VECTOR_DIR}/alpha-diversity.tsv"
# We load the scores, setting the first column (sample ID) as the index
df_faith_pd = pd.read_csv(ALPHA_FAITH_PD_VECTOR_FILE, sep='\t', index_col=0)
# The column name from QIIME 2 is 'faith_pd'
print("Faith's PD Scores (first 5 rows):")
print(df_faith_pd.head())

# --- Load Metadata ---
METADATA_TSV = "../data/metadata.tsv"
df_metadata = pd.read_csv(METADATA_TSV, sep='\t', index_col='sample-id')

# --- Merge ---
df_faith_pd_merged = pd.merge(
    df_faith_pd,
    df_metadata,
    left_index=True,
    right_index=True
)

# --- Clean Groups (same as PCoA plot) ---
df_faith_pd_merged['Group'] = df_faith_pd_merged['gastrointest_disord'].replace({
    "Crohn's disease": "Crohn's disease",
    "none": "none (Healthy Control)"
})
plot_df_alpha_faith = df_faith_pd_merged[
    df_faith_pd_merged['Group'].isin(["Crohn's disease", "none (Healthy Control)"])
].copy()

print(f"\n--- 2. Data merged. Plotting {plot_df_alpha_faith.shape[0]} samples. ---")

# --- 3. Plotting Boxplot ---
plt.figure(figsize=(7, 7)) # 7-inch wide, 7-inch tall

# Create the boxplot
ax = sns.boxplot(
    data=plot_df_alpha_faith,
    x='Group',
    y='faith_pd',
    palette={'Crohn\'s disease': '#E63946', 'none (Healthy Control)': '#457B9D'} # Red vs Blue
)

# Add a title and labels
ax.set_title(
    "Alpha Diversity (Faith's PD)\nCrohn's Disease vs. Healthy",
    fontsize=16,
    fontweight='bold'
)
ax.set_xlabel("Sample Group", fontsize=12)
ax.set_ylabel("Faith's Phylogenetic Diversity", fontsize=12)

# Define the path to save the figure
ALPHA_PLOT_PATH_FAITH = "../results/10_exported_results/alpha_diversity_faith_pd.png"

# Save the figure
plt.savefig(ALPHA_PLOT_PATH_FAITH, dpi=300, bbox_inches='tight')

print(f"\n--- 3. Plot saved to: {ALPHA_PLOT_PATH_FAITH} ---")

# Display the plot in the notebook
plt.show()

### 4.2 Visualize Shannon's Index (Boxplot)

The Faith's PD boxplot clearly shows a significant *decrease* in phylogenetic diversity in the Crohn's disease group.

Now, we will create the same boxplot for **Shannon's Index**. Based on our earlier statistical result (q-value = 0.28, which is *not* significant), we expect to see that the two boxes are at a very similar level, visually confirming the lack of a statistical difference.

In [None]:
# --- Export Shannon's Index Raw Vector ---

print("--- 1. Exporting Shannon's Index raw vector data ---")

# Define input path
ALPHA_SHANNON_QZA = "../results/09_core_metrics/shannon_vector.qza"
# Define output path (a *new directory*)
ALPHA_SHANNON_VECTOR_DIR = f"{EXPORT_DIR_BASE}/alpha_shannon_vector"

# Run the export command
!docker run --rm -v $(pwd)/..:/data -w /data/notebooks \
  qiime2/core:latest \
  qiime tools export \
    --input-path {ALPHA_SHANNON_QZA} \
    --output-path {ALPHA_SHANNON_VECTOR_DIR}

print(f"\n--- 2. Export complete. Listing contents of {ALPHA_SHANNON_VECTOR_DIR} ---")
!ls -lh {ALPHA_SHANNON_VECTOR_DIR}

print("\nWe are looking for the 'alpha-diversity.tsv' file.")

In [None]:
# --- Load, Merge, and Plot Shannon's Index [Corrected] ---

print("--- 1. Loading and merging Shannon's Index data ---")

# --- Load Scores ---
ALPHA_SHANNON_VECTOR_FILE = f"{ALPHA_SHANNON_VECTOR_DIR}/alpha-diversity.tsv"
df_shannon = pd.read_csv(ALPHA_SHANNON_VECTOR_FILE, sep='\t', index_col=0)
print("Shannon's Index Scores (first 5 rows):")
print(df_shannon.head())

# --- Load Metadata ---
METADATA_TSV = "../data/metadata.tsv" # <-- This is the correct variable name
df_metadata = pd.read_csv(METADATA_TSV, sep='\t', index_col='sample-id') # <-- CORRECTION: Used the correct variable name

# --- Merge ---
df_shannon_merged = pd.merge(
    df_shannon,
    df_metadata,
    left_index=True,
    right_index=True
)

# --- Clean Groups (same as PCoA plot) ---
df_shannon_merged['Group'] = df_shannon_merged['gastrointest_disord'].replace({
    "Crohn's disease": "Crohn's disease",
    "none": "none (Healthy Control)"
})
plot_df_alpha_shannon = df_shannon_merged[
    df_shannon_merged['Group'].isin(["Crohn's disease", "none (Healthy Control)"])
].copy()

print(f"\n--- 2. Data merged. Plotting {plot_df_alpha_shannon.shape[0]} samples. ---")

# --- 3. Plotting Boxplot ---
plt.figure(figsize=(7, 7))

# Create the boxplot
ax = sns.boxplot(
    data=plot_df_alpha_shannon,
    x='Group',
    y='shannon_entropy', # Use the 'shannon_entropy' column
    palette={'Crohn\'s disease': '#E63946', 'none (Healthy Control)': '#457B9D'}
)

# Add a title and labels
ax.set_title(
    "Alpha Diversity (Shannon's Index)\nCrohn's Disease vs. Healthy",
    fontsize=16,
    fontweight='bold'
)
ax.set_xlabel("Sample Group", fontsize=12)
ax.set_ylabel("Shannon's Diversity Index (shannon_entropy)", fontsize=12)

# Define the path to save the figure
ALPHA_PLOT_PATH_SHANNON = "../results/10_exported_results/alpha_diversity_shannon.png"

# Save the figure
plt.savefig(ALPHA_PLOT_PATH_SHANNON, dpi=300, bbox_inches='tight')

print(f"\n--- 3. Plot saved to: {ALPHA_PLOT_PATH_SHANNON} ---")

# Display the plot in the notebook
plt.show()

### 5. Final Scientific Conclusions

This notebook successfully extracted, interpreted, and visualized the key statistical results of our project. We have now answered our primary research questions with robust statistical and visual evidence.

**1. Is there a difference in Alpha Diversity (within-sample diversity)?**
* **YES (Phylogenetically).** We found a **statistically significant** difference (q-value = 0.0006) when using **Faith's PD**.
* **NO (Non-phylogenetically).** We found **no significant** difference (q-value = 0.28) when using **Shannon's Index**.
* **Visual Confirmation:** Our boxplots confirm this perfectly. The **Faith's PD Boxplot** (Cell 22) shows a clear and significant *decrease* in phylogenetic diversity in the Crohn's disease group. In contrast, the **Shannon Boxplot** (Cell 25) shows both groups at nearly identical levels.
* **Interpretation:** This is the project's key finding. The dysbiosis in Crohn's disease is not a simple loss of *how many* species (richness/evenness), but a significant loss of *entire evolutionary branches* (phylogenetic diversity).

**2. Is there a difference in Beta Diversity (between-sample composition)?**
* **YES.** We found a **highly significant** difference (PERMANOVA p-value = 0.0007) when using **Unweighted UniFrac**.
* **Visual Confirmation:** Our **PCoA Plot** (Cell 18) visually confirms this separation. The **Healthy (none)** samples cluster tightly together, indicating a stable, "normal" microbiome. The **Crohn's Disease** samples are highly dispersed and clearly separate from the healthy cluster, indicating an unstable and dysbiotic state.

**Overall Project Conclusion:**
The data strongly supports the hypothesis that Crohn's Disease is associated with a significant **phylogenetic dysbiosis**. This is characterized by (1) a loss of evolutionary (but not numerical) alpha diversity, and (2) a major, statistically significant shift in overall community composition.