# 1. Taxonomic Profiling and Data Wrangling

## Objective
The goal of this notebook is to process the raw output reports generated by `Kraken2` for all 120 samples (60 T2D / 60 Control). We aim to consolidate the individual read counts into a single, comprehensive Taxa x Samples matrix (Abundance Matrix), which is the necessary input for statistical analysis and visualization of gut dysbiosis.

## Methodology
The process involves reading 120 individual `.report` files, filtering only for biologically meaningful ranks (Phylum, Class, Genus, Species), and joining the results using Pandas.

In [None]:
#  Imports and Configuration
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import io

# Configuration: Path to Kraken2 reports directory
REPORTS_DIR = 'results/kraken'

# Standard Kraken2 report column names
# Columns: % of reads | Number of reads | Reads rooted at this taxon | Rank code | NCBI TaxID | Scientific Name
KRAKEN_COLS = ['percent', 'reads', 'reads_taxo', 'rank', 'taxid', 'name']

print("‚úÖ Cell 1 Complete: Libraries imported and configuration set.")

In [None]:
#  Define Data Parsing Function
def load_kraken_report(file_path):
    """
    Reads a single Kraken2 report file, filters for Species level, 
    and returns a clean pandas Series with read counts.
    """
    try:
        # 1. Read the report file (Tab-separated)
        # We use latin-1 encoding to handle special characters in scientific names
        df = pd.read_csv(file_path, sep='\t', header=None, names=KRAKEN_COLS, encoding='latin-1')
        
        # 2. Extract Sample ID from the filename (e.g., 'SRR123.report' -> 'SRR123')
        sample_id = os.path.basename(file_path).replace('.report', '')
        
        # 3. Filter: Keep ONLY Species level ('S')
        # This prevents double-counting reads at different taxonomic levels
        df_filtered = df[df['rank'] == 'S'].copy()
        
        # 4. Data Cleaning: Strip whitespace from names
        df_filtered['name'] = df_filtered['name'].str.strip()
        
        # 5. Select relevant columns and rename 'reads' to the Sample ID
        # We use 'reads' (direct count assigned to this taxon)
        df_final = df_filtered[['name', 'reads']].rename(columns={'reads': sample_id})
        
        # 6. Set scientific name as index for easy merging later
        return df_final.set_index('name')
        
    except Exception as e:
        print(f"Error processing file {file_path}: {e}")
        return None

print("‚úÖ Cell 2 Complete: Parsing function defined successfully.")

## 2. Data Loading and Integration

In this step, we locate all Kraken2 report files generated by the upstream pipeline. We iterate through each file, extracting species-level abundance data using the parsing function defined above. Finally, we merge these individual datasets into a single comprehensive abundance matrix ($N \times M$), where $N$ represents the number of unique species identified and $M$ represents the number of samples (120).

Missing values (species present in some samples but not others) are filled with `0`.

In [None]:
#  Load Reports and Build Abundance Matrix

# 1. Locate all report files
report_files = glob.glob(os.path.join(REPORTS_DIR, '*.report'))
print(f"üìÇ Found {len(report_files)} Kraken2 report files.")

# 2. Iterate and Load
print("‚è≥ Loading and merging files... Please wait.")

all_dataframes = []
for file in report_files:
    # Use the function defined in Cell 2
    df = load_kraken_report(file)
    
    if df is not None:
        all_dataframes.append(df)

# 3. Concatenate into a Single Matrix
if all_dataframes:
    # axis=1: Merge by columns (samples side-by-side)
    # join='outer': Keep all species found in any sample
    # fillna(0): If a species is missing in a sample, abundance is 0
    final_tax_table = pd.concat(all_dataframes, axis=1, join='outer').fillna(0)
    
    print("\n=== Matrix Generation Complete ===")
    print(f"‚úÖ Successfully created abundance matrix.")
    print(f"üìä Dimensions: {final_tax_table.shape[0]} Species (Rows) x {final_tax_table.shape[1]} Samples (Columns)")
    
    # Preview the first few rows
    display(final_tax_table.head())

else:
    print("‚ùå Error: No dataframes loaded. Please check the file path.")

## 3. Metadata Integration (Automatic Fetch)

To perform comparative analysis, we need to map each Run ID (e.g., `SRR...`) to its experimental group (Type 2 Diabetes vs. Healthy Control).

Since the local metadata file is incomplete, we fetch the official study design directly from the **European Nucleotide Archive (ENA)** API for project **PRJNA422434**. We parse the sample aliases to classify them:
* **Case:** Samples labeled with 'T2D' or 'Case'.
* **Control:** All other samples (Healthy).

In [None]:
#  Fetch Metadata and Map Groups (Fixed Logic)

# 1. Define Project ID
PROJECT_ID = "PRJNA422434"
print(f"üåç Fetching official metadata for {PROJECT_ID} from ENA API...")

# 2. Construct API URL
url = f"https://www.ebi.ac.uk/ena/portal/api/filereport?accession={PROJECT_ID}&result=read_run&fields=run_accession,sample_alias,sample_title&format=tsv"

try:
    # 3. Download and Parse Metadata
    response = requests.get(url)
    response.raise_for_status()
    meta_online = pd.read_csv(io.StringIO(response.content.decode('utf-8')), sep='\t')
    
    print("‚úÖ Metadata downloaded successfully.")
    
    # 4. Create Mapping Dictionary (Run ID -> Group)
    run_to_group = {}
    
    for index, row in meta_online.iterrows():
        run_id = row['run_accession']
        full_info = str(row['sample_alias'])
        
        # --- The Fixed Logic ---
        # If 'T2D' is in the alias -> Case
        # Everything else -> Control
        if 'T2D' in full_info:
            run_to_group[run_id] = 'T2D'
        else:
            run_to_group[run_id] = 'Control'
            
    # 5. Apply Mapping to our Matrix Columns
    # We map only the samples that exist in our final_tax_table
    metadata = pd.DataFrame(index=final_tax_table.columns)
    metadata['Group'] = metadata.index.map(run_to_group)
    
    # 6. Verify Distribution
    print("\n=== Final Sample Distribution ===")
    print(metadata['Group'].value_counts())
    
    # Check balance
    t2d_count = metadata[metadata['Group'] == 'T2D'].shape[0]
    control_count = metadata[metadata['Group'] == 'Control'].shape[0]
    
    if t2d_count > 0 and control_count > 0:
        print("\n‚úÖ Metadata mapping successful: Groups are identified correctly!")
    else:
        print("\n‚ö†Ô∏è Warning: Still seeing imbalance. Please check the output.")

except Exception as e:
    print(f"‚ùå Error fetching metadata: {e}")

In [None]:
#  Visualization and Saving Results

# 1. Calculate Top 10 Most Abundant Species
# Sum reads across all samples, sort descending, and take top 10
top_10_species = final_tax_table.sum(axis=1).sort_values(ascending=False).head(10)

# 2. Plotting
plt.figure(figsize=(12, 8))
sns.barplot(x=top_10_species.values, y=top_10_species.index, palette="viridis")

# Styling the plot
plt.title("Top 10 Most Abundant Species (Gut Microbiome - 120 Samples)", fontsize=16)
plt.xlabel("Total Reads count", fontsize=14)
plt.ylabel("Species", fontsize=14)

# 3. Save the Plot
plt.savefig("top10_species_abundance.png", dpi=300, bbox_inches='tight')
print("‚úÖ Plot saved as 'top10_species_abundance.png'")

# 4. Save the Data for downstream analysis (e.g., in R)
final_tax_table.to_csv("abundance_matrix_species.csv")
metadata.to_csv("metadata_mapped.csv")
print("‚úÖ Data saved: 'abundance_matrix_species.csv' and 'metadata_mapped.csv'")

# Show the plot
plt.show()

# 4. Conclusion & Next Steps

## Scientific Observations based on Top 10 Taxa
1.  **Dominance of *Segatella copri* (*Prevotella copri*):** The abundance plot reveals that *Segatella copri* is the most abundant species across the cohort. Given its known association with insulin resistance and branched-chain amino acid biosynthesis, this is a strong candidate biomarker for T2D in our dataset.
2.  **Bacteroidetes Prevalence:** The top ranks are heavily populated by *Phocaeicola* and *Bacteroides* species (*P. vulgatus*, *B. uniformis*), confirming that the gut microbiome in this cohort is driven by the Bacteroidetes phylum.
3.  **Presence of Beneficial Taxa:** *Faecalibacterium prausnitzii*, a key butyrate producer known to be depleted in T2D, appears in the top 10. Future statistical analysis (differential abundance) will determine if its levels are significantly lower in the Case group compared to Controls.

## Future Directions (Functional Analysis)
Having established the taxonomic landscape ("Who is there"), we observe a mix of potential pathogens (*P. copri*) and beneficial bacteria (*F. prausnitzii*). The next step is to investigate the metabolic implications:
* *Functional Profiling:* Do T2D patients show a reduction in genes related to **Butyrate Production**?
* *Pathway Analysis:* Is there an enrichment of **Inflammation-related pathways** in samples dominated by *Segatella*?