# Equine Microbiome Reporter Tutorial

This tutorial will guide you through using the Equine Microbiome Reporter to generate professional PDF reports from 16S rRNA sequencing data.

## Table of Contents
1. [Setup and Installation](#setup)
2. [Understanding the Data Format](#data-format)
3. [Basic PDF Generation](#basic-pdf)
4. [Advanced Polish Laboratory Reports](#advanced-pdf)
5. [Batch Processing](#batch-processing)
6. [Customization Options](#customization)
7. [Troubleshooting](#troubleshooting)

<a id='setup'></a>
## 1. Setup and Installation

First, let's ensure all required dependencies are installed and import the necessary modules.

In [1]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import sys
import os

# Add the project directory to Python path
sys.path.append(os.path.dirname(os.path.abspath('')))

# Import our custom modules
from pdf_generator import MicrobiomeReportGenerator
from advanced_pdf_generator import AdvancedMicrobiomeReportGenerator
from batch_processor import BatchReportProcessor

print("All modules imported successfully!")

All modules imported successfully!


In [2]:
# Create necessary directories
directories = ['data', 'reports', 'temp']
for directory in directories:
    Path(directory).mkdir(exist_ok=True)
    print(f"Created directory: {directory}")

Created directory: data
Created directory: reports
Created directory: temp


<a id='data-format'></a>
## 2. Understanding the Data Format

The microbiome data should be in CSV format with specific columns. Let's examine the real data file from the project to understand the structure.

In [3]:
# Load the real microbiome data
real_data_file = 'data/25_04_23 bact.csv'
df = pd.read_csv(real_data_file)

# Display basic information about the data
print(f"Data shape: {df.shape}")
print(f"\nColumns in the dataset:")
print(f"Species column: {df.columns[0]}")
print(f"Barcode columns: {[col for col in df.columns if 'barcode' in col][:10]}...")
print(f"Taxonomy columns: {[col for col in df.columns if col in ['phylum', 'genus', 'family', 'class', 'order']]}")

# Display first few rows
print("\nFirst 5 species in the dataset:")
df[['species', 'barcode59', 'phylum', 'genus']].head()

Data shape: (169, 47)

Columns in the dataset:
Species column: species
Barcode columns: ['barcode45', 'barcode46', 'barcode47', 'barcode48', 'barcode50', 'barcode51', 'barcode52', 'barcode53', 'barcode54', 'barcode55']...
Taxonomy columns: ['phylum', 'class', 'order', 'family', 'genus']

First 5 species in the dataset:


Unnamed: 0,species,barcode59,phylum,genus
0,Streptomyces albidoflavus,27,Actinomycetota,Streptomyces
1,Streptomyces laculatispora,0,Actinomycetota,Streptomyces
2,Streptomyces sp. NBC_01685,16,Actinomycetota,Streptomyces
3,Cellulomonas chengniuliangii,0,Actinomycetota,Cellulomonas
4,Oerskovia sp. KBS0722,0,Actinomycetota,Oerskovia


In [4]:
# Let's analyze barcode59 specifically (Montana's sample)
barcode_col = 'barcode59'
print(f"Analyzing {barcode_col} data:")

# Filter for species with non-zero counts
species_with_counts = df[df[barcode_col] > 0].copy()
print(f"\nNumber of species detected in {barcode_col}: {len(species_with_counts)}")

# Calculate total count
total_count = species_with_counts[barcode_col].sum()
print(f"Total bacterial count in {barcode_col}: {total_count}")

# Calculate percentages
species_with_counts['percentage'] = (species_with_counts[barcode_col] / total_count * 100).round(2)

# Show top 10 species
print(f"\nTop 10 species in {barcode_col}:")
top_species = species_with_counts.nlargest(10, 'percentage')[['species', barcode_col, 'percentage', 'phylum']]
print(top_species.to_string(index=False))

Analyzing barcode59 data:

Number of species detected in barcode59: 17
Total bacterial count in barcode59: 184

Top 10 species in barcode59:
                     species  barcode59  percentage         phylum
         Acinetobacter lanii         46       25.00 Pseudomonadota
   Streptomyces albidoflavus         27       14.67 Actinomycetota
  Streptomyces sp. NBC_01685         16        8.70 Actinomycetota
     Glutamicibacter sp. M10         14        7.61 Actinomycetota
     Lysinibacillus sp. 2017         14        7.61      Bacillota
        Arthrobacter citreus         11        5.98 Actinomycetota
Solibacillus sp. FSL H8-0523         10        5.43      Bacillota
       Streptococcus equinus         10        5.43      Bacillota
        Pseudomonas saxonica          8        4.35 Pseudomonadota
      Acinetobacter wanghuae          7        3.80 Pseudomonadota


### Exploring the Data Structure

Let's examine the key components of our microbiome data:

In [5]:
# Analyze phylum distribution for barcode59
barcode_col = 'barcode59'

# Filter for non-zero counts
df_filtered = df[df[barcode_col] > 0].copy()

# Calculate phylum distribution
phylum_dist = df_filtered.groupby('phylum')[barcode_col].sum()
total_count = phylum_dist.sum()
phylum_pct = (phylum_dist / total_count * 100).round(2).sort_values(ascending=False)

print(f"Phylum distribution for {barcode_col}:")
for phylum, pct in phylum_pct.items():
    print(f"{phylum}: {pct}%")

# Visualize phylum distribution
plt.figure(figsize=(10, 6))
colors = ['#4CAF50', '#FF5722', '#00BCD4', '#9C27B0', '#00E5FF', '#FFC107', '#795548']
phylum_pct.head(7).plot(kind='bar', color=colors[:len(phylum_pct.head(7))])
plt.title(f'Phylum Distribution in {barcode_col} (Montana)')
plt.xlabel('Phylum')
plt.ylabel('Percentage (%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Phylum distribution for barcode59:
Actinomycetota: 38.04%
Pseudomonadota: 34.24%
Bacillota: 24.46%
Bacteroidota: 2.17%
Gemmatimonadota: 1.09%


  plt.show()


In [6]:
# Analyze phylum distribution
phylum_dist = df.groupby('phylum')['barcode59'].sum()
phylum_pct = (phylum_dist / total_count * 100).round(2)

print("Phylum distribution:")
for phylum, pct in phylum_pct.items():
    print(f"{phylum}: {pct}%")

# Visualize phylum distribution
plt.figure(figsize=(8, 6))
phylum_pct.plot(kind='bar', color=['#4CAF50', '#FF5722', '#00BCD4', '#9C27B0', '#00E5FF'])
plt.title('Phylum Distribution in Microbiome Sample')
plt.xlabel('Phylum')
plt.ylabel('Percentage (%)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Phylum distribution:
Actinomycetota: 38.04%
Bacillota: 24.46%
Bacteroidota: 2.17%
Fibrobacterota: 0.0%
Gemmatimonadota: 1.09%
Mycoplasmatota: 0.0%
Pseudomonadota: 34.24%
Spirochaetota: 0.0%


  plt.show()


<a id='basic-pdf'></a>
## 3. Basic PDF Generation

Let's start with generating a basic PDF report using the simple generator.

In [7]:
# Create a basic report generator using the real data
basic_generator = MicrobiomeReportGenerator('25_04_23 bact.csv', 'barcode59')

# Generate basic report
basic_generator.generate_report('reports/basic_montana_report.pdf')

print("Basic report generated: reports/basic_montana_report.pdf")

Error: CSV file not found at 25_04_23 bact.csv
Basic report generated: reports/basic_montana_report.pdf


<a id='advanced-pdf'></a>
## 4. Advanced Polish Laboratory Reports

Now let's create a professional laboratory report in Polish format with patient information.

In [8]:
# Define patient information for Montana
patient_info = {
    'name': 'Montana',
    'species': 'Koń',
    'age': '20 lat',
    'sample_number': '506',
    'date_received': '07.05.2025 r.',
    'performed_by': 'Julia Kończak',
    'requested_by': 'Aleksandra Matusiak'
}

# Create advanced report generator with real data
advanced_generator = AdvancedMicrobiomeReportGenerator('data/25_04_23 bact.csv', 'barcode59')

# Generate advanced report
advanced_generator.generate_report('reports/montana_advanced_report.pdf', patient_info)

print("Advanced report generated: reports/montana_advanced_report.pdf")

Report generated successfully: reports/montana_advanced_report.pdf
Advanced report generated: reports/montana_advanced_report.pdf


### Understanding the Advanced Report Components

The advanced report includes several key sections:

### Working with Multiple Barcodes

The real data file contains multiple barcode columns representing different samples. Let's explore how to generate reports for different samples:

In [9]:
# Explore all available barcodes in the real data
df_real = pd.read_csv('data/25_04_23 bact.csv')
barcode_columns = [col for col in df_real.columns if col.startswith('barcode')]

print(f"Available barcode columns: {len(barcode_columns)}")
print(f"Barcode columns: {barcode_columns}")

# Check which barcodes have data
barcodes_with_data = []
for barcode in barcode_columns:
    total = df_real[barcode].sum()
    if total > 0:
        barcodes_with_data.append((barcode, total))

print(f"\nBarcodes with data ({len(barcodes_with_data)}):")
for barcode, total in sorted(barcodes_with_data, key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {barcode}: {total} total reads")

Available barcode columns: 30
Barcode columns: ['barcode45', 'barcode46', 'barcode47', 'barcode48', 'barcode50', 'barcode51', 'barcode52', 'barcode53', 'barcode54', 'barcode55', 'barcode56', 'barcode58', 'barcode59', 'barcode60', 'barcode61', 'barcode62', 'barcode63', 'barcode64', 'barcode66', 'barcode67', 'barcode68', 'barcode69', 'barcode70', 'barcode71', 'barcode72', 'barcode74', 'barcode75', 'barcode76', 'barcode77', 'barcode78']

Barcodes with data (29):
  barcode52: 606 total reads
  barcode78: 522 total reads
  barcode56: 520 total reads
  barcode63: 478 total reads
  barcode66: 462 total reads
  barcode71: 418 total reads
  barcode67: 391 total reads
  barcode61: 308 total reads
  barcode53: 285 total reads
  barcode69: 265 total reads


In [10]:
# Display the components included in the advanced report
report_sections = [
    "1. Header with patient information",
    "2. Microbiome profile visualization",
    "3. Phylum distribution with reference ranges",
    "4. Dysbiosis index calculation",
    "5. Clinical interpretation in Polish",
    "6. Microscopic analysis results",
    "7. Biochemical analysis",
    "8. Parasite screening results"
]

print("Advanced Report Sections:")
for section in report_sections:
    print(f"  • {section}")

Advanced Report Sections:
  • 1. Header with patient information
  • 2. Microbiome profile visualization
  • 3. Phylum distribution with reference ranges
  • 4. Dysbiosis index calculation
  • 5. Clinical interpretation in Polish
  • 6. Microscopic analysis results
  • 7. Biochemical analysis
  • 8. Parasite screening results


<a id='batch-processing'></a>
## 5. Batch Processing

For processing multiple samples, we can use the batch processor.

In [11]:
# Create a manifest file for batch processing with real barcodes
# Let's process multiple barcodes from the same file
top_barcodes = ['barcode59', 'barcode72', 'barcode56']  # Based on the real data

manifest_data = pd.DataFrame({
    'csv_file': ['25_04_23 bact.csv', '25_04_23 bact.csv', '25_04_23 bact.csv'],
    'barcode_column': top_barcodes,
    'patient_name': ['Montana', 'Thunder', 'Spirit'],
    'species': ['Koń', 'Koń', 'Koń'],
    'age': ['20 lat', '15 lat', '12 lat'],
    'sample_number': ['506', '507', '508'],
    'date_received': ['07.05.2025 r.', '08.05.2025 r.', '09.05.2025 r.'],
    'performed_by': ['Julia Kończak', 'Julia Kończak', 'Julia Kończak'],
    'requested_by': ['Aleksandra Matusiak', 'Dr. Smith', 'Dr. Johnson']
})

manifest_data.to_csv('manifest_real_data.csv', index=False)
print("Created manifest_real_data.csv for real data processing")
manifest_data

Created manifest_real_data.csv for real data processing


Unnamed: 0,csv_file,barcode_column,patient_name,species,age,sample_number,date_received,performed_by,requested_by
0,25_04_23 bact.csv,barcode59,Montana,Koń,20 lat,506,07.05.2025 r.,Julia Kończak,Aleksandra Matusiak
1,25_04_23 bact.csv,barcode72,Thunder,Koń,15 lat,507,08.05.2025 r.,Julia Kończak,Dr. Smith
2,25_04_23 bact.csv,barcode56,Spirit,Koń,12 lat,508,09.05.2025 r.,Julia Kończak,Dr. Johnson


In [12]:
# Create sample files from the real data for batch processing demonstration
import random

# Use different barcodes from the real data to create sample files
barcodes_for_samples = ['barcode52', 'barcode78', 'barcode56']  # Top barcodes with most data

for i, barcode in enumerate(barcodes_for_samples, 1):
    # Create a subset of the real data for each barcode
    subset_df = df[df[barcode] > 0].copy()
    
    # Rename the barcode column to 'barcode59' for consistency with our examples
    subset_df['barcode59'] = subset_df[barcode]
    
    # Keep only necessary columns
    columns_to_keep = ['species', 'barcode59', 'phylum', 'genus', 'family', 'class', 'order']
    subset_df = subset_df[columns_to_keep]
    
    # Save to file
    subset_df.to_csv(f'data/sample_{i}.csv', index=False)
    print(f"Created data/sample_{i}.csv from {barcode} data ({len(subset_df)} species)")

Created data/sample_1.csv from barcode52 data (26 species)
Created data/sample_2.csv from barcode78 data (34 species)
Created data/sample_3.csv from barcode56 data (47 species)


In [13]:
# Create a configuration file for batch processing
config_content = """# Default barcode column to analyze
barcode_column: barcode59

# Default patient information
default_species: Koń
default_age: Unknown
performed_by: Julia Kończak
requested_by: Aleksandra Matusiak

# Processing settings
max_workers: 4
log_level: INFO
"""

with open('config.yaml', 'w') as f:
    f.write(config_content)
print("Created config.yaml")

Created config.yaml


In [14]:
# Create a manifest file for batch processing with patient details
manifest_data = pd.DataFrame({
    'csv_file': ['data/sample_1.csv', 'data/sample_2.csv', 'data/sample_3.csv'],
    'patient_name': ['Montana', 'Thunder', 'Spirit'],
    'species': ['Koń', 'Koń', 'Koń'],
    'age': ['20 lat', '15 lat', '12 lat'],
    'sample_number': ['506', '507', '508'],
    'date_received': ['07.05.2025 r.', '08.05.2025 r.', '09.05.2025 r.'],
    'performed_by': ['Julia Kończak', 'Julia Kończak', 'Julia Kończak'],
    'requested_by': ['Aleksandra Matusiak', 'Dr. Smith', 'Dr. Johnson']
})

manifest_data.to_csv('manifest.csv', index=False)
print("Created manifest.csv")
manifest_data

Created manifest.csv


Unnamed: 0,csv_file,patient_name,species,age,sample_number,date_received,performed_by,requested_by
0,data/sample_1.csv,Montana,Koń,20 lat,506,07.05.2025 r.,Julia Kończak,Aleksandra Matusiak
1,data/sample_2.csv,Thunder,Koń,15 lat,507,08.05.2025 r.,Julia Kończak,Dr. Smith
2,data/sample_3.csv,Spirit,Koń,12 lat,508,09.05.2025 r.,Julia Kończak,Dr. Johnson


In [15]:
# Process files using the batch processor
processor = BatchReportProcessor()

# Process from manifest
processor.process_from_manifest('manifest.csv', 'reports/batch/')

print("\nBatch processing complete!")
print("\nGenerated reports:")
for report in Path('reports/batch/').glob('*.pdf'):
    print(f"  • {report}")

2025-06-27 18:41:37,318 - batch_processor - INFO - Processing data/sample_1.csv -> reports/batch/sample_1_report.pdf
Report generated successfully: reports/batch/sample_1_report.pdf
2025-06-27 18:41:39,664 - batch_processor - INFO - Successfully generated report: reports/batch/sample_1_report.pdf
2025-06-27 18:41:39,665 - batch_processor - INFO - Processing data/sample_2.csv -> reports/batch/sample_2_report.pdf
Report generated successfully: reports/batch/sample_2_report.pdf
2025-06-27 18:41:41,861 - batch_processor - INFO - Successfully generated report: reports/batch/sample_2_report.pdf
2025-06-27 18:41:41,862 - batch_processor - INFO - Processing data/sample_3.csv -> reports/batch/sample_3_report.pdf
Report generated successfully: reports/batch/sample_3_report.pdf
2025-06-27 18:41:44,122 - batch_processor - INFO - Successfully generated report: reports/batch/sample_3_report.pdf

Batch processing complete!

Generated reports:
  • reports/batch/sample_3_report.pdf
  • reports/batch/sa

<a id='customization'></a>
## 6. Customization Options

Let's explore how to customize various aspects of the reports.

### Customizing Reference Ranges

You can modify the reference ranges for different phylums based on your specific requirements:

In [16]:
# Display current reference ranges
from advanced_pdf_generator import AdvancedMicrobiomeReportGenerator

print("Current Reference Ranges:")
for phylum, (min_val, max_val) in AdvancedMicrobiomeReportGenerator.REFERENCE_RANGES.items():
    print(f"{phylum}: {min_val}% - {max_val}%")

# Example of how to modify reference ranges
# Note: This would need to be done in the actual Python file
custom_ranges = {
    'Actinomycetota': (0.5, 10),
    'Bacillota': (25, 65),
    'Bacteroidota': (5, 35),
    'Pseudomonadota': (3, 30),
    'Fibrobacterota': (0.5, 8)
}

print("\nCustom Reference Ranges (example):")
for phylum, (min_val, max_val) in custom_ranges.items():
    print(f"{phylum}: {min_val}% - {max_val}%")

Current Reference Ranges:
Actinomycetota: 0.1% - 8%
Bacillota: 20% - 70%
Bacteroidota: 4% - 40%
Pseudomonadota: 2% - 35%
Fibrobacterota: 0.1% - 5%

Custom Reference Ranges (example):
Actinomycetota: 0.5% - 10%
Bacillota: 25% - 65%
Bacteroidota: 5% - 35%
Pseudomonadota: 3% - 30%
Fibrobacterota: 0.5% - 8%


### Customizing Colors and Visualization

In [17]:
# Display current color scheme
print("Current Phylum Colors:")
for phylum, color in AdvancedMicrobiomeReportGenerator.PHYLUM_COLORS.items():
    print(f"{phylum}: {color}")

# Visualize the color scheme
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
y_pos = 0
for phylum, color in AdvancedMicrobiomeReportGenerator.PHYLUM_COLORS.items():
    ax.barh(y_pos, 1, color=color, height=0.8)
    ax.text(0.5, y_pos, phylum, ha='center', va='center', fontweight='bold')
    y_pos += 1

ax.set_xlim(0, 1)
ax.set_ylim(-0.5, len(AdvancedMicrobiomeReportGenerator.PHYLUM_COLORS) - 0.5)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('Phylum Color Scheme')
plt.tight_layout()
plt.show()

Current Phylum Colors:
Actinomycetota: #00BCD4
Bacillota: #4CAF50
Bacteroidota: #FF5722
Pseudomonadota: #00E5FF
Fibrobacterota: #9C27B0
Other: #9E9E9E


  plt.show()


### Creating Custom Clinical Interpretations

In [18]:
# Example function to generate custom clinical interpretation
def generate_custom_interpretation(phylum_dist, species_data):
    """
    Generate a custom clinical interpretation based on the microbiome data.
    
    Args:
        phylum_dist: Dictionary of phylum percentages
        species_data: DataFrame with species information
    
    Returns:
        str: Clinical interpretation text
    """
    interpretation = "Analiza mikrobiomu wykazała "
    
    # Check Bacillota levels
    bacillota_pct = phylum_dist.get('Bacillota', 0)
    if 20 <= bacillota_pct <= 70:
        interpretation += "prawidłowy poziom bakterii Bacillota, "
    elif bacillota_pct < 20:
        interpretation += "obniżony poziom bakterii Bacillota, "
    else:
        interpretation += "podwyższony poziom bakterii Bacillota, "
    
    # Check for pathogenic species
    pathogenic = ['Clostridium_difficile', 'Salmonella_enterica', 'Campylobacter_jejuni']
    pathogens_found = species_data[species_data['species'].isin(pathogenic)]
    
    if pathogens_found.empty:
        interpretation += "brak patogenów jelitowych. "
    else:
        interpretation += f"wykryto potencjalne patogeny: {', '.join(pathogens_found['species'])}. "
    
    interpretation += "Zaleca się kontynuację monitorowania stanu zdrowia."
    
    return interpretation

# Test the function with our sample data
test_phylum_dist = {'Bacillota': 24.46, 'Bacteroidota': 2.17, 'Pseudomonadota': 34.24}
test_interpretation = generate_custom_interpretation(test_phylum_dist, df)
print("Custom interpretation:")
print(test_interpretation)

Custom interpretation:
Analiza mikrobiomu wykazała prawidłowy poziom bakterii Bacillota, brak patogenów jelitowych. Zaleca się kontynuację monitorowania stanu zdrowia.


In [19]:
# Check if required columns are present in our real data
def validate_csv_format(csv_path):
    """
    Validate that the CSV file has all required columns.
    """
    required_columns = ['species', 'phylum', 'genus']
    
    try:
        df = pd.read_csv(csv_path)
        missing_columns = [col for col in required_columns if col not in df.columns]
        
        if missing_columns:
            print(f"❌ Missing required columns: {missing_columns}")
            return False
        
        # Check for barcode columns
        barcode_columns = [col for col in df.columns if col.startswith('barcode')]
        if not barcode_columns:
            print("❌ No barcode columns found")
            return False
        
        print("✅ CSV format is valid")
        print(f"   Found {len(barcode_columns)} barcode columns")
        print(f"   Species count: {len(df)}")
        print(f"   Unique phylums: {df['phylum'].nunique()}")
        return True
        
    except Exception as e:
        print(f"❌ Error reading CSV: {e}")
        return False

# Test validation with real data
print("Validating real data file:")
validate_csv_format('data/25_04_23 bact.csv')

Validating real data file:
✅ CSV format is valid
   Found 30 barcode columns
   Species count: 169
   Unique phylums: 8


True

<a id='troubleshooting'></a>
## 7. Troubleshooting

Common issues and their solutions:

### Issue 1: Missing Columns in CSV

In [20]:
# Check if required columns are present
def validate_csv_format(csv_path):
    """
    Validate that the CSV file has all required columns.
    """
    required_columns = ['species', 'phylum', 'genus']
    
    try:
        df = pd.read_csv(csv_path)
        missing_columns = [col for col in required_columns if col not in df.columns]
        
        if missing_columns:
            print(f"❌ Missing required columns: {missing_columns}")
            return False
        
        # Check for barcode columns
        barcode_columns = [col for col in df.columns if col.startswith('barcode')]
        if not barcode_columns:
            print("❌ No barcode columns found")
            return False
        
        print("✅ CSV format is valid")
        print(f"   Found barcode columns: {barcode_columns}")
        return True
        
    except Exception as e:
        print(f"❌ Error reading CSV: {e}")
        return False

# Test validation
validate_csv_format('data/25_04_23 bact.csv')

✅ CSV format is valid
   Found barcode columns: ['barcode45', 'barcode46', 'barcode47', 'barcode48', 'barcode50', 'barcode51', 'barcode52', 'barcode53', 'barcode54', 'barcode55', 'barcode56', 'barcode58', 'barcode59', 'barcode60', 'barcode61', 'barcode62', 'barcode63', 'barcode64', 'barcode66', 'barcode67', 'barcode68', 'barcode69', 'barcode70', 'barcode71', 'barcode72', 'barcode74', 'barcode75', 'barcode76', 'barcode77', 'barcode78']


True

### Issue 2: Polish Characters Not Displaying

If Polish characters are not displaying correctly in your PDFs, try these solutions:

1. **Install system fonts with Polish support**
2. **Configure matplotlib to use appropriate fonts**
3. **Ensure your data files are UTF-8 encoded**

In [21]:
# Test Polish character support
polish_text = "KOMPLEKSOWE BADANIE KAŁU - żółć, źrebię, gęś"
print(f"Polish text: {polish_text}")
print(f"Encoded: {polish_text.encode('utf-8')}")

# Ensure matplotlib supports Polish characters
plt.rcParams['font.family'] = 'DejaVu Sans'
fig, ax = plt.subplots(figsize=(6, 2))
ax.text(0.5, 0.5, polish_text, ha='center', va='center', fontsize=12)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
plt.title('Polish Character Test')
plt.show()

Polish text: KOMPLEKSOWE BADANIE KAŁU - żółć, źrebię, gęś
Encoded: b'KOMPLEKSOWE BADANIE KA\xc5\x81U - \xc5\xbc\xc3\xb3\xc5\x82\xc4\x87, \xc5\xbarebi\xc4\x99, g\xc4\x99\xc5\x9b'


  plt.show()


In [22]:
# Solution: Install and configure fonts for Polish characters
import matplotlib.font_manager as fm

# List available fonts
available_fonts = [f.name for f in fm.fontManager.ttflist]
print(f"Total fonts available: {len(available_fonts)}")

# Find fonts that support Polish characters
polish_compatible_fonts = ['DejaVu Sans', 'Liberation Sans', 'Arial Unicode MS', 'Noto Sans']
found_fonts = [font for font in polish_compatible_fonts if font in available_fonts]

print(f"\nPolish-compatible fonts found: {found_fonts}")

# Set a font that works
if found_fonts:
    plt.rcParams['font.family'] = found_fonts[0]
    print(f"\nUsing font: {found_fonts[0]}")
else:
    print("\n⚠️ No Polish-compatible fonts found. Consider installing DejaVu fonts.")
    print("   On Ubuntu/Debian: sudo apt-get install fonts-dejavu")
    print("   On macOS: brew install font-dejavu")
    print("   On Windows: Download from https://dejavu-fonts.github.io/")

Total fonts available: 575

Polish-compatible fonts found: ['DejaVu Sans', 'Liberation Sans', 'Noto Sans']

Using font: DejaVu Sans


In [23]:
# Working with the real data file
real_data_file = 'data/25_04_23 bact.csv'
print(f"Looking for real data file: {real_data_file}")

# Load and examine the data
real_df = pd.read_csv(real_data_file)
print(f"\nData shape: {real_df.shape}")
print(f"\nColumns: {list(real_df.columns)[:10]}...")  # Show first 10 columns

# Find barcode columns
barcode_cols = [col for col in real_df.columns if 'barcode' in col.lower()]
print(f"\nBarcode columns found: {len(barcode_cols)}")
print(f"First few: {barcode_cols[:5]}")

# Analyze diversity across samples
print("\nSpecies diversity by sample (top 5):")
diversity_stats = []
for barcode in barcode_cols[:5]:
    species_count = (real_df[barcode] > 0).sum()
    total_reads = real_df[barcode].sum()
    if total_reads > 0:
        diversity_stats.append({
            'Barcode': barcode,
            'Species Count': species_count,
            'Total Reads': total_reads
        })

diversity_df = pd.DataFrame(diversity_stats)
print(diversity_df.to_string(index=False))

Looking for real data file: data/25_04_23 bact.csv

Data shape: (169, 47)

Columns: ['species', 'barcode45', 'barcode46', 'barcode47', 'barcode48', 'barcode50', 'barcode51', 'barcode52', 'barcode53', 'barcode54']...

Barcode columns found: 30
First few: ['barcode45', 'barcode46', 'barcode47', 'barcode48', 'barcode50']

Species diversity by sample (top 5):
  Barcode  Species Count  Total Reads
barcode45              8          104
barcode46             17          135
barcode47              7           82
barcode48              5           67
barcode50              6           66


For handling large CSV files that might cause memory issues:

1. **Use chunk processing** - Process the file in smaller pieces
2. **Filter unnecessary columns** - Only load the columns you need
3. **Use data types efficiently** - Specify dtypes when reading CSV
4. **Consider using databases** - For very large datasets, use SQLite or PostgreSQL

In [24]:
# Example: Efficient loading of large CSV files
def load_microbiome_data_efficiently(csv_path, barcode_column):
    """
    Load microbiome data efficiently by only reading necessary columns.
    """
    # First, read just the header to get column names
    with open(csv_path, 'r') as f:
        header = f.readline().strip().split(',')
    
    # Identify columns we need
    taxonomy_cols = ['species', 'phylum', 'genus', 'family', 'class', 'order']
    cols_to_read = taxonomy_cols + [barcode_column]
    cols_to_read = [col for col in cols_to_read if col in header]
    
    # Read only the necessary columns
    df = pd.read_csv(csv_path, usecols=cols_to_read, dtype={
        'species': 'string',
        'phylum': 'string',
        'genus': 'string',
        'family': 'string',
        'class': 'string',
        'order': 'string',
        barcode_column: 'int32'  # Use int32 instead of int64 to save memory
    })
    
    print(f"Loaded {len(df)} rows with {len(df.columns)} columns")
    print(f"Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")
    
    return df

# Example usage
# df = load_microbiome_data_efficiently('25_04_23 bact.csv', 'barcode59')

In [25]:
# Function to process large CSV files in chunks
def process_large_csv(csv_path, barcode_column, chunk_size=1000):
    """
    Process large CSV files in chunks to avoid memory issues.
    """
    total_counts = {}
    phylum_counts = {}
    
    # Read CSV in chunks
    for chunk in pd.read_csv(csv_path, chunksize=chunk_size):
        # Process each chunk
        for _, row in chunk.iterrows():
            if barcode_column in row and row[barcode_column] > 0:
                species = row['species']
                count = row[barcode_column]
                phylum = row['phylum']
                
                total_counts[species] = total_counts.get(species, 0) + count
                phylum_counts[phylum] = phylum_counts.get(phylum, 0) + count
    
    print(f"Processed {len(total_counts)} unique species")
    print(f"Total phylums: {len(phylum_counts)}")
    
    return total_counts, phylum_counts

# Example usage (would work with large files)
print("This function can process CSV files of any size without loading everything into memory.")

This function can process CSV files of any size without loading everything into memory.


### Working with Real Data

Let's test with the actual data file in the repository:

In [26]:
# Check if the real data file exists
real_data_file = '25_04_23 bact.csv'
if Path(real_data_file).exists():
    print(f"Found real data file: {real_data_file}")
    
    # Load and examine the data
    real_df = pd.read_csv(real_data_file)
    print(f"\nData shape: {real_df.shape}")
    print(f"\nColumns: {list(real_df.columns)[:10]}...")  # Show first 10 columns
    
    # Find barcode columns
    barcode_cols = [col for col in real_df.columns if 'barcode' in col.lower()]
    print(f"\nBarcode columns found: {len(barcode_cols)}")
    if barcode_cols:
        print(f"First few: {barcode_cols[:5]}")
else:
    print(f"Real data file not found: {real_data_file}")
    print("Make sure to place the data file in the project root directory")

Real data file not found: 25_04_23 bact.csv
Make sure to place the data file in the project root directory


## Summary

In this tutorial, we've covered:

1. ✅ Setting up the Equine Microbiome Reporter
2. ✅ Understanding the required data format
3. ✅ Generating basic PDF reports
4. ✅ Creating advanced Polish laboratory reports
5. ✅ Batch processing multiple samples
6. ✅ Customizing reports and interpretations
7. ✅ Troubleshooting common issues

### Next Steps

- Experiment with your own microbiome data
- Customize the report templates for your specific needs
- Integrate the reporter into your laboratory workflow
- Contribute improvements to the project

For more information, check the project README and documentation.

In [27]:
# Clean up temporary files (optional)
import shutil

# Uncomment to clean up
# shutil.rmtree('temp', ignore_errors=True)
# print("Cleaned up temporary files")