# SickleScope Examples: Multiple Use Cases and Real-World Applications

This notebook demonstrates various practical use cases for the SickleScope genomics analysis package. Each example showcases different aspects of genetic variant analysis, from basic screening to advanced population studies.

## Example Use Cases Covered

1. **Clinical Screening Workflow** - Standard patient genetic screening
2. **Family Genetic Analysis** - Multi-sample family study
3. **Population Comparison Study** - Analyzing variants across different populations
4. **Research Dataset Analysis** - Large-scale variant analysis
5. **Clinical Trial Screening** - Identifying patients for clinical trials
6. **Carrier Screening Program** - Population-wide carrier detection

---

## Prerequisites

- Basic understanding of genetics and variant analysis
- Python 3.9+ with SickleScope installed
- Sample datasets (provided in this notebook)

Let's begin with the setup and imports.

In [None]:
# Import required libraries
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
from datetime import datetime
import json

warnings.filterwarnings('ignore')

# Set up plotting parameters
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11
sns.set_style("whitegrid")

print("Libraries imported successfully")
print(f"Python version: {sys.version[:6]}")
print(f"Working directory: {os.getcwd()}")

In [None]:
# Import SickleScope modules
sys.path.append('../')

try:
    from sickle_scope.analyser import SickleAnalyser
    from sickle_scope.visualiser import SickleVisualiser
    print("SickleScope modules imported successfully")
    print("Ready to begin multiple use case examples!")
except ImportError as e:
    print(f"Import error: {e}")
    print("Make sure you're running this notebook from the notebooks/ directory")
    print("Or install SickleScope with: pip install -e ../")

## Example 1: Clinical Screening Workflow

**Use Case**: A genetics clinic needs to screen a new patient for sickle cell disease risk based on their genetic variants.

**Scenario**: Patient John Doe, 28 years old, presents for preconception genetic counselling. His family history includes relatives with anaemia, and he wants to know his genetic risk profile.

### Patient Data Setup

In [None]:
# Create realistic patient data for clinical screening
print("Creating Clinical Screening Example Data")
print("=" * 45)

# Patient information
patient_info = {
    'patient_id': 'PATIENT_001',
    'name': 'John Doe',
    'age': 28,
    'ethnicity': 'African American',
    'indication': 'Preconception genetic counselling'
}

print("Patient Information:")
for key, value in patient_info.items():
    print(f"  {key.replace('_', ' ').title()}: {value}")

# Create patient variant data (realistic genetic profile)
patient_variants = pd.DataFrame({
    'chromosome': ['11', '11', '11', '2', '6'],
    'position': [5227002, 5226977, 5248000, 60494113, 135460698],
    'ref_allele': ['T', 'G', 'C', 'G', 'T'],
    'alt_allele': ['A', 'A', 'G', 'T', 'C'],
    'genotype': ['0/1', '0/0', '0/0', '1/1', '0/1'],
    'variant_id': ['rs334', 'rs33930165', 'novel_001', 'rs1427407', 'rs9399137']
})

print(f"\nPatient Genetic Profile:")
print(f"Total variants to analyse: {len(patient_variants)}")
print(f"Chromosomes involved: {sorted(patient_variants['chromosome'].unique())}")

# Display the variant data
print("\nPatient Variant Data:")
display(patient_variants)

### Clinical Analysis and Risk Assessment

In [None]:
# Initialise analyser for clinical screening
print("CLINICAL SCREENING ANALYSIS")
print("=" * 35)

# Save patient data to temporary file for analysis
temp_file = 'temp_patient_001.csv'
patient_variants.to_csv(temp_file, index=False)

# Initialise SickleScope analyser
analyser = SickleAnalyser(verbose=True)
print(f"\nAnalyzing patient: {patient_info['patient_id']}")

# Perform comprehensive analysis
results = analyser.analyse_file(temp_file)

print(f"\nANALYSIS COMPLETED")
print(f"Variants analyzed: {len(results)}")
print(f"Analysis columns: {results.shape[1]}")

# Clean up temporary file
os.remove(temp_file)

### Clinical Report Generation

In [None]:
# Generate comprehensive clinical report
print("GENETIC COUNSELLING REPORT")
print("=" * 30)
print(f"Patient: {patient_info['name']} (ID: {patient_info['patient_id']})")
print(f"Date: {datetime.now().strftime('%Y-%m-%d')}")
print(f"Indication: {patient_info['indication']}")

print("\n" + "=" * 50)
print("VARIANT ANALYSIS SUMMARY")
print("=" * 50)

# Risk assessment
if 'risk_score' in results.columns:
    max_risk = results['risk_score'].max()
    mean_risk = results['risk_score'].mean()
    
    print(f"Overall Risk Assessment:")
    print(f"  Maximum risk score: {max_risk:.2f}")
    print(f"  Average risk score: {mean_risk:.2f}")
    
    # Risk interpretation
    if max_risk > 40:
        risk_category = "MODERATE TO HIGH"
        recommendation = "Genetic counselling recommended"
    elif max_risk > 20:
        risk_category = "MILD TO MODERATE"  
        recommendation = "Consider genetic counselling"
    else:
        risk_category = "LOW"
        recommendation = "Standard care protocols"
    
    print(f"  Risk Category: {risk_category}")
    print(f"  Recommendation: {recommendation}")

# Pathogenic variant assessment
if 'is_pathogenic' in results.columns:
    pathogenic_count = results['is_pathogenic'].sum()
    print(f"\nPathogenic Variants: {pathogenic_count}")
    
    if pathogenic_count > 0:
        pathogenic_variants = results[results['is_pathogenic'] == True]
        print("  CRITICAL FINDINGS:")
        for idx, (_, variant) in enumerate(pathogenic_variants.iterrows(), 1):
            print(f"  {idx}. {variant['variant_name']} (Risk: {variant['risk_score']:.2f})")
            print(f"     Location: Chr{variant['chromosome']}:{variant['position']}")
            print(f"     Genotype: {variant['genotype']}")

# Carrier status assessment
print(f"\nCarrier Status Assessment:")
carrier_variants = results[results['genotype'] == '0/1']
print(f"  Heterozygous variants: {len(carrier_variants)}")

if len(carrier_variants) > 0:
    print("  Patient is a carrier for:")
    for _, variant in carrier_variants.iterrows():
        if 'variant_name' in variant:
            print(f"    - {variant['variant_name']}")

# Clinical recommendations
print(f"\n" + "=" * 50)
print("CLINICAL RECOMMENDATIONS")
print("=" * 50)

recommendations = []

if max_risk > 40:
    recommendations.extend([
        "1. Referral to haematology specialist",
        "2. Partner genetic testing recommended",
        "3. Preimplantation genetic diagnosis consideration",
        "4. Regular follow-up appointments"
    ])
elif max_risk > 20:
    recommendations.extend([
        "1. Genetic counselling session",
        "2. Partner testing recommended",
        "3. Family planning discussion"
    ])
else:
    recommendations.extend([
        "1. Standard preconception care",
        "2. General genetic counselling available",
        "3. No immediate genetic concerns"
    ])

for rec in recommendations:
    print(rec)

---

## Example 2: Family Genetic Analysis

**Use Case**: A family with multiple members affected by sickle cell disease wants comprehensive genetic analysis to understand inheritance patterns and inform family planning decisions.

**Scenario**: The Smith family includes parents (both carriers) and three children. They need analysis to understand each family member's genetic status and risk profile.

### Family Data Setup

In [None]:
# Create family genetic data for multi-sample analysis
print("FAMILY GENETIC ANALYSIS SETUP")
print("=" * 35)

# Family information
family_info = {
    'family_id': 'SMITH_FAMILY',
    'members': 5,
    'indication': 'Family genetic counselling and inheritance pattern analysis',
    'proband': 'CHILD_002'  # Child with sickle cell disease
}

print("Family Information:")
for key, value in family_info.items():
    print(f"  {key.replace('_', ' ').title()}: {value}")

# Create genetic data for each family member
family_members = {
    'FATHER': {
        'age': 45,
        'status': 'Carrier',
        'variants': {
            'chromosome': ['11', '11', '2'],
            'position': [5227002, 5248000, 60494113],
            'ref_allele': ['T', 'C', 'G'],
            'alt_allele': ['A', 'G', 'T'],
            'genotype': ['0/1', '0/0', '1/1'],  # HbS carrier
            'sample_id': ['FATHER'] * 3
        }
    },
    'MOTHER': {
        'age': 42,
        'status': 'Carrier',
        'variants': {
            'chromosome': ['11', '11', '6'],
            'position': [5227002, 5226977, 135460698],
            'ref_allele': ['T', 'G', 'T'],
            'alt_allele': ['A', 'A', 'C'],
            'genotype': ['0/1', '0/0', '0/1'],  # HbS carrier
            'sample_id': ['MOTHER'] * 3
        }
    },
    'CHILD_001': {
        'age': 18,
        'status': 'Unaffected',
        'variants': {
            'chromosome': ['11', '11'],
            'position': [5227002, 5248000],
            'ref_allele': ['T', 'C'],
            'alt_allele': ['A', 'G'],
            'genotype': ['0/0', '0/0'],  # Normal
            'sample_id': ['CHILD_001'] * 2
        }
    },
    'CHILD_002': {
        'age': 15,
        'status': 'Affected - Sickle Cell Disease',
        'variants': {
            'chromosome': ['11', '11', '2'],
            'position': [5227002, 5248000, 60494113],
            'ref_allele': ['T', 'C', 'G'],
            'alt_allele': ['A', 'G', 'T'],
            'genotype': ['1/1', '0/0', '0/1'],  # Homozygous HbS
            'sample_id': ['CHILD_002'] * 3
        }
    },
    'CHILD_003': {
        'age': 12,
        'status': 'Carrier',
        'variants': {
            'chromosome': ['11', '6'],
            'position': [5227002, 135460698],
            'ref_allele': ['T', 'T'],
            'alt_allele': ['A', 'C'],
            'genotype': ['0/1', '1/1'],  # HbS carrier + protective modifier
            'sample_id': ['CHILD_003'] * 2
        }
    }
}

# Combine all family member data into one dataframe
all_family_data = []

for member_id, member_data in family_members.items():
    member_df = pd.DataFrame(member_data['variants'])
    member_df['member_id'] = member_id
    member_df['age'] = member_data['age']
    member_df['clinical_status'] = member_data['status']
    all_family_data.append(member_df)

family_variants = pd.concat(all_family_data, ignore_index=True)

print(f"\nFamily Genetic Dataset:")
print(f"Total variants: {len(family_variants)}")
print(f"Family members: {family_variants['member_id'].nunique()}")
print(f"Chromosomes: {sorted(family_variants['chromosome'].unique())}")

print(f"\nFamily Composition:")
for member_id, member_data in family_members.items():
    print(f"  {member_id}: Age {member_data['age']}, {member_data['status']}")

# Display sample data
print(f"\nSample Family Genetic Data:")
display(family_variants.head(10))

### Family-wide Genetic Analysis

In [None]:
# Analyse each family member individually and compare results
print("FAMILY-WIDE GENETIC ANALYSIS")
print("=" * 35)

# Save family data for analysis
family_temp_file = 'temp_smith_family.csv'
family_variants.to_csv(family_temp_file, index=False)

# Analyse all family members together
family_results = analyser.analyse_file(family_temp_file)

print(f"Analysis completed for {family_variants['member_id'].nunique()} family members")
print(f"Total variants analyzed: {len(family_results)}")

# Create individual family member summaries
family_summary = {}

for member_id in family_variants['member_id'].unique():
    member_results = family_results[family_results['member_id'] == member_id]
    
    summary = {
        'member_id': member_id,
        'age': member_results['age'].iloc[0],
        'clinical_status': member_results['clinical_status'].iloc[0],
        'total_variants': len(member_results),
        'max_risk_score': member_results['risk_score'].max() if 'risk_score' in member_results.columns else 0,
        'avg_risk_score': member_results['risk_score'].mean() if 'risk_score' in member_results.columns else 0,
        'pathogenic_variants': member_results['is_pathogenic'].sum() if 'is_pathogenic' in member_results.columns else 0,
        'carrier_variants': len(member_results[member_results['genotype'] == '0/1'])
    }
    
    family_summary[member_id] = summary

print(f"\nFAMILY RISK PROFILE SUMMARY")
print("=" * 30)

for member_id, summary in family_summary.items():
    print(f"\n{member_id}:")
    print(f"  Age: {summary['age']}")
    print(f"  Clinical Status: {summary['clinical_status']}")
    print(f"  Max Risk Score: {summary['max_risk_score']:.2f}")
    print(f"  Pathogenic Variants: {summary['pathogenic_variants']}")
    print(f"  Carrier Status: {summary['carrier_variants']} heterozygous variants")

# Clean up temporary file
os.remove(family_temp_file)

---

## Example 3: Population Comparison Study

**Use Case**: A research team wants to compare sickle cell variant frequencies and risk profiles across different ethnic populations to understand population-specific genetic patterns.

**Scenario**: Comparative analysis of African American, Mediterranean, and Middle Eastern populations to identify population-specific risk factors and protective variants.

### Population Dataset Creation

In [None]:
# Create population-stratified genetic data
print("POPULATION COMPARISON STUDY SETUP")
print("=" * 40)

# Define population characteristics
populations = {
    'African_American': {
        'n_samples': 30,
        'hbs_frequency': 0.08,  # Higher HbS frequency
        'protective_frequency': 0.15,
        'description': 'African American population'
    },
    'Mediterranean': {
        'n_samples': 25, 
        'hbs_frequency': 0.03,  # Lower HbS frequency
        'protective_frequency': 0.25,  # Higher protective variants
        'description': 'Mediterranean population'
    },
    'Middle_Eastern': {
        'n_samples': 20,
        'hbs_frequency': 0.05,  # Moderate HbS frequency
        'protective_frequency': 0.20,
        'description': 'Middle Eastern population'
    }
}

print("Population Study Parameters:")
for pop_name, pop_data in populations.items():
    print(f"  {pop_name}: {pop_data['n_samples']} samples, HbS freq: {pop_data['hbs_frequency']:.3f}")

# Generate population-specific variant data
np.random.seed(42)  # For reproducible results

population_data = []

for pop_name, pop_params in populations.items():
    n_samples = pop_params['n_samples']
    
    for sample_id in range(1, n_samples + 1):
        sample_name = f"{pop_name}_S{sample_id:03d}"
        
        # Generate variants for this sample
        sample_variants = []
        
        # HbS variant (rs334)
        if np.random.random() < pop_params['hbs_frequency']:
            genotype = np.random.choice(['0/1', '1/1'], p=[0.8, 0.2])  # Mostly carriers
        else:
            genotype = '0/0'
        
        sample_variants.append({
            'chromosome': '11',
            'position': 5227002,
            'ref_allele': 'T',
            'alt_allele': 'A',
            'genotype': genotype,
            'sample_id': sample_name,
            'population': pop_name,
            'variant_id': 'rs334'
        })
        
        # Protective variants (BCL11A region)
        if np.random.random() < pop_params['protective_frequency']:
            protective_genotype = np.random.choice(['0/1', '1/1'], p=[0.7, 0.3])
        else:
            protective_genotype = '0/0'
            
        sample_variants.append({
            'chromosome': '2',
            'position': 60494113,
            'ref_allele': 'G',
            'alt_allele': 'T',
            'genotype': protective_genotype,
            'sample_id': sample_name,
            'population': pop_name,
            'variant_id': 'rs1427407'
        })
        
        # Additional random variants
        for _ in range(2):
            random_pos = np.random.randint(5200000, 5280000)
            random_genotype = np.random.choice(['0/0', '0/1', '1/1'], p=[0.6, 0.3, 0.1])
            
            sample_variants.append({
                'chromosome': '11',
                'position': random_pos,
                'ref_allele': np.random.choice(['A', 'T', 'G', 'C']),
                'alt_allele': np.random.choice(['A', 'T', 'G', 'C']),
                'genotype': random_genotype,
                'sample_id': sample_name,
                'population': pop_name,
                'variant_id': f'var_{random_pos}'
            })
        
        population_data.extend(sample_variants)

# Create population dataframe
population_df = pd.DataFrame(population_data)

print(f"\nPopulation Dataset Created:")
print(f"Total samples: {population_df['sample_id'].nunique()}")
print(f"Total variants: {len(population_df)}")
print(f"Populations: {population_df['population'].nunique()}")

# Population distribution
pop_counts = population_df.groupby('population')['sample_id'].nunique()
print(f"\nSample Distribution:")
for pop, count in pop_counts.items():
    print(f"  {pop}: {count} samples")

# Display sample data
print(f"\nSample Population Data:")
display(population_df.head(10))

### Population Analysis and Comparison

In [None]:
# Analyse population-specific patterns
print("POPULATION GENETIC ANALYSIS")
print("=" * 35)

# Save population data for analysis
pop_temp_file = 'temp_population_study.csv'
population_df.to_csv(pop_temp_file, index=False)

# Analyse all populations
pop_results = analyser.analyse_file(pop_temp_file)

print(f"Analysis completed for {population_df['population'].nunique()} populations")
print(f"Total samples analyzed: {population_df['sample_id'].nunique()}")

# Population-specific summary statistics
population_stats = {}

for pop_name in population_df['population'].unique():
    pop_data = pop_results[pop_results['population'] == pop_name]
    
    # Calculate HbS allele frequency
    hbs_variants = pop_data[pop_data['variant_id'] == 'rs334']
    total_alleles = len(hbs_variants) * 2
    
    # Convert genotype to allele count explicitly as numeric
    alt_alleles = hbs_variants['genotype'].astype(str).apply(
        lambda x: 2 if x == '1/1' else 1 if x == '0/1' else 0
    ).astype(int).sum()
    
    hbs_freq = alt_alleles / total_alleles if total_alleles > 0 else 0
    
    population_stats[pop_name] = {
        'samples': len(hbs_variants),
        'hbs_frequency': hbs_freq,
        'carriers': len(hbs_variants[hbs_variants['genotype'].astype(str) == '0/1']),
        'affected': len(hbs_variants[hbs_variants['genotype'].astype(str) == '1/1']),
        'avg_risk': pop_data['risk_score'].mean() if 'risk_score' in pop_data.columns else 0
    }

print(f"\nPOPULATION COMPARISON RESULTS")
print("=" * 40)

for pop_name, stats in population_stats.items():
    print(f"\n{pop_name}:")
    print(f"  Total samples: {stats['samples']}")
    print(f"  HbS allele frequency: {stats['hbs_frequency']:.3f}")
    print(f"  Carriers (0/1): {stats['carriers']}")
    print(f"  Affected (1/1): {stats['affected']}")
    print(f"  Average risk score: {stats['avg_risk']:.2f}")

# Clean up temporary file
os.remove(pop_temp_file)

### Visualisation of Population Differences

In [None]:
# Create visualisations for population comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Population Genetic Comparison Study', fontsize=16, fontweight='bold')

# Plot 1: HbS Allele Frequencies
ax1 = axes[0, 0]
populations_list = list(population_stats.keys())
hbs_frequencies = [population_stats[pop]['hbs_frequency'] for pop in populations_list]

bars1 = ax1.bar(populations_list, hbs_frequencies, color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax1.set_ylabel('HbS Allele Frequency', fontsize=12)
ax1.set_title('HbS Variant Frequency Across Populations', fontsize=12)
ax1.set_ylim(0, max(hbs_frequencies) * 1.2)

# Add value labels on bars
for bar, freq in zip(bars1, hbs_frequencies):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height,
             f'{freq:.3f}', ha='center', va='bottom')

# Plot 2: Carrier vs Affected Distribution
ax2 = axes[0, 1]
carriers = [population_stats[pop]['carriers'] for pop in populations_list]
affected = [population_stats[pop]['affected'] for pop in populations_list]

x = np.arange(len(populations_list))
width = 0.35

bars2 = ax2.bar(x - width/2, carriers, width, label='Carriers (0/1)', color='#95E1D3')
bars3 = ax2.bar(x + width/2, affected, width, label='Affected (1/1)', color='#F38181')

ax2.set_ylabel('Number of Individuals', fontsize=12)
ax2.set_title('Carrier vs Affected Status by Population', fontsize=12)
ax2.set_xticks(x)
ax2.set_xticklabels(populations_list)
ax2.legend()

# Plot 3: Average Risk Scores
ax3 = axes[1, 0]
avg_risks = [population_stats[pop]['avg_risk'] for pop in populations_list]

bars4 = ax3.bar(populations_list, avg_risks, color=['#FFD93D', '#6BCB77', '#4D96FF'])
ax3.set_ylabel('Average Risk Score', fontsize=12)
ax3.set_title('Mean Genetic Risk Score by Population', fontsize=12)

# Add value labels
for bar, risk in zip(bars4, avg_risks):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height,
             f'{risk:.1f}', ha='center', va='bottom')

# Plot 4: Sample Size Distribution
ax4 = axes[1, 1]
sample_sizes = [population_stats[pop]['samples'] for pop in populations_list]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

wedges, texts, autotexts = ax4.pie(sample_sizes, labels=populations_list, 
                                     colors=colors, autopct='%1.1f%%',
                                     startangle=90)
ax4.set_title('Sample Distribution Across Populations', fontsize=12)

# Make percentage text bold
for autotext in autotexts:
    autotext.set_fontweight('bold')

plt.tight_layout()
plt.show()

print("\nPopulation comparison visualisations generated successfully!")

---

## Summary and Next Steps

### What We've Demonstrated

Through these examples, we've shown how SickleScope can be used for:

1. **Clinical Screening**: Individual patient risk assessment with actionable recommendations
2. **Family Studies**: Multi-sample analysis to understand inheritance patterns
3. **Population Research**: Comparative analysis across different ethnic groups

### Key Features Highlighted

- **Flexible Data Input**: Handles various data formats and sample types
- **Comprehensive Analysis**: Risk scoring, variant classification, and modifier detection
- **Clinical Relevance**: Generates reports suitable for genetic counselling
- **Research Applications**: Supports population studies and comparative analyses
- **Visualization**: Creates publication-quality plots for data interpretation

### Additional Use Cases to Explore

4. **Research Dataset Analysis**: Process large-scale genomic datasets
5. **Clinical Trial Screening**: Identify eligible patients based on genetic criteria
6. **Carrier Screening Programs**: Population-wide carrier detection initiatives
7. **Pharmacogenomics**: Predict drug response based on genetic variants
8. **Severity Prediction**: Use machine learning to predict disease severity

### Next Steps

To further explore SickleScope capabilities:

1. Try the **advanced.ipynb** notebook for machine learning features
2. Use the command-line interface for batch processing
3. Integrate SickleScope into your existing genomics pipeline
4. Contribute to the project on GitHub

### Resources

- **Documentation**: See README.md for complete API documentation
- **GitHub**: https://github.com/talhahzubayer/sickle-scope
- **Support**: Open an issue on GitHub for questions or bugs
- **Contributing**: Pull requests welcome for new features

---

Thank you for using SickleScope!