# Genomic Variant Analysis: VCF File Analysis and Visualization

This notebook demonstrates analysis of genetic variants from VCF (Variant Call Format) files.

## Learning Objectives

1. Load and parse VCF files
2. Calculate variant statistics (SNPs vs indels)
3. Perform quality control
4. Compute allele frequencies
5. Filter variants by quality
6. Visualize variant distributions (Manhattan plot, density)

## What is VCF?

VCF (Variant Call Format) is the standard format for storing genetic variation data:
- **SNPs** (Single Nucleotide Polymorphisms): Single base changes
- **Indels**: Insertions or deletions
- **Quality scores**: Confidence in variant calls
- **Genotypes**: Individual genetic makeup (0/0, 0/1, 1/1)

## Dataset

Synthetic VCF file with 25 variants on chromosome 22 for educational purposes.

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('Set2')
%matplotlib inline

print("✓ Imports successful")

## 2. Load VCF File

We'll parse the VCF file manually to understand its structure.

In [None]:
def parse_vcf(vcf_file):
    """Parse VCF file into pandas DataFrame."""
    variants = []
    samples = []
    
    with open(vcf_file, 'r') as f:
        for line in f:
            # Skip metadata lines
            if line.startswith('##'):
                continue
            
            # Parse header to get sample names
            if line.startswith('#CHROM'):
                header = line.strip().split('\t')
                samples = header[9:]  # Sample names start at column 9
                continue
            
            # Parse variant line
            fields = line.strip().split('\t')
            chrom, pos, id_field, ref, alt, qual, filter_field, info, format_field = fields[:9]
            sample_genotypes = fields[9:]
            
            # Parse INFO field
            info_dict = {}
            for item in info.split(';'):
                if '=' in item:
                    key, value = item.split('=', 1)
                    info_dict[key] = value
            
            variant = {
                'CHROM': chrom,
                'POS': int(pos),
                'ID': id_field if id_field != '.' else None,
                'REF': ref,
                'ALT': alt,
                'QUAL': float(qual),
                'FILTER': filter_field,
                'DP': int(info_dict.get('DP', 0)),
                'AF': float(info_dict.get('AF', 0)),
                'AC': int(info_dict.get('AC', 0)),
                'AN': int(info_dict.get('AN', 0)),
                'TYPE': info_dict.get('TYPE', 'UNKNOWN')
            }
            
            variants.append(variant)
    
    return pd.DataFrame(variants), samples

# Load data
df, sample_names = parse_vcf('sample_variants.vcf')

print(f"Loaded {len(df)} variants")
print(f"Samples: {', '.join(sample_names)}")
df.head(10)

## 3. Basic Variant Statistics

In [None]:
print("="*60)
print("VARIANT STATISTICS")
print("="*60)

# Overall counts
print(f"Total variants: {len(df)}")
print(f"Chromosomes: {df['CHROM'].unique()}")
print(f"Position range: {df['POS'].min():,} - {df['POS'].max():,}")

# Variant types
print(f"\nVariant Types:")
type_counts = df['TYPE'].value_counts()
for variant_type, count in type_counts.items():
    print(f"  {variant_type}: {count} ({count/len(df)*100:.1f}%)")

# Known vs novel
known = df['ID'].notna().sum()
novel = df['ID'].isna().sum()
print(f"\nAnnotation Status:")
print(f"  Known (has rsID): {known} ({known/len(df)*100:.1f}%)")
print(f"  Novel: {novel} ({novel/len(df)*100:.1f}%)")

# Filter status
print(f"\nQuality Filter:")
filter_counts = df['FILTER'].value_counts()
for filter_status, count in filter_counts.items():
    print(f"  {filter_status}: {count} ({count/len(df)*100:.1f}%)")

print("="*60)

In [None]:
# Visualize variant type distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Variant types
type_counts.plot(kind='bar', ax=axes[0], color=['#3498db', '#e74c3c', '#2ecc71'])
axes[0].set_title('Variant Type Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Variant Type')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)
axes[0].grid(True, alpha=0.3, axis='y')

# Filter status
filter_counts.plot(kind='bar', ax=axes[1], color=['#27ae60', '#e67e22'])
axes[1].set_title('Quality Filter Status', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Filter')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=0)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 4. Quality Control Analysis

In [None]:
# Quality score statistics
print("Quality Score Statistics:")
print(df['QUAL'].describe())

# Depth statistics
print("\nDepth of Coverage Statistics:")
print(df['DP'].describe())

In [None]:
# Quality control plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Quality score distribution
axes[0, 0].hist(df['QUAL'], bins=20, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(df['QUAL'].mean(), color='red', linestyle='--', 
                  linewidth=2, label=f'Mean: {df["QUAL"].mean():.1f}')
axes[0, 0].axvline(30, color='orange', linestyle='--', 
                  linewidth=2, label='Threshold: 30')
axes[0, 0].set_title('Quality Score Distribution')
axes[0, 0].set_xlabel('QUAL Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Depth distribution
axes[0, 1].hist(df['DP'], bins=20, color='coral', edgecolor='black', alpha=0.7)
axes[0, 1].axvline(df['DP'].mean(), color='red', linestyle='--', 
                  linewidth=2, label=f'Mean: {df["DP"].mean():.1f}')
axes[0, 1].set_title('Depth of Coverage Distribution')
axes[0, 1].set_xlabel('Depth (DP)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Quality vs Depth scatter
axes[1, 0].scatter(df['DP'], df['QUAL'], c=df['QUAL'], cmap='viridis', 
                  s=100, alpha=0.6, edgecolors='black')
axes[1, 0].set_title('Quality vs Depth')
axes[1, 0].set_xlabel('Depth (DP)')
axes[1, 0].set_ylabel('Quality Score')
axes[1, 0].grid(True, alpha=0.3)

# Quality by variant type
for variant_type in df['TYPE'].unique():
    subset = df[df['TYPE'] == variant_type]
    axes[1, 1].scatter(subset.index, subset['QUAL'], label=variant_type, s=50, alpha=0.7)
axes[1, 1].axhline(30, color='red', linestyle='--', alpha=0.5, label='Threshold')
axes[1, 1].set_title('Quality by Variant Type')
axes[1, 1].set_xlabel('Variant Index')
axes[1, 1].set_ylabel('Quality Score')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Allele Frequency Analysis

In [None]:
# Allele frequency statistics
print("Allele Frequency Statistics:")
print(df['AF'].describe())

# Minor allele frequency (MAF)
df['MAF'] = df['AF'].apply(lambda x: min(x, 1-x))

print("\nMinor Allele Frequency Statistics:")
print(df['MAF'].describe())

# Classify by frequency
def classify_maf(maf):
    if maf < 0.01:
        return 'Rare (< 1%)'
    elif maf < 0.05:
        return 'Low Freq (1-5%)'
    else:
        return 'Common (> 5%)'

df['MAF_class'] = df['MAF'].apply(classify_maf)

print("\nMAF Classification:")
print(df['MAF_class'].value_counts())

In [None]:
# Allele frequency visualizations
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# AF distribution
axes[0].hist(df['AF'], bins=20, color='purple', edgecolor='black', alpha=0.7)
axes[0].axvline(0.5, color='red', linestyle='--', linewidth=2, label='AF = 0.5')
axes[0].set_title('Allele Frequency Distribution')
axes[0].set_xlabel('Allele Frequency (AF)')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# MAF distribution
axes[1].hist(df['MAF'], bins=20, color='teal', edgecolor='black', alpha=0.7)
axes[1].axvline(0.05, color='orange', linestyle='--', linewidth=2, label='MAF = 5%')
axes[1].axvline(0.01, color='red', linestyle='--', linewidth=2, label='MAF = 1%')
axes[1].set_title('Minor Allele Frequency Distribution')
axes[1].set_xlabel('Minor Allele Frequency (MAF)')
axes[1].set_ylabel('Frequency')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# MAF classes
maf_counts = df['MAF_class'].value_counts()
maf_counts.plot(kind='bar', ax=axes[2], color='salmon', alpha=0.7)
axes[2].set_title('MAF Classification')
axes[2].set_xlabel('MAF Class')
axes[2].set_ylabel('Count')
axes[2].tick_params(axis='x', rotation=45)
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 6. Variant Filtering

In [None]:
# Apply quality filters
def apply_filters(df, min_qual=40, min_depth=100):
    """Apply quality control filters."""
    filtered = df[
        (df['QUAL'] >= min_qual) & 
        (df['DP'] >= min_depth) &
        (df['FILTER'] == 'PASS')
    ].copy()
    return filtered

# Apply filters
high_quality = apply_filters(df, min_qual=40, min_depth=100)

print(f"Original variants: {len(df)}")
print(f"High-quality variants: {len(high_quality)} ({len(high_quality)/len(df)*100:.1f}%)")
print(f"Filtered out: {len(df) - len(high_quality)} ({(len(df)-len(high_quality))/len(df)*100:.1f}%)")

print("\nHigh-Quality Variant Types:")
print(high_quality['TYPE'].value_counts())

In [None]:
# Compare before and after filtering
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Quality distribution comparison
axes[0].hist([df['QUAL'], high_quality['QUAL']], bins=20, 
            label=['All variants', 'High-quality'], 
            color=['lightblue', 'darkblue'], alpha=0.6, edgecolor='black')
axes[0].axvline(40, color='red', linestyle='--', linewidth=2, label='Threshold: 40')
axes[0].set_title('Quality Score: Before vs After Filtering')
axes[0].set_xlabel('Quality Score')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Allele frequency comparison
axes[1].hist([df['AF'], high_quality['AF']], bins=20, 
            label=['All variants', 'High-quality'], 
            color=['lightcoral', 'darkred'], alpha=0.6, edgecolor='black')
axes[1].set_title('Allele Frequency: Before vs After Filtering')
axes[1].set_xlabel('Allele Frequency')
axes[1].set_ylabel('Frequency')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Manhattan Plot

Classic genomics visualization showing variant significance across the genome.

In [None]:
# Manhattan plot
fig, ax = plt.subplots(figsize=(14, 6))

# Color by variant type
colors = {'SNP': '#3498db', 'INS': '#e74c3c', 'DEL': '#2ecc71'}

for variant_type in df['TYPE'].unique():
    subset = df[df['TYPE'] == variant_type]
    ax.scatter(subset['POS'], subset['QUAL'], 
              label=variant_type, color=colors.get(variant_type, 'gray'),
              s=100, alpha=0.7, edgecolors='black', linewidth=0.5)

# Add quality threshold line
ax.axhline(40, color='red', linestyle='--', linewidth=2, 
          alpha=0.5, label='Quality Threshold')

ax.set_title('Manhattan Plot: Variant Quality Across Chromosome 22', 
            fontsize=16, fontweight='bold')
ax.set_xlabel('Position on Chromosome 22', fontsize=12)
ax.set_ylabel('Quality Score', fontsize=12)
ax.legend(loc='best')
ax.grid(True, alpha=0.3)

# Format x-axis
ax.ticklabel_format(style='plain', axis='x')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 8. Variant Density Along Chromosome

In [None]:
# Calculate variant density in windows
window_size = 500  # Base pairs
min_pos = df['POS'].min()
max_pos = df['POS'].max()

# Create windows
windows = range(min_pos, max_pos + window_size, window_size)
density = []

for window_start in windows:
    window_end = window_start + window_size
    count = len(df[(df['POS'] >= window_start) & (df['POS'] < window_end)])
    density.append(count)

# Plot density
fig, ax = plt.subplots(figsize=(14, 5))
ax.bar(windows, density, width=window_size*0.8, color='steelblue', 
      edgecolor='black', alpha=0.7)
ax.set_title(f'Variant Density Along Chromosome 22 ({window_size}bp windows)', 
            fontsize=14, fontweight='bold')
ax.set_xlabel('Position', fontsize=12)
ax.set_ylabel('Number of Variants', fontsize=12)
ax.grid(True, alpha=0.3, axis='y')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 9. Summary Statistics Export

In [None]:
# Create summary report
summary = {
    'Total Variants': len(df),
    'SNPs': len(df[df['TYPE'] == 'SNP']),
    'Insertions': len(df[df['TYPE'] == 'INS']),
    'Deletions': len(df[df['TYPE'] == 'DEL']),
    'Known (rsID)': df['ID'].notna().sum(),
    'Novel': df['ID'].isna().sum(),
    'High Quality (QUAL≥40)': len(df[df['QUAL'] >= 40]),
    'PASS Filter': len(df[df['FILTER'] == 'PASS']),
    'Mean Quality': df['QUAL'].mean(),
    'Mean Depth': df['DP'].mean(),
    'Mean AF': df['AF'].mean(),
    'Rare Variants (MAF<1%)': len(df[df['MAF'] < 0.01]),
    'Common Variants (MAF>5%)': len(df[df['MAF'] > 0.05])
}

summary_df = pd.DataFrame.from_dict(summary, orient='index', columns=['Value'])
summary_df['Value'] = summary_df['Value'].apply(lambda x: f"{x:.2f}" if isinstance(x, float) else x)

print("VARIANT ANALYSIS SUMMARY")
print("="*60)
print(summary_df.to_string())
print("="*60)

# Save summary
summary_df.to_csv('variant_summary.csv')
print("\n✓ Summary saved to variant_summary.csv")

In [None]:
# Save high-quality variants
high_quality.to_csv('high_quality_variants.csv', index=False)
print(f"✓ Saved {len(high_quality)} high-quality variants to high_quality_variants.csv")

## Key Findings

### Variant Composition
- **Total Variants**: [X] variants analyzed
- **SNPs**: [Y]% of variants
- **Indels**: [Z]% of variants (insertions + deletions)

### Quality Control
- **Mean Quality Score**: [value]
- **PASS Filter**: [X]% of variants
- **High Quality (≥40)**: [Y]% of variants

### Allele Frequencies
- **Rare Variants** (MAF < 1%): [X] variants
- **Common Variants** (MAF > 5%): [Y] variants
- **Mean MAF**: [value]

### Known vs Novel
- **Known** (has rsID): [X]% annotated in dbSNP
- **Novel**: [Y]% not in databases

## Next Steps

1. **Functional Annotation**: Use tools like VEP or ANNOVAR to predict variant effects
2. **Population Comparison**: Compare allele frequencies with 1000 Genomes
3. **Association Testing**: Test for disease/trait associations
4. **Pathway Analysis**: Identify affected biological pathways
5. **Validation**: Confirm variants with Sanger sequencing

## Resources

- [VCF Specification](https://samtools.github.io/hts-specs/VCFv4.2.pdf)
- [1000 Genomes Project](https://www.internationalgenome.org/)
- [dbSNP Database](https://www.ncbi.nlm.nih.gov/snp/)
- [Ensembl VEP](https://www.ensembl.org/Tools/VEP)
- [UCSC Genome Browser](https://genome.ucsc.edu/)