# Comparison with Secondary Finding Genes

## Overview
This notebook performs a comprehensive analysis to identify which variants from our Principal Component 1 (PC1) analysis fall within genes that are recommended for secondary findings by the American College of Medical Genetics and Genomics (ACMG).

## Purpose
The main objective is to determine if the most important variants identified through PCA analysis (top 5% contributors to PC1) are located within clinically relevant genes that are routinely screened for secondary findings in genomic analysis.

## Workflow

### Step 1: Data Loading and Parsing
- **Input**: `refGene.xlsx` file containing variant data in the `top5Variants` sheet
- **Format**: Variants are stored in format `chr2:8731872:T>C`
- **Processing**: Parses each variant string into separate columns:
  - `chrom`: Chromosome (e.g., "chr1", "chr2")
  - `pos`: Genomic position
  - `ref`: Reference allele
  - `alt`: Alternative allele
- **Output**: Creates a structured DataFrame and saves to `ParsedVariants` sheet

### Step 2: Gene-Variant Mapping
- **Input**: 
  - Parsed variants from Step 1
  - `refGenesFiltered` DataFrame containing gene information (chromosome, start/end positions, gene names)
- **Process**: 
  - For each variant, checks if it falls within any gene's boundaries
  - A variant is considered "within a gene" if:
    - It's on the same chromosome as the gene
    - Its position is between the gene's start and end coordinates
- **Output**: 
  - Comprehensive mapping of variants to genes
  - List of unique genes containing variants
  - Summary statistics

### Step 3: Results and Analysis
- **Identifies**: Which PC1-contributing variants are in ACMG secondary finding genes
- **Provides**: 
  - Complete variant-to-gene mapping
  - Summary of variants with/without gene matches
  - List of all genes containing variants
- **Saves**: Results to `VariantGeneMatches` sheet in Excel

## Clinical Significance
This analysis helps determine the clinical relevance of our PCA findings by identifying whether the most important variants (those driving the principal component) are located in genes that are routinely screened for secondary findings in clinical genomics.

## Output Files
- **Excel Sheets**:
  - `ParsedVariants`: Structured variant data
  - `VariantGeneMatches`: Complete variant-gene mapping results
- **Console Output**: Summary statistics and gene lists


In [1]:
%load_ext autoreload
%autoreload 2

from utils import *
import os

### Parsing the variants in the Principle Component 1

In [None]:

# Step 1: Load your variants from the top5Variants sheet
variants_read = pd.read_excel("refGene.xlsx", sheet_name="top5Variants")

print("Loaded variants from refGene.xlsx - top5Variants sheet:")
print(f"Shape: {variants_read.shape}")
print(f"Columns: {variants_read.columns.tolist()}")
print("\nFirst few rows:")
print(variants_read.head())

# Parse the variant data from format "chr2:8731872:T>C" into separate columns
print("\nParsing variant data from format 'chr2:8731872:T>C'...")

# Get the Variant column
variant_column = variants_read["Variant"]

variants = []
for variant_string in variant_column:
    if pd.notna(variant_string):  # Skip NaN values
        # Split by ':' to get [chr, pos, ref_alt]
        parts = str(variant_string).split(":")
        if len(parts) == 3:
            chrom = parts[0]  # Keep 'chr' prefix (e.g., "chr2")
            pos = int(parts[1])  # Convert position to integer
            ref_alt = parts[2]  # Get ref>alt part
            
            # Split ref>alt by '>'
            if '>' in ref_alt:
                ref, alt = ref_alt.split('>')
                variants.append([chrom, pos, ref, alt])
            else:
                print(f"Warning: Could not parse ref>alt from: {ref_alt}")
        else:
            print(f"Warning: Could not parse variant string: {variant_string}")

# Create DataFrame with parsed variants
variants_df = pd.DataFrame(variants, columns=["chrom", "pos", "ref", "alt"])

print(f"\nParsed {len(variants)} variants successfully")
print(f"Final variants_df shape: {variants_df.shape}")
print("Final variants_df:")
print(variants_df.head())

# Write the parsed variants to a new sheet in the Excel file
print("\nWriting parsed variants to new sheet 'ParsedVariants' in refGene.xlsx...")

# Read the existing Excel file to preserve other sheets
with pd.ExcelWriter("refGene.xlsx", engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    variants_df.to_excel(writer, sheet_name='ParsedVariants', index=False)

print("Successfully saved parsed variants to 'ParsedVariants' sheet in refGene.xlsx")

Loaded variants from refGene.xlsx - top5Variants sheet:
Shape: (7259, 1)
Columns: ['Variant']

First few rows:
               Variant
0     chr2:8731872:T>C
1  chr12:103742538:C>T
2  chr12:105026635:C>A
3  chr12:105030141:C>A
4   chr17:47916678:C>T

Parsing variant data from format 'chr2:8731872:T>C'...

Parsed 7259 variants successfully
Final variants_df shape: (7259, 4)
Final variants_df:
   chrom        pos ref alt
0   chr2    8731872   T   C
1  chr12  103742538   C   T
2  chr12  105026635   C   A
3  chr12  105030141   C   A
4  chr17   47916678   C   T

Writing parsed variants to new sheet 'ParsedVariants' in refGene.xlsx...
Successfully saved parsed variants to 'ParsedVariants' sheet in refGene.xlsx


In [None]:
secondary_findings_file = pd.read_excel("refGene.xlsx", sheet_name="secFinding")
secFinding = secondary_findings_file["Gene"]

In [14]:
refGene_file = pd.read_excel("refGene.xlsx", sheet_name="refGene")

# filter the reGene file with the secFinding file
refGenesFiltered = refGene_file[refGene_file["gene"].isin(secFinding)]


In [None]:

# Function to check if a variant falls within a gene
def variant_in_gene(variant_chrom, variant_pos, gene_chrom, gene_start, gene_end):
    """
    Check if a variant falls within a gene's boundaries
    
    Args:
        variant_chrom: Chromosome of the variant (e.g., "chr1")
        variant_pos: Position of the variant
        gene_chrom: Chromosome of the gene (e.g., "chr1")
        gene_start: Start position of the gene
        gene_end: End position of the gene
    
    Returns:
        bool: True if variant is within gene boundaries
    """
    return (variant_chrom == gene_chrom and 
            gene_start <= variant_pos <= gene_end)

# Create a list to store results
variant_gene_matches = []

# Check each variant against each gene
print(f"\nChecking {len(variants_df)} variants against {len(refGenesFiltered)} genes...")

for idx, variant in variants_df.iterrows():
    variant_chrom = variant['chrom']
    variant_pos = variant['pos']
    
    # Find genes that contain this variant
    matching_genes = []
    
    for gene_idx, gene in refGenesFiltered.iterrows():
        gene_chrom = gene['chrom']  
        gene_start = gene['start']       
        gene_end = gene['end']           
        gene_name = gene['gene']    
        
        if variant_in_gene(variant_chrom, variant_pos, gene_chrom, gene_start, gene_end):
            matching_genes.append(gene_name)
    
    # Store the result
    variant_gene_matches.append({
        'variant': f"{variant_chrom}:{variant_pos}:{variant['ref']}>{variant['alt']}",
        'chrom': variant_chrom,
        'pos': variant_pos,
        'ref': variant['ref'],
        'alt': variant['alt'],
        'matching_genes': matching_genes,
        'num_genes': len(matching_genes)
    })

# Convert results to DataFrame
results_df = pd.DataFrame(variant_gene_matches)

print(f"\nResults summary:")
print(f"Total variants checked: {len(results_df)}")
print(f"Variants with gene matches: {len(results_df[results_df['num_genes'] > 0])}")
print(f"Variants without gene matches: {len(results_df[results_df['num_genes'] == 0])}")

# Show variants that have gene matches
variants_with_genes = results_df[results_df['num_genes'] > 0]
if len(variants_with_genes) > 0:
    print(f"\nVariants that fall within genes:")
    for idx, row in variants_with_genes.iterrows():
        print(f"  {row['variant']} -> Genes: {', '.join(row['matching_genes'])}")
else:
    print("\nNo variants found within gene boundaries.")

# Get unique gene names that contain variants
all_matching_genes = []
for genes_list in results_df['matching_genes']:
    all_matching_genes.extend(genes_list)

unique_genes_with_variants = list(set(all_matching_genes))
print(f"\nUnique genes that contain variants ({len(unique_genes_with_variants)}):")
for gene in sorted(unique_genes_with_variants):
    print(f"  - {gene}")

# Save results to Excel
print(f"\nSaving results to Excel...")
with pd.ExcelWriter("refGene.xlsx", engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    results_df.to_excel(writer, sheet_name='VariantGeneMatches', index=False)

print("Results saved to 'VariantGeneMatches' sheet in refGene.xlsx")

Checking if variants fall within specific genes...
refGenesFiltered shape: (368, 14)
refGenesFiltered columns: ['chrom', 'strand', 'start', 'end', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'gene', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13']

First few rows of refGenesFiltered:
      chrom strand      start        end  Unnamed: 4  Unnamed: 5  Unnamed: 6  \
385   chr16      -   15703134   15857028    15708810    15838252          43   
395    chr2      +  219418376  219426734   219418462   219425990           7   
1064  chr19      +   46602071   46610782    46602207    46609153           6   
1066  chr19      +   46601073   46610782    46608270    46609153           6   
2595   chr1      -  201359013  201377680   201359209   201373254          16   

                                             Unnamed: 7  \
385   15703134,15708802,15714908,15715163,15717139,1...   
395   219418376,219420094,219420250,219421483,219423...   
1064  46602071,466058