## Structural Variant Calling Output

In [1]:
import pandas as pd

#path to your new CSV file
csv_path = "/home/vaiicodes/sv_task/results/HG002.variants.csv"

#read the CSV
df = pd.read_csv(csv_path)

#print the first few variants
print(df.head(20))

#quick stats
print("\nNumber of variants:", len(df))
print("Unique chromosomes:", df['CHROM'].unique())

#most common variant types
print("\nTop 5 most common Variant Types:")
print(df['Variant_Type'].value_counts().head())

#filtering only PASS variants
pass_variants = df[df['FILTER'] == "PASS"]
print("\nNumber of PASS variants:", len(pass_variants))

#overview of FILTER categories
print("\nFILTER category counts:")
print(df['FILTER'].value_counts())


    CHROM    START       END      SIZE   QUAL   FILTER Variant_Type
0   chr21  5227400   5227561       161  240.0     PASS          DEL
1   chr21  5227477  41545919  36318442  120.0  LowQual          INV
2   chr21  5231410   5231442        32  180.0     PASS          DEL
3   chr21  5242021   5242524       503  120.0  LowQual          DEL
4   chr21  5248247   5248647       400  120.0  LowQual          DEL
5   chr21  5291906   8035546   2743640   21.0  LowQual          DUP
6   chr21  5292139   5298256      6117   21.0  LowQual          DUP
7   chr21  5292329  10755289   5462960  120.0  LowQual          INV
8   chr21  5312293  44091742  38779449   30.0  LowQual          INV
9   chr21  5312295  25350434  20038139   80.0  LowQual          DEL
10  chr21  5313072  12984442   7671370  111.0     PASS          INV
11  chr21  5313193  12984263   7671070   68.0     PASS          INV
12  chr21  5316526   7246105   1929579  155.0  LowQual          DUP
13  chr21  5317498  10778114   5460616   31.0  L

## GWAS-to-Drug Target Strategy for Type 2 Diabetes

## Summary
**My approach follows the natural biological cascade: genetic variants -> gene expression changes -> disrupted cellular processes -> druggable targets. I'll develop three interconnected deep learning models that build upon existing architectures but are specifically designed for T2D drug target discovery.**

## 3-Step Linked Strategy

### Step 1: Develop SNP-to-Expression Model (DNA -> RNA Level)
- **What we're building**: A deep learning model to predict gene expression changes from T2D SNPs
- **Architecture**: Fine-tune Enformer transformer with additional tissue-specific layers
- **Model design**: Add T2D-specific output heads for pancreatic islets, liver, muscle tissues
- **Training**: Use GTEx eQTL data + T2D-specific RNA-seq datasets
- **Output**: Tissue-specific expression change predictions for each SNP -> **feeds into Step 2**

### Step 2: Develop Pathway Disruption Model (RNA -> Function Level)
- **What we're building**: A neural network to predict pathway-level dysfunction from gene expression changes
- **Architecture**: Multi-task neural network with pathway-aware attention mechanisms
- **Model design**: Incorporate biological pathway structure as graph neural network components
- **Input**: **Expression predictions from Step 1** + pathway databases
- **Training**: Use known T2D pathway disruptions as supervision
- **Output**: Pathway disruption scores and affected biological processes -> **feeds into Step 3**

### Step 3: Develop Drug Target Prioritization Model
- **What we're building**: A ranking model to identify the most promising druggable targets
- **Architecture**: Ensemble model combining pathway disruption scores with druggability features
- **Input**: **Pathway disruption scores from Step 2** + protein druggability data + tissue expression
- **Model design**: Multi-modal architecture incorporating protein structure, drug interaction data
- **Training**: Use known T2D drug targets as positive examples for supervised learning
- **Output**: Ranked list of druggable targets with therapeutic strategy recommendations