Split VCF File by Column Value

Overview

This script processes a VCF (Variant Call Format) file, ensures all rows have the same number of columns, and then splits the file into multiple parts based on the values in the 20th column. Each split dataset is saved as a separate .vcf file.

Features

* Reads a large VCF file without skipping any data.

* Ensures irregular rows with extra/missing columns are properly handled.

* Identifies unique values in the 20th column.

* Splits the data into multiple DataFrames based on those values.

* Saves each subset as a new VCF file in a separate folder.

Requirements

* Python 3.7+

* Pandas

In [16]:
import pandas as pd
import csv

In [None]:
file_path = "Your_vcf_file_name_here.vcf"

In [24]:
# Step 1: Read the file and find max column count
max_columns = 0
rows = []

with open(file_path, 'r') as f:
    for line in f:
        values = line.strip().split("\t")  # Split each line by tab
        max_columns = max(max_columns, len(values))  # Track max column count
        rows.append(values)  # Store the row data

In [25]:
# Step 2: Adjust the header to match max columns
header = rows[0]  # First row is the header
extra_cols_needed = max_columns - len(header)

if extra_cols_needed > 0:
    header += [f"Extra_{i+1}" for i in range(extra_cols_needed)]  # Add missing columns

In [26]:
# Step 3: Convert all rows to match max column count
normalized_rows = [row + [""] * (max_columns - len(row)) for row in rows[1:]]  # Fill missing columns with empty strings

In [27]:
# Step 4: Create DataFrame
df = pd.DataFrame(normalized_rows, columns=header)

In [28]:
print(df.head())  # Verify the output

          rs# alleles chrom   pos strand assembly# center protLSID assayLSID  \
0  1pos1178.1     G/T     1  1178      +        NA     NA       NA        NA   
1  1pos1203.1     T/C     1  1203      +        NA     NA       NA        NA   
2  1pos1249.1     A/C     1  1249      +        NA     NA       NA        NA   
3  1pos1266.1     G/A     1  1266      +        NA     NA       NA        NA   
4  1pos1277.1     T/C     1  1277      +        NA     NA       NA        NA   

  panel  ...                 Extra_5 Extra_6 Extra_7    Extra_8 Extra_9  \
0    NA  ...                                                              
1    NA  ...                                                              
2    NA  ...                                                              
3    NA  ...  CHR_START-Os01g0100100                  n.1266G>A           
4    NA  ...  CHR_START-Os01g0100100                  n.1277T>C           

  Extra_10 Extra_11 Extra_12 Extra_13 Extra_14  
0                  

In [29]:
# Step 2: Identify unique values in the 20th column (index 19)
column_20_name = df.columns[19]  # Get column name
unique_values = df[column_20_name].unique()  # List unique values

In [30]:
unique_values

array(['intergenic_region', 'downstream_gene_variant',
       '5_prime_UTR_variant', 'intron_variant', 'upstream_gene_variant',
       'missense_variant', 'synonymous_variant',
       'non_coding_transcript_exon_variant', '3_prime_UTR_variant',
       'stop_gained', 'splice_region_variant&synonymous_variant',
       'splice_region_variant&intron_variant',
       'splice_acceptor_variant&intron_variant', 'stop_lost',
       'start_lost', 'missense_variant&splice_region_variant',
       'splice_donor_variant&intron_variant', 'stop_retained_variant',
       'stop_lost&splice_region_variant',
       'splice_region_variant&non_coding_transcript_exon_variant',
       'splice_donor_variant&splice_region_variant&intron_variant',
       'stop_gained&splice_region_variant',
       'splice_region_variant&stop_retained_variant',
       'initiator_codon_variant'], dtype=object)

In [31]:
import os

In [32]:
# Step 3: Split and save each part
output_dir = "split_vcf_files"
os.makedirs(output_dir, exist_ok=True)  # Create directory if not exists

In [33]:
for value in unique_values:
    subset_df = df[df[column_20_name] == value]  # Filter rows by value
    file_name = os.path.join(output_dir, f"split_{value}.vcf")
    
    # Save while maintaining VCF format
    subset_df.to_csv(file_name, sep="\t", index=False, header=True)

    print(f"Saved: {file_name}")

print("Splitting complete! ðŸš€")

Saved: split_vcf_files\split_intergenic_region.vcf
Saved: split_vcf_files\split_downstream_gene_variant.vcf
Saved: split_vcf_files\split_5_prime_UTR_variant.vcf
Saved: split_vcf_files\split_intron_variant.vcf
Saved: split_vcf_files\split_upstream_gene_variant.vcf
Saved: split_vcf_files\split_missense_variant.vcf
Saved: split_vcf_files\split_synonymous_variant.vcf
Saved: split_vcf_files\split_non_coding_transcript_exon_variant.vcf
Saved: split_vcf_files\split_3_prime_UTR_variant.vcf
Saved: split_vcf_files\split_stop_gained.vcf
Saved: split_vcf_files\split_splice_region_variant&synonymous_variant.vcf
Saved: split_vcf_files\split_splice_region_variant&intron_variant.vcf
Saved: split_vcf_files\split_splice_acceptor_variant&intron_variant.vcf
Saved: split_vcf_files\split_stop_lost.vcf
Saved: split_vcf_files\split_start_lost.vcf
Saved: split_vcf_files\split_missense_variant&splice_region_variant.vcf
Saved: split_vcf_files\split_splice_donor_variant&intron_variant.vcf
Saved: split_vcf_files\s