Generating required files for variant effect prediction #17
-
Hello, I am interested in deploying a pre-trained model to compute the GPN scores for all possible SNPs in a specific region. As I understand it, the VEP tool from Ensembl can be used to generate all consequences. Could you provide the file or instructions on how to generate the file used as an input for the VEP tool? The vcf file found here: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/variation/vcf/arabidopsis_thaliana/ seems to not include all possible variants considered in the paper, correct? Essentially I want to generate this file to feed to run_vep script (minus the 'score' column): Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The above file seems to include all variants observed in 1001 Genomes Project, while the file below contains all possible SNPs in a 1Mb region. I used this code to generate the file used as input for Ensembl VEP: rule make_simulated_variants:
input:
"output/genome.fa.gz",
output:
# then take this file to ensembl vep online and run with option
# upstream/downstream = 500
"output/simulated_variants/variants.vcf.gz",
"output/simulated_variants/variants.parquet",
run:
genome = Genome(input[0])
chrom = "5"
start = 3500000
end = start + 1_000_000
rows = []
nucleotides = list("ACGT")
for pos in range(start, end):
ref = genome.get_nuc(chrom, pos).upper()
for alt in nucleotides:
if alt == ref: continue
rows.append([chrom, pos, '.', ref, alt, '.', '.', '.'])
df = pd.DataFrame(data=rows)
print(df)
df.to_csv(output[0], sep="\t", index=False, header=False)
df[[0, 1, 3, 4]].rename(
columns={0: "chrom", 1: "pos", 3: "ref", 4: "alt"}
).to_parquet(output[1], index=False) |
Beta Was this translation helpful? Give feedback.
The above file seems to include all variants observed in 1001 Genomes Project, while the file below contains all possible SNPs in a 1Mb region.
I used this code to generate the file used as input for Ensembl VEP: