Generating required files for variant effect prediction #17

katiana22 · 2024-01-09T23:52:49Z

katiana22
Jan 9, 2024

Hello,

I am interested in deploying a pre-trained model to compute the GPN scores for all possible SNPs in a specific region. As I understand it, the VEP tool from Ensembl can be used to generate all consequences. Could you provide the file or instructions on how to generate the file used as an input for the VEP tool?

The vcf file found here: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/variation/vcf/arabidopsis_thaliana/ seems to not include all possible variants considered in the paper, correct?

Essentially I want to generate this file to feed to run_vep script (minus the 'score' column):

Thank you!

Answered by gonzalobenegas

Jan 10, 2024

The above file seems to include all variants observed in 1001 Genomes Project, while the file below contains all possible SNPs in a 1Mb region.

I used this code to generate the file used as input for Ensembl VEP:

rule make_simulated_variants:
    input:
        "output/genome.fa.gz",
    output:
        # then take this file to ensembl vep online and run with option
        # upstream/downstream = 500
        "output/simulated_variants/variants.vcf.gz",
        "output/simulated_variants/variants.parquet",
    run:
        genome = Genome(input[0])
        chrom = "5"
        start = 3500000
        end = start + 1_000_000
        rows = []
        nucleotides = list("ACGT")
        for pos

View full answer

gonzalobenegas · 2024-01-10T19:54:04Z

gonzalobenegas
Jan 10, 2024
Maintainer

The above file seems to include all variants observed in 1001 Genomes Project, while the file below contains all possible SNPs in a 1Mb region.

I used this code to generate the file used as input for Ensembl VEP:

rule make_simulated_variants:
    input:
        "output/genome.fa.gz",
    output:
        # then take this file to ensembl vep online and run with option
        # upstream/downstream = 500
        "output/simulated_variants/variants.vcf.gz",
        "output/simulated_variants/variants.parquet",
    run:
        genome = Genome(input[0])
        chrom = "5"
        start = 3500000
        end = start + 1_000_000
        rows = []
        nucleotides = list("ACGT")
        for pos in range(start, end):
            ref = genome.get_nuc(chrom, pos).upper()
            for alt in nucleotides:
                if alt == ref: continue
                rows.append([chrom, pos, '.', ref, alt, '.', '.', '.'])
        df = pd.DataFrame(data=rows)
        print(df)
        df.to_csv(output[0], sep="\t", index=False, header=False)
        df[[0, 1, 3, 4]].rename(
            columns={0: "chrom", 1: "pos", 3: "ref", 4: "alt"}
        ).to_parquet(output[1], index=False)

1 reply

katiana22 Jan 11, 2024
Author

Great, thank you for your prompt response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating required files for variant effect prediction #17

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Generating required files for variant effect prediction #17

katiana22 Jan 9, 2024

Replies: 1 comment · 1 reply

gonzalobenegas Jan 10, 2024 Maintainer

katiana22 Jan 11, 2024 Author

katiana22
Jan 9, 2024

Replies: 1 comment 1 reply

gonzalobenegas
Jan 10, 2024
Maintainer

katiana22 Jan 11, 2024
Author