Scripts and notebooks I used to analyze my own genome sequence.
I use a laptop with 16G RAM and Ubuntu operating system.
Here are the tools that we will be using:
- IGV genome browser - for visualization
- samtools - for processing genome alignment (BAM) files
- bcftools - for manipulating variant call format (VCF) files
- Illumina's akt - ancestry and kinship toolkit - for projecting onto 1000 genomes pca
I sequenced my genome with Dante labs (disclaimer: I am neither advertising nor getting anything from them) for two reasons:
- attractive price (250$ for 30x coverage in 2024)
- they deliver all the raw data
I came to the idea thanks to this highly recommended blogpost from my friend German.
Here are the file types that they deliver:
File | Description |
---|---|
GFX0000000_SA_L001_R1_001.fastq.gz | read1 (raw) |
GFX0000000_SA_L001_R2_001.fastq.gz | read2 (raw) |
GFX0000000.bam | reads aligned to reference genome (hg19) |
GFX0000000.bam.bai | index file for aligned reads |
GFX0000000.filtered.snp.vcf.gz | variants: single nucleotide polymorphisms (SNP) |
GFX0000000.filtered.snp.vcf.gz.tbi | index file for SNP |
GFX0000000.filtered.indel.vcf.gz | variants: insertions and deletions |
GFX0000000.filtered.indel.vcf.gz.tbi | index file for indels |
GFX0000000.cnv.vcf.gz | variants: copy number variants (CNV) |
GFX0000000.cnv.vcf.gz.tbi | index file for CNV |
Important: make sure to download all the files and keep them backed up. Dante labs will remove them after a few months.
I use IGV genome browser to look at the genomic data.
- Download IGV according to your system and load your data according to the manual:
- Choose reference genome. In my case it is Human (GRCh37/hg19). (see below)
- Important: Go to View -> Preferences -> Alignments and click "Downsample reads" (I use 100 reads per window of 50 bases), otherwise IGV will be slow on a laptop.
- Go to File -> Load from file and select your bam and/or vcf file. For example, I loaded bam and SNP vcf.
Dante labs may have update their reference to a newer one, hg38. We can use samtools to look at the header of the BAM alignment file, which stores the full command which have been used to process the data, including path to the reference.
samtools view -H GFX0000000.bam | grep --color ht-reference
In the output, we can see "v37" or "grch37" after the "--ht-reference" parameter:
@PG ID: Hash Table Build VN: 01.003.044.3.4.11-hv-7 CL: dragen --build-hash-table true --ht-reference /staging/human_g1k_v37_decoy.fasta --output-directory /staging/grch37/ --enable-cnv true --enable-rna true DS: digest_type: 1 digest: 0xF2543D4A ref_digest: 0xC2311E75 ref_index_digest: 0xB4307AF0 hash_digest: 0x7BF2A3E5
This means that the reference genome is Human (GRCh37/hg19). If you see "v38", choose Human (hg38). Other references, such as T2T, are also possible.
Let us look at one specific variant. For example, we can google for "cilantro taste SNP" to find out if I like or hate cilantro taste. Each SNP has an id, and this one's id is rs72921001. We can enter this into the search field of the IGV and the browser will center on it. Click and drag around the site to zoom in a bit more.
There are several websites where we can get information about a variant. Because our reference genome is a bit older, I prefer to use the archived version of Ensembl.
We can see some information about this variant, such as:
- what is the reference nucleotide (C)
- what is the alternate allele (A)
- what is the frequency of this allele in the population (minor allele frequency, MAF) - this seems to be a very common allele with 0.32
- what is the deleteriousness of this allele (CADD score)
- location in the genome
- links to other databases and publications
Looks like people with A like cilantro and people with C hate it. What about me? I am heterozygous, have both A and C - this can be seen by the two different colors on the vcf file, as well as in raw reads (C's are not spelled out because they are equal to reference, and A's are spelled out in green).
Note: A single T is probably an error in the sequencing process - this can happen, especially at an end of the read
Well, I used to hate cilantro as a child and I am crazy about it as an adult. I am not a neuroscientist, but this gene is an olfactory receptor (what OR in the gene name stands for) and it seems that neurons express them in monoallelic fashion - in contrast to usual situation, when both alleles of a gene are active. Could it be that cilantro loving neurons took over as I grew up? Who knows, but I am not unhappy about the fact.
We will follow this helpful resource from Merriman Lab.
VCF file from Dante labs does not contain SNP IDs (rs...). To add them, we would first need to get them from dbSNP database which is hosted by NCBI.
-
Download the dbSNP VCF from this FTP site: https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/. Again, make sure that the version of the genome corresponds to the reference that has been used by Dante labs.
If your internet connection at home is slow, you could use common SNPs file, which is 1.5G (all SNPs is 15G and would take too long to download):
00-common_all.vcf.gz
. You will also need the00-common_all.vcf.gz.tbi
index. Alternatively, download the full00-All.vcf.gz
. -
Use bcftools to annotate your VCF like this:
bcftools annotate -a 00-common_all.vcf.gz -c ID -o GFX0000000.filtered.snp.dbsnp_annotated.vcf.gz -O z GFX0000000.filtered.snp.vcf.gz
Replace GFX0000000... with your actual VCF file name.
We can use the data from 1000 genomes project to visualize where among the big populations our ancestry lies. This was inspired by this great blogpost by Wendy Wong, who found out this way that their sample was swapped by Dante labs! 😱
We will use Illumina's ancestry and kinship toolkit (akt) which already has all the necessary data in the right format. Follow instructions to clone the akt repository and install akt.
After that run the following command to project your genome onto principal components of the 1000 genomes and create the file 1000G.pca.txt
:
akt pca --assume-homref -W ./data/wgs.grch37.vcf.gz ../../genomes/GFX00000000.filtered.snp.vcf.gz > 1000G.pca.txt
Usually the first two principal components are taken for visualization. We can plot our genome together with the 1000 genomes on a scatterplot (see notebooks/Plot_1000genomes_PCA.ipynb
):
It looks plausible for me to be among the European population, so this time hopefully no sample swap!