# Intersecting sgRNA sequences with VCF

use bash scripts, i.e. `vcftools` and `bedtools`, to merge VCFs and intersect with the CDS-PAMs generated in step 1

### Prepare VCF file

1. Download Gnomad and dbSNP build 151:
- Gnomad: https://gnomad.broadinstitute.org/downloads#v2-liftover-variants
- dbSNP 151: https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/

2. Strip `chr` in Gnomad VCF:
```bash
zcat gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.gz | awk '{gsub(/^chr/,""); print}' | bgzip -c  > gnomad.exomes.r2.1.1.sites.liftover_grch38.NoCHR.vcf.gz ; tabix -p vcf gnomad.exomes.r2.1.1.sites.liftover_grch38.NoCHR.vcf.gz
```

3. Merge Gnomad and dbSNP
```bash
vcf-merge Gnomad_V2_Exomes_hg38_liftover/gnomad.exomes.r2.1.1.sites.liftover_grch38.NoCHR.vcf.gz hg38_All_20180418.vcf.gz  | bgzip -c > Gnomad_V2_Exomes_hg38_liftover-AND-hg38_dbSNP151_20180418.vcf.gz
```
**Warning: This could take a really long time (~2days).**


4. Split by Chromosomes


5. Intersection with CRISPR/Cas9 target sequences.


### To create .vcf files to intersect .bed files with (in 'human_SNP' NOT 'croton' folder)

The chromosome level VCF files are generated by filtering out low MAF (<0.1%) variants:
```bash
for i in `seq 1 22`; do
        echo $i
        cat /mnt/ceph/users/vli/human_SNP/Gnomad_V2_Exomes_hg38_liftover2/chr$i.vcf | python /mnt/ceph/users/zzhang/croton/src/variant/filter_vcf.py > chr$i.vcf
done

cat /mnt/ceph/users/vli/human_SNP/Gnomad_V2_Exomes_hg38_liftover2/chrX.vcf | python /mnt/ceph/users/zzhang/croton/src/variant/filter_vcf.py > chrX.vcf
```

`filter_vcf.py` can be found in `backend` folder.


In [None]:
%%time
%%bash
for i in `seq 1 22`; do 
    echo $i; 
    bedtools intersect \
    -a ./frontend/data/bed/CDSpams-byChrom/$i.bed \
    -b /mnt/home/zzhang/ceph/human_SNP/Gnomad_V2_Exomes_hg38_liftover/by_chrom/chr$i.vcf \
    -loj > ./frontend/data/bed/CDSpams-byChrom/$i.intersect.bed; 
done

In [None]:
%%bash
i="X"
echo $i; 
bedtools intersect \
-a ./frontend/data/bed/CDSpams-byChrom/$i.bed \
-b /mnt/home/zzhang/ceph/human_SNP/Gnomad_V2_Exomes_hg38_liftover/by_chrom/chr$i.vcf \
-loj > ./frontend/data/bed/CDSpams-byChrom/$i.intersect.bed; 

In [None]:
%load_ext watermark
%watermark -n -u -v -iv -w