## Use bedtools to see where DMLs and MACAU loci are located.

DMLs between the Olympia oyster populations, Hood Canal and South Sound, were identified using MethylKit. File is: [analyses/dml25.bed](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/dml25.bed) 

MACAU was used to identify loci at which methylation is associated with a phenotype, in our case shell length, while controlling for relatedness. 
- Loci, all samples with 10x coverage: [analyses/macau-all10x.bed](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/macau-all10x.bed)
- Loci, any samples with 10x coverage:[analyses/macau-any10x.bed](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/analyses/macau-any10x.bed)

In [1]:
pwd

'/Users/laura/Documents/roberts-lab/paper-oly-mbdbs-gen/code'

### Make directory for BED output

In [2]:
mkdir ../analyses/BEDtools/

### Preview DML and MACAU loci bed files 

In [3]:
DML = "../analyses/dml25.bed"
!head {DML}
!wc -l {DML}

Contig102998	2220	2222	26
Contig104531	8145	8147	-37
Contig109515	3377	3379	54
Contig1104	15920	15922	29
Contig128059	154	156	-27
Contig129435	3172	3174	-26
Contig1297	49910	49912	25
Contig131260	1798	1800	-29
Contig132309	816	818	28
Contig13829	2520	2522	25
      51 ../analyses/dml25.bed


In [4]:
macau75 = "../analyses/macau/macau-10x75perc.bed"
!head {macau75}
!wc -l {macau75}

Contig110278	3989	3989
Contig110906	429	429
Contig11177	2433	2433
Contig112997	5372	5372
Contig113403	874	874
Contig120547	984	984
Contig12795	477	477
Contig132207	6026	6026
Contig136499	1267	1267
Contig141213	6476	6476
      90 ../analyses/macau/macau-10x75perc.bed


### Set file paths for feature files 

[Olurida_v081.gene.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081.gene.gff) - genes    
[Olurida_v081.CDS.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081.CDS.gff) - Coding regions of genes    
[Olurida_v081.exon.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081.exon.gff) - Exons   
[Olurida_v081.intron.bed](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081.intron.bed) - Introns    
[Olurida_v081.mRNA.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081.mRNA.gff) - mRNA    
[Olurida_v081.three_prime_UTR.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081.three_prime_UTR.gff) - 3' untranslated regions   
[Olurida_v081.five_prime_UTR.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081.five_prime_UTR.gff) - 5' untranslated regions   
[Olurida_v081_TE-Cg.gff](https://github.com/sr320/paper-oly-mbdbs-gen/blob/master/genome-features/Olurida_v081_TE-Cg.gff) - Transposable elements  

In [7]:
CDS = "../genome-features/Olurida_v081.CDS.gff"
exon = "../genome-features/Olurida_v081.exon.gff"
UTR5 = "../genome-features/Olurida_v081.five_prime_UTR.gff"
gene = "../genome-features/Olurida_v081.gene.gff"
intron = "../genome-features/Olurida_v081.intron.bed"
mRNA = "../genome-features/Olurida_v081.mRNA.gff"
UTR3 = "../genome-features/Olurida_v081.three_prime_UTR.gff"
TE = "../genome-features/Olurida_v081_TE-Cg.gff"
AllLoci = "../analyses/macau/macau-all-loci.bed"

In [8]:
! bedtools intersect \


Tool:    bedtools intersect (aka intersectBed)
Version: v2.29.0
Summary: Report overlaps between two feature files.

Usage:   bedtools intersect [OPTIONS] -a <bed/gff/vcf/bam> -b <bed/gff/vcf/bam>

	Note: -b may be followed with multiple databases and/or 
	wildcard (*) character(s). 
Options: 
	-wa	Write the original entry in A for each overlap.

	-wb	Write the original entry in B for each overlap.
		- Useful for knowing _what_ A overlaps. Restricted by -f and -r.

	-loj	Perform a "left outer join". That is, for each feature in A
		report each overlap with B.  If no overlaps are found, 
		report a NULL feature for B.

	-wo	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlaps restricted by -f and -r.
		  Only A features with overlap are reported.

	-wao	Write the original A and B entries plus the number of base
		pairs of overlap between the two features.
		- Overlapping features restricted by -f 

Bedtool options to use:  
`-u` - Write the original A entry _once_ if _any_ overlaps found in B, _i.e._ just report the fact >=1 hit was found  
`-a` - File A  
`-b` - File B  

## 1. DMLs 

In [9]:
! echo "Total methylated loci:" 
! cat {AllLoci} | wc -l

! echo "Loci differentially methylated between SS and HC populations:"
! cat {DML} | wc -l 

!echo "Loci that overlap with genes:"
! bedtools intersect \
-u \
-a {DML} \
-b {gene} | wc -l

!echo "Loci that overlap with exons:"
! bedtools intersect \
-u \
-a {DML} \
-b {exon} | wc -l

!echo "Loci that overlap with introns:"
! bedtools intersect \
-u \
-a {DML} \
-b {intron} | wc -l

!echo "Loci that overlap with coding sequences:"
! bedtools intersect \
-u \
-a {DML} \
-b {CDS} | wc -l

!echo "Loci that overlap with 3' UTRs:"
! bedtools intersect \
-u \
-a {DML} \
-b {UTR3} | wc -l

!echo "Loci that overlap with 5' UTRs:"
! bedtools intersect \
-u \
-a {DML} \
-b {UTR5} | wc -l

!echo "Loci that overlap with mRNA:"
! bedtools intersect \
-u \
-a {DML} \
-b {mRNA} | wc -l

!echo "Loci that overlap with transposable elements:"
! bedtools intersect \
-u \
-a {DML} \
-b {TE} | wc -l

!echo "Loci that do not overlap with known features:"
! bedtools intersect \
-v \
-a {DML} \
-b {mRNA} | wc -l

Total methylated loci:
  256043
Loci differentially methylated between SS and HC populations:
      51
Loci that overlap with genes:
      22
Loci that overlap with exons:
      20
Loci that overlap with introns:
       2
Loci that overlap with coding sequences:
      19
Loci that overlap with 3' UTRs:
       1
Loci that overlap with 5' UTRs:
       0
Loci that overlap with mRNA:
      22
Loci that overlap with transposable elements:
       3
Loci that do not overlap with known features:
      29


### Save DML lists to file 

In [10]:
! bedtools intersect -wb -a {DML} -b {gene} >  ../analyses/BEDtools/DML-gene.txt
! bedtools intersect -wb -a {DML} -b {exon} >  ../analyses/BEDtools/DML-exon.txt
! bedtools intersect -wb -a {DML} -b {intron} >  ../analyses/BEDtools/DML-intron.txt
! bedtools intersect -wb -a {DML} -b {CDS} >  ../analyses/BEDtools/DML-CDS.txt
! bedtools intersect -wb -a {DML} -b {UTR3} >  ../analyses/BEDtools/DML-UTR3.txt
! bedtools intersect -wb -a {DML} -b {mRNA} >  ../analyses/BEDtools/DML-mRNA.txt
! bedtools intersect -wb -a {DML} -b {TE} >  ../analyses/BEDtools/DML-TE.txt
! bedtools intersect -v -a {DML} -b {gene} {exon} {intron} {CDS} {UTR3} {UTR5} {mRNA} {TE} >  ../analyses/BEDtools/DML-intragenic.txt

### Save background loci feature lists to files 

In [16]:
! echo "genes" 
! bedtools intersect -u -a {AllLoci} -b {gene} | wc -l
! echo "exon" 
! bedtools intersect -u -a {AllLoci} -b {exon} | wc -l
! echo "intron" 
! bedtools intersect -u -a {AllLoci} -b {intron} | wc -l
! echo "CDS" 
! bedtools intersect -u -a {AllLoci} -b {CDS} | wc -l
! echo "UTR3" 
! bedtools intersect -u -a {AllLoci} -b {UTR3} | wc -l
! echo "UTR5" 
! bedtools intersect -u -a {AllLoci} -b {UTR5} | wc -l
! echo "mRNA"
! bedtools intersect -u -a {AllLoci} -b {mRNA} | wc -l
! echo "TE" 
! bedtools intersect -u -a {AllLoci} -b {TE} | wc -l
! echo "intragenic" 
! bedtools intersect -v -a {AllLoci} -b {gene} {exon} {intron} {CDS} {UTR3} {UTR5} {mRNA} {TE} | wc -l

genes
   98741
exon
   74880
intron
   23983
CDS
   72939
UTR3
    1581
UTR5
     411
mRNA
   98741
TE
   15510
intragenic
  144501


In [17]:
! bedtools intersect -wb -a {AllLoci} -b {gene} >  ../analyses/BEDtools/AllLoci-gene.txt
! bedtools intersect -wb -a {AllLoci} -b {exon} >  ../analyses/BEDtools/AllLoci-exon.txt
! bedtools intersect -wb -a {AllLoci} -b {intron} >  ../analyses/BEDtools/AllLoci-intron.txt
! bedtools intersect -wb -a {AllLoci} -b {CDS} >  ../analyses/BEDtools/AllLoci-CDS.txt
! bedtools intersect -wb -a {AllLoci} -b {UTR3} >  ../analyses/BEDtools/AllLoci-UTR3.txt
! bedtools intersect -wb -a {AllLoci} -b {UTR5} >  ../analyses/BEDtools/AllLoci-UTR5.txt
! bedtools intersect -wb -a {AllLoci} -b {mRNA} >  ../analyses/BEDtools/AllLoci-mRNA.txt
! bedtools intersect -wb -a {AllLoci} -b {TE} >  ../analyses/BEDtools/AllLoci-TE.txt
! bedtools intersect -v -a {AllLoci} -b {gene} {exon} {intron} {CDS} {UTR3} {UTR5} {mRNA} {TE} >  ../analyses/BEDtools/AllLoci-intragenic.txt

## 2. MACAU Loci 

In [12]:
! echo "Total methylated loci:" 
! cat ../analyses/macau/macau-all-loci.bed | wc -l

! echo "Loci associated with shell length (MACAU):"
! cat {macau75} | wc -l 

!echo "Loci that overlap with genes:"
! bedtools intersect \
-u \
-a {macau75} \
-b {gene} | wc -l

!echo "Loci that overlap with exons:"
! bedtools intersect \
-u \
-a {macau75} \
-b {exon} | wc -l

!echo "Loci that overlap with introns:"
! bedtools intersect \
-u \
-a {macau75} \
-b {intron} | wc -l

!echo "Loci that overlap with coding sequences:"
! bedtools intersect \
-u \
-a {macau75} \
-b {CDS} | wc -l

!echo "Loci that overlap with 3' UTRs:"
! bedtools intersect \
-u \
-a {macau75} \
-b {UTR3} | wc -l

!echo "Loci that overlap with 5' UTRs:"
! bedtools intersect \
-u \
-a {macau75} \
-b {UTR5} | wc -l

!echo "Loci that overlap with mRNA:"
! bedtools intersect \
-u \
-a {macau75} \
-b {mRNA} | wc -l

!echo "Loci that overlap with transposable elements:"
! bedtools intersect \
-u \
-a {macau75} \
-b {TE} | wc -l

!echo "Loci that do not overlap with known features:"
! bedtools intersect \
-v \
-a {macau75} \
-b {mRNA} | wc -l

Total methylated loci:
  256043
Loci associated with shell length (MACAU):
      90
Loci that overlap with genes:
      40
Loci that overlap with exons:
      36
Loci that overlap with introns:
       4
Loci that overlap with coding sequences:
      35
Loci that overlap with 3' UTRs:
       1
Loci that overlap with 5' UTRs:
       0
Loci that overlap with mRNA:
      40
Loci that overlap with transposable elements:
       1
Loci that do not overlap with known features:
      50


### Save macau lists to file 

In [13]:
! bedtools intersect -wb -a {macau75} -b {gene} >  ../analyses/BEDtools/macau75-gene.txt
! bedtools intersect -wb -a {macau75} -b {exon} >  ../analyses/BEDtools/macau75-exon.txt
! bedtools intersect -wb -a {macau75} -b {intron} >  ../analyses/BEDtools/macau75-intron.txt
! bedtools intersect -wb -a {macau75} -b {CDS} >  ../analyses/BEDtools/macau75-CDS.txt
! bedtools intersect -wb -a {macau75} -b {UTR3} >  ../analyses/BEDtools/macau75-UTR3.txt
! bedtools intersect -wb -a {macau75} -b {mRNA} >  ../analyses/BEDtools/macau75-mRNA.txt
! bedtools intersect -wb -a {macau75} -b {TE} >  ../analyses/BEDtools/macau75-TE.txt
! bedtools intersect -v -a {macau75} -b {gene} {exon} {intron} {CDS} {UTR3} {UTR5} {mRNA} {TE} >  ../analyses/BEDtools/macau75-intragenic.txt

## Merge with blastx annotations 

In [6]:
! curl https://raw.githubusercontent.com/sr320/paper-oly-mbdbs-gen/master/analyses/Olgene_blastx_uniprot.05.tab \
    > ../data/Olgene_blastx_uniprot.05.tabmacau-all10x.bed

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1037k  100 1037k    0     0  1383k      0 --:--:-- --:--:-- --:--:-- 1381k


In [7]:
#convert pipes to tab
!tr '|' '\t' < ../data/Olgene_blastx_uniprot.05.tab \
> ../data/Olgene_blastx_uniprot.05-20191122.tab

In [9]:
#Reduce the number of columns using awk. Sort, and save as a new file.
!awk -v OFS='\t' '{print $1, $3, $13}' \
< ../data/Olgene_blastx_uniprot.05-20191122.tab | sort \
> ../data/Olgene_blastx_uniprot.05-20191122-sort.tab

In [10]:
! head ../data/Olgene_blastx_uniprot.05-20191122-sort.tab

Contig100018:1232-2375	P31695	2.23e-06
Contig100073:8284-10076	H2A0L8	6.98e-24
Contig100101:2158-2821	O35796	3.67e-28
Contig100107:1089-2009	Q2KMM2	8.78e-15
Contig100114:437-2094	Q9V4M2	1.41e-09
Contig100163:2402-6678	P23708	2.55e-18
Contig100166:542-2054	G5EBR3	2.08e-11
Contig100170:472-1350	Q5F3T9	9.14e-42
Contig100188:460-2761	Q8TD26	1.35e-18
Contig100206:5719-12338	Q2HJH1	1.51e-14


In [70]:
#Uniprot codes have ".1" appended, so those need to be removed.
#Isolate the contig column name with cut
#Flip order of characters with rev
#Delete last three characters with cut -c
#Flip order of characters with rev
#Add information as a new column to annotated table with paste

!cut -f1 temporary/olurida-blast-sort.tab \
| rev \
| cut -c 3- \
| rev \
> temporary/olurida-blast-sort2.tab

In [20]:
!curl http://owl.fish.washington.edu/halfshell/bu-alanine-wd/17-07-20/uniprot-SP-GO.sorted \
    > ../data/uniprot-SP-GO-sorted.tab

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  340M  100  340M    0     0  2083k      0  0:02:47  0:02:47 --:--:-- 2187k 0  2154k      0  0:02:41  0:00:16  0:02:25 2175k0  1988k      0  0:02:55  0:00:36  0:02:19 1911k0  0:02:59  0:00:57  0:02:02 2190k79M    0     0  2067k      0  0:02:48  0:02:18  0:00:30 2206k


In [21]:
! head ../data/uniprot-SP-GO-sorted.tab

A0A023GPI8	LECA_CANBL	reviewed	Lectin alpha chain (CboL) [Cleaved into: Lectin beta chain; Lectin gamma chain]		Canavalia boliviana	237			mannose binding [GO:0005537]; metal ion binding [GO:0046872]	mannose binding [GO:0005537]; metal ion binding [GO:0046872]	GO:0005537; GO:0046872
A0A023GPJ0	CDII_ENTCC	reviewed	Immunity protein CdiI	cdiI ECL_04450.1	Enterobacter cloacae subsp. cloacae (strain ATCC 13047 / DSM 30054 / NBRC 13535 / NCDC 279-56)	145					
A0A023PXA5	YA19A_YEAST	reviewed	Putative uncharacterized protein YAL019W-A	YAL019W-A	Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)	189					
A0A023PXB0	YA019_YEAST	reviewed	Putative uncharacterized protein YAR019W-A	YAR019W-A	Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)	110					
A0A023PXB5	IRC2_YEAST	reviewed	Putative uncharacterized membrane protein IRC2 (Increased recombination centers protein 2)	IRC2 YDR112W	Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)	102		i

Join the first column in the first file with the first column in the second file

The files are tab delimited, and the output should also be tab delimited (-t $'\t')

In [22]:
! join -1 2 -2 1 -t $'\t' \
../data/Olgene_blastx_uniprot.05-20191122-sort.tab \
../data/uniprot-SP-GO-sorted.tab \
> ../data/Oly_blastx_uniprot.tab

In [18]:
! head ../data/Oly_blastx_uniprot.tab

P31695	Contig100018:1232-2375	2.23e-06	NOTC4_MOUSE	reviewed	Neurogenic locus notch homolog protein 4 (Notch 4) [Cleaved into: Transforming protein Int-3; Notch 4 extracellular truncation; Notch 4 intracellular domain]	Notch4 Int-3 Int3	Mus musculus (Mouse)	1964	branching involved in blood vessel morphogenesis [GO:0001569]; cell differentiation [GO:0030154]; embryo development [GO:0009790]; mammary gland development [GO:0030879]; morphogenesis of a branching structure [GO:0001763]; negative regulation of endothelial cell differentiation [GO:0045602]; negative regulation of Notch signaling pathway [GO:0045746]; Notch signaling pathway [GO:0007219]; positive regulation of angiogenesis [GO:0045766]; positive regulation of aorta morphogenesis [GO:1903849]; regulation of Notch signaling pathway [GO:0008593]; regulation of protein localization [GO:0032880]; regulation of protein processing [GO:0070613]; regulation of transcription, DNA-templated [GO:0006355]; transcription, DNA-templated [GO: