# Methylation landscape analysis

In this notebook, I'll characterize the methylation landscape using output from [`BAT_summarize`](https://github.com/yaaminiv/killifish-hypoxia-RRBS/tree/main/output/05-analysis/summarize).

## 0. Prepare notebook for analysis

In [1]:
!pwd

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/code


In [2]:
cd ../output/05-analysis/

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/output/05-analysis


In [3]:
cd new-genome/

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/output/05-analysis/new-genome


In [4]:
!mkdir methylation-landscape

In [5]:
cd methylation-landscape/

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/output/05-analysis/new-genome/methylation-landscape


In [6]:
!which bedtools

/opt/homebrew/bin/bedtools


In [7]:
bedtoolsDirectory = "/opt/homebrew/bin/"

In [8]:
#Install pandas for this notebook
import pandas as pd
print(pd.__version__)

0.25.1


## 1. Create union file

I want to understand the data in two different ways: one with missing values, and one without. I will use `unionBedGraphs` to create a file with missing data for methylation landscape analysis.

In [17]:
#Find files to concatenate
!find /Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/*sort.bedgraph

/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-N4_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-S1_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-S3_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-S4_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_5-N1_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_5-N2_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_5-S3_CG.sort.bedgraph
/Volumes/

In [16]:
!tail /Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-N4_CG.sort.bedgraph

NW_023397471.1	20968	20969	0.00
NW_023397471.1	20977	20978	0.00
NW_023397471.1	22730	22731	0.50
NW_023397471.1	22762	22763	0.33
NW_023397471.1	22789	22790	0.94
NW_023397471.1	22797	22798	1.00
NW_023397471.1	22824	22825	0.78
NW_023397471.1	22879	22880	0.97
NW_023397471.1	22892	22893	0.99
NW_023397471.1	22914	22915	0.81


In [18]:
#Create a union bedGraph
#Use N/A when there is no data for a CpG in a sample
#Define sample IDs
#Use sorted bedgraphs
#Cound the number of lines (CpGs) with data
!{bedtoolsDirectory}unionBedGraphs \
-header \
-filler N/A \
-names N_20-N4 N_5-N1 N_5-N2 N_20-N2 N_5-N3 N_20-N1 N_OC-N5 N_OC-N1 N_OC-N2 N_OC-N4 S_20-S1 S_20-S3 S_20-S4 S_5-S3 S_5-S4 S_5-S2 S_20-S2 S_5-S1 S_OC-S1 S_OC-S2 S_OC-S3 S_OC-S5 \
-i /Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/*sort.bedgraph \
> union_10x.bedgraph

In [19]:
#Check output
!head union_10x.bedgraph
!wc -l union_10x.bedgraph

chrom	start	end	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5
NC_012312.1	60	61	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.00	N/A	0.00	0.00	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
NC_012312.1	61	62	0.00	N/A	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A
NC_012312.1	126	127	N/A	0.00	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A	0.00	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
NC_012312.1	127	128	N/A	N/A	0.00	N/A	N/A	N/A	0.00	0.00	N/A	N/A	N/A	0.00	0.00	N/A	N/A	N/A	0.00	N/A	N/A	N/A	N/A	N/A
NC_012312.1	296	297	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.01	0.01	0.02	0.05	0.01	0.02	0.04	0.03	0.01	0.01	0.01	N/A	0.03
NC_012312.1	307	308	N/A	0.00	N/A	N/A	0.00	N/A	N/A	N/A	0.00	N/A	N/A	0.55	0.00	N/A	N/A	N/A	0.00	0.00	N/A	N/A	0.00	N/A
NC_012312.1	308	309	0.00	0.07	0.00	0.00	0.00	0.00	N/A	0.00	0.00	N/A	0.00	N/A	0.00	N/A	N/A	0.00	0.00	N/A	N/A	0.01	N/A	N/A
NC_012312.1	31

In [20]:
#Import data into pandas
#Check head
df = pd.read_table("union_10x.bedgraph")
df.head(5)

Unnamed: 0,chrom,start,end,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,...,S_20-S4,S_5-S3,S_5-S4,S_5-S2,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5
0,NC_012312.1,60,61,,,,,,,,...,0.0,,,,,,,,,
1,NC_012312.1,61,62,0.0,,,0.0,,,,...,,,,0.0,,,,,,
2,NC_012312.1,126,127,,0.0,,0.0,,,,...,0.0,,,,,,,,,
3,NC_012312.1,127,128,,,0.0,,,,0.0,...,0.0,,,,0.0,,,,,
4,NC_012312.1,296,297,,,,,,,,...,0.05,0.01,0.02,0.04,0.03,0.01,0.01,0.01,,0.03


In [21]:
#Average all samples for total genome methylation information and save as a new column
#NA are not included in averages
#Check output
df['average'] = df[['N_20-N4', 'N_5-N1', 'N_5-N2', 'N_20-N2', 'N_5-N3', 'N_20-N1', 'N_OC-N5', 'N_OC-N1', 'N_OC-N2', 'N_OC-N4', 'S_20-S1', 'S_20-S3', 'S_20-S4', 'S_5-S3', 'S_5-S4', 'S_5-S2', 'S_20-S2', 'S_5-S1', 'S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df.head(10)

Unnamed: 0,chrom,start,end,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,...,S_5-S3,S_5-S4,S_5-S2,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5,average
0,NC_012312.1,60,61,,,,,,,,...,,,,,,,,,,0.0
1,NC_012312.1,61,62,0.0,,,0.0,,,,...,,,0.0,,,,,,,0.0
2,NC_012312.1,126,127,,0.0,,0.0,,,,...,,,,,,,,,,0.0
3,NC_012312.1,127,128,,,0.0,,,,0.0,...,,,,0.0,,,,,,0.0
4,NC_012312.1,296,297,,,,,,,,...,0.01,0.02,0.04,0.03,0.01,0.01,0.01,,0.03,0.020833
5,NC_012312.1,307,308,,0.0,,,0.0,,,...,,,,0.0,0.0,,,0.0,,0.06875
6,NC_012312.1,308,309,0.0,0.07,0.0,0.0,0.0,0.0,,...,,,0.0,0.0,,,0.01,,,0.006154
7,NC_012312.1,319,320,,,,,0.0,,,...,,,,0.0,0.0,,,0.0,,0.0
8,NC_012312.1,320,321,0.0,0.0,0.0,0.08,0.0,0.0,,...,,,0.0,0.0,,,0.0,,,0.006154
9,NC_012312.1,321,322,,,,,0.01,,,...,,,,0.03,0.0,,,0.0,,0.005714


In [22]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("all-samples-averages-union.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [23]:
!head all-samples-averages-union.bedgraph
!wc -l all-samples-averages-union.bedgraph

	chrom	start	end	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5	average
0	NC_012312.1	60	61	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0	N/A	0.0	0.0	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0
1	NC_012312.1	61	62	0.0	N/A	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	0.0
2	NC_012312.1	126	127	N/A	0.0	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	0.0	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0
3	NC_012312.1	127	128	N/A	N/A	0.0	N/A	N/A	N/A	0.0	0.0	N/A	N/A	N/A	0.0	0.0	N/A	N/A	N/A	0.0	N/A	N/A	N/A	N/A	N/A	0.0
4	NC_012312.1	296	297	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.01	0.01	0.02	0.05	0.01	0.02	0.04	0.03	0.01	0.01	0.01	N/A	0.03	0.020833333333333332
5	NC_012312.1	307	308	N/A	0.0	N/A	N/A	0.0	N/A	N/A	N/A	0.0	N/A	N/A	0.55	0.0	N/A	N/A	N/A	0.0	0.0	N/A	N/A	0.0	N/A	0.06875
6	NC_012312.1	308	309	0.0	0.07	0.0	0.0	0.0	0.0	N/A	0.0	0.0	N/A	0.0	N/A	0.0	N/A	N/A	0.0	0.0	N

In [24]:
#Confirm column number with average methylation information
!cut -f27 all-samples-averages-union.bedgraph | head

average
0.0
0.0
0.0
0.0
0.020833333333333332
0.06875
0.006153846153846154
0.0
0.006153846153846154
cut: stdout: Broken pipe


In [25]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-samples-averages-union.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($27 > 0) { print $27 }}' ${f} \
| wc -l

  354151


In [16]:
#Number of unmethylated CpGs
5413382 - 4339834

1073548

In [14]:
#Import data into pandas
#Calculate average methylation (pandas ignores NAs)
df = pd.read_table("all-samples-averages-union.bedgraph")
df[['average']].mean()

average    0.588945
dtype: float64

In [10]:
#Calculate average methylation using awk
#Answer matches with pandas, so awk also ignores NAs
! tail -n+2 all-samples-averages-union.bedgraph \
| awk '{ total += $27; count++ } END { print total/count }'

0.588945


## 2. Global methylation

### 2a. Format data

In [83]:
#metilene output from all population comparison that includes all samples
!head ../summarize/all_pop/all_pop_metilene_N_S.txt
!wc -l ../summarize/all_pop/all_pop_metilene_N_S.txt

chr	pos	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5
JXMV01056319.1	89	0.15	0.07	0.00	0.16	0.05	0.35	0.22	0.52	0.09	0.00	0.06	0.05	0.06	0.12	0.03	0.10	0.00	0.04	0.02	0.08	0.02	0.00
JXMV01056319.1	145	0.00	0.28	0.00	0.00	0.35	0.03	0.00	0.01	0.01	0.27	0.08	0.03	0.01	0.00	0.01	0.33	0.00	0.00	0.02	0.03	0.34	0.75
JXMV01057363.1	4077	0.01	0.00	0.04	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.17	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
JXMV01057363.1	4120	0.13	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
JXMV01057451.1	2455	0.21	0.10	0.30	0.00	0.49	0.00	0.42	0.16	0.03	0.31	0.22	0.03	0.03	0.01	0.01	0.12	0.00	0.00	0.00	0.38	0.02	0.08
JXMV01058392.1	169	0.09	0.00	0.00	0.21	0.18	0.00	0.00	0.00	0.00	0.00	0.16	0.11	0.00	0.00	0.12	0.09	0.00	0.99	0.00	0.00	0.00	0.00
JXMV01058392.1	174	0.00	0.00	0.00	0.00	0.00	0.0

In [84]:
#Import data into pandas
#Check head
df = pd.read_table("../summarize/all_pop/all_pop_metilene_N_S.txt")
df.head(5)

Unnamed: 0,chr,pos,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,N_OC-N1,...,S_20-S4,S_5-S3,S_5-S4,S_5-S2,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5
0,JXMV01056319.1,89,0.15,0.07,0.0,0.16,0.05,0.35,0.22,0.52,...,0.06,0.12,0.03,0.1,0.0,0.04,0.02,0.08,0.02,0.0
1,JXMV01056319.1,145,0.0,0.28,0.0,0.0,0.35,0.03,0.0,0.01,...,0.01,0.0,0.01,0.33,0.0,0.0,0.02,0.03,0.34,0.75
2,JXMV01057363.1,4077,0.01,0.0,0.04,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,JXMV01057363.1,4120,0.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,JXMV01057451.1,2455,0.21,0.1,0.3,0.0,0.49,0.0,0.42,0.16,...,0.03,0.01,0.01,0.12,0.0,0.0,0.0,0.38,0.02,0.08


In [85]:
#Average all samples for total genome methylation information and save as a new column
#NA are not included in averages
#Check output
df['average'] = df[['N_20-N4', 'N_5-N1', 'N_5-N2', 'N_20-N2', 'N_5-N3', 'N_20-N1', 'N_OC-N5', 'N_OC-N1', 'N_OC-N2', 'N_OC-N4', 'S_20-S1', 'S_20-S3', 'S_20-S4', 'S_5-S3', 'S_5-S4', 'S_5-S2', 'S_20-S2', 'S_5-S1', 'S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df.head(10)

Unnamed: 0,chr,pos,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,N_OC-N1,...,S_5-S3,S_5-S4,S_5-S2,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5,average
0,JXMV01056319.1,89,0.15,0.07,0.0,0.16,0.05,0.35,0.22,0.52,...,0.12,0.03,0.1,0.0,0.04,0.02,0.08,0.02,0.0,0.099545
1,JXMV01056319.1,145,0.0,0.28,0.0,0.0,0.35,0.03,0.0,0.01,...,0.0,0.01,0.33,0.0,0.0,0.02,0.03,0.34,0.75,0.115909
2,JXMV01057363.1,4077,0.01,0.0,0.04,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
3,JXMV01057363.1,4120,0.13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005909
4,JXMV01057451.1,2455,0.21,0.1,0.3,0.0,0.49,0.0,0.42,0.16,...,0.01,0.01,0.12,0.0,0.0,0.0,0.38,0.02,0.08,0.132727
5,JXMV01058392.1,169,0.09,0.0,0.0,0.21,0.18,0.0,0.0,0.0,...,0.0,0.12,0.09,0.0,0.99,0.0,0.0,0.0,0.0,0.088636
6,JXMV01058392.1,174,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.11,0.6,0.0,0.0,0.037273
7,JXMV01058392.1,183,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015
8,JXMV01058392.1,190,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,...,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.006364
9,JXMV01058392.1,197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.05,0.0,0.0,0.13,0.0,0.0,0.0,0.008182


In [86]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("all-samples-averages.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [87]:
!head all-samples-averages.bedgraph
!wc -l all-samples-averages.bedgraph

	chr	pos	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5	average
0	JXMV01056319.1	89	0.15	0.07	0.0	0.16	0.05	0.35	0.22	0.52	0.09	0.0	0.06	0.05	0.06	0.12	0.03	0.1	0.0	0.04	0.02	0.08	0.02	0.0	0.09954545454545456
1	JXMV01056319.1	145	0.0	0.28	0.0	0.0	0.35	0.03	0.0	0.01	0.01	0.27	0.08	0.03	0.01	0.0	0.01	0.33	0.0	0.0	0.02	0.03	0.34	0.75	0.11590909090909092
2	JXMV01057363.1	4077	0.01	0.0	0.04	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.17	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.010000000000000002
3	JXMV01057363.1	4120	0.13	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.005909090909090909
4	JXMV01057451.1	2455	0.21	0.1	0.3	0.0	0.49	0.0	0.42	0.16	0.03	0.31	0.22	0.03	0.03	0.01	0.01	0.12	0.0	0.0	0.0	0.38	0.02	0.08	0.1327272727272727
5	JXMV01058392.1	169	0.09	0.0	0.0	0.21	0.18	0.0	0.0	0.0	0.0	0.0	0.16	0.11	0.0	0.0	0.12	0.09	0.0	0.99	0.0	0.0	0.0	0.

### 2b. Number of methylated and unmethylated CpGs

In [18]:
#Confirm which column has the average methylation information
!cut -f26 all-samples-averages.bedgraph | head

average
0.09954545454545456
0.11590909090909092
0.010000000000000002
0.005909090909090909
0.1327272727272727
0.08863636363636364
0.03727272727272727
0.015000000000000001
0.006363636363636364


In [19]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-samples-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($26 > 0) { print $26 }}' ${f} \
| wc -l

   13723


In [13]:
#Remove header
#Count number of CpGs
! tail -n+2 all-samples-averages.bedgraph \
| wc -l

   14895


In [20]:
#Number of unmethylated CpGs
14895 - 13723

1172

In [21]:
#Calculate average methylation
! tail -n+2 all-samples-averages.bedgraph \
| awk '{ total += $26; count++ } END { print total/count }'

0.206832


## 3. Methylation by population

### 3a. New Bedford Harbor

In [27]:
!ls ../summarize/20_5_N/

[31mN_20-N1.bedgraph[m[m         [31mN_5-N1.bw[m[m                [31mN_mean_HY.bedgraph[m[m
[31mN_20-N1.bw[m[m               [31mN_5-N2.bedgraph[m[m          [31mN_mean_HY.bw[m[m
[31mN_20-N2.bedgraph[m[m         [31mN_5-N2.bw[m[m                [31mN_mean_NO.bedgraph[m[m
[31mN_20-N2.bw[m[m               [31mN_5-N3.bedgraph[m[m          [31mN_mean_NO.bw[m[m
[31mN_20-N4.bedgraph[m[m         [31mN_5-N3.bw[m[m                [31mN_metilene_NO_HY.txt[m[m
[31mN_20-N4.bw[m[m               [31mN_diff_NO_HY.bedgraph[m[m    [31mN_summary_NO_HY.bedgraph[m[m
[31mN_5-N1.bedgraph[m[m          [31mN_diff_NO_HY.bw[m[m


In [29]:
#metilene output from NBH comparison of hypoxia and normoxia samples
!head ../summarize/20_5_N/N_metilene_NO_HY.txt
!wc -l ../summarize/20_5_N/N_metilene_NO_HY.txt

chr	pos	NO_20-N4	NO_20-N2	NO_20-N1	HY_5-N1	HY_5-N2	HY_5-N3
JXMV01051582.1	7435	0.93	0.51	0.10	0.80	0.77	0.80
JXMV01051582.1	7457	0.52	0.62	0.30	0.78	0.61	0.99
JXMV01051582.1	7488	0.19	1.00	0.60	0.93	0.52	0.78
JXMV01051609.1	3478	0.02	0.67	0.97	0.12	0.63	0.85
JXMV01051609.1	3495	0.48	0.83	1.00	1.00	0.98	1.00
JXMV01051609.1	3502	0.07	0.75	0.03	1.00	0.37	0.99
JXMV01051609.1	3512	0.48	0.83	1.00	1.00	0.72	1.00
JXMV01051609.1	3518	0.59	0.83	1.00	1.00	0.72	0.99
JXMV01051609.1	3523	0.47	0.83	1.00	0.08	0.59	1.00
  176287 ../summarize/20_5_N/N_metilene_NO_HY.txt


In [30]:
#Import data into pandas
#Check head
df = pd.read_table("../summarize/20_5_N/N_metilene_NO_HY.txt")
df.head(5)

Unnamed: 0,chr,pos,NO_20-N4,NO_20-N2,NO_20-N1,HY_5-N1,HY_5-N2,HY_5-N3
0,JXMV01051582.1,7435,0.93,0.51,0.1,0.8,0.77,0.8
1,JXMV01051582.1,7457,0.52,0.62,0.3,0.78,0.61,0.99
2,JXMV01051582.1,7488,0.19,1.0,0.6,0.93,0.52,0.78
3,JXMV01051609.1,3478,0.02,0.67,0.97,0.12,0.63,0.85
4,JXMV01051609.1,3495,0.48,0.83,1.0,1.0,0.98,1.0


In [31]:
#Average all samples for total genome methylation information and save as a new column
#NA are not included in averages
#Check output
df['average'] = df[['NO_20-N4', 'NO_20-N2', 'NO_20-N1', 'HY_5-N1', 'HY_5-N2', 'HY_5-N3']].mean(axis=1)
df.head(10)

Unnamed: 0,chr,pos,NO_20-N4,NO_20-N2,NO_20-N1,HY_5-N1,HY_5-N2,HY_5-N3,average
0,JXMV01051582.1,7435,0.93,0.51,0.1,0.8,0.77,0.8,0.651667
1,JXMV01051582.1,7457,0.52,0.62,0.3,0.78,0.61,0.99,0.636667
2,JXMV01051582.1,7488,0.19,1.0,0.6,0.93,0.52,0.78,0.67
3,JXMV01051609.1,3478,0.02,0.67,0.97,0.12,0.63,0.85,0.543333
4,JXMV01051609.1,3495,0.48,0.83,1.0,1.0,0.98,1.0,0.881667
5,JXMV01051609.1,3502,0.07,0.75,0.03,1.0,0.37,0.99,0.535
6,JXMV01051609.1,3512,0.48,0.83,1.0,1.0,0.72,1.0,0.838333
7,JXMV01051609.1,3518,0.59,0.83,1.0,1.0,0.72,0.99,0.855
8,JXMV01051609.1,3523,0.47,0.83,1.0,0.08,0.59,1.0,0.661667
9,JXMV01051609.1,12744,0.98,1.0,0.0,1.0,0.79,1.0,0.795


In [32]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("N-samples-averages.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [33]:
!head N-samples-averages.bedgraph
!wc -l N-samples-averages.bedgraph

	chr	pos	NO_20-N4	NO_20-N2	NO_20-N1	HY_5-N1	HY_5-N2	HY_5-N3	average
0	JXMV01051582.1	7435	0.93	0.51	0.1	0.8	0.77	0.8	0.6516666666666667
1	JXMV01051582.1	7457	0.52	0.62	0.3	0.78	0.61	0.99	0.6366666666666667
2	JXMV01051582.1	7488	0.19	1.0	0.6	0.93	0.52	0.78	0.67
3	JXMV01051609.1	3478	0.02	0.67	0.97	0.12	0.63	0.85	0.5433333333333333
4	JXMV01051609.1	3495	0.48	0.83	1.0	1.0	0.98	1.0	0.8816666666666667
5	JXMV01051609.1	3502	0.07	0.75	0.03	1.0	0.37	0.99	0.535
6	JXMV01051609.1	3512	0.48	0.83	1.0	1.0	0.72	1.0	0.8383333333333334
7	JXMV01051609.1	3518	0.59	0.83	1.0	1.0	0.72	0.99	0.855
8	JXMV01051609.1	3523	0.47	0.83	1.0	0.08	0.59	1.0	0.6616666666666666
  176287 N-samples-averages.bedgraph


In [34]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 N-samples-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($10 > 0) { print $10 }}' ${f} \
| wc -l

  116821


In [35]:
#Remove header
#Count number of CpGs
! tail -n+2 N-samples-averages.bedgraph \
| wc -l

  176286


In [36]:
#Number of unmethylated CpGs
176286 - 116821

59465

In [37]:
#Calculate average methylation
! tail -n+2 N-samples-averages.bedgraph \
| awk '{ total += $10; count++ } END { print total/count }'

0.212373


### 3b. Scorton Creek

In [38]:
!ls ../summarize/20_5_S/

[31mS_20-S1.bedgraph[m[m         [31mS_5-S1.bedgraph[m[m          [31mS_diff_NO_HY.bedgraph[m[m
[31mS_20-S1.bw[m[m               [31mS_5-S1.bw[m[m                [31mS_diff_NO_HY.bw[m[m
[31mS_20-S2.bedgraph[m[m         [31mS_5-S2.bedgraph[m[m          [31mS_mean_HY.bedgraph[m[m
[31mS_20-S2.bw[m[m               [31mS_5-S2.bw[m[m                [31mS_mean_HY.bw[m[m
[31mS_20-S3.bedgraph[m[m         [31mS_5-S3.bedgraph[m[m          [31mS_mean_NO.bedgraph[m[m
[31mS_20-S3.bw[m[m               [31mS_5-S3.bw[m[m                [31mS_mean_NO.bw[m[m
[31mS_20-S4.bedgraph[m[m         [31mS_5-S4.bedgraph[m[m          [31mS_metilene_NO_HY.txt[m[m
[31mS_20-S4.bw[m[m               [31mS_5-S4.bw[m[m                [31mS_summary_NO_HY.bedgraph[m[m


In [39]:
#metilene output from NBH comparison of hypoxia and normoxia samples
!head ../summarize/20_5_S/S_metilene_NO_HY.txt
!wc -l ../summarize/20_5_S/S_metilene_NO_HY.txt

chr	pos	NO_20-S1	NO_20-S3	NO_20-S4	NO_20-S2	HY_5-S3	HY_5-S4	HY_5-S2	HY_5-S1
JXMV01052040.1	7874	0.29	0.06	0.19	0.33	0.25	0.35	0.07	0.31
JXMV01052596.1	1405	0.00	0.00	0.00	0.00	0.01	0.00	0.00	0.00
JXMV01052596.1	16005	0.00	0.00	0.00	0.00	0.18	0.01	0.00	0.00
JXMV01052596.1	16038	0.00	0.00	0.00	0.00	0.18	0.00	0.00	0.00
JXMV01052596.1	16084	0.00	0.00	0.00	0.00	0.18	0.00	0.00	0.00
JXMV01052596.1	16088	0.00	0.00	0.00	0.00	0.18	0.00	0.00	0.00
JXMV01052596.1	16095	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
JXMV01054326.1	10898	0.68	0.45	0.72	0.90	0.99	0.83	0.74	0.03
JXMV01054326.1	10903	0.83	0.45	0.99	0.90	0.84	0.91	0.59	1.00
   82306 ../summarize/20_5_S/S_metilene_NO_HY.txt


In [40]:
#Import data into pandas
#Check head
df = pd.read_table("../summarize/20_5_S/S_metilene_NO_HY.txt")
df.head(5)

Unnamed: 0,chr,pos,NO_20-S1,NO_20-S3,NO_20-S4,NO_20-S2,HY_5-S3,HY_5-S4,HY_5-S2,HY_5-S1
0,JXMV01052040.1,7874,0.29,0.06,0.19,0.33,0.25,0.35,0.07,0.31
1,JXMV01052596.1,1405,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
2,JXMV01052596.1,16005,0.0,0.0,0.0,0.0,0.18,0.01,0.0,0.0
3,JXMV01052596.1,16038,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0
4,JXMV01052596.1,16084,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0


In [41]:
#Average all samples for total genome methylation information and save as a new column
#NA are not included in averages
#Check output
df['average'] = df[['NO_20-S1', 'NO_20-S3', 'NO_20-S4', 'NO_20-S2', 'HY_5-S3', 'HY_5-S4', 'HY_5-S2', 'HY_5-S1']].mean(axis=1)
df.head(10)

Unnamed: 0,chr,pos,NO_20-S1,NO_20-S3,NO_20-S4,NO_20-S2,HY_5-S3,HY_5-S4,HY_5-S2,HY_5-S1,average
0,JXMV01052040.1,7874,0.29,0.06,0.19,0.33,0.25,0.35,0.07,0.31,0.23125
1,JXMV01052596.1,1405,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.00125
2,JXMV01052596.1,16005,0.0,0.0,0.0,0.0,0.18,0.01,0.0,0.0,0.02375
3,JXMV01052596.1,16038,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0225
4,JXMV01052596.1,16084,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0225
5,JXMV01052596.1,16088,0.0,0.0,0.0,0.0,0.18,0.0,0.0,0.0,0.0225
6,JXMV01052596.1,16095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,JXMV01054326.1,10898,0.68,0.45,0.72,0.9,0.99,0.83,0.74,0.03,0.6675
8,JXMV01054326.1,10903,0.83,0.45,0.99,0.9,0.84,0.91,0.59,1.0,0.81375
9,JXMV01054326.1,10905,0.88,0.45,0.69,0.9,0.56,0.78,0.48,0.99,0.71625


In [42]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("S-samples-averages.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [43]:
!head S-samples-averages.bedgraph
!wc -l S-samples-averages.bedgraph

	chr	pos	NO_20-S1	NO_20-S3	NO_20-S4	NO_20-S2	HY_5-S3	HY_5-S4	HY_5-S2	HY_5-S1	average
0	JXMV01052040.1	7874	0.29	0.06	0.19	0.33	0.25	0.35	0.07	0.31	0.23125000000000004
1	JXMV01052596.1	1405	0.0	0.0	0.0	0.0	0.01	0.0	0.0	0.0	0.00125
2	JXMV01052596.1	16005	0.0	0.0	0.0	0.0	0.18	0.01	0.0	0.0	0.02375
3	JXMV01052596.1	16038	0.0	0.0	0.0	0.0	0.18	0.0	0.0	0.0	0.0225
4	JXMV01052596.1	16084	0.0	0.0	0.0	0.0	0.18	0.0	0.0	0.0	0.0225
5	JXMV01052596.1	16088	0.0	0.0	0.0	0.0	0.18	0.0	0.0	0.0	0.0225
6	JXMV01052596.1	16095	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7	JXMV01054326.1	10898	0.68	0.45	0.72	0.9	0.99	0.83	0.74	0.03	0.6675000000000001
8	JXMV01054326.1	10903	0.83	0.45	0.99	0.9	0.84	0.91	0.59	1.0	0.81375
   82306 S-samples-averages.bedgraph


In [44]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 S-samples-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($12 > 0) { print $12 }}' ${f} \
| wc -l

   57338


In [45]:
#Remove header
#Count number of CpGs
! tail -n+2 S-samples-averages.bedgraph \
| wc -l

   82305


In [46]:
#Number of unmethylated CpGs
82305 - 57338

24967

In [47]:
#Calculate average methylation
! tail -n+2 S-samples-averages.bedgraph \
| awk '{ total += $12; count++ } END { print total/count }'

0.162864


## 4. Methylation by population and oxygen treatment

### 4a. New Bedford Harbor

#### Hypoxia

In [101]:
#metilene output from all population comparison that includes all samples
!head ../summarize/20_5_N/N_mean_HY.bedgraph
!wc -l ../summarize/20_5_N/N_mean_HY.bedgraph

#chr	start	end	mean_HY
JXMV01051582.1	7434	7435	0.79
JXMV01051582.1	7456	7457	0.793333333333333
JXMV01051582.1	7487	7488	0.743333333333334
JXMV01051609.1	3477	3478	0.533333333333333
JXMV01051609.1	3494	3495	0.993333333333333
JXMV01051609.1	3501	3502	0.786666666666667
JXMV01051609.1	3511	3512	0.906666666666667
JXMV01051609.1	3517	3518	0.903333333333333
JXMV01051609.1	3522	3523	0.556666666666667
  176287 ../summarize/20_5_N/N_mean_HY.bedgraph


In [26]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 ../summarize/20_5_N/N_mean_HY.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($4 > 0) { print $4 }}' ${f} \
| wc -l

   96857


In [27]:
#Remove header
#Count number of CpGs
! tail -n+2 ../summarize/20_5_N/N_mean_HY.bedgraph \
| wc -l

  176286


In [28]:
#Count the number of unmethylated CpGs
176286 - 96857

79429

In [106]:
#Calculate average methylation
! tail -n+2 ../summarize/20_5_N/N_mean_HY.bedgraph \
| awk '{ total += $4; count++ } END { print total/count }'

0.220691


#### Normoxia

In [107]:
#metilene output from all population comparison that includes all samples
!head ../summarize/20_5_N/N_mean_NO.bedgraph
!wc -l ../summarize/20_5_N/N_mean_NO.bedgraph

#chr	start	end	mean_NO
JXMV01051582.1	7434	7435	0.513333333333333
JXMV01051582.1	7456	7457	0.48
JXMV01051582.1	7487	7488	0.596666666666667
JXMV01051609.1	3477	3478	0.553333333333333
JXMV01051609.1	3494	3495	0.77
JXMV01051609.1	3501	3502	0.283333333333333
JXMV01051609.1	3511	3512	0.77
JXMV01051609.1	3517	3518	0.806666666666667
JXMV01051609.1	3522	3523	0.766666666666667
  176287 ../summarize/20_5_N/N_mean_NO.bedgraph


In [29]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 ../summarize/20_5_N/N_mean_NO.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($4 > 0) { print $4 }}' ${f} \
| wc -l

   93895


In [30]:
#Remove header
#Count number of CpGs
! tail -n+2 ../summarize/20_5_N/N_mean_NO.bedgraph \
| wc -l

  176286


In [31]:
#Count the number of unmethylated CpGs
176286 - 93895

82391

In [110]:
#Calculate average methylation
! tail -n+2 ../summarize/20_5_N/N_mean_NO.bedgraph \
| awk '{ total += $4; count++ } END { print total/count }'

0.204056


### 4b. Scorton Creek

#### Hypoxia

In [112]:
#metilene output from all population comparison that includes all samples
!head ../summarize/20_5_S/S_mean_HY.bedgraph
!wc -l ../summarize/20_5_S/S_mean_HY.bedgraph

#chr	start	end	mean_HY
JXMV01052040.1	7873	7874	0.245
JXMV01052596.1	1404	1405	0.0025
JXMV01052596.1	16004	16005	0.0475
JXMV01052596.1	16037	16038	0.045
JXMV01052596.1	16083	16084	0.045
JXMV01052596.1	16087	16088	0.045
JXMV01052596.1	16094	16095	0
JXMV01054326.1	10897	10898	0.6475
JXMV01054326.1	10902	10903	0.835
   82306 ../summarize/20_5_S/S_mean_HY.bedgraph


In [32]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 ../summarize/20_5_S/S_mean_HY.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($4 > .1) { print $4 }}' ${f} \
| wc -l

   22694


In [33]:
#Remove header
#Count number of CpGs
! tail -n+2 ../summarize/20_5_S/S_mean_HY.bedgraph \
| wc -l

   82305


In [34]:
#Count the number of unmethylated CpGs
82305 - 22694

59611

In [115]:
#Calculate average methylation
! tail -n+2 ../summarize/20_5_S/S_mean_HY.bedgraph \
| awk '{ total += $4; count++ } END { print total/count }'

0.164196


#### Normoxia

In [116]:
#metilene output from all population comparison that includes all samples
!head ../summarize/20_5_S/S_mean_NO.bedgraph
!wc -l ../summarize/20_5_S/S_mean_NO.bedgraph

#chr	start	end	mean_NO
JXMV01052040.1	7873	7874	0.2175
JXMV01052596.1	1404	1405	0
JXMV01052596.1	16004	16005	0
JXMV01052596.1	16037	16038	0
JXMV01052596.1	16083	16084	0
JXMV01052596.1	16087	16088	0
JXMV01052596.1	16094	16095	0
JXMV01054326.1	10897	10898	0.6875
JXMV01054326.1	10902	10903	0.7925
   82306 ../summarize/20_5_S/S_mean_NO.bedgraph


In [35]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 ../summarize/20_5_S/S_mean_NO.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($4 > 0) { print $4 }}' ${f} \
| wc -l

   45661


In [36]:
#Remove header
#Count number of CpGs
! tail -n+2 ../summarize/20_5_S/S_mean_NO.bedgraph \
| wc -l

   82305


In [37]:
#Count the number of unmethylated CpGs
82305 - 45661

36644

In [119]:
#Calculate average methylation
! tail -n+2 ../summarize/20_5_S/S_mean_NO.bedgraph \
| awk '{ total += $4; count++ } END { print total/count }'

0.161531
