# Methylation landscape analysis

In this notebook, I'll characterize the methylation landscape using output from [`BAT_summarize`](https://github.com/yaaminiv/killifish-hypoxia-RRBS/tree/main/output/05-analysis/summarize).

## 0. Prepare notebook for analysis

In [1]:
!pwd

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/code


In [2]:
cd ../output/05-analysis/

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/output/05-analysis


In [3]:
cd new-genome/

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/output/05-analysis/new-genome


In [4]:
!mkdir methylation-landscape

mkdir: methylation-landscape: File exists


In [5]:
cd methylation-landscape/

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/output/05-analysis/new-genome/methylation-landscape


In [6]:
!mkdir missing_1

mkdir: missing_1: File exists


In [7]:
cd missing_1

/Users/yaaminivenkataraman/Documents/killifish-hypoxia-RRBS/output/05-analysis/new-genome/methylation-landscape/missing_1


In [8]:
!which bedtools

/opt/homebrew/bin/bedtools


In [9]:
bedtoolsDirectory = "/opt/homebrew/bin/"

In [10]:
#Install pandas for this notebook
import pandas as pd
print(pd.__version__)

0.25.1


## 1. Create union file

I want to understand the data in two different ways: one with missing values, and one without. I will use `unionBedGraphs` to create a file with missing data for methylation landscape analysis.

### 1a. Count the number of CpGs in the genome

But first...I will count the number of CpGs in the *F. heteroclitus* genome.

### 1b. Union file

In [17]:
#Find files to concatenate
!find /Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/*sort.bedgraph

/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-N4_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-S1_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-S3_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-S4_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_5-N1_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_5-N2_CG.sort.bedgraph
/Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_5-S3_CG.sort.bedgraph
/Volumes/

In [16]:
!tail /Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/190626_I114_FCH7TVNBBXY_L2_20-N4_CG.sort.bedgraph

NW_023397471.1	20968	20969	0.00
NW_023397471.1	20977	20978	0.00
NW_023397471.1	22730	22731	0.50
NW_023397471.1	22762	22763	0.33
NW_023397471.1	22789	22790	0.94
NW_023397471.1	22797	22798	1.00
NW_023397471.1	22824	22825	0.78
NW_023397471.1	22879	22880	0.97
NW_023397471.1	22892	22893	0.99
NW_023397471.1	22914	22915	0.81


In [18]:
#Create a union bedGraph
#Use N/A when there is no data for a CpG in a sample
#Define sample IDs
#Use sorted bedgraphs
#Cound the number of lines (CpGs) with data
!{bedtoolsDirectory}unionBedGraphs \
-header \
-filler N/A \
-names N_20-N4 N_5-N1 N_5-N2 N_20-N2 N_5-N3 N_20-N1 N_OC-N5 N_OC-N1 N_OC-N2 N_OC-N4 S_20-S1 S_20-S3 S_20-S4 S_5-S3 S_5-S4 S_5-S2 S_20-S2 S_5-S1 S_OC-S1 S_OC-S2 S_OC-S3 S_OC-S5 \
-i /Volumes/yaamini.venkataraman/killifish-hypoxia-RRBS/output/04-calling/new-genome/filtered/*sort.bedgraph \
> union_10x.bedgraph

In [12]:
#Check output
!head ../union_10x.bedgraph
!wc -l ../union_10x.bedgraph

chrom	start	end	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5
NC_012312.1	60	61	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.00	N/A	0.00	0.00	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
NC_012312.1	61	62	0.00	N/A	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A
NC_012312.1	126	127	N/A	0.00	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A	0.00	N/A	0.00	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
NC_012312.1	127	128	N/A	N/A	0.00	N/A	N/A	N/A	0.00	0.00	N/A	N/A	N/A	0.00	0.00	N/A	N/A	N/A	0.00	N/A	N/A	N/A	N/A	N/A
NC_012312.1	296	297	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.01	0.01	0.02	0.05	0.01	0.02	0.04	0.03	0.01	0.01	0.01	N/A	0.03
NC_012312.1	307	308	N/A	0.00	N/A	N/A	0.00	N/A	N/A	N/A	0.00	N/A	N/A	0.55	0.00	N/A	N/A	N/A	0.00	0.00	N/A	N/A	0.00	N/A
NC_012312.1	308	309	0.00	0.07	0.00	0.00	0.00	0.00	N/A	0.00	0.00	N/A	0.00	N/A	0.00	N/A	N/A	0.00	0.00	N/A	N/A	0.01	N/A	N/A
NC_012312.1	31

In [13]:
#Import data into pandas
#Check head
df = pd.read_table("../union_10x.bedgraph")
df.head(5)

Unnamed: 0,chrom,start,end,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,...,S_20-S4,S_5-S3,S_5-S4,S_5-S2,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5
0,NC_012312.1,60,61,,,,,,,,...,0.0,,,,,,,,,
1,NC_012312.1,61,62,0.0,,,0.0,,,,...,,,,0.0,,,,,,
2,NC_012312.1,126,127,,0.0,,0.0,,,,...,0.0,,,,,,,,,
3,NC_012312.1,127,128,,,0.0,,,,0.0,...,0.0,,,,0.0,,,,,
4,NC_012312.1,296,297,,,,,,,,...,0.05,0.01,0.02,0.04,0.03,0.01,0.01,0.01,,0.03


In [14]:
#Average all samples for total genome methylation information and save as a new column
#NA are not included in averages
#Check output
df['average'] = df[['N_20-N4', 'N_5-N1', 'N_5-N2', 'N_20-N2', 'N_5-N3', 'N_20-N1', 'N_OC-N5', 'N_OC-N1', 'N_OC-N2', 'N_OC-N4', 'S_20-S1', 'S_20-S3', 'S_20-S4', 'S_5-S3', 'S_5-S4', 'S_5-S2', 'S_20-S2', 'S_5-S1', 'S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df.head(10)

Unnamed: 0,chrom,start,end,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,...,S_5-S3,S_5-S4,S_5-S2,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5,average
0,NC_012312.1,60,61,,,,,,,,...,,,,,,,,,,0.0
1,NC_012312.1,61,62,0.0,,,0.0,,,,...,,,0.0,,,,,,,0.0
2,NC_012312.1,126,127,,0.0,,0.0,,,,...,,,,,,,,,,0.0
3,NC_012312.1,127,128,,,0.0,,,,0.0,...,,,,0.0,,,,,,0.0
4,NC_012312.1,296,297,,,,,,,,...,0.01,0.02,0.04,0.03,0.01,0.01,0.01,,0.03,0.020833
5,NC_012312.1,307,308,,0.0,,,0.0,,,...,,,,0.0,0.0,,,0.0,,0.06875
6,NC_012312.1,308,309,0.0,0.07,0.0,0.0,0.0,0.0,,...,,,0.0,0.0,,,0.01,,,0.006154
7,NC_012312.1,319,320,,,,,0.0,,,...,,,,0.0,0.0,,,0.0,,0.0
8,NC_012312.1,320,321,0.0,0.0,0.0,0.08,0.0,0.0,,...,,,0.0,0.0,,,0.0,,,0.006154
9,NC_012312.1,321,322,,,,,0.01,,,...,,,,0.03,0.0,,,0.0,,0.005714


In [15]:
#Average all NBH samples and NBH x treatment samples
#NA are not included in averages
#Check output
df['NBH.average'] = df[['N_20-N4', 'N_5-N1', 'N_5-N2', 'N_20-N2', 'N_5-N3', 'N_20-N1', 'N_OC-N5', 'N_OC-N1', 'N_OC-N2', 'N_OC-N4']].mean(axis=1)
df['NBH.OC.average'] = df[['N_OC-N5', 'N_OC-N1', 'N_OC-N2', 'N_OC-N4']].mean(axis=1)
df['NBH.NO.average'] = df[['N_20-N4', 'N_20-N2', 'N_20-N1']].mean(axis=1)
df['NBH.HY.average'] = df[['N_5-N1', 'N_5-N2', 'N_5-N3']].mean(axis=1)
df.head(10)

Unnamed: 0,chrom,start,end,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,...,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5,average,NBH.average,NBH.OC.average,NBH.NO.average,NBH.HY.average
0,NC_012312.1,60,61,,,,,,,,...,,,,,,0.0,0.0,0.0,,
1,NC_012312.1,61,62,0.0,,,0.0,,,,...,,,,,,0.0,0.0,,0.0,
2,NC_012312.1,126,127,,0.0,,0.0,,,,...,,,,,,0.0,0.0,,0.0,0.0
3,NC_012312.1,127,128,,,0.0,,,,0.0,...,,,,,,0.0,0.0,0.0,,0.0
4,NC_012312.1,296,297,,,,,,,,...,0.01,0.01,0.01,,0.03,0.020833,0.01,0.01,,
5,NC_012312.1,307,308,,0.0,,,0.0,,,...,0.0,,,0.0,,0.06875,0.0,0.0,,0.0
6,NC_012312.1,308,309,0.0,0.07,0.0,0.0,0.0,0.0,,...,,,0.01,,,0.006154,0.00875,0.0,0.0,0.023333
7,NC_012312.1,319,320,,,,,0.0,,,...,0.0,,,0.0,,0.0,0.0,0.0,,0.0
8,NC_012312.1,320,321,0.0,0.0,0.0,0.08,0.0,0.0,,...,,,0.0,,,0.006154,0.01,0.0,0.026667,0.0
9,NC_012312.1,321,322,,,,,0.01,,,...,0.0,,,0.0,,0.005714,0.005,0.0,,0.01


In [16]:
#Average all SC samples and SC x treatment samples
#NA are not included in averages
#Check output
df['SC.average'] = df[['S_20-S1', 'S_20-S3', 'S_20-S4', 'S_5-S3', 'S_5-S4', 'S_5-S2', 'S_20-S2', 'S_5-S1', 'S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df['SC.OC.average'] = df[['S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df['SC.NO.average'] = df[['S_20-S1', 'S_20-S3', 'S_20-S4', 'S_20-S2']].mean(axis=1)
df['SC.HY.average'] = df[['S_5-S3', 'S_5-S4', 'S_5-S2', 'S_5-S1']].mean(axis=1)
df.head(10)

Unnamed: 0,chrom,start,end,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,...,S_OC-S5,average,NBH.average,NBH.OC.average,NBH.NO.average,NBH.HY.average,SC.average,SC.OC.average,SC.NO.average,SC.HY.average
0,NC_012312.1,60,61,,,,,,,,...,,0.0,0.0,0.0,,,0.0,,0.0,
1,NC_012312.1,61,62,0.0,,,0.0,,,,...,,0.0,0.0,,0.0,,0.0,,,0.0
2,NC_012312.1,126,127,,0.0,,0.0,,,,...,,0.0,0.0,,0.0,0.0,0.0,,0.0,
3,NC_012312.1,127,128,,,0.0,,,,0.0,...,,0.0,0.0,0.0,,0.0,0.0,,0.0,
4,NC_012312.1,296,297,,,,,,,,...,0.03,0.020833,0.01,0.01,,,0.021818,0.016667,0.0275,0.02
5,NC_012312.1,307,308,,0.0,,,0.0,,,...,,0.06875,0.0,0.0,,0.0,0.11,0.0,0.183333,0.0
6,NC_012312.1,308,309,0.0,0.07,0.0,0.0,0.0,0.0,,...,,0.006154,0.00875,0.0,0.0,0.023333,0.002,0.01,0.0,0.0
7,NC_012312.1,319,320,,,,,0.0,,,...,,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0
8,NC_012312.1,320,321,0.0,0.0,0.0,0.08,0.0,0.0,,...,,0.006154,0.01,0.0,0.026667,0.0,0.0,0.0,0.0,0.0
9,NC_012312.1,321,322,,,,,0.01,,,...,,0.005714,0.005,0.0,,0.01,0.006,0.0,0.01,0.0


In [17]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("all-samples-averages-union.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [18]:
!head all-samples-averages-union.bedgraph
!wc -l all-samples-averages-union.bedgraph

	chrom	start	end	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5	average	NBH.average	NBH.OC.average	NBH.NO.average	NBH.HY.average	SC.average	SC.OC.average	SC.NO.average	SC.HY.average
0	NC_012312.1	60	61	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0	N/A	0.0	0.0	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0	0.0	0.0	N/A	N/A	0.0	N/A	0.0	N/A
1	NC_012312.1	61	62	0.0	N/A	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	0.0	0.0	N/A	0.0	N/A	0.0	N/A	N/A	0.0
2	NC_012312.1	126	127	N/A	0.0	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	0.0	N/A	0.0	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.0	0.0	N/A	0.0	0.0	0.0	N/A	0.0	N/A
3	NC_012312.1	127	128	N/A	N/A	0.0	N/A	N/A	N/A	0.0	0.0	N/A	N/A	N/A	0.0	0.0	N/A	N/A	N/A	0.0	N/A	N/A	N/A	N/A	N/A	0.0	0.0	0.0	N/A	0.0	0.0	N/A	0.0	N/A
4	NC_012312.1	296	297	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	0.01	0.01	0.02	0.05	0.01	0.02	0.04	0.03	0.01	0.01	0.01	N/A

In [19]:
#Confirm column number with average methylation information
!cut -f27 all-samples-averages-union.bedgraph | head

average
0.0
0.0
0.0
0.0
0.020833333333333332
0.06875
0.006153846153846154
0.0
0.006153846153846154
cut: stdout: Broken pipe


In [25]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-samples-averages-union.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($27 > 0) { print $27 }}' ${f} \
| wc -l

  354151


In [26]:
#Number of unmethylated CpGs
#Total CpGs with data (line count of file) - CpGs with > 0% methylation
439470 - 354151

85319

In [27]:
#Import data into pandas
#Calculate average methylation (pandas ignores NAs)
df = pd.read_table("all-samples-averages-union.bedgraph")
df[['average']].mean()

average    0.603309
dtype: float64

In [28]:
#Calculate average methylation using awk
#Answer matches with pandas, so awk also ignores NAs
! tail -n+2 all-samples-averages-union.bedgraph \
| awk '{ total += $27; count++ } END { print total/count }'

0.603309


## 2. Methylation by population

### 2a. Create summary file

The union bedgraph is helpful for knowing how many unique loci were included in the analysis. To be conservative with my measurements of average methylation, I will only consider all common loci between all samples.

In [9]:
!ls ../../summarize/missing_1/all_pop/

[31mall_pop_20-N1.bedgraph[m[m       [31mall_pop_5-S4.bedgraph[m[m
[31mall_pop_20-N1.bw[m[m             [31mall_pop_5-S4.bw[m[m
[31mall_pop_20-N2.bedgraph[m[m       [31mall_pop_OC-N1.bedgraph[m[m
[31mall_pop_20-N2.bw[m[m             [31mall_pop_OC-N1.bw[m[m
[31mall_pop_20-N4.bedgraph[m[m       [31mall_pop_OC-N2.bedgraph[m[m
[31mall_pop_20-N4.bw[m[m             [31mall_pop_OC-N2.bw[m[m
[31mall_pop_20-S1.bedgraph[m[m       [31mall_pop_OC-N4.bedgraph[m[m
[31mall_pop_20-S1.bw[m[m             [31mall_pop_OC-N4.bw[m[m
[31mall_pop_20-S2.bedgraph[m[m       [31mall_pop_OC-N5.bedgraph[m[m
[31mall_pop_20-S2.bw[m[m             [31mall_pop_OC-N5.bw[m[m
[31mall_pop_20-S3.bedgraph[m[m       [31mall_pop_OC-S1.bedgraph[m[m
[31mall_pop_20-S3.bw[m[m             [31mall_pop_OC-S1.bw[m[m
[31mall_pop_20-S4.bedgraph[m[m       [31mall_pop_OC-S2.bedgraph[m[m
[31mall_pop_20-S4.bw[m[m             [31mall_pop_OC-S2.bw[

In [10]:
#metilene output from all population comparison of NBH and SC samples
!head ../../summarize/missing_1/all_pop/all_pop_metilene_N_S.txt
!wc -l ../../summarize/missing_1/all_pop/all_pop_metilene_N_S.txt

chr	pos	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5
NC_012312.1	1062	0.00	0.00	0.00	0.00	0.00	0.00	0.02	0.08	0.00	0.04	0.00	0.02	0.00	0.04	0.00	0.09	0.03	0.06	0.00	0.02	0.04	0.04
NC_012312.1	1063	0.00	0.00	0.00	0.00	0.00	0.00	0.33	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.04	0.00	0.00	0.03	0.00	NA
NC_012312.1	1073	0.04	0.05	0.04	0.06	0.08	0.09	0.03	0.41	0.05	0.04	0.00	0.14	0.06	0.07	0.00	0.00	0.17	0.00	0.04	0.02	0.03	NA
NC_012312.1	1999	0.00	0.00	0.02	0.00	0.00	0.03	0.00	0.00	0.02	0.02	0.00	0.00	0.02	0.00	0.01	0.02	NA	0.00	0.00	0.03	0.04	0.00
NC_012312.1	7746	0.00	0.00	0.00	0.01	0.04	NA	0.01	0.01	0.02	0.01	0.01	0.05	0.01	0.01	0.00	0.01	0.00	0.00	0.08	0.00	0.03	0.00
NC_046361.1	905	0.02	0.30	0.28	0.00	0.33	0.00	0.02	0.01	0.16	0.72	0.34	0.06	0.07	0.45	0.22	0.37	0.08	NA	0.02	0.07	0.09	0.18
NC_046361.1	924	0.80	0.43	0.88	0.14	0.75	0.99	0.40	0.62	0.71	0.70	0.77	0

In [11]:
#Import data into pandas
#Check head
df = pd.read_table("../../summarize/missing_1/all_pop/all_pop_metilene_N_S.txt")
df.head(5)

Unnamed: 0,chr,pos,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,N_OC-N1,...,S_20-S4,S_5-S3,S_5-S4,S_5-S2,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5
0,NC_012312.1,1062,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.08,...,0.0,0.04,0.0,0.09,0.03,0.06,0.0,0.02,0.04,0.04
1,NC_012312.1,1063,0.0,0.0,0.0,0.0,0.0,0.0,0.33,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.03,0.0,
2,NC_012312.1,1073,0.04,0.05,0.04,0.06,0.08,0.09,0.03,0.41,...,0.06,0.07,0.0,0.0,0.17,0.0,0.04,0.02,0.03,
3,NC_012312.1,1999,0.0,0.0,0.02,0.0,0.0,0.03,0.0,0.0,...,0.02,0.0,0.01,0.02,,0.0,0.0,0.03,0.04,0.0
4,NC_012312.1,7746,0.0,0.0,0.0,0.01,0.04,,0.01,0.01,...,0.01,0.01,0.0,0.01,0.0,0.0,0.08,0.0,0.03,0.0


In [14]:
#Average all NBH samples for total genome methylation information and save as a new column
#Average all OC samples and save as a new column
#Average all NO samples and save as a new column
#Average all HY samples and save as a new column
#NA are not included in averages
#Check output
df['NBH.average'] = df[['N_OC-N5' ,'N_OC-N1' ,'N_OC-N2', 'N_OC-N4', 'N_20-N4', 'N_20-N2', 'N_20-N1', 'N_5-N1', 'N_5-N2', 'N_5-N3']].mean(axis=1)
df['NBH.OC.average'] = df[['N_OC-N5' ,'N_OC-N1' ,'N_OC-N2', 'N_OC-N4']].mean(axis=1)
df['NBH.NO.average'] = df[['N_20-N4', 'N_20-N2', 'N_20-N1']].mean(axis=1)
df['NBH.HY.average'] = df[['N_5-N1', 'N_5-N2', 'N_5-N3']].mean(axis=1)
df.head(10)

Unnamed: 0,chr,pos,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,N_OC-N1,...,S_20-S2,S_5-S1,S_OC-S1,S_OC-S2,S_OC-S3,S_OC-S5,NBH.average,NBH.OC.average,NBH.NO.average,NBH.HY.average
0,NC_012312.1,1062,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.08,...,0.03,0.06,0.0,0.02,0.04,0.04,0.014,0.035,0.0,0.0
1,NC_012312.1,1063,0.0,0.0,0.0,0.0,0.0,0.0,0.33,0.0,...,0.04,0.0,0.0,0.03,0.0,,0.033,0.0825,0.0,0.0
2,NC_012312.1,1073,0.04,0.05,0.04,0.06,0.08,0.09,0.03,0.41,...,0.17,0.0,0.04,0.02,0.03,,0.089,0.1325,0.063333,0.056667
3,NC_012312.1,1999,0.0,0.0,0.02,0.0,0.0,0.03,0.0,0.0,...,,0.0,0.0,0.03,0.04,0.0,0.009,0.01,0.01,0.006667
4,NC_012312.1,7746,0.0,0.0,0.0,0.01,0.04,,0.01,0.01,...,0.0,0.0,0.08,0.0,0.03,0.0,0.011111,0.0125,0.005,0.013333
5,NC_046361.1,905,0.02,0.3,0.28,0.0,0.33,0.0,0.02,0.01,...,0.08,,0.02,0.07,0.09,0.18,0.184,0.2275,0.006667,0.303333
6,NC_046361.1,924,0.8,0.43,0.88,0.14,0.75,0.99,0.4,0.62,...,0.62,,0.62,0.83,0.43,0.57,0.642,0.6075,0.643333,0.686667
7,NC_046361.1,931,0.95,0.99,0.99,0.2,1.0,0.99,0.97,0.79,...,0.94,,0.96,0.92,0.63,0.57,0.864,0.88,0.713333,0.993333
8,NC_046361.1,3080,0.85,0.72,0.69,0.77,0.75,0.27,0.76,0.73,...,0.92,0.8,0.77,0.52,0.79,0.52,0.71,0.7625,0.63,0.72
9,NC_046361.1,3088,0.89,0.86,0.85,0.82,0.87,0.93,0.5,0.8,...,0.91,0.83,0.7,0.75,0.74,0.76,0.795,0.6825,0.88,0.86


In [19]:
#Average all SC samples for total genome methylation information and save as a new column
#Average all OC samples and save as a new column
#Average all NO samples and save as a new column
#Average all HY samples and save as a new column
#NA are not included in averages
#Check output
df['SC.average'] = df['average'] = df[['S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5', 'S_20-S1', 'S_20-S3', 'S_20-S4', 'S_20-S2', 'S_5-S3', 'S_5-S4', 'S_5-S2', 'S_5-S1']].mean(axis=1)
df['SC.OC.average'] = df[['S_OC-S1', 'S_OC-S2', 'S_OC-S3', 'S_OC-S5']].mean(axis=1)
df['SC.NO.average'] = df[['S_20-S1', 'S_20-S3', 'S_20-S4', 'S_20-S2']].mean(axis=1)
df['SC.HY.average'] = df[['S_5-S3', 'S_5-S4', 'S_5-S2', 'S_5-S1']].mean(axis=1)
df.head(10)

Unnamed: 0,chr,pos,N_20-N4,N_5-N1,N_5-N2,N_20-N2,N_5-N3,N_20-N1,N_OC-N5,N_OC-N1,...,S_OC-S5,NBH.average,NBH.OC.average,NBH.NO.average,NBH.HY.average,SC.average,average,SC.OC.average,SC.NO.average,SC.HY.average
0,NC_012312.1,1062,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.08,...,0.04,0.014,0.035,0.0,0.0,0.028333,0.028333,0.025,0.0125,0.0475
1,NC_012312.1,1063,0.0,0.0,0.0,0.0,0.0,0.0,0.33,0.0,...,,0.033,0.0825,0.0,0.0,0.006364,0.006364,0.01,0.01,0.0
2,NC_012312.1,1073,0.04,0.05,0.04,0.06,0.08,0.09,0.03,0.41,...,,0.089,0.1325,0.063333,0.056667,0.048182,0.048182,0.03,0.0925,0.0175
3,NC_012312.1,1999,0.0,0.0,0.02,0.0,0.0,0.03,0.0,0.0,...,0.0,0.009,0.01,0.01,0.006667,0.010909,0.010909,0.0175,0.006667,0.0075
4,NC_012312.1,7746,0.0,0.0,0.0,0.01,0.04,,0.01,0.01,...,0.0,0.011111,0.0125,0.005,0.013333,0.016667,0.016667,0.0275,0.0175,0.005
5,NC_046361.1,905,0.02,0.3,0.28,0.0,0.33,0.0,0.02,0.01,...,0.18,0.184,0.2275,0.006667,0.303333,0.177273,0.177273,0.09,0.1375,0.346667
6,NC_046361.1,924,0.8,0.43,0.88,0.14,0.75,0.99,0.4,0.62,...,0.57,0.642,0.6075,0.643333,0.686667,0.649091,0.649091,0.6125,0.7275,0.593333
7,NC_046361.1,931,0.95,0.99,0.99,0.2,1.0,0.99,0.97,0.79,...,0.57,0.864,0.88,0.713333,0.993333,0.836364,0.836364,0.77,0.8725,0.876667
8,NC_046361.1,3080,0.85,0.72,0.69,0.77,0.75,0.27,0.76,0.73,...,0.52,0.71,0.7625,0.63,0.72,0.739167,0.739167,0.65,0.81,0.7575
9,NC_046361.1,3088,0.89,0.86,0.85,0.82,0.87,0.93,0.5,0.8,...,0.76,0.795,0.6825,0.88,0.86,0.834167,0.834167,0.7375,0.8775,0.8875


In [20]:
#Save dataframe in a tabular format and include N/As. Do not include quotes.
df.to_csv("all-common-loci-averages.bedgraph", sep = "\t", na_rep = "N/A", quoting = 3)

In [21]:
!head all-common-loci-averages.bedgraph
!wc -l all-common-loci-averages.bedgraph

	chr	pos	N_20-N4	N_5-N1	N_5-N2	N_20-N2	N_5-N3	N_20-N1	N_OC-N5	N_OC-N1	N_OC-N2	N_OC-N4	S_20-S1	S_20-S3	S_20-S4	S_5-S3	S_5-S4	S_5-S2	S_20-S2	S_5-S1	S_OC-S1	S_OC-S2	S_OC-S3	S_OC-S5	NBH.average	NBH.OC.average	NBH.NO.average	NBH.HY.average	SC.average	average	SC.OC.average	SC.NO.average	SC.HY.average
0	NC_012312.1	1062	0.0	0.0	0.0	0.0	0.0	0.0	0.02	0.08	0.0	0.04	0.0	0.02	0.0	0.04	0.0	0.09	0.03	0.06	0.0	0.02	0.04	0.04	0.014000000000000002	0.035	0.0	0.0	0.028333333333333335	0.028333333333333335	0.025	0.0125	0.0475
1	NC_012312.1	1063	0.0	0.0	0.0	0.0	0.0	0.0	0.33	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.04	0.0	0.0	0.03	0.0	N/A	0.033	0.0825	0.0	0.0	0.006363636363636364	0.006363636363636364	0.01	0.01	0.0
2	NC_012312.1	1073	0.04	0.05	0.04	0.06	0.08	0.09	0.03	0.41	0.05	0.04	0.0	0.14	0.06	0.07	0.0	0.0	0.17	0.0	0.04	0.02	0.03	N/A	0.089	0.13249999999999998	0.06333333333333334	0.056666666666666664	0.04818181818181819	0.04818181818181819	0.03	0.0925	0.0175
3	NC_012312.1	1999	0.0	0.0	0.02	0.0	0.0	0.03	0.0	0.0

### 2b. New Bedford Harbor

#### All NBH Samples

In [26]:
!cut -f26 all-common-loci-averages.bedgraph | head

NBH.average
0.014000000000000002
0.033
0.089
0.009000000000000001
0.011111111111111112
0.184
0.6419999999999999
0.8640000000000001
0.71
cut: stdout: Broken pipe


In [27]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($26 > 0) { print $26 }}' ${f} \
| wc -l

  121017


In [28]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [29]:
#Number of unmethylated CpGs
148752 - 121017

27735

In [30]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $26; count++ } END { print total/count }'

0.282805


#### Outside control

In [31]:
!cut -f27 all-common-loci-averages.bedgraph | head

NBH.OC.average
0.035
0.0825
0.13249999999999998
0.01
0.0125
0.22749999999999998
0.6074999999999999
0.88
0.7625000000000001
cut: stdout: Broken pipe


In [32]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($27 > 0) { print $27 }}' ${f} \
| wc -l

  101532


In [28]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [33]:
#Number of unmethylated CpGs
148752 - 101532

47220

In [34]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $27; count++ } END { print total/count }'

0.279189


#### Normoxia

In [35]:
!cut -f28 all-common-loci-averages.bedgraph | head

NBH.NO.average
0.0
0.0
0.06333333333333334
0.01
0.005
0.006666666666666667
0.6433333333333334
0.7133333333333333
0.63
cut: stdout: Broken pipe


In [36]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($28 > 0) { print $28 }}' ${f} \
| wc -l

   92498


In [37]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [38]:
#Number of unmethylated CpGs
148752 - 92498

56254

In [39]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $28; count++ } END { print total/count }'

0.272361


#### Hypoxia

In [40]:
!cut -f29 all-common-loci-averages.bedgraph | head

NBH.HY.average
0.0
0.0
0.056666666666666664
0.006666666666666667
0.013333333333333334
0.3033333333333334
0.6866666666666666
0.9933333333333333
0.7200000000000001
cut: stdout: Broken pipe


In [41]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($29 > 0) { print $29 }}' ${f} \
| wc -l

   96541


In [28]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [42]:
#Number of unmethylated CpGs
148752 - 96541

52211

In [43]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $29; count++ } END { print total/count }'

0.297145


### 2c. Scorton Creek

#### All SC Samples

In [44]:
!cut -f30 all-common-loci-averages.bedgraph | head

SC.average
0.028333333333333335
0.006363636363636364
0.04818181818181819
0.01090909090909091
0.016666666666666666
0.17727272727272728
0.649090909090909
0.8363636363636363
0.7391666666666667
cut: stdout: Broken pipe


In [45]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($30 > 0) { print $30 }}' ${f} \
| wc -l

  124001


In [28]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [47]:
#Number of unmethylated CpGs
148752 - 124001

24751

In [46]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $30; count++ } END { print total/count }'

0.28804


#### Outside control

In [48]:
!cut -f32 all-common-loci-averages.bedgraph | head

SC.OC.average
0.025
0.01
0.03
0.0175
0.0275
0.09
0.6124999999999999
0.7699999999999999
0.65
cut: stdout: Broken pipe


In [49]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($31 > 0) { print $31 }}' ${f} \
| wc -l

  124001


In [28]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [51]:
#Number of unmethylated CpGs
148752 - 124001

24751

In [50]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $31; count++ } END { print total/count }'

0.28804


#### Normoxia

In [53]:
!cut -f33 all-common-loci-averages.bedgraph | head

SC.NO.average
0.0125
0.01
0.0925
0.006666666666666667
0.0175
0.1375
0.7275
0.8724999999999999
0.8099999999999999
cut: stdout: Broken pipe


In [54]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($33 > 0) { print $33 }}' ${f} \
| wc -l

   99786


In [37]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [56]:
#Number of unmethylated CpGs
148752 - 99786

48966

In [55]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $33; count++ } END { print total/count }'

0.288239


#### Hypoxia

In [57]:
!cut -f34 all-common-loci-averages.bedgraph | head

SC.HY.average
0.0475
0.0
0.0175
0.0075
0.005
0.3466666666666667
0.5933333333333334
0.8766666666666666
0.7575000000000001
cut: stdout: Broken pipe


In [58]:
#Remove header
#Find CpGs with > 0% methylation
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| awk -F'\t' -v OFS='\t' '{if ($34 > 0) { print $34 }}' ${f} \
| wc -l

  101036


In [28]:
#Remove header
#Count number of CpGs
! tail -n+2 all-common-loci-averages.bedgraph \
| wc -l

  148752


In [60]:
#Number of unmethylated CpGs
148752 - 101036

47716

In [59]:
#Calculate average methylation
! tail -n+2 all-common-loci-averages.bedgraph \
| awk '{ total += $34; count++ } END { print total/count }'

0.291675
