# Draft Workflow

The general outline for this workflow will be to process data stored in ../data/raw step-by-step.

1. Preprocess H2AX/$\gamma$-H2AX .bed files: 1) remap to hg38
2. Preprocess somatic mutations .maf files: 1) convert to .bed file
3. Filter .bed files for hg38 blacklist and RepeatMasks (and other?)
4. Use bedtools window to find intersection between H2AX/$\gamma$-H2AX and somatic mutations

These are the basic four steps to this workflow.

Before doing anything, just make sure we're in the root directory of this project.

```{bash}
# Ensure data/ lies in your current directory
cd ../
```

> NOTE: The above must be at the beginning of every bash cell since bash cells don't communicate between each other it appears. 

The Python imports necessary for this notebook are shown below.

In [25]:
from pybedtools import BedTool # Note this requires bedtools
import os
import glob

Since Python cells communicate with each other, we can just change the current directory once.

In [26]:
# Ensure at top-level directory
os.chdir("../")
os.getcwd()

'/home/work2017/Documents/Jamin'

Current tree of directory:

```
   .
   |-data
   |---preprocessed
   |-----filtered
   |-------marker
   |-------mutations
   |-----liftOver
   |-----somaticmutations_bed
   |---processed
   |---raw
   |-----blacklists
   |-----gdc_breast_somaticmutations
   |-----mark_gammaH2AX
   |-----mark_H2AX
   |-----mark_other
   |-src
```

# 1: Preprocessing H2AX/$\gamma$-H2AX .bed files

.bed data from Accession Number GSE25577 was downloaded, extracted and split into the directories `/data/raw/mark_H2AX`, `/data/raw/mark_gammaH2AX` and `/data/raw/mark_other`.

Remapping will be done with liftOver, provided by UCSC (https://genome.ucsc.edu/cgi-bin/hgLiftOver)

For the moment, I have tested the remapping on `GSM628535_jurkat_gh2ax.bed` and saved the two output files - `GSM628535_jurkat_gh2ax_liftOver.bed` and `GSM628535_jurkat_gh2ax_liftOver.err` (error output) into `/data/preprocessed/liftOver`. Testing bigger files didn't seem to come to completion for the moment. 

> Will need to do this from command line later, but this will suffice to test things out for the moment.

My greatest fear here is the conversion between Assembly Versions...I think I need to read up on it a bit more before I feel confident with it.

Let's check this:

In [39]:
%%bash

# Ensure data/ lies in your current directory
cd ../

head data/preprocessed/liftOver/GSM628535_jurkat_gh2ax_liftOver.bed

chr1	10150	10350	.	.	-
chr1	10474	10811	.	.	+
chr1	10332	10532	.	.	-
chr1	10503	10840	.	.	+
chr1	10526	10863	.	.	+
chr1	10528	10865	.	.	+
chr1	10532	10869	.	.	+
chr1	10443	10780	.	.	-
chr1	12879	13079	.	.	-
chr1	14909	15109	.	.	+


Just out of interest, let's compare this to the hg18 original raw bed file:

In [40]:
%%bash

# Ensure data/ lies in your current directory
cd ../

head data/raw/mark_gammaH2AX/GSM628535_jurkat_gh2ax.bed

chr1	150	350	.	.	 -
chr1	474	674	.	.	 +
chr1	332	532	.	.	 -
chr1	503	703	.	.	 +
chr1	526	726	.	.	 +
chr1	528	728	.	.	 +
chr1	532	732	.	.	 +
chr1	443	643	.	.	 -
chr1	2742	2942	.	.	 -
chr1	4772	4972	.	.	 +


# 2: Preprocessing Somatic Mutations .maf files

.maf files somatic mutations were downloaded from the GDC data portal under the query `Primary Site IS Breast AND Data Type IS Masked Somatic Mutation`. The cart was extracted and placed in the directory `/data/raw/gdc_breast_somaticmutations`.


To convert the .maf file to a .bed file, columns 5 - 8 (inclusive both ends) and rows >= 6 were extracted and two columns (populated with ".") were added. Since (and unfortunately due to my newness with awk...) there were spaces produced by the FS rather than tabs, the space delimiters are then replaced with tabs to produce the .bed file.

In [17]:
%%bash

# Ensure data/ lies in your current directory
cd ../data/raw/gdc_breast_somaticmutations

for filename in *.maf;
do 
    cut -f 5-8 $filename \
    | awk 'NR >= 6 {print}' \
    | awk '$3 = $3 FS "." FS "."' \
    | sed s/" "/"\t"/g \
    > ../../preprocessed/somaticmutations_bed/${filename%.maf}.bed
done

> NOTE TO SELF: Must come back to this! This isn't generalised yet (expecially the saving to file at the end - make it the same name as the original, but with the bed extension)

Let's check this:

In [37]:
%%bash

# This is necessary again because bash doesn't seem to communiate between cells
cd ../

head data/preprocessed/somaticmutations_bed/gdc_breast.bed

chr1	152355460	152355460	.	.	+
chr1	190098358	190098358	.	.	+
chr1	231694371	231694371	.	.	+
chr2	88947539	88947539	.	.	+
chr2	96865893	96865893	.	.	+
chr2	113443886	113443886	.	.	+
chr2	124772881	124772881	.	.	+
chr2	166056443	166056443	.	.	+
chr2	238025349	238025349	.	.	+
chr3	11259195	11259195	.	.	+


# 3: Filters

The hg38 blacklist was obtained from https://sites.google.com/site/anshulkundaje/projects/blacklists and was saved to `/data/raw/blacklists`. 

> I won't use repeat masker right now since I'm not sure how to do this at this stage...it seems I need a FASTQ file to do so...

With the small blacklist at the moment, this is not too difficult a task. Any further blacklist additions will need to be concatenated.

In [22]:
# INPUT/OUTPUT
dir_liftOver = "data/preprocessed/liftOver/"
dir_somaticmutations = "data/preprocessed/somaticmutations_bed/"

dir_blacklist = "data/raw/blacklists/"
name_blacklist = "hg38.blacklist.bed"

dir_save = "data/preprocessed/filtered/"

In [3]:
# Produce blacklist BedTool instance
blacklist = BedTool("".join([dir_blacklist, name_blacklist]))

# Iterate for those in liftOver
dir_list = [dir_liftOver, dir_somaticmutations]
type_list = ["marker/", "mutations/"]
for i, directory in enumerate(dir_list):
    for filename in glob.glob("".join([directory, "*.bed"])):
        full_bed = BedTool(filename)
        filtered_bed = full_bed.subtract(blacklist)
        filtered_bed.saveas("".join([dir_save, 
                                     type_list[i], 
                                     "filtered_", 
                                     filename.replace(directory, "")
                                    ]))

[Let's have a look at an example of what's generated] is what I would say, but there are so few blacklists at the moment that I don't think there's much point right now...

# 4: Finding the Intersection between Marker and Mutations

To find the intersection, we will use BedTools.window which allow intersection within a certain symmetrical radius about the coordinates. 

In [23]:
# INPUT/OUTPUT
dir_filtered = "data/preprocessed/filtered/"
dir_marker = "".join([dir_filtered, "marker/"])
dir_mutations = "".join([dir_filtered, "mutations/"])

dir_save = "data/processed/"

window_size = 1000 #default

In [11]:
for file_marker in glob.glob("".join([dir_marker, "*.bed"])):
    for file_mutations in glob.glob("".join([dir_mutations, "*.bed"])):
        bed_marker = BedTool(file_marker)
        bed_mutations = BedTool(file_mutations)
        bed_window = bed_marker.window(bed_mutations, w = window_size)
        bed_window.saveas("".join([dir_save,
                                   file_marker.replace(dir_marker, "").replace("filtered_", "").replace("_liftOver.bed", ""),
                                   "_X_",
                                   file_mutations.replace(dir_mutations, "").replace("filtered_", "")
                                  ]))

# Brief Look at the Processed Data


## Proportion of Marker that Matched with Mutation within Window


In [19]:
# Quick function to count lines in file
def count_lines(filename):
    num_lines = sum(1 for line in open(filename))
    return num_lines

In [27]:
total_marker = count_lines("".join([dir_marker, "filtered_GSM628535_jurkat_gh2ax_liftOver.bed"]))
print(total_marker)

11669349


In [34]:
total_marker_unfiltered = count_lines("".join([dir_liftOver, "GSM628535_jurkat_gh2ax_liftOver.bed"]))
print(total_marker_unfiltered)

11669508


In [29]:
total_intersect = count_lines("".join([dir_save, "GSM628535_jurkat_gh2ax_liftOver_X_gdc_breast.bed"]))
print(total_intersect)

898935


In [20]:
total_intersect / total_marker * 100

7.7033860243617704

Now just looking at the mutation file:

In [31]:
total_mutations = count_lines("".join([dir_mutations, "filtered_gdc_breast.bed"]))
print(total_mutations)

88299


## Trying a Control...

In [40]:
control_bed = BedTool()
test_control = control_bed.random(l=0, n=total_marker_unfiltered, genome="hg38", seed=0)
test_control_filtered = test_control.subtract(blacklist)
test_control_window = test_control_filtered.window(BedTool("".join([dir_mutations, "filtered_gdc_breast.bed"])), w=window_size)
print(test_control_window.count())

642933
