# Draft Workflow

The general outline for this workflow will be to process data stored in ../data/raw step-by-step.

1. Preprocess H2AX/$\gamma$-H2AX .bed files: 1) remap to hg38
2. Preprocess somatic mutations .maf files: 1) convert to .bed file
3. Filter .bed files for hg38 blacklist and RepeatMasks (and other?)
4. Use bedtools window to find intersection between H2AX/$\gamma$-H2AX and somatic mutations

These are the basic four steps to this workflow.

Before doing anything, just make sure we're in the root directory of this project.

In [None]:
# Ensure data/ lies in your current directory
cd ../

> NOTE: The above must be at the beginning of every bash cell since bash cells don't communicate between each other it appears. 

# 1: Preprocessing H2AX/$\gamma$-H2AX .bed files

.bed data from Accession Number GSE25577 was downloaded, extracted and split into the directories `/data/raw/mark_H2AX`, `/data/raw/mark_gammaH2AX` and `/data/raw/mark_other`.

> NOTE: HAVE NOT YET REMAPPED TO hg38. liftOver doesn't seem to be cooperating at the moment.

I have tested the remapping on `GSM628535_jurkat_gh2ax.bed` and saved the two output files - `GSM628535_jurkat_gh2ax_liftOver.bed` and `GSM628535_jurkat_gh2ax_liftOver.err` (error output) into `/data/preprocessed/liftOver`.

> It is imperative that you do this locally from the command line at a later time, but at the moment this will suffice as a test.

Let's check this:

In [39]:
%%bash

# Ensure data/ lies in your current directory
cd ../

head data/preprocessed/liftOver/GSM628535_jurkat_gh2ax_liftOver.bed

chr1	10150	10350	.	.	-
chr1	10474	10811	.	.	+
chr1	10332	10532	.	.	-
chr1	10503	10840	.	.	+
chr1	10526	10863	.	.	+
chr1	10528	10865	.	.	+
chr1	10532	10869	.	.	+
chr1	10443	10780	.	.	-
chr1	12879	13079	.	.	-
chr1	14909	15109	.	.	+


# 2: Preprocessing Somatic Mutations .maf files

.maf files somatic mutations were downloaded from the GDC data portal under the query `Primary Site IS Breast AND Data Type IS Masked Somatic Mutation`. The cart was extracted and placed in the directory `/data/raw/gdc_breast_somaticmutations`.


To convert the .maf file to a .bed file, columns 5 - 8 (inclusive both ends) and rows >= 6 were extracted and two columns (populated with ".") were added. Since (and unfortunately due to my newness with awk...) there were spaces produced by the FS rather than tabs, the space delimiters are then replaced with tabs to produce the .bed file.

In [35]:
%%bash

# Ensure data/ lies in your current directory
cd ../

cut -f 5-8 data/raw/gdc_breast_somaticmutations/*.maf \
| awk 'NR >= 6 {print}' \
| awk '$3 = $3 FS "." FS "."' \
| sed s/" "/"\t"/g \
> data/preprocessed/somaticmutations_bed/gdc_breast.bed

/home/work2017/Documents/Jamin


Let's check this:

In [37]:
%%bash

# This is necessary again because bash doesn't seem to communiate between cells
cd ../

head data/preprocessed/somaticmutations_bed/gdc_breast.bed

chr1	152355460	152355460	.	.	+
chr1	190098358	190098358	.	.	+
chr1	231694371	231694371	.	.	+
chr2	88947539	88947539	.	.	+
chr2	96865893	96865893	.	.	+
chr2	113443886	113443886	.	.	+
chr2	124772881	124772881	.	.	+
chr2	166056443	166056443	.	.	+
chr2	238025349	238025349	.	.	+
chr3	11259195	11259195	.	.	+


# Filter 