Rework Hi-C workflow proposal

stjudecloud · Apr 4, 2024 · acb94d7 · acb94d7
1 parent 6b0751e
commit acb94d7
Showing 1 changed file with 67 additions and 14 deletions.
diff --git a/text/hic-workflow.md b/text/hic-workflow.md
@@ -30,16 +30,37 @@ The Hi-C method measures the frequency at which two fragments physically associa
 Though Hi-C experiments utilize paired-end sequencing, the reads can not be mapped using the paired-end mode of aligners. Since the two reads in the pair come from different fragments, the insert size of the pair is highly variable and this violates the assumptions that most paired-end aligners make. Therefore reads are often mapped as independent single-end reads. The goal of finding the ligation junctions in the Hi-C experiment have led to most projects using an iterative mapping strategy. Often starting with a truncated read and searching for a unique mapping. Non-uniquely mapped reads are then extended and an additional alignment run is performed. This can be done iteratively until the reads are either uniquely mapped or no unique mapping is found for each read. Only reads with both ends uniquely mapped are retained for analysis.
 
 At Jinghui's direction, this workflow will provide Hi-C data in a raw format. We will opt to provide this in unaligned BAM format.
+
 # Specification
 
 ## Dependencies
 
 - Picard Tools
+- samtools
+- bwa
+- pairix
+- juicer
 
 ## Reference Files
 
-There are no reference files for the Hi-C harmonization workflow.
+The following reference files are used as the basis of the Hi-C Workflow:
+
+- Similarly to all analysis pipelines in St. Jude Cloud, we use the `GRCh38_no_alt` analysis set for our reference genome. You can get a copy of the file [here](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). Additionally, you can get the file by running the following commands:
+
+  ```bash
+  wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz -O GRCh38_no_alt.fa.gz
+  gunzip GRCh38_no_alt.fa.gz
+
+  echo "a6da8681616c05eb542f1d91606a7b2f  GRCh38_no_alt.fa" > GRCh38_no_alt.fa.md5
+  md5sum -c GRCh38_no_alt.fa.md5
+  # > GRCh38_no_alt.fa: OK
+  ```
 
+- The following command is used to prepare the BWA index file:
+
+    ```bash
+    bwa index GRCh38_no_alt.fa
+    ```
 
 ## Workflow
 
@@ -54,27 +75,59 @@ Here are the resulting steps in the Hi-C workflow. There might be slight alterat
                         IGNORE=MISSING_PLATFORM_VALUE
     ```
 
-2. Run Picard `RevertSam` on the validated BAM file.
+2. Run samtools `collate` on the validated BAM file.
 
     ```bash
-        picard RevertSam \
-                INPUT=$INPUT_BAM \                        # Input BMA to revert
-                OUTPUT=$REVERTED_BAM \                    # Output unaligned BAM name
-                REMOVE_ALIGNMENT_INFORMATION=true \       # Remove alignments
-                REMOVE_DUPLICATE_INFORMATION=true \       # Remove duplicate flags
-                VALIDATION_STRINGENCY=SILENT \            # Ignore some validation warnings
-                SORT_ORDER=queryname                      # Sort by queryname
+        samtools collate \ 
+            -o $COLLATED_BAM \ 
+            $INPUT_BAM
     ```
 
-3. Run `picard ValidateSam` on the reverted BAM to ensure that it is well-formed.
+3. Run Samtools `fastq` on the collated BAM file.
 
     ```bash
-    picard ValidateSamFile \
-                        I=$REVERTED_BAM \                  # Input BAM.
-                        IGNORE=INVALID_PLATFORM_VALUE \    # Validations to ignore.
-                        IGNORE=MISSING_PLATFORM_VALUE
+        samtools fastq \
+                -1 $READ_ONE_GZ \    # Fastq with read 1s
+                -2 $READ_TWO_GZ \    # Fastq with read 2s
+                $COLLATED_BAM        # Input BMA to convert to fastq
+
     ```
+4. Run picard `FastqToSam` on fastq files to create unaligned BAM file.
 
+    ```bash
+        picard FastqToSam \      
+            FASTQ=$READ_ONE_GZ \   # Fastq with read 1s
+            FASTQ2=$READ_TWO_GZ \  # Fastq with read 2s
+            OUTPUT=$UNALIGNED_BAM  # Output unaligned BAM file
+    ```
+
+5. Run bwa `mem` on fastq files to create aligned BAM file.
+
+    ```bash
+        bwa mem \ 
+            $BWA_DB \             # BWA reference database
+            $READ_ONE_GZ \        # Fastq with read 1s
+            $READ_TWO_GZ \        # Fastq with read 2s
+            | samtools view -hb \ # Convert output SAM to BAM
+            > $ALIGNED_BAM
+    ```
+
+6. Run pairix `bam2pairs` on aligned BAM file.
+
+    ```bash
+        bam2pairs \ 
+            $ALIGNED_BAM \    # BAM file from bwa mem
+            $PAIRS            # Output mapped read pairs
+    ```
+
+7. Run Juicer `pre` to generate .hic file.
+
+    ```bash
+        juicer_tools pre \ 
+            $PAIRS \         # Reads with both mates mapped
+            $HIC \           # hic file with multiple resolutions
+            $GENOME_ID       # Predefined genome matching alignment
+    ```
 
 # Further Reading
 - [The Hitchhiker's Guide to Hi-C Analysis: Practical guidelines](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4347522/)