# Case study: Regular exercise suppresses steatosis-associated liver cancer development by degrading E2F1 and c-Myc via circadian gene upregulation

## Theodoro Gasperin Terra Camargo (r0974221)

**Research question:** How does regular exercise suppress liver cancer development at the molecular level in mice?

**Experiment setup:** 

* Exercise group:
    * tumor tissue
    * non-tumor tissue
    
* Non-Exercise group:
    * tumor tissue (positive control)
    * non-tumor tissue (negative control)
    
**Comparisom of interest:** 

* Exercise (Non-Tumor) vs. Non-Exercise (Non-Tumor) --> genes affected by exercise on healthy tissue
* Exercise (Tumor) vs. Non-Exercise (Tumor) -------------> genes affected by exercise on cancer tissue
    
**Note**:

All processed data can be found here:

https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA1142455&o=acc_s%3Aa

Paper can be found here: 

https://onlinelibrary.wiley.com/doi/full/10.1111/gtc.13161


## Step 0: set up working directory

In [1]:
pwd

/mnt/storage/r0974221/jupyternotebooks/assignment_1


In [2]:
mkdir -p /mnt/storage/$USER/jupyternotebooks/assignment_1/bulk_RNA_seq
cd /mnt/storage/$USER/jupyternotebooks//assignment_1/bulk_RNA_seq

In [3]:
pwd

/mnt/storage/r0974221/jupyternotebooks/assignment_1/bulk_RNA_seq


## Step 1: get reads
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE273579

In [6]:
mkdir fastq_files #make directory to store fastq files

In [8]:
# Download raw fastq files
# SRR30054413 SRR30054417: fastq files for two tumor exercise samples.
# SRR30054426 SRR30054429: fastq files for two non-tumor exercise samples
# SRR30054446 SRR30054452: fastq files for two tumor non-exercise sampels 
# SRR30054455 SRR30054456: fastq files for two non-tumor non-exercise sampels 
# --split-files: Split read 1 and read 2 data (necessary for paired-end data)
# --outdir fastq_files_test: Save files in the specified folder
# --verbose: Show all log messages
fastq-dump \
    SRR30054413 SRR30054417 \
    SRR30054426 SRR30054429 \
    SRR30054446 SRR30054452 \
    SRR30054455 SRR30054456 \
    --split-files \
    --outdir fastq_files \
    --verbose 

Preference setting is: Prefer SRA Normalized Format files with full base quality scores if available.
SRR30054413 is an SRA Normalized Format file with full base quality scores.
Read 6216860 spots for SRR30054413
Written 6216860 spots for SRR30054413
SRR30054417 is an SRA Normalized Format file with full base quality scores.
Read 5970478 spots for SRR30054417
Written 5970478 spots for SRR30054417
SRR30054426 is an SRA Normalized Format file with full base quality scores.
Read 5876033 spots for SRR30054426
Written 5876033 spots for SRR30054426
SRR30054429 is an SRA Normalized Format file with full base quality scores.
Read 6187225 spots for SRR30054429
Written 6187225 spots for SRR30054429
SRR30054446 is an SRA Normalized Format file with full base quality scores.
Read 5920104 spots for SRR30054446
Written 5920104 spots for SRR30054446
SRR30054452 is an SRA Normalized Format file with full base quality scores.
Read 5930437 spots for SRR30054452
Written 5930437 spots for SRR30054452
SRR3

In [9]:
ls fastq_files

SRR30054413_1.fastq  SRR30054429_1.fastq  SRR30054455_1.fastq
SRR30054417_1.fastq  SRR30054446_1.fastq  SRR30054456_1.fastq
SRR30054426_1.fastq  SRR30054452_1.fastq


In [10]:
head -n4 fastq_files/SRR30054413_1.fastq

@SRR30054413.1 NB501915:417:HLJYNBGXW:3:11401:16292:1022 length=76
CACTANTCTGTTCTACATTAAAGTTCCTTTCCATAGAACTAGATTCTTCTGCATGGATACAGACTAAAGTCAGTTC
+SRR30054413.1 NB501915:417:HLJYNBGXW:3:11401:16292:1022 length=76
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE


### We will rename the files so it is easier to distinguish what data we are looking at.

- __exe_tumor_1__: Exercise, tumor, replicate 1
- __exe_tumor_2__: Exercise, tumor, replicate 2
- __exe_non_tumor_1__: Exercise, non-tumor, replicate 1
- __exe_non_tumor_2__: Exercise, non-tumor, replicate 2
- __non_exe_tumor_1__: Non-Exercise, tumor, replicate 1
- __non_exe_tumor_2__: Non-Exercise, tumor, replicate 2
- __non_exe_non_tumor_1__: Non-Exercise, non-tumor, replicate 1
- __non_exe_non_tumor_2__: Non-Exercise, non-tumor, replicate 2

In [11]:
mv fastq_files/SRR30054413_1.fastq fastq_files/exe_tumor_1.fastq
mv fastq_files/SRR30054417_1.fastq fastq_files/exe_tumor_2.fastq
mv fastq_files/SRR30054426_1.fastq fastq_files/exe_non_tumor_1.fastq
mv fastq_files/SRR30054429_1.fastq fastq_files/exe_non_tumor_2.fastq
mv fastq_files/SRR30054446_1.fastq fastq_files/non_exe_tumor_1.fastq
mv fastq_files/SRR30054452_1.fastq fastq_files/non_exe_tumor_2.fastq
mv fastq_files/SRR30054455_1.fastq fastq_files/non_exe_non_tumor_1.fastq
mv fastq_files/SRR30054456_1.fastq fastq_files/non_exe_non_tumor_2.fastq

In [12]:
ls fastq_files

exe_non_tumor_1.fastq  exe_tumor_2.fastq          non_exe_tumor_1.fastq
exe_non_tumor_2.fastq  non_exe_non_tumor_1.fastq  non_exe_tumor_2.fastq
exe_tumor_1.fastq      non_exe_non_tumor_2.fastq


In [13]:
cat fastq_files/exe_tumor_1.fastq | wc -l

24867440


In [14]:
echo 24867440/4 | bc #the number of reads is the number of lines in the fastq file divided by four

6216860


This number of reads (6216860) is consistent with the number of reads for SRR30054413 in "https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&page_size=10&acc=SRR30054413&display=reads"

![Screenshot%202024-11-11%20at%204.52.44%E2%80%AFPM.png](attachment:Screenshot%202024-11-11%20at%204.52.44%E2%80%AFPM.png)

## Step 2: Quality control using FASTQC

In [16]:
mkdir fastQC

In [17]:
# Run fastqc on all files
fastqc fastq_files/*.fastq -o fastQC

Started analysis of exe_non_tumor_1.fastq
Approx 5% complete for exe_non_tumor_1.fastq
Approx 10% complete for exe_non_tumor_1.fastq
Approx 15% complete for exe_non_tumor_1.fastq
Approx 20% complete for exe_non_tumor_1.fastq
Approx 25% complete for exe_non_tumor_1.fastq
Approx 30% complete for exe_non_tumor_1.fastq
Approx 35% complete for exe_non_tumor_1.fastq
Approx 40% complete for exe_non_tumor_1.fastq
Approx 45% complete for exe_non_tumor_1.fastq
Approx 50% complete for exe_non_tumor_1.fastq
Approx 55% complete for exe_non_tumor_1.fastq
Approx 60% complete for exe_non_tumor_1.fastq
Approx 65% complete for exe_non_tumor_1.fastq
Approx 70% complete for exe_non_tumor_1.fastq
Approx 75% complete for exe_non_tumor_1.fastq
Approx 80% complete for exe_non_tumor_1.fastq
Approx 85% complete for exe_non_tumor_1.fastq
Approx 90% complete for exe_non_tumor_1.fastq
Approx 95% complete for exe_non_tumor_1.fastq
Analysis complete for exe_non_tumor_1.fastq
Started analysis of exe_non_tumor_2.fastq

### Fastqc Analysis:

![Screenshot%202024-11-11%20at%205.13.52%E2%80%AFPM.png](attachment:Screenshot%202024-11-11%20at%205.13.52%E2%80%AFPM.png)


All fastq files failed the kmer content and some failed the duplication levels test as well. However, I am not particulary worried about these failed test as all the other test passed. Possible reasons for the Kmer content test fail could be contatamination by exogenous sources, and for the duplication levels could be that some genes are highly expressed or caused by a PCR amplification bias. Contamination will be adressed by alligning the reads to the reference genome thus eliminating any exogenous reads that will not get alligned, and duplication levels will be normalized. 

## Step 3: alligning reads to a reference genome

In [19]:
mkdir alignment

In [21]:
FILES_TO_ALIGN=`ls fastq_files | grep fastq`
echo "Following files will be aligned:" 
echo ${FILES_TO_ALIGN} | sed "s/ /, /g"              #Print fastq files and put commas between each file name
for fastq_file in ${FILES_TO_ALIGN}                  #Loop over all fastq files
do
    SAMPLE_NAME=`echo ${fastq_file} | cut -f 1 -d .` #Get the sample name by remove '.fastq'
    echo "Aligning ${SAMPLE_NAME}" 
    STAR \
        --genomeDir /mnt/storage/db/star/mm10/genomedir/ \
        --runThreadN 20 \
        --readFilesIn fastq_files/${fastq_file} \
        --outFileNamePrefix alignment/${SAMPLE_NAME}. \
        --outSAMtype BAM SortedByCoordinate 
done

Following files will be aligned:
exe_non_tumor_1.fastq, exe_non_tumor_2.fastq, exe_tumor_1.fastq, exe_tumor_2.fastq, non_exe_non_tumor_1.fastq, non_exe_non_tumor_2.fastq, non_exe_tumor_1.fastq, non_exe_tumor_2.fastq
Aligning exe_non_tumor_1
Nov 11 20:10:59 ..... started STAR run
Nov 11 20:11:00 ..... loading genome
Nov 11 20:12:43 ..... started mapping
Nov 11 20:13:34 ..... started sorting BAM
Nov 11 20:13:42 ..... finished successfully
Aligning exe_non_tumor_2
Nov 11 20:13:44 ..... started STAR run
Nov 11 20:13:44 ..... loading genome
Nov 11 20:15:08 ..... started mapping
Nov 11 20:15:58 ..... started sorting BAM
Nov 11 20:16:09 ..... finished successfully
Aligning exe_tumor_1
Nov 11 20:16:11 ..... started STAR run
Nov 11 20:16:11 ..... loading genome
Nov 11 20:17:18 ..... started mapping
Nov 11 20:17:59 ..... started sorting BAM
Nov 11 20:18:08 ..... finished successfully
Aligning exe_tumor_2
Nov 11 20:18:10 ..... started STAR run
Nov 11 20:18:10 ..... loading genome
Nov 11 20:19:19 

In [22]:
ls alignment | grep bam

exe_non_tumor_1.Aligned.sortedByCoord.out.bam
exe_non_tumor_2.Aligned.sortedByCoord.out.bam
exe_tumor_1.Aligned.sortedByCoord.out.bam
exe_tumor_2.Aligned.sortedByCoord.out.bam
non_exe_non_tumor_1.Aligned.sortedByCoord.out.bam
non_exe_non_tumor_2.Aligned.sortedByCoord.out.bam
non_exe_tumor_1.Aligned.sortedByCoord.out.bam
non_exe_tumor_2.Aligned.sortedByCoord.out.bam


In [31]:
samtools view alignment/exe_non_tumor_1.Aligned.sortedByCoord.out.bam | grep -v "^@" | head -n 1

SRR30054426.4426108	272	1	3016932	0	76M	*	0	0	AATGTATTTTATATTATTTGTGACTATTGAGAAGGGTGTTGTTTCCCTAATTTCTTTCTCAGCCTGTTTATCCTTT	EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA	NH:i:8	HI:i:6	AS:i:74	nM:i:0
grep: write error: Broken pipe
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1


In [25]:
cat alignment/exe_non_tumor_1.Log.final.out

                                 Started job on |	Nov 11 20:10:59
                             Started mapping on |	Nov 11 20:12:43
                                    Finished on |	Nov 11 20:13:42
       Mapping speed, Million of reads per hour |	358.54

                          Number of input reads |	5876033
                      Average input read length |	76
                                    UNIQUE READS:
                   Uniquely mapped reads number |	5043247
                        Uniquely mapped reads % |	85.83%
                          Average mapped length |	75.79
                       Number of splices: Total |	1347698
            Number of splices: Annotated (sjdb) |	1340240
                       Number of splices: GT/AG |	1337723
                       Number of splices: GC/AG |	8238
                       Number of splices: AT/AC |	684
               Number of splices: Non-canonical |	1053
                      Mismatch rate per base, % |	0.18%
                  

## Step 4: visualize data in a genome browser (IGV)

In [46]:
# Indexing bam files
for bam_file in alignment/*.bam; do
    samtools index -b "$bam_file"
done

[E::hts_hopen] Failed to open file alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam
[E::hts_open_format] Failed to open file alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam
samtools index: failed to open "alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam": Exec format error


In [48]:
samtools view alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam | head

[E::hts_hopen] Failed to open file alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam
[E::hts_open_format] Failed to open file alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam
samtools view: failed to open "alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam" for reading: Exec format error


In [59]:
STAR \
        --genomeDir /mnt/storage/db/star/mm10/genomedir/ \
        --runThreadN 20 \
        --readFilesIn fastq_files/exe_non_tumor_2.fastq \
        --outFileNamePrefix alignment/exe_non_tumor_2. \
        --outSAMtype BAM SortedByCoordinate 

Nov 13 10:37:43 ..... started STAR run
Nov 13 10:37:43 ..... loading genome
Nov 13 10:38:44 ..... started mapping
Nov 13 10:39:03 ..... started sorting BAM
Nov 13 10:39:08 ..... finished successfully


In [62]:
samtools index -b alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam

In [63]:
ls alignment | grep "^exe_non_tumor_2"

exe_non_tumor_2.Aligned.sortedByCoord.out.bam
exe_non_tumor_2.Aligned.sortedByCoord.out.bam.bai
exe_non_tumor_2.Log.final.out
exe_non_tumor_2.Log.out
exe_non_tumor_2.Log.progress.out
exe_non_tumor_2.SJ.out.tab


In [49]:
samtools idxstats alignment/exe_non_tumor_1.Aligned.sortedByCoord.out.bam
# this commands outputs a tab delimited file with the following columns:
# Col 1 chromosome name
# Col 2 length of the chromosoem
# Col 3 number of reads mapped to this chromosome
# Col 4 number of unmapped read fragments

1	195471971	571914	0
2	182113224	394681	0
3	160039680	320455	0
4	156508116	604489	0
5	151834684	605114	0
6	149736546	286284	0
7	145441459	451497	0
8	129401213	249141	0
9	124595110	381555	0
10	130694993	232759	0
11	122082543	333186	0
12	120129022	433726	0
13	120421639	174391	0
14	124902244	214372	0
15	104043685	202405	0
16	98207768	155475	0
17	94987271	281247	0
18	90702639	147886	0
19	61431566	248054	0
MT	16299	512501	0
X	171031299	107010	0
Y	91744698	4671	0
*	0	0	0


### IGV analysis of Per1 Gene 

![Screenshot%202024-11-13%20at%202.23.42%E2%80%AFPM.png](attachment:Screenshot%202024-11-13%20at%202.23.42%E2%80%AFPM.png)

#### Observations:

The two groups Exercise vs Non-Exercise show visual diferrences in coverage of the Per1 gene (circadian rhythm).

## Step 5: count number of reads overlapping each gene.

Before we can start with differential expression analysis we need to count, for each sample seperatly, how many reads overlap each gene. This will generate a count matrix which we can use for identifying genes for which the expression is different between our two conditions.

For this we will use the program:

    featureCounts
    
and we will need a gtf file.

A GTF file is a file containing genomic features (like genes, exons, introns, ...) along with their genomic locations and other metadata. This is a tab-seperated file containing the following fields:

1. seqname: name of the chromosome (or scaffold)
2. source: name of the program that generated this feature or the data source
3. start: genomic start position of the feature
4. end: genomic end position of the feature
5. score: a floating point value of the score of the feature (optional)
6. strand: wether the feature is on the forward (+) or reverse (-) strand.
7. frame: One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
8. attribute: a semicolon (;) separated list of tag value pairs containing additional information

see https://www.ensembl.org/info/website/upload/gff.html for more information.

#### Retriving the Mouse gtf file 

In [64]:
ln -sf /mnt/storage/db/star/mm10/gtfFile/Mus_musculus.GRCm38.90.gtf .

In [65]:
head Mus_musculus.GRCm38.90.gtf

#!genome-build GRCm38.p5
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.7
#!genebuild-last-updated 2017-06
1	havana	gene	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC";
1	havana	transcript	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-201"; transcript_source "havana"; transcript_biotype "TEC"; tag "basic"; transcript_support_level "NA";
1	havana	exon	3073253	3074322	.	+	.	gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; exon_number "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-201"; transcript_source "havana"; transcript_biotype "TEC"; exon_id "ENSMUSE000013

#### Creating dir

In [66]:
mkdir counts

In [67]:
#-Q 10 only count reads with a minimum mapping quality of 10
#-g gene_name use gene_name as feature names (you can also use gene_id, although this is less human readable)
#-a gencode.v26.primary_assembly.annotation.gtf specify the annotaiton file
#-o counts/mcf7_ns_s.count specify the output file
#alignment/*.bam count for all the bam files in the alignment folder
featureCounts \
    -Q 10 \
    -g gene_name \
    -a Mus_musculus.GRCm38.90.gtf \
    -o counts/samples.counts \
    alignment/*.bam


       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v1.6.0

||  [0m                                                                          ||
||             Input files : [36m8 BAM files  [0m [0m                                   ||
||                           [32mS[36m alignment/exe_non_tumor_1.Aligned.sortedBy[0m ... [0m||
||                           [32mS[36m alignment/exe_non_tumor_2.Aligned.sortedBy[0m ... [0m||
||                           [32mS[36m alignment/exe_tumor_1.Aligned.sortedByCoor[0m ... [0m||
||                           [32mS[36m alignment/exe_tumor_2.Aligned.sortedByCoor[0m ... [0m||
||                           [32mS[36m alignment/no

||    Total reads : [36m6868233[0m [0m                                                  ||
||    Successfully assigned reads : [36m4497785 (65.5%)[0m [0m                          ||
||    Running time : [36m0.24 minutes[0m [0m                                            ||
||  [0m                                                                          ||
||                         [36mRead assignment finished.[0m [0m                         ||
||  [0m                                                                          ||
|| Summary of counting results can be found in file "counts/samples.counts.s [0m ||
|| ummary" [0m                                                                   ||
||  [0m                                                                          ||



In [68]:
ls counts

samples.counts  samples.counts.summary


In [70]:
cat counts/samples.counts.summary

Status	alignment/exe_non_tumor_1.Aligned.sortedByCoord.out.bam	alignment/exe_non_tumor_2.Aligned.sortedByCoord.out.bam	alignment/exe_tumor_1.Aligned.sortedByCoord.out.bam	alignment/exe_tumor_2.Aligned.sortedByCoord.out.bam	alignment/non_exe_non_tumor_1.Aligned.sortedByCoord.out.bam	alignment/non_exe_non_tumor_2.Aligned.sortedByCoord.out.bam	alignment/non_exe_tumor_1.Aligned.sortedByCoord.out.bam	alignment/non_exe_tumor_2.Aligned.sortedByCoord.out.bam
Assigned	4436155	4582770	4658401	4520283	4336783	4298848	4492274	4497785
Unassigned_Unmapped	0	0	0	0	0	0	0	0
Unassigned_MappingQuality	1869566	2117605	1881575	1673341	1654263	1644659	1717756	1568373
Unassigned_Chimera	0	0	0	0	0	0	0	0
Unassigned_FragmentLength	0	0	0	0	0	0	0	0
Unassigned_Duplicate	0	0	0	0	0	0	0	0
Unassigned_MultiMapping	0	0	0	0	0	0	0	0
Unassigned_Secondary	0	0	0	0	0	0	0	0
Unassigned_Nonjunction	0	0	0	0	0	0	0	0
Unassigned_NoFeatures	361641	396036	378679	328869	356769	353075	361430	324356
Unassigned_Overlapping_Length	0	0	0	0	

## ------------------------ The End Part 1 ------------------------