# Download fastq files from the web and run fastqc to look at the quality of the sequencing run

**Step 1: Download dataset**


First we will download our raw sequencing data. We will be using the [ENCODE](https://www.encodeproject.org/) database as our main source, due to its wide availability and thoroughness in data quality. There are many different experiments and datasets available here. You can download both raw and fully processed data. 

For our learning purposes, we will download the raw data (fastq). We want to look at a biologically relevant dataset that we know will have noticeable changes in gene expression. So, let's take a look at RNA-seq data from mouse dendritic cells that have been activated *in vitro* with the endotoxin lipopolysaccharide (LPS) [here](https://www.encodeproject.org/experiments/ENCSR085TSA/), as well as control cells [here](https://www.encodeproject.org/experiments/ENCSR945SYV/). This will be a relatively small, easy to work with dataset that should have some very useful information.

Let's find the fastq files now. Go down to the "Files" section and click on the "File details" tab. Under the "Raw sequencing data" we can find our fastq files. When you have found the link to the fastq files, right click on it and select "copy Link Address". ENCODE also has processed data files that have already been aligned (sometimes to different genome builds including mm9) and quantified. In order to run the entire pipeline, we only want the raw reads stored in fastq.  

Then on TSCC, put your file in the proper directory by first moving into the directory where you would like your data to end up, and then pasting the link you have copied after the "wget" command. (Remember this is what we did when we downloaded Anaconda). Keep in mind that this data is paired-end, so there are two reads per dataset (R1 and R2). So you will need to download two files per sample. 

    cd ~/raw_data/
    
Let's make a directory in raw_data specifically for the raw data for this project. 

    mkdir ~/raw_data/mouse_LPS/
    
Then move into that directory before running wget. REMEMBER TO USE TABS TO EASILY MOVE BETWEEN DIRECTORIES. 

    cd ~/raw_data/mouse_LPS/

    wget https://www.encodeproject.org/files/ENCFF178GZL/@@download/ENCFF178GZL.fastq.gz
    
    wget https://www.encodeproject.org/files/ENCFF925PIZ/@@download/ENCFF925PIZ.fastq.gz 
    
*NOTE* - to avoid a backlog on the head node with all of us downloading the same datasets, please make a softlinks to the files that I have already downloaded and stored them in our shared folder for later use:

    ~/bms_2018/rna_seq/raw_data/
    
    
Here you will find fastq files named:
    
    mouse_0hr_rep1_R1.fastq.gz
    mouse_0hr_rep1_R2.fastq.gz
    mouse_0hr_rep2_R1.fastq.gz
    mouse_0hr_rep2_R2.fastq.gz
    
    
    mouse_4hr_rep1_R1.fastq.gz    
    mouse_4hr_rep1_R2.fastq.gz
    mouse_4hr_rep2_R1.fastq.gz
    mouse_4hr_rep2_R2.fastq.gz
      


***Note*** If you do plan on downloading datasets, be sure use mv to rename your files something more meaningful. It is too difficult to go throug the remainder of the pipeline steps with coded language:

    mv ENCFF178GZL.fastq.gz mouse_0hr_rep1_R1.fastq.gz
    mv ENCFF925PIZ.fastq.gz mouse_0hr_rep1_R2.fastq.gz

**Step 2: Run fastqc to check the sequencing quality of the reads that you downloaded. Remember that we installed fastqc with:**

    conda install -c bioconda fastqc
    
You can see that it has installed properly with:

    which fastqc
    
The output should be something like:

    ~/anaconda2/bin/fastqc
    
*Q. Why is it finding the program in this location?*

Let's make a directory in projects for our new fto_shrna project, and make another directory within that folder for the restuls of our fastqc run.


    mkdir ~/projects/mouse_LPS/
    mkdir ~/projects/mouse_LPS/fastqc/
    
For future reference, you could aternatively use the -p flag to make the fastqc directory and its parent directory (fto_shrna/) in one step:

    mkdir -p ~/projects/mouse_LPS/fastqc/
    

Let's run fastqc to check the quality of your sequencing results. Remember to specify the *full path* of where your datasets are stored and where you want the processed data to end up. You will have to do this one one file at a time. REMEMBER TO USE TABS TO AVOID TYPOS! The -o argument is used to specify the location of the output files.

    fastqc cd ~/raw_data/mouse_LPS/ENCFF178GZL.fastq.gz -o ~/projects/mouse_LPS/fastqc/

By the way, how do we find out more about the fastqc command? Try the following:

    fastqc --help

If you could not download the files yourself, run from the shared folder 
    
    fastqc ~/bms_2018/rna_seq/raw_data/mouse_0hr_rep1_R1.fastq.gz -o ~/projects/mouse_LPS/fastqc/

This wil take a few minutes to run. When finished, you should have a .zip file and a .html file.


**Step 3: View your output files in a web browser:**

We'll use a simple method called secure copy (SCP) to transfer the html address to our local machine. For alternative means of file transferring, try looking into the [SSHFS notebook](https://github.com/ryanmarina/BMS_bioinformatics_bootcamp_2018/blob/master/tutorials/SSHFS_installs.ipynb) in your own free time.

For SCP, on our LOCAL MACHINE, we will type the following command to transfer files from our TSCC account to our Desktop in order to access with a web browser:



    