<div class="alert alert-warning">
    <strong>Analyst Note:</strong><br />
    Fill in the human-readable name of your project and the type of your data, such as:
    
   > # Dr. Doe Human Patient Time-Series RNASeq

</div>


# Data Download And Preparation

<div class="alert alert-warning">
    <strong>Analyst Note:</strong><br />
    Fill in the author attributions for your analysis, such as:
    
   > * Guorong Xu, CCBB (g1xu@ucsd.edu)
</div>


## Table of Contents
* [Introduction](#Introduction)
* [Parameters Input](#Parameters-Input)
* [Files Download](#Files-Download)
* [Lane Merge Commands Generation](#Lane-Merge-Commands-Generation)
* [Lane Merge and Files Upload](#Lane-Merge-and-Files-Upload)


[Table of Contents](#Table-of-Contents)

## Introduction

The notebook provides steps to download either RNASeq or miRNASeq fastq data onto a machine or virtual instance and, if necessary, combine fastq files across lanes to get inputs suitable for primary analysis.

<div class="alert alert-info">

Before running this notebook, ensure:

* You are running it on a linux or Mac OSX platform
* The fastq files being prepped are named according to the convention:

    `<somename>_S<#+>_L<###>_R<1 or 2>_001.fastq.gz` 
    
    For example: 
    `NA0007887693_S28_L007_R1_001.fastq.gz`
    
    (This is the naming convention used by almost all sequencing produced by the UCSD IGM sequencing core. Note that if the suffix is NOT ALWAYS \_001, the code below will need to be extended to account for this.)

</div>

[Table of Contents](#Table-of-Contents)

## Parameters Input


In [1]:
g_min_lane_num = 4
g_max_lane_num = 8
g_num_ends = 1 # only acceptable choices are 1 (single-end) or 2 (paired-end)

[Table of Contents](#Table-of-Contents)

## Files Download

<div class="alert alert-info">
        
**NOTE** that the `aws` commands below will only work if the `aws` command line tool has been configured with AWS credentials on the machine being used.  (If the machine is an AWS cirrus-ngs node that was created using the BasicCFNClusterSetup.ipynb notebook, this will have been done automatically).
        
</div>

    # if the files are being downloaded to a remote machine,
    # ssh into that instance; for example,
    # ssh -i ~/Keys/abirmingham_oregon.pem ec2-user@52.13.244.188
    ssh -i <key_file_path> <user-name>@<aws_ip_address>
    
    # check how much free space is available
    df -h
    
    # move to a drive with adequate space for the data
    # being downloaded; if using a cirrus-ngs cluster
    # node, this will be the /shared dir, so for example,
    # cd /shared
    cd <spacious_dir>
    
    # make a temporary directory to hold the data
    # being downloaded and move into it
    mkdir data
    cd data
    
    # use wget to download the fastq.gz files from
    # a password-protected ftp server, for example:
    # wget --user=modigliano --password='HWq5ZlaZ' ftp://igm-storage1.ucsd.edu/190319_K00180_0772_AH2Y27BBXY_SR75_Combo/NA000*.gz
    wget --user=<username> --password='<password>' <ftp url with directory>/*.gz
    
    # push all of these fastq.gz files up to CCBB's
    # S3 bucket for input data for archiving/future use
    # (If you would like to see the size of the directory before 
    # pushing it up, run du -sh )   
    # The s3 bucket url will be of the format 
    # s3://ucsd-ccb-data-upload-1/20190205_hepokoski_mtDNA-in-acute-kidney-injury/20190205_mouse_rnaseq/
    # (note root is the input bucket, first subdirectory is the project name, 
    # second subdirectory is the name of this particular dataset, 
    # prefixed with the date it was downloaded.
    # It is ok if the first and second subdirectory don't exist
    # yet (as long as the input bucket does) because the S3 command
    # will make them.  For example,
    # aws s3 cp --recursive . s3://ucsd-ccb-data-upload-1/20181202_califano_luana-dos-santos_squamous-cell-carcinoma/20190327_human_rnaseq/
    aws s3 cp --recursive . <s3 bucket>/ 
    
    # get the number of files that were downloaded
    ls | wc -l
    
    # get all the fastq.gz file names that were downloaded,
    # one file name per line
    ls | cat

<div class="alert alert-warning">
<h4>Analyst note:</h4>
The values in the cell below are example settings, and <strong>MUST</strong> be replaced with appropriate values for your data.
</div>

In [2]:
g_file_names_str = """218_S48_L004_R1_001.fastq.gz
218_S48_L005_R1_001.fastq.gz
218_S48_L006_R1_001.fastq.gz
218_S48_L007_R1_001.fastq.gz
218_S48_L008_R1_001.fastq.gz
549_S49_L004_R1_001.fastq.gz
549_S49_L005_R1_001.fastq.gz
549_S49_L006_R1_001.fastq.gz
549_S49_L007_R1_001.fastq.gz
549_S49_L008_R1_001.fastq.gz
NA0007886209_S40_L004_R1_001.fastq.gz
NA0007886209_S40_L005_R1_001.fastq.gz
NA0007886209_S40_L006_R1_001.fastq.gz
NA0007886209_S40_L007_R1_001.fastq.gz
NA0007886209_S40_L008_R1_001.fastq.gz
NA0007886614_S14_L004_R1_001.fastq.gz
NA0007886614_S14_L005_R1_001.fastq.gz
NA0007886614_S14_L006_R1_001.fastq.gz
NA0007886614_S14_L007_R1_001.fastq.gz
NA0007886614_S14_L008_R1_001.fastq.gz
NA0007886888_S15_L004_R1_001.fastq.gz
NA0007886888_S15_L005_R1_001.fastq.gz
NA0007886888_S15_L006_R1_001.fastq.gz
NA0007886888_S15_L007_R1_001.fastq.gz
NA0007886888_S15_L008_R1_001.fastq.gz
NA0007886925_S4_L004_R1_001.fastq.gz
NA0007886925_S4_L005_R1_001.fastq.gz
NA0007886925_S4_L006_R1_001.fastq.gz
NA0007886925_S4_L007_R1_001.fastq.gz
NA0007886925_S4_L008_R1_001.fastq.gz
NA0007887034_S3_L004_R1_001.fastq.gz
NA0007887034_S3_L005_R1_001.fastq.gz
NA0007887034_S3_L006_R1_001.fastq.gz
NA0007887034_S3_L007_R1_001.fastq.gz
NA0007887034_S3_L008_R1_001.fastq.gz
NA0007887056_S31_L004_R1_001.fastq.gz
NA0007887056_S31_L005_R1_001.fastq.gz
NA0007887056_S31_L006_R1_001.fastq.gz
NA0007887056_S31_L007_R1_001.fastq.gz
NA0007887056_S31_L008_R1_001.fastq.gz
NA0007887125_S1_L004_R1_001.fastq.gz
NA0007887125_S1_L005_R1_001.fastq.gz
NA0007887125_S1_L006_R1_001.fastq.gz
NA0007887125_S1_L007_R1_001.fastq.gz
NA0007887125_S1_L008_R1_001.fastq.gz
NA0007887143_S46_L004_R1_001.fastq.gz
NA0007887143_S46_L005_R1_001.fastq.gz
NA0007887143_S46_L006_R1_001.fastq.gz
NA0007887143_S46_L007_R1_001.fastq.gz
NA0007887143_S46_L008_R1_001.fastq.gz
NA0007887343_S7_L004_R1_001.fastq.gz
NA0007887343_S7_L005_R1_001.fastq.gz
NA0007887343_S7_L006_R1_001.fastq.gz
NA0007887343_S7_L007_R1_001.fastq.gz
NA0007887343_S7_L008_R1_001.fastq.gz
NA0007887346_S32_L004_R1_001.fastq.gz
NA0007887346_S32_L005_R1_001.fastq.gz
NA0007887346_S32_L006_R1_001.fastq.gz
NA0007887346_S32_L007_R1_001.fastq.gz
NA0007887346_S32_L008_R1_001.fastq.gz
NA0007887358_S17_L004_R1_001.fastq.gz
NA0007887358_S17_L005_R1_001.fastq.gz
NA0007887358_S17_L006_R1_001.fastq.gz
NA0007887358_S17_L007_R1_001.fastq.gz
NA0007887358_S17_L008_R1_001.fastq.gz
NA0007887442_S39_L004_R1_001.fastq.gz
NA0007887442_S39_L005_R1_001.fastq.gz
NA0007887442_S39_L006_R1_001.fastq.gz
NA0007887442_S39_L007_R1_001.fastq.gz
NA0007887442_S39_L008_R1_001.fastq.gz
NA0007887449_S21_L004_R1_001.fastq.gz
NA0007887449_S21_L005_R1_001.fastq.gz
NA0007887449_S21_L006_R1_001.fastq.gz
NA0007887449_S21_L007_R1_001.fastq.gz
NA0007887449_S21_L008_R1_001.fastq.gz
NA0007887672_S42_L004_R1_001.fastq.gz
NA0007887672_S42_L005_R1_001.fastq.gz
NA0007887672_S42_L006_R1_001.fastq.gz
NA0007887672_S42_L007_R1_001.fastq.gz
NA0007887672_S42_L008_R1_001.fastq.gz
NA0007887693_S28_L004_R1_001.fastq.gz
NA0007887693_S28_L005_R1_001.fastq.gz
NA0007887693_S28_L006_R1_001.fastq.gz
NA0007887693_S28_L007_R1_001.fastq.gz
NA0007887693_S28_L008_R1_001.fastq.gz
NA0007887705_S38_L004_R1_001.fastq.gz
NA0007887705_S38_L005_R1_001.fastq.gz
NA0007887705_S38_L006_R1_001.fastq.gz
NA0007887705_S38_L007_R1_001.fastq.gz
NA0007887705_S38_L008_R1_001.fastq.gz
NA0007887861_S29_L004_R1_001.fastq.gz
NA0007887861_S29_L005_R1_001.fastq.gz
NA0007887861_S29_L006_R1_001.fastq.gz
NA0007887861_S29_L007_R1_001.fastq.gz
NA0007887861_S29_L008_R1_001.fastq.gz
NA0007887878_S8_L004_R1_001.fastq.gz
NA0007887878_S8_L005_R1_001.fastq.gz
NA0007887878_S8_L006_R1_001.fastq.gz
NA0007887878_S8_L007_R1_001.fastq.gz
NA0007887878_S8_L008_R1_001.fastq.gz
NA0007888020_S9_L004_R1_001.fastq.gz
NA0007888020_S9_L005_R1_001.fastq.gz
NA0007888020_S9_L006_R1_001.fastq.gz
NA0007888020_S9_L007_R1_001.fastq.gz
NA0007888020_S9_L008_R1_001.fastq.gz
NA0007888035_S44_L004_R1_001.fastq.gz
NA0007888035_S44_L005_R1_001.fastq.gz
NA0007888035_S44_L006_R1_001.fastq.gz
NA0007888035_S44_L007_R1_001.fastq.gz
NA0007888035_S44_L008_R1_001.fastq.gz
NA0007888504_S35_L004_R1_001.fastq.gz
NA0007888504_S35_L005_R1_001.fastq.gz
NA0007888504_S35_L006_R1_001.fastq.gz
NA0007888504_S35_L007_R1_001.fastq.gz
NA0007888504_S35_L008_R1_001.fastq.gz
NA0007888559_S22_L004_R1_001.fastq.gz
NA0007888559_S22_L005_R1_001.fastq.gz
NA0007888559_S22_L006_R1_001.fastq.gz
NA0007888559_S22_L007_R1_001.fastq.gz
NA0007888559_S22_L008_R1_001.fastq.gz
NA0007888615_S43_L004_R1_001.fastq.gz
NA0007888615_S43_L005_R1_001.fastq.gz
NA0007888615_S43_L006_R1_001.fastq.gz
NA0007888615_S43_L007_R1_001.fastq.gz
NA0007888615_S43_L008_R1_001.fastq.gz
NA0007888634_S27_L004_R1_001.fastq.gz
NA0007888634_S27_L005_R1_001.fastq.gz
NA0007888634_S27_L006_R1_001.fastq.gz
NA0007888634_S27_L007_R1_001.fastq.gz
NA0007888634_S27_L008_R1_001.fastq.gz
NA0007888704_S30_L004_R1_001.fastq.gz
NA0007888704_S30_L005_R1_001.fastq.gz
NA0007888704_S30_L006_R1_001.fastq.gz
NA0007888704_S30_L007_R1_001.fastq.gz
NA0007888704_S30_L008_R1_001.fastq.gz
NA0007888760_S47_L004_R1_001.fastq.gz
NA0007888760_S47_L005_R1_001.fastq.gz
NA0007888760_S47_L006_R1_001.fastq.gz
NA0007888760_S47_L007_R1_001.fastq.gz
NA0007888760_S47_L008_R1_001.fastq.gz
NA0007888810_S12_L004_R1_001.fastq.gz
NA0007888810_S12_L005_R1_001.fastq.gz
NA0007888810_S12_L006_R1_001.fastq.gz
NA0007888810_S12_L007_R1_001.fastq.gz
NA0007888810_S12_L008_R1_001.fastq.gz
NA0007888878_S33_L004_R1_001.fastq.gz
NA0007888878_S33_L005_R1_001.fastq.gz
NA0007888878_S33_L006_R1_001.fastq.gz
NA0007888878_S33_L007_R1_001.fastq.gz
NA0007888878_S33_L008_R1_001.fastq.gz
NA0007889020_S36_L004_R1_001.fastq.gz
NA0007889020_S36_L005_R1_001.fastq.gz
NA0007889020_S36_L006_R1_001.fastq.gz
NA0007889020_S36_L007_R1_001.fastq.gz
NA0007889020_S36_L008_R1_001.fastq.gz
NA0007889211_S18_L004_R1_001.fastq.gz
NA0007889211_S18_L005_R1_001.fastq.gz
NA0007889211_S18_L006_R1_001.fastq.gz
NA0007889211_S18_L007_R1_001.fastq.gz
NA0007889211_S18_L008_R1_001.fastq.gz
NA0007889228_S11_L004_R1_001.fastq.gz
NA0007889228_S11_L005_R1_001.fastq.gz
NA0007889228_S11_L006_R1_001.fastq.gz
NA0007889228_S11_L007_R1_001.fastq.gz
NA0007889228_S11_L008_R1_001.fastq.gz
NA0007889307_S6_L004_R1_001.fastq.gz
NA0007889307_S6_L005_R1_001.fastq.gz
NA0007889307_S6_L006_R1_001.fastq.gz
NA0007889307_S6_L007_R1_001.fastq.gz
NA0007889307_S6_L008_R1_001.fastq.gz
NA0007889386_S5_L004_R1_001.fastq.gz
NA0007889386_S5_L005_R1_001.fastq.gz
NA0007889386_S5_L006_R1_001.fastq.gz
NA0007889386_S5_L007_R1_001.fastq.gz
NA0007889386_S5_L008_R1_001.fastq.gz
NA0007889390_S41_L004_R1_001.fastq.gz
NA0007889390_S41_L005_R1_001.fastq.gz
NA0007889390_S41_L006_R1_001.fastq.gz
NA0007889390_S41_L007_R1_001.fastq.gz
NA0007889390_S41_L008_R1_001.fastq.gz
NA0007889526_S19_L004_R1_001.fastq.gz
NA0007889526_S19_L005_R1_001.fastq.gz
NA0007889526_S19_L006_R1_001.fastq.gz
NA0007889526_S19_L007_R1_001.fastq.gz
NA0007889526_S19_L008_R1_001.fastq.gz
NA0007889536_S2_L004_R1_001.fastq.gz
NA0007889536_S2_L005_R1_001.fastq.gz
NA0007889536_S2_L006_R1_001.fastq.gz
NA0007889536_S2_L007_R1_001.fastq.gz
NA0007889536_S2_L008_R1_001.fastq.gz
NA0007889936_S24_L004_R1_001.fastq.gz
NA0007889936_S24_L005_R1_001.fastq.gz
NA0007889936_S24_L006_R1_001.fastq.gz
NA0007889936_S24_L007_R1_001.fastq.gz
NA0007889936_S24_L008_R1_001.fastq.gz
NA0007889968_S10_L004_R1_001.fastq.gz
NA0007889968_S10_L005_R1_001.fastq.gz
NA0007889968_S10_L006_R1_001.fastq.gz
NA0007889968_S10_L007_R1_001.fastq.gz
NA0007889968_S10_L008_R1_001.fastq.gz
NA0007889984_S34_L004_R1_001.fastq.gz
NA0007889984_S34_L005_R1_001.fastq.gz
NA0007889984_S34_L006_R1_001.fastq.gz
NA0007889984_S34_L007_R1_001.fastq.gz
NA0007889984_S34_L008_R1_001.fastq.gz
NA0007890164_S16_L004_R1_001.fastq.gz
NA0007890164_S16_L005_R1_001.fastq.gz
NA0007890164_S16_L006_R1_001.fastq.gz
NA0007890164_S16_L007_R1_001.fastq.gz
NA0007890164_S16_L008_R1_001.fastq.gz
NA0007890192_S26_L004_R1_001.fastq.gz
NA0007890192_S26_L005_R1_001.fastq.gz
NA0007890192_S26_L006_R1_001.fastq.gz
NA0007890192_S26_L007_R1_001.fastq.gz
NA0007890192_S26_L008_R1_001.fastq.gz
NA0007890194_S20_L004_R1_001.fastq.gz
NA0007890194_S20_L005_R1_001.fastq.gz
NA0007890194_S20_L006_R1_001.fastq.gz
NA0007890194_S20_L007_R1_001.fastq.gz
NA0007890194_S20_L008_R1_001.fastq.gz
NA0007890213_S23_L004_R1_001.fastq.gz
NA0007890213_S23_L005_R1_001.fastq.gz
NA0007890213_S23_L006_R1_001.fastq.gz
NA0007890213_S23_L007_R1_001.fastq.gz
NA0007890213_S23_L008_R1_001.fastq.gz
NA0007890364_S25_L004_R1_001.fastq.gz
NA0007890364_S25_L005_R1_001.fastq.gz
NA0007890364_S25_L006_R1_001.fastq.gz
NA0007890364_S25_L007_R1_001.fastq.gz
NA0007890364_S25_L008_R1_001.fastq.gz
NA0007890566_S45_L004_R1_001.fastq.gz
NA0007890566_S45_L005_R1_001.fastq.gz
NA0007890566_S45_L006_R1_001.fastq.gz
NA0007890566_S45_L007_R1_001.fastq.gz
NA0007890566_S45_L008_R1_001.fastq.gz
NA0007890738_S13_L004_R1_001.fastq.gz
NA0007890738_S13_L005_R1_001.fastq.gz
NA0007890738_S13_L006_R1_001.fastq.gz
NA0007890738_S13_L007_R1_001.fastq.gz
NA0007890738_S13_L008_R1_001.fastq.gz
NA0007890757_S37_L004_R1_001.fastq.gz
NA0007890757_S37_L005_R1_001.fastq.gz
NA0007890757_S37_L006_R1_001.fastq.gz
NA0007890757_S37_L007_R1_001.fastq.gz
NA0007890757_S37_L008_R1_001.fastq.gz"""

[Table of Contents](#Table-of-Contents)

## Lane Merge Commands Generation


In [3]:
g_file_names = g_file_names_str.split()
print(len(g_file_names))

245


<div class="alert alert-warning">
<h4>Analyst note:</h4>
Ensure that the number of file names above matches the number of downloaded files reported in the ssh session above.
</div>

In [4]:
def get_unique_sorted_names(input_names):
    result = sorted(list(set(input_names)))
    print(len(result))
    return result

In [5]:
g_lane_names = [x.replace("_R1_001.fastq.gz", "").replace("_R2_001.fastq.gz", "") for x in g_file_names]
g_unique_lane_names = get_unique_sorted_names(g_lane_names)

245


<div class="alert alert-warning">
<h4>Analyst note:</h4>
Ensure that the number of unique lane names above matches the expected number--either the same as the number of sequencing files, if the sequencing is single-end, or half the number of input files, if the sequencing is paired-end.
</div>

In [10]:
def check_lane_num(lane_num):
    # there is no technical reason why lane numbers must be single-digit,
    # but these function have not been "taught" to handle multi-digit
    # lane numbers, so they are currently disallowed to prevent 
    # garbage output.
    if lane_num > 9:
        raise ValueError("currently lane numbers must be single-digit")
        
def loop_over_lane_nums(a_name, min_lane_num, max_lane_num, per_lane_num_func):
    result = None
    # +1 is because range max is exclusive but max_lane_num is inclusive
    for i in range (min_lane_num, max_lane_num+1):
        result = per_lane_num_func(i, a_name, result)
    return result

def remove_lane_nums(lane_name, min_lane_num, max_lane_num):
    def remove_one_lane_num(i, lane_name, result):
        check_lane_num(i)
        curr_name = lane_name if result is None else result
        return curr_name.replace("_L00{0}".format(i), "")
    
    return loop_over_lane_nums(lane_name, min_lane_num, max_lane_num, remove_one_lane_num)

In [11]:
g_sample_names = [remove_lane_nums(x, g_min_lane_num, g_max_lane_num) for x in g_unique_lane_names]
g_unique_sample_names = get_unique_sorted_names(g_sample_names)
g_unique_sample_names

49


['218_S48',
 '549_S49',
 'NA0007886209_S40',
 'NA0007886614_S14',
 'NA0007886888_S15',
 'NA0007886925_S4',
 'NA0007887034_S3',
 'NA0007887056_S31',
 'NA0007887125_S1',
 'NA0007887143_S46',
 'NA0007887343_S7',
 'NA0007887346_S32',
 'NA0007887358_S17',
 'NA0007887442_S39',
 'NA0007887449_S21',
 'NA0007887672_S42',
 'NA0007887693_S28',
 'NA0007887705_S38',
 'NA0007887861_S29',
 'NA0007887878_S8',
 'NA0007888020_S9',
 'NA0007888035_S44',
 'NA0007888504_S35',
 'NA0007888559_S22',
 'NA0007888615_S43',
 'NA0007888634_S27',
 'NA0007888704_S30',
 'NA0007888760_S47',
 'NA0007888810_S12',
 'NA0007888878_S33',
 'NA0007889020_S36',
 'NA0007889211_S18',
 'NA0007889228_S11',
 'NA0007889307_S6',
 'NA0007889386_S5',
 'NA0007889390_S41',
 'NA0007889526_S19',
 'NA0007889536_S2',
 'NA0007889936_S24',
 'NA0007889968_S10',
 'NA0007889984_S34',
 'NA0007890164_S16',
 'NA0007890192_S26',
 'NA0007890194_S20',
 'NA0007890213_S23',
 'NA0007890364_S25',
 'NA0007890566_S45',
 'NA0007890738_S13',
 'NA0007890757_S37'

<div class="alert alert-warning">
<h4>Analyst note:</h4>
Ensure that the number and names of samples above matches the expected number and names provided by the customer in their metadata.
</div>

In [14]:
def get_lane_file_str(i, sample_name, lane_files_str, read_num):
    check_lane_num(i)    
    base_str = "" if lane_files_str is None  else lane_files_str
    return "{0}{1}_L00{2}_R{3}_001.fastq.gz ".format(base_str, sample_name, i, read_num)
    
def make_cat_cmd(sample_name, min_lane_num, max_lane_num, read_num):
    cat_script_template = """cat {0}> {1}_R{2}_001.fastq.gz
rm {1}_L00*_R{2}_001.fastq.gz"""
    
    def per_lane_func(i, sample_name, result):
        return get_lane_file_str(i, sample_name, result, read_num)
    
    lane_files_str = loop_over_lane_nums(sample_name, min_lane_num, max_lane_num, per_lane_func)
    return cat_script_template.format(lane_files_str, sample_name, read_num)

In [15]:
for curr_sample_name in g_unique_sample_names:
    # +1 is because range max is exclusive but g_num_ends is inclusive
    for curr_read_num in range(1, g_num_ends+1):
        print(make_cat_cmd(curr_sample_name, g_min_lane_num, g_max_lane_num, curr_read_num))

cat 218_S48_L004_R1_001.fastq.gz 218_S48_L005_R1_001.fastq.gz 218_S48_L006_R1_001.fastq.gz 218_S48_L007_R1_001.fastq.gz 218_S48_L008_R1_001.fastq.gz > 218_S48_R1_001.fastq.gz
rm 218_S48_L00*_R1_001.fastq.gz
cat 549_S49_L004_R1_001.fastq.gz 549_S49_L005_R1_001.fastq.gz 549_S49_L006_R1_001.fastq.gz 549_S49_L007_R1_001.fastq.gz 549_S49_L008_R1_001.fastq.gz > 549_S49_R1_001.fastq.gz
rm 549_S49_L00*_R1_001.fastq.gz
cat NA0007886209_S40_L004_R1_001.fastq.gz NA0007886209_S40_L005_R1_001.fastq.gz NA0007886209_S40_L006_R1_001.fastq.gz NA0007886209_S40_L007_R1_001.fastq.gz NA0007886209_S40_L008_R1_001.fastq.gz > NA0007886209_S40_R1_001.fastq.gz
rm NA0007886209_S40_L00*_R1_001.fastq.gz
cat NA0007886614_S14_L004_R1_001.fastq.gz NA0007886614_S14_L005_R1_001.fastq.gz NA0007886614_S14_L006_R1_001.fastq.gz NA0007886614_S14_L007_R1_001.fastq.gz NA0007886614_S14_L008_R1_001.fastq.gz > NA0007886614_S14_R1_001.fastq.gz
rm NA0007886614_S14_L00*_R1_001.fastq.gz
cat NA0007886888_S15_L004_R1_001.fastq.gz NA00

[Table of Contents](#Table-of-Contents)

## Lane Merge and Files Upload

Access a terminal on the machine holding the data files (ssh-ing in again if necessary). Copy the lane merge commands above into the terminal and run them.  After the merge is complete, upload the merged files to S3 for future use.

    # check the number of files remaining; 
    # this should equal either the number of samples
    # (for single-end data) or 2 * the number of samples
    # (for paired-end data)
    ls | wc -l
    
    # push all of these files up to CCBB's
    # S3 bucket for input data for archiving/future use
    # (If you would like to see the size of the directory before 
    # pushing it up, run du -sh )   
    # The s3 bucket url will be of the format 
    # s3://ucsd-ccb-data-upload-1/20190205_hepokoski_mtDNA-in-acute-kidney-injury/20190205_mouse_rnaseq/merged/
    # (note root is the input bucket, first subdirectory is the project name, 
    # second subdirectory is the name of this particular dataset, 
    # prefixed with the date it was downloaded, and third subdirectory indicates
    # these are merged files rather than raw inputs.
    # It is ok if the first through third subdirectory don't exist
    # yet (as long as the input bucket does) because the S3 command
    # will make them.  For example,
    # aws s3 cp --recursive . s3://ucsd-ccb-data-upload-1/20181202_califano_luana-dos-santos_squamous-cell-carcinoma/20190327_human_rnaseq/merged/
    aws s3 cp --recursive . <s3 bucket>/ 
    
    # print out (and copy into the cell below) all
    # the merged files names; these wil be needed
    # for design file creation for primary analysis
    ls | cat
    
    # finally, delete the merged files and the directory
    # created to hold them; for example, 
    # rm -rf /shared/data

[Table of Contents](#Table-of-Contents)


Copyright (c) 2018 UC San Diego Center for Computational Biology & Bioinformatics under the MIT License

Notebook template by Amanda Birmingham