### Running in Docker container on Ostrich

#### Started Docker container with the following command:

```docker run -p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:/owl_web -v /Users/sam/gitrepos:/gitrepos -it f99537d7e06a```

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files on Owl/home and Owl/web accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

```jupyter notebook```

This is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

The Docker container is running on an image created from this [Dockerfile (Git commit 443bc42)](https://github.com/sr320/LabDocs/blob/443bc425cd36d23a07cf12625f38b7e3a397b9be/code/dockerfiles/Dockerfile.bio)

In [1]:
%%bash
date

Mon Feb 27 18:32:53 UTC 2017


### Check computer specs

In [2]:
%%bash
hostname

0f2bca9c664b


In [3]:
%%bash
lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 26
Model name:            Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
Stepping:              5
CPU MHz:               2260.998
BogoMIPS:              4521.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K


### Download Jay's Non-Demultiplexed data

#### The files in this folder are as follows (email correspondence):
> Hi Sam,
>
Your new directory is on the server "Dimond_170224", the checksum info in listed in a text file for the three files. Is there a better way to list the checksums? I'm new to this.  
>  
>Also, I gave you the 2 reads and the 6bp index file.  
>  
>Best,
>
>Shana

>Shana McDevitt

>Director

>Vincent J. Coates Genomics Sequencing Laboratory

>California Institute for Quantitative Biosciences (QB3)

>University of California, Berkeley

In [1]:
cd /data/20170227_jay_data_tmp/

/data/20170227_jay_data_tmp


#### The following command uses ```wget``` to download all of the files in the target directory. Here's an explanation of the code:

- ```time```: Evaluates how long it takes for the command to complete. 

- ```WGETRC=/data/wgetrc_berk_seq```: This assigns the value of the bash variable ```WGETRC``` to the contents of the ```wgetrc_berk_seq``` file. This file contains the username and password needed to ftp the data from the UC Berkeley server. Using this allows me to run the command in a Jupyter notebook without the need for pasting the actual username and password into the command string.

- ```-r```: Recursive; i.e. download all things in this directory and anything in any subdirectories.

- ```-np```: No parent; i.e. do not ascend to higher directories.

- ```-nc```: No clobber; i.e. do not overwrite any existing files in the download directory.

- ```-q```: Quiet; i.e. do not print wget status to screen. This is to prevent bogging down the Jupyter notebook with thousands of output lines.

In [5]:
%%bash
time WGETRC=/data/wgetrc_berk_seq wget -r -np -nc -q ftp://gslserver.qb3.berkeley.edu/Dimond_170224


real	37m7.999s
user	0m3.130s
sys	20m34.060s


In [6]:
%%bash
ls -lh

total 0
drwxr-xr-x 1 srlab staff 102 Feb 27 19:54 gslserver.qb3.berkeley.edu


In [7]:
cd gslserver.qb3.berkeley.edu/Dimond_170224/

/data/20170227_jay_data_tmp/gslserver.qb3.berkeley.edu/Dimond_170224


In [8]:
%%bash
ls -lh

total 41G
-rw-r--r-- 1 srlab staff 2.1G Feb 24 23:28 JD002_S0_L005_I1_001.fastq.gz
-rw-r--r-- 1 srlab staff  18G Feb 24 23:28 JD002_S0_L005_R1_001.fastq.gz
-rw-r--r-- 1 srlab staff  22G Feb 24 23:28 JD002_S0_L005_R2_001.fastq.gz
-rw-r--r-- 1 srlab staff  192 Feb 25 00:01 md5sum_report


In [9]:
cat md5sum_report

baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz


### Generate our own MD5 checksums

In [10]:
%%bash
time for i in *.gz
    do
    md5sum "$i" >> checksums.md5
    done


real	8m0.815s
user	0m4.260s
sys	4m44.580s


In [11]:
cat checksums.md5

baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz


### Compare MD5 checksums

Visual inspection suggests that these are good to go, but we'll compare them programmatically anyway...

In [12]:
%%bash
diff checksums.md5 md5sum_report

No output means no differences between the two files. However, to further verify, we'll check the exit status of the last command run (should be 0 if last command completed successfully with no errors). This is accomplished by calling the bash variable ```$?```.

In [13]:
%%bash
echo $?

0


### Copy files to directories on Owl

Jay has three different species in his sequencing data, so I'm copying the data to each of three different species folders on Owl. The code below uses the ```-no-clobber``` argument to prevent the program from overwriting any existing files in the destination directory that might have the same file name.

In [14]:
%%bash
time for file in *.gz
    do
    cp --no-clobber "$file" /owl_web/nightingales/P_generosa/
    cp --no-clobber "$file" /owl_web/nightingales/Porites_spp/
    cp --no-clobber "$file" /owl_web/nightingales/A_elegantissima/
    done


real	130m18.473s
user	0m0.020s
sys	18m23.290s


### Generate new checksums for files copied to Owl

In [17]:
%%bash
time for i in /owl_web/nightingales/P_generosa/JD002_S0_L005*.gz
    do
    md5sum "$i" >> temp_checksums.md5
    done


real	31m19.144s
user	0m3.860s
sys	4m44.720s


In [18]:
%%bash
time for i in /owl_web/nightingales/Porites_spp/JD002_S0_L005*.gz
    do
    md5sum "$i" >> temp_checksums.md5
    done


real	27m17.756s
user	0m4.390s
sys	4m44.170s


In [19]:
%%bash
time for i in /owl_web/nightingales/A_elegantissima/JD002_S0_L005*.gz
    do
    md5sum "$i" >> temp_checksums.md5
    done


real	26m45.605s
user	0m4.870s
sys	4m43.930s


### Compare initial checksums with temporary checksums on Owl

I screwed up and didn't create/write the ```temp_checksums.md5``` file into the different directories on Owl. Will create a concatenated ```md5sum_report``` file that mimics the contents of the ```temp_checksums.md5``` file.

In [20]:
%%bash
cat temp_checksums.md5

baa87464b77f937fccf496351bb7f000  /owl_web/nightingales/P_generosa/JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  /owl_web/nightingales/P_generosa/JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  /owl_web/nightingales/P_generosa/JD002_S0_L005_R2_001.fastq.gz
baa87464b77f937fccf496351bb7f000  /owl_web/nightingales/Porites_spp/JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  /owl_web/nightingales/Porites_spp/JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  /owl_web/nightingales/Porites_spp/JD002_S0_L005_R2_001.fastq.gz
baa87464b77f937fccf496351bb7f000  /owl_web/nightingales/A_elegantissima/JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  /owl_web/nightingales/A_elegantissima/JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  /owl_web/nightingales/A_elegantissima/JD002_S0_L005_R2_001.fastq.gz


In [21]:
%%bash
cat md5sum_report >> md5sum_report_cat
cat md5sum_report >> md5sum_report_cat
cat md5sum_report >> md5sum_report_cat

In [22]:
%%bash
cat md5sum_report_cat

baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz
baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz
baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz


In [23]:
%%bash
diff md5sum_report_cat temp_checksums.md5

1,9c1,9
< baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
< e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
< 9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz
< baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
< e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
< 9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz
< baa87464b77f937fccf496351bb7f000  JD002_S0_L005_I1_001.fastq.gz
< e05eea61dbd405c890f241f824b2012b  JD002_S0_L005_R1_001.fastq.gz
< 9e34ddfc4dbdd9a96bd4f8f102f52693  JD002_S0_L005_R2_001.fastq.gz
---
> baa87464b77f937fccf496351bb7f000  /owl_web/nightingales/P_generosa/JD002_S0_L005_I1_001.fastq.gz
> e05eea61dbd405c890f241f824b2012b  /owl_web/nightingales/P_generosa/JD002_S0_L005_R1_001.fastq.gz
> 9e34ddfc4dbdd9a96bd4f8f102f52693  /owl_web/nightingales/P_generosa/JD002_S0_L005_R2_001.fastq.gz
> baa87464b77f937fccf496351bb7f000  /owl_web/nightingales/Porites_spp/JD002_S0_L005_I1_001.fastq.

Well, I didn't take into account that the full path to the file would be written to the checksum file. As such, the ```diff``` command sees this. However, the checksums appear to visually match. Will proceed with adding the checksums to the checksum files in each Owl directory.

### Append checksums to existing checksum files in each directory

In [24]:
%%bash
cat checksums.md5sums.md5sum_report >> /owl_web/nightingales/P_generosa/checksums.md5
cat md5sum_report >> /owl_web/nightingales/Porites_spp/checksums.md5
cat md5sum_report >> /owl_web/nightingales/A_elegantissima/checksums.md5

Whoops! Typo in that first line above! Fixed below

In [25]:
%%bash
cat md5sum_report >> /owl_web/nightingales/P_generosa/checksums.md5

In [26]:
%%bash
ls -lh

total 41G
-rw-r--r-- 1 srlab staff 2.1G Feb 24 23:28 JD002_S0_L005_I1_001.fastq.gz
-rw-r--r-- 1 srlab staff  18G Feb 24 23:28 JD002_S0_L005_R1_001.fastq.gz
-rw-r--r-- 1 srlab staff  22G Feb 24 23:28 JD002_S0_L005_R2_001.fastq.gz
-rw-r--r-- 1 srlab staff  192 Feb 27 20:44 checksums.md5
-rw-r--r-- 1 srlab staff  192 Feb 25 00:01 md5sum_report
-rw-r--r-- 1 srlab staff  576 Feb 28 01:13 md5sum_report_cat
-rw-r--r-- 1 srlab staff  891 Feb 28 01:13 temp_checksums.md5


In [27]:
rm -rf /data/20170227_jay_data_tmp/gslserver.qb3.berkeley.edu/

In [28]:
%%bash
ls -lh /data/20170227_jay_data_tmp/

total 0


shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory


### Download Jay's demultiplexed data

In [29]:
cd /data/20170227_jay_data_tmp/

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/root/miniconda2/lib/python2.7/site-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/root/miniconda2/lib/python2.7/site-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/root/miniconda2/lib/python2.7/inspect.py", line 1049, in getinnerframes
    framelist.append((tb.tb_frame,) + getframeinfo(tb, context))
  File "/root/miniconda2/lib/python2.7/inspect.py", line 1009, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/root/miniconda2/lib/python2.7/inspect.py", line 454, in getsourcefile
    if hasattr(getmodule(object, filename), '__loader__'):
  File "/root/miniconda2/lib/python2.7/ins

IndexError: string index out of range