### Running in Docker container on EC2 instance

#### Started Docker container with the following command:

```docker run  -p 8888:8888 -v /home/ubuntu/gitrepos/LabDocs/jupyter_nbs/sam:/home/notebooks -v /home/ubuntu/data:/data/ -it kubu4/bioinformatics:v11 /bin/bash```

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files accessible to the Docker container

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

```jupyter notebook```

This is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

#### Created a tunnel from my local computer to the Docker container:

```ssh -i ~/Dropbox/Lab/Sam/bioinformatics.pem -N -L localhost:8888:localhost:8888 ubuntu@ec2.ip.address```

This command is run in a separate Terminal window than the one that is used to ssh into the EC2 instance to start Docker and all of that.

This ssh command specifies to use my Amazon EC2 authentication file (bioinformatics.pem), along with the -N and -L options for port forwarding stuff (see man ssh for deets), and binds the port 8888 on my local computer to port 8888 on the EC2 isntance. 

The tunnel allows me to start the Jupyter Notebook in my web browser. I enter ```localhost:8888``` in as the URL.

In [1]:
%%bash
date

Fri Jul 15 16:56:46 UTC 2016


In [2]:
%%bash
hostname

570c28713283


### Check computer specs

In [3]:
%%bash
lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz
Stepping:              2
CPU MHz:               2900.088
BogoMIPS:              5800.17
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7


In [4]:
cd /data/

/data


In [5]:
ls

160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz  [0m[01;32m1NF_25A_2.fq.gz[0m*
160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz  [01;32m1NF_26A_1.fq.gz[0m*
[01;32m1HL_10A_1.fq.gz[0m*                                            [01;32m1NF_26A_2.fq.gz[0m*
[01;32m1HL_10A_2.fq.gz[0m*                                            [01;32m1NF_27A_1.fq.gz[0m*
[01;32m1HL_11A_1.fq.gz[0m*                                            [01;32m1NF_27A_2.fq.gz[0m*
[01;32m1HL_11A_2.fq.gz[0m*                                            [01;32m1NF_28A_1.fq.gz[0m*
[01;32m1HL_12A_1.fq.gz[0m*                                            [01;32m1NF_28A_2.fq.gz[0m*
[01;32m1HL_12A_2.fq.gz[0m*                                            [01;32m1NF_29A_1.fq.gz[0m*
[01;32m1HL_13A_1.fq.gz[0m*                                            [01;32m1NF_29A_2.fq.gz[0m*
[01;32m1HL_13A_2.fq.gz[0m*                                            [01;32m1NF_2A_1.fq.gz[0m*


### Rename FASTQ files to match R1 and R2 requirements for pyrad demultiplexingÂ¶

In [9]:
%%bash
mv 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_R1_.fq.gz

In [10]:
%%bash
mv 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz 160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_R2_.fq.gz

In [11]:
cd analysis/

/data/analysis


In [12]:
ls

barcodes.txt  params.txt


In [13]:
%%bash
cat params.txt

./                     ## 1. Working directory                                 (all)
/data/oly_gbs/*.gz             	       ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
/analysis/20160609_pyrad/barcodes.txt	               ## 3. Loc. of barcode file (if not line 18)             (s1)
/usr/local/bioinformatics/vsearch-1.11.1-linux-x86_64/bin/vsearch                ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
/usr/local/bioinformatics/muscle3.8.31_i86linux64                    ## 5. command (or path) to call muscle                  (s3,s7)
CWGC                   ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
16                     ## 7. N processors (parallel)                           (all)
6                      ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                      ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.88                    ## 10. Wclust: clustering threshold as a decimal        (

In [14]:
%%bash
mv /data/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_R1_.fq.gz /data/analysis/

In [16]:
%%bash
mv /data/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_R2_.fq.gz /data/analysis/

In [17]:
ls

160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_R1_.fq.gz  barcodes.txt
160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_R2_.fq.gz  params.txt


Need to edit params file to match current locations of data files. Will do outside of notebook. BRB...

In [18]:
%%bash
cat params.txt

/data/analysis/                     ## 1. Working directory                                 (all)
/data/analysis/             	       ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
/data/analysis/barcodes.txt	               ## 3. Loc. of barcode file (if not line 18)             (s1)
/usr/local/bioinformatics/vsearch-1.11.1-linux-x86_64/bin/vsearch                ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
/usr/local/bioinformatics/muscle3.8.31_i86linux64                    ## 5. command (or path) to call muscle                  (s3,s7)
CWGC                   ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
16                     ## 7. N processors (parallel)                           (all)
6                      ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                      ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.88                    ## 10. Wclust: clustering threshold as a decimal        (

### View barcodes.txt file

In [19]:
%%bash
cat barcodes.txt

1NF_1A	CTCC
1NF_2A	TGCA
1NF_4A	ACTA
1NF_5A	CAGA
1NF_6A	AACT
1NF_7A	GCGT
1NF_8A	CGAT
1NF_9A	GTAA
1NF_10A	AGGC
1NF_11A	GATC
1NF_12A	TCAC
1NF_13A	TGCGA
1NF_14A	CGCTT
1NF_15A	TCACC
1NF_16A	CTAGC
1NF_17A	ACAAA
1NF_18A	TTCTC
1NF_19A	AGCCC
1NF_20A	GTATT
1NF_21A	CTGTA
1NF_22A	ACCGT
1NF_23A	GCTTA
1NF_24A	GGTGT
1NF_25A	AGGAT
1NF_26A	ATTGA
1NF_27A	CATCT
1NF_28A	CCTAC
1NF_29A	GAGGA
1NF_30A	GGAAC
1NF_31A	GTCAA
1NF_32A	TAATA
1NF_33A	TACAT
1SN_1A	TCGTT
1SN_2A	GGTTGT
1SN_3A	CCAGCT
1SN_4A	TTCAGA
1SN_5A	TAGGAA
1SN_6A	GCTCTA
1SN_7A	CCACAA
1SN_8A	CTTCCA
1SN_9A	GAGATA
1SN_10A	ATGCCT
1SN_11A	AGTGGA
1SN_12A	ACCTAA
1SN_13A	ATATGT
1SN_14A	ATCGTA
1SN_15A	CATCGT
1SN_16A	CGCGGT
1SN_17A	CTATTA
1SN_18A	GCCAGT
1SN_19A	GGAAGA
1SN_20A	GTACTT
1SN_21A	GTTGAA
1SN_22A	TAACGA
1SN_23A	TGGCTA
1SN_24A	TATTTTT
1SN_25A	CTTGCTT
1SN_26A	ATGAAAC
1SN_27A	AAAAGTT
1SN_28A	GAATTCA
1SN_29A	GAACTTC
1SN_30A	GGACCTA
1SN_31A	GTCGATT
1SN_32A	AACGCCT
1HL_1A	AATATGC
1HL_2A	ACGTGTT
1HL_3A	ATTAATT
1HL_4A	ATTGGAT
1HL_5A	CATAAGT
1HL_6A	CGCTGAT
1H

### Step 1: De-multiplex reads

In [None]:
%%bash
time pyrad -p params.txt -s 1