# mapping and counting
In this note, I would address the following tasks:

1. How many FAST5 files do we have?
1. Take several sample files, and convert them into FASTQ file
1. Map to the reference genome (hg19)
1. Bring in prior data (UKBB 500k individuals)
1. Take platimum whole genome VCF file to compute $\theta$

In [1]:
!python --version

Python 3.5.2 :: Anaconda 4.1.1 (64-bit)


# Summary

## 0] Preparation
- software installation and data download

## 1] How many FAST5 files?

- cDNA: 20161006_minion_hu man_cDNA
- WGS: 20161008_wgs_cauc asian_48hr

||cDNA | WGS|
| ---- | ---- | ---- |
| Number of FAST5 files| 48280 | 184911|
|total reads | 26,854 | 29,964|
|total base pairs | 46,314,462 | 44,839,915|
|mean | 1724.68 | 1496.46|
|median | 1094 | 925|
|min | 58 | 35|
|max | 108262 | 94024|
|N25 | 5201 | 4547|
|N50 | 2529 | 2227|
|N75 | 1327 | 1140|

## 1) How many fast5 files do we have?
- simply use find command to count files

In [3]:
!find /home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/ -name "*.fast5"| wc -l

48280


In [2]:
!find /home/ytanigaw/data/nanopore/20161008_wgs_caucasian_48hr/ -name "*.fast5"| wc -l

^C

- it took time, so I submitted a job
```
[ytanigaw@sh-5-36 ~/projects/nanopore/scripts/20161018]$ sbatch 20161008_wgs_caucasian_48hr-count-fast5.sbatch
Submitted batch job 10280157
```

In [4]:
!tail -n1 /home/ytanigaw/projects/nanopore/scripts/20161018/20161008_wgs_caucasian_48hr-count-fast5.sbatch

find ${HOME}/data/nanopore/20161008_wgs_caucasian_48hr/ -name "*.fast5"| wc -l


In [20]:
!cat ../scripts/20161018/20161008_wgs_caucasian_48hr-count-fast5.out

184911


In [2]:
!find /home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/ -name "*.fast5"|head

/home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/DN0a27203a_SUNet_20161001_FNFAB30583_MN20225_sequencing_run_20161001_brcabl_80046_ch292_read263_strand.fast5
/home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/DN0a27203a_SUNet_20161001_FNFAB30583_MN20225_sequencing_run_20161001_brcabl_80046_ch22_read997_strand.fast5
/home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/DN0a27203a_SUNet_20161001_FNFAB30583_MN20225_sequencing_run_20161001_brcabl_80046_ch358_read263_strand.fast5
/home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/DN0a27203a_SUNet_20161001_FNFAB30583_MN20225_sequencing_run_20161001_brcabl_80046_ch218_read2332_strand.fast5
/home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/DN0a27203a_SUNet_20161001_FNFAB30583_MN20225_sequencing_run_20161001_brcabl_80046_ch494_read1184_strand.fast5
/home/ytanigaw/data/nanopore/20161006_minion_human_cDNA/DN0a27203a_SUNet_20161001_FNFAB30583_MN20225_sequencing_run_20161001_brcabl_80046_ch211_read974_strand.fast5
/home/yta

## Helio's script
- Helio kindly shared his code with us
- I will have a look at it

In [5]:
!cat /home/ytanigaw/projects/nanopore/scripts/from_Helio/nanopore_rna.20161009.v1.sh

########################################################
### Declare paths
########################################################
SAMTOOLS='/srv/gsfs0/projects/bustamante/progs/samtools-0.1.20/samtools'
GENOME_GRCH37='/srv/gsfs0/projects/bustamante/hcosta_projects/resources/Gencode/Release19_GRCh37/GRCh37.p13.genome.fa'
GENOME_BCRABL_DIR='/srv/gsfs0/projects/bustamante/hcosta_projects/resources/BCR-ABL_genome'
FASTQ='/srv/gsfs0/projects/bustamante/hcosta_projects/projects/minion/data/20161001_bcrabl/20161001_bcrabl.fq'
LAST='/srv/gsfs0/projects/bustamante/progs/last-759'
RESULTS_DIR='/srv/gsfs0/projects/bustamante/hcosta_projects/projects/minion/results'
PICARD='/srv/gsfs0/projects/bustamante/progs/picard-tools-1.138/picard.jar'


########################################################
### QLOGIN
########################################################

screen
qlogin -l h_vmem=40G -q extended


#################################################

- basically they have installed the necessary softwares into their lab partition on scg4 cluster

## mapping software
- Should we use sherlock cluster or scg4 cluster?
  - depending on the availabilities of softwares

### Sherlock cluster

In [11]:
!hostname

sh-5-36.local


In [9]:
!module avail > /tmp/module-avail 2>&1

In [10]:
!cat /tmp/module-avail

Rebuilding cache, please wait ... (not written to file) done

---------------------------- /share/sw/modules/Core ----------------------------
   APBS/1.4.1
   CNTK/1.1                                        (g)
   CUB/1.5.2                                       (g)
   FETK/1.4
   GraphicsMagick/GraphicsMagick-1.4.020151212
   IGV/2.3.79
   LAMMPS/9Dec2014
   NAMD/2.11                                       (g)
   OpenBabel/2.3.2
   QuantumEspresso/5.1.1/intel
   R/3.0.2
   R/3.2.0
   R/3.2.2                                         (D)
   R/3.2.5.intel.tcltk
   R/3.2.5.intelmpi
   R/3.2.5
   R/3.3.0
   STAR/STAR
   STAR-Fusion/v0.8
   afni/16.2.13
   allinea/5.0
   allinea/5.0.1
   allinea/5.1
   allinea/6.0                                     (D)
   amber/14-cuda                                   (g)
   amber/14-intel                                  (D)
   anaconda/anaconda2
   anaconda/anaconda3                              (L,D)
   ansys/icemcfd/15.0


- there is less bioinformatics tools support on this cluster

### scg4 cluster
- they have bowtie, blast, bwa, ... most commonly used bioinfo tools.
- however, they do not have anaconda module, even worse they do not have python 3.5 (they have python 3.4)

## data download
- In the meanwhile, I would download data

#### hg19
- go to UCSC genome browser and download hg19
  - http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

In [15]:
!tail -n3 ../scripts/20161018/20161008_hg19_dl.sbatch

SOURCE="http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz"
wget ${SOURCE} \
     -P ${HOME}/data/


In [None]:
## software installation
- I need to install relevant softwares on Sherlock

- submitted a job
```
[ytanigaw@sh-5-36 ~/projects/nanopore/scripts/20161018]$ sbatch 20161008_hg19_dl.sbatch
Submitted batch job 10281245
```

### LAST


In [23]:
!wget http://last.cbrc.jp/last-759.zip -P /tmp

--2016-10-18 15:29:21--  http://last.cbrc.jp/last-759.zip
Resolving last.cbrc.jp... 124.35.84.43
Connecting to last.cbrc.jp|124.35.84.43|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 710209 (694K) [application/zip]
Saving to: “/tmp/last-759.zip”


2016-10-18 15:29:22 (822 KB/s) - “/tmp/last-759.zip” saved [710209/710209]



In [25]:
!unzip /tmp/last-759.zip

Archive:  /tmp/last-759.zip
replace last-759/ChangeLog.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [27]:
!cd /tmp/last-759/

In [28]:
!./configure --prefix=/share/PI/mrivas

/bin/sh: ./configure: No such file or directory


In [29]:
!cat /tmp/last-759/README.txt

Please see the documentation in the "doc" directory.

* Installation & general info: doc/last.txt
* Usage: start with doc/last-tutorial.txt


- I just followed the docmentation and installed into the shared dir of our lab.
```
[ytanigaw@sh-5-36 /tmp/last-759]$ ml load gcc
[ytanigaw@sh-5-36 /tmp/last-759]$ make
[ytanigaw@sh-5-36 /tmp/last-759]$ make install prefix=/share/PI/mrivas/
```
- added `/share/PI/mrivas/bin/` to my path (~/.bash_profile)

In [1]:
!echo $PATH

/share/sw/free/anaconda/anaconda3/bin:/share/PI/mrivas/bin:/home/ytanigaw/.local/bin:/share/sw/srcc/bin:/share/sw/srcc/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/opt/pdsh/bin:/opt/rocks/bin:/opt/rocks/sbin:/opt/dell/srvadmin/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/ytanigaw/bin


In [2]:
!which lastdb

/share/PI/mrivas/bin/lastdb


- successfully installed

#### LAST example #1 on tutorial
- http://last.cbrc.jp/doc/last-tutorial.html

In [4]:
!ml load gcc

In [10]:
!lastdb -cR01 /tmp/last-759/humdb /tmp/last-759/examples/humanMito.fa

lastdb: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.17' not found (required by lastdb)
lastdb: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by lastdb)
lastdb: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by lastdb)


- I got an error
- After having some google search, I decided to try with older version of gcc
```
[ytanigaw@sh-5-36 /tmp/last-759]$ ml load gcc/4.9.1
[ytanigaw@sh-5-36 /tmp/last-759]$ make
[ytanigaw@sh-5-36 /tmp/last-759]$ make install prefix=/share/PI/mrivas/
```


In [11]:
!lastdb -cR01 /tmp/last-759/humdb /tmp/last-759/examples/humanMito.fa

- it works!! :)

In [12]:
!lastal /tmp/last-759/humdb /tmp/last-759/examples/fuguMito.fa > ~/myalns.maf

In [13]:
!wc -l /home/ytanigaw/myalns.maf

93 /home/ytanigaw/myalns.maf


In [16]:
!git status

# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	[31mmodified:   20161018_mapping_and_counting.ipynb[m
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	[31m../scripts/20161018/[m
no changes added to commit (use "git add" and/or "git commit -a")


In [17]:
!git add 20161018_mapping_and_counting.ipynb

In [18]:
!git commit -m "update note: LAST installation"

[master 39a5fcd] update note: LAST installation
 1 file changed, 474 insertions(+), 5 deletions(-)


In [19]:
!git push origin master

Counting objects: 7, done.
Delta compression using up to 16 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 2.48 KiB, done.
Total 4 (delta 3), reused 0 (delta 0)
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.[K
To git@github.com:rivas-lab/nanopore.git
   08d728a..39a5fcd  master -> master
