Create slides and run slide show

* ipython nbconvert presentation.ipynb --to slides --post serve

Or if slides are created
* python -m SimpleHTTPServer

Navigate to http://127.0.0.1:8000/presentation.slides.html

# Command Line Tools for Data Manipulation


<img width='1000' height='500' src=pics/command_line_fu.png /img>




This notebook is using the BASH Kernel by *takluyver*
https://github.com/takluyver/bash_kernel

<center><b> Leveraging Linux Shell Programs for Data Science Tasks</b></center>

* Extraction in this field can be difficult. We are not just pulling data from a data base...

* Skipping Extraction for now, let's focus on Transformation. 

* Storage we will touch on for small prototyping project only.

<img src=pics/ETL_diagram.png /img>

# Strengths of Shell Programs:  
* Written in C 
* Close to the Kernel
* Robust 
 * Well documented
 * Written by amazing programmers
 
<div align="left">
<table style="width:25%">
 <b>A few common authors</b>
* Ken Thompson
* Lee E. McMahon 
* Richard M. Stallman
* Linus Torvalds
     

Are you really going to write better, more robust, faster code than these people??

<img width='700' src=pics/c_prog_comic_foxtrot.jpg /img>

<b>What Is The Command Line Good For ?</b>

* Small data sets
* R&D code for Map Reduce



#Imaginary Use Case: File Inventory

For now, imagine we have a data file in CSV format with inventory information about
a hard-drive.

## Basic Inventory Consists Of:
* number of files
* size of volume
* number of fiels by file-extension
* lines of code in the files

In [None]:
#!/bin/bash

function count_lines
{
    input=$1
    echo $(cat $input | wc -l)
}

function check_for_ascii
{
    input=$1
    bool=$(file $input | grep -ic "ascii")
    if [[ $bool -gt 0 ]];then
        echo 1
    else
        echo 0
    fi
}

function get_extension
{
    input="$1"
    base=$(basename "$input")
    test_=$(echo "$base" | grep -c "\.")

    if [ $test_ -eq 0 ];then
        echo "NONE"
    else
        ext=$(echo "$base" | rev | cut -d. -f 1 | rev)
    fi

    echo ${ext}
}

#### calls here ###
input=$1

count=$(count_lines $input)
ascii_bool=$(check_for_ascii $input)
ext=$(get_extension $input)

if [ $ascii_bool == 1 ];then
    printf "\"${input}\",\"${count}\",\"${ext}\"\n"
else
    printf "\"{$input}\",\"0\",\"${ext}\"\n"
fi

In [11]:
root_dir="/home/daniel/git/Python2.7/DataScience/notebooks/orbital"
export PATH=$PWD:$PATH



In [2]:
du -h -c linux-2.6.32.67 | tail -n 1

418M	total


In [10]:
#make a header function for inventory csv file
function make_header() {
printf "\"path\",\"nlines\",\"ext\"\n"
}



In [22]:
make_header > $root_dir/linux_inventory_local_path.csv
# run inventory program on files 
# use xargs multi process argument
time find $root_dir/linux-2.6.32.67 -type f | xargs -n 1 -P 0 $root_dir/make_inventory.sh >> $root_dir/linux_inventory_local_path.csv


real	1m21.277s
user	4m25.776s
sys	6m2.499s


In [23]:
cat $root_dir/linux_inventory_local_path.csv | wc -l | thou_comma.sh

30,486


In [24]:
head -n 5 "${root_dir}/linux_inventory_local_path.csv" | cut -d/ -f 7- | column -t -s,

"path"                                                          "nlines"  "ext"
notebooks/orbital/linux-2.6.32.67/net/decnet/dn_dev.c"          "0"       "c"
notebooks/orbital/linux-2.6.32.67/net/sunrpc/Makefile"          "18"      "NONE"
notebooks/orbital/linux-2.6.32.67/net/bluetooth/cmtp/Kconfig"   "11"      "NONE"
notebooks/orbital/linux-2.6.32.67/net/bluetooth/bnep/Makefile"  "7"       "NONE"


What if we want to change the root path in the inventory files ?

In [14]:
head -n 1 $root_dir/linux_inventory_local_path.csv

"/home/daniel/git/Python2.7/DataScience/notebooks/orbital/linux-2.6.32.67/net/wireless/ibss.c","509","c"


In [3]:
head -n 1 $root_dir/linux_inventory_local_path.csv | cut -d, -f 1 | sed 's/\"//g' | cut -d/ -f 9-

linux-2.6.32.67/net/wireless/ibss.c


In [5]:
new_path="/work/sandbox/orbital_slides/"
head -n 1 $root_dir/linux_inventory_local_path.csv | sed 's/\"//1' |  cut -d/ -f 9- | sed "s#^#$new_path#g"   # need to change the sed sep to '#' bc path has '/'

/work/sandbox/orbital_slides/linux-2.6.32.67/net/wireless/ibss.c","509","c"


In [7]:
head -n 1 $root_dir/linux_inventory_local_path.csv | sed 's/\"//1' |  cut -d/ -f 9- | sed "s#^#$new_path#g" | sed 's/^/\"/1' # need to change the sed sep to '#' bc path has '/'sed 's/^/\"/1'

"/work/sandbox/orbital_slides/linux-2.6.32.67/net/wireless/ibss.c","509","c"


## What Types Of File Are Present ?   Sort by extension

In [26]:
cat $root_dir/linux_inventory_local_path.csv | cut -d, -f 3 | sort | uniq -ic | sort -n | tail -n 5

    857 "txt"
   1080 "S"
   2818 "NONE"
  11638 "h"
  13154 "c"


#GNU Parallel

<img width='300'  src=pics/parallel_orig.png /img>

##Run shell scripts and/or commands ( which are really C programs ) in parallel from a terminal.

*Documentation*
<url>https://www.gnu.org/software/parallel/parallel_tutorial.html#The-7-predefined-replacement-strings</url>

Good examples of using advanced features of *parallel*
<url>https://www.biostars.org/p/63816/</url>

I used to work with brain imaging, and a common task is to extract the brain from the skull with Brain Extraction Tool (BET).
BET is a complex monster that uses Naive Bayes and a brain atlas. It runs fairly quick for one brain, but here we need to run it on 210 seperate images.

BACKGROUND:
In order to register the anatomic image (T1 weighted high res) to the lower res BOLD image, the mean of the time series is taken.
But, the single volume frames should first be extracted from the skull, because the skull has few features to match.

* extract brain from every (210) BOLD volume
* Motion Correct, (register the brain volumes to each other)
* Make a mean image for registration ( T1 is registered to the BOLD mean and resampled)

Running BET on 210 images in serial takes a long time and is annoying when you are working fast.

## Time Command
<url>http://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1</url>

* Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked.

* User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. 

* Sys is the amount of CPU time spent in the kernel within the process. 

In [8]:
in_path="/home/daniel/git/Python2.7/DataScience/notebooks/orbital/bold_split/original_split"
out_path="/home/daniel/git/Python2.7/DataScience/notebooks/orbital/bold_split/bet"

time parallel --jobs 8 "bet {} $out_path/{#}vol_bet" :::  $in_path/vol*.nii.gz


real	1m15.517s
user	8m57.207s
sys	0m10.395s


In [10]:
ls $out_path | head -n 3

100vol_bet.nii.gz
101vol_bet.nii.gz
102vol_bet.nii.gz


## Blood oxygenation level dependent (BOLD) image: 
### With Skull and Scalp (Before) and After Extraction AKA Skull Stripping (After)
<img src=pics/bf_and_after_bold_bet_masked.png>

## Run Inventory With Parallel

In [4]:
time find $root_dir/linux-2.6.32.67 -type f | parallel -n 1 --jobs 8 'make_inventory.sh' >> /dev/null


real	1m46.358s
user	4m39.934s
sys	6m7.481s


# Best Use Case: File System Manipulation And Quick Regex

In [None]:
# become root and run 
# free && sync && echo 3 > /proc/sys/vm/drop_caches && free
time find "${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -name "*.c" |  parallel --jobs 8 -n 1 'pcregrep -no "(?sm)if\s*\(.*?\)" /dev/null' | sed 's/\s//g' > /dev/null

#FIFO For Moving Data

* mkfifo \$root_dir/pipe
* cat > \$root_dir/pipe &
    * get pid
     * echo $!
* run inventory with output to pipe
* load inventory output into DB via pipe
* kill cat with pid

In [21]:
if [ -p $root_dir/pipe ];then
    rm $root_dir/pipe
fi
   
if [ -f $root_dir/inven.db ];then
    rm $root_dir/inven.db
fi
    
mkfifo $root_dir/pipe
#$cat > $root_dir/pipe &
make_header > $root_dir/pipe &
# using maxdepth 2 for shorter experiment
find $root_dir/linux-2.6.32.67 -maxdepth 3 -type f  | parallel --jobs 2 'make_inventory.sh' >> $root_dir/pipe &

[2] 15708


In [22]:
csvsql --db sqlite:///inven.db --insert $root_dir/pipe --table inven 

[1]-  Done                    make_header > $root_dir/pipe
[2]+  Done                    find $root_dir/linux-2.6.32.67 -maxdepth 3 -type f | parallel --jobs 2 'make_inventory.sh' >> $root_dir/pipe


In [24]:
pkill cat
#sql2csv --db sqlite:///inven.db --query "select * from inven;"
sql2csv --db sqlite:///inven.db --query "select ext, count(ext) from inven group by ext order by count(ext) desc limit 5;"

ext,count(ext)
c,4293
h,2541
txt,705
ihex,111
gitignore,33


# Implement Shell Commands In Map Reduce, Via MRJob

<img width='500' src=pics/map_reduce.png /img>

* We've already invested time working with the shell commands, why not use them in a Map Reduce process ?
* Assuming that the inventory is something huge like the Facebook source.

* we could use a sample of the source and extrapolate
* it's still good to convince pple that of the total count of a simple metric

In [None]:
cat linux_inventory_local_path.csv | cut -d, -f 1 | xargs -n 1  grep -i "copyright" | sed 's/\W//g' | sed 's/[0-9]//g' | sort | uniq -c | sort -n

In [8]:
cat ~/.mrjob.conf | sed  's/\(aws_.*:\s*\)\(.*$\)/\1******/g'

runners:
  emr:
    aws_access_key_id: ******
    aws_secret_access_key: ******
    ec2_key_pair: Example
    ec2_key_pair_file: ~/.ssh/Example.pem
    python_bin: python2.7
    strict_protocols: true
    bootstrap:
    - sudo python2.7 -m pip install mrjob

  hadoop:
    strict_protocols: true
  inline:
    strict_protocols: true
  local:
    strict_protocols: true


#Helpful Custom Shell Scripts

In [28]:
function pp
# script to pretty print variables in a easy to read list 

{
    var=$1
    echo $var | sed 's/:/\n/g' | sort | uniq
}

pp $PATH

/bin
/home/daniel/anaconda/bin
/home/daniel/anaconda/envs/py27/bin
/home/daniel/bin
/home/daniel/FSL
/home/daniel/spark-1.5.2-bin-hadoop2.6/bin
/home/daniel/spark-1.5.2-bin-hadoop2.6/sbin
/opt/afni_bin/linux_xorg7_64
/sbin
/usr/bin
/usr/games
/usr/lib/cmtk/bin/
/usr/lib/fsl/5.0
/usr/local/bin
/usr/local/sbin
/usr/sbin


In [25]:
function thou_comma(){
# for some reason you need to use diff printf
alias printf=/usr/bin/printf

input=$1
read input
printf "%'d\n"  $input
}

echo "1234" | thou_comma.sh

1,234


# Links
<img width='500' src=pics/chain.png /img>

* Command line tricks http://www.commandlinefu.com/commands/browse
* fifo for DB http://stackoverflow.com/questions/30688178/maintaining-a-fifo-readable-across-different-executions
* sed grouped example http://unix.stackexchange.com/questions/24140/return-only-the-portion-of-a-line-after-a-matching-pattern
* FIFO http://stackoverflow.com/questions/30688178/maintaining-a-fifo-readable-across-different-executions 
* CSVKIT http://csvkit.readthedocs.org/en/0.9.1/tutorial.html