# Command Line Tools for Data Manipulation


<img src=pics/command_line_fu.png /img>




<center><b> Leveraging Linux Shell Programs for Data Science Tasks</b></center>

* Extraction in this field can be difficult. We are not just pulling data from a data base...

* Skipping Extraction for now, let's focus on Transformation. 

* Storage we will touch on for small prototyping project only.

<img src=pics/ETL_diagram.png /img>

# Shell programs are written in C, are close to the Kernel and written in C.
<div align="left">
<table style="width:25%">
 <b>Authors</b>
* Ken Thompson
* Lee E. McMahon 
* Richard M. Stallman
     

Are you really going to write better, more robust, faster code than these people??




<b>What Is The Command Line Good For ?</b>

* Small data sets



#Imaginary use case: File Inventory

For now, imagine we have a data file in CSV format with inventory information about
a hard-drive.

## basic inventory
* number of files
* size of volume
* number of fiels by file-extension
* lines of code in the files

In [None]:
#!/bin/bash

function count_lines
{
    input=$1
    echo $(cat $input | wc -l)
}

function check_for_ascii
{
    input=$1
    bool=$(file $input | grep -ic "ascii")
    if [[ $bool -gt 0 ]];then
        echo 1
    else
        echo 0
    fi
}

function get_extension
{
    input="$1"
    base=$(basename "$input")
    test_=$(echo "$base" | grep -c "\.")

    if [ $test_ -eq 0 ];then
        echo "NONE"
    else
        ext=$(echo "$base" | rev | cut -d. -f 1 | rev)
    fi

    echo ${ext}
}

#### calls here ###
input=$1

count=$(count_lines $input)
ascii_bool=$(check_for_ascii $input)
ext=$(get_extension $input)

if [ $ascii_bool == 1 ];then
    printf "\"${input}\",\"${count}\",\"${ext}\"\n"
else
    printf "\"{$input}\",\"0\",\"${ext}\"\n"
fi

In [2]:
root_dir="/home/daniel/git/Python2.7/DataScience/notebooks/orbital"
export PATH=$PWD:$PATH



In [2]:
du -h -c linux-2.6.32.67 | tail -n 1

418M	total


In [17]:
#make a header
printf "\"path\",\"nlines\",\"ext\"\n" > $root_dir/linux_inventory.csv
# run inventory program on files 
time find $root_dir/linux-2.6.32.67 -type f | xargs -n 1 $root_dir/make_inventory.sh >> $root_dir/linux_inventory_local_path.csv


real	5m37.544s
user	2m18.962s
sys	4m50.221s


In [29]:
cat $root_dir/linux_inventory_local_path.csv | wc -l | thou_comma.sh

30,485


In [38]:
head -n 5 "${root_dir}/linux_inventory_local_path.csv" | cut -d/ -f 7- | column -t -s,

notebooks/orbital/linux-2.6.32.67/net/wireless/ibss.c"                 "509"   "c"
notebooks/orbital/linux-2.6.32.67/net/wireless/scan.c"                 "1027"  "c"
notebooks/orbital/linux-2.6.32.67/net/wireless/lib80211_crypt_tkip.c"  "788"   "c"
notebooks/orbital/linux-2.6.32.67/net/wireless/util.c"                 "717"   "c"
notebooks/orbital/linux-2.6.32.67/net/wireless/nl80211.c"              "4896"  "c"


What if we want to change the root path in the inventory files ?

In [14]:
head -n 1 $root_dir/linux_inventory_local_path.csv

"/home/daniel/git/Python2.7/DataScience/notebooks/orbital/linux-2.6.32.67/net/wireless/ibss.c","509","c"


In [3]:
head -n 1 $root_dir/linux_inventory_local_path.csv | cut -d, -f 1 | sed 's/\"//g' | cut -d/ -f 9-

linux-2.6.32.67/net/wireless/ibss.c


In [5]:
new_path="/work/sandbox/orbital_slides/"
head -n 1 $root_dir/linux_inventory_local_path.csv | sed 's/\"//1' |  cut -d/ -f 9- | sed "s#^#$new_path#g"   # need to change the sed sep to '#' bc path has '/'

/work/sandbox/orbital_slides/linux-2.6.32.67/net/wireless/ibss.c","509","c"


In [7]:
head -n 1 $root_dir/linux_inventory_local_path.csv | sed 's/\"//1' |  cut -d/ -f 9- | sed "s#^#$new_path#g" | sed 's/^/\"/1' # need to change the sed sep to '#' bc path has '/'sed 's/^/\"/1'

"/work/sandbox/orbital_slides/linux-2.6.32.67/net/wireless/ibss.c","509","c"


## What Types Of File Are Present ? : Sort by extension

In [36]:
cat $root_dir/linux_inventory.csv | cut -d, -f 3 | sort | uniq -ic | sort -n | tail -n 5

    857 "txt"
   1080 "S"
   2818 "NONE"
  11638 "h"
  13154 "c"


##GNU Parallel

Run shell scripts and/or commands ( which are really C programs ) in parallel from a terminal.

*Documentation*
<url>https://www.gnu.org/software/parallel/parallel_tutorial.html#The-7-predefined-replacement-strings</url>

Good examples of using advanced features of *parallel*
<url>https://www.biostars.org/p/63816/</url>

I used to work with brain imaging, and a common task is to extract the brain from the skull with Brain Extraction Tool (BET).
BET is a complex monster that uses Naive Bayes and a brain atlas. It runs fairly quick for one brain, but here we need to run it on 210 seperate images.

BACKGROUND:
In order to register the anatomic image (T1 weighted high res) to the lower res BOLD image, the mean of the time series is taken.
But, the single volume frames should first be extracted from the skull, because the skull has few features to match.

* extract brain from every (210) BOLD volume
* Motion Correct, (register the brain volumes to each other)
* Make a mean image for registration ( T1 is registered to the BOLD mean and resampled)

Running BET on 210 images in serial takes a long time and is annoying when you are working fast.

## Time Command
<url>http://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1</url>

* Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).

* User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.

* Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.


In [8]:
in_path="/home/daniel/git/Python2.7/DataScience/notebooks/orbital/bold_split/original_split"
out_path="/home/daniel/git/Python2.7/DataScience/notebooks/orbital/bold_split/bet"

time parallel --jobs 8 "bet {} $out_path/{#}vol_bet" :::  $in_path/vol*.nii.gz


real	1m15.517s
user	8m57.207s
sys	0m10.395s


In [10]:
ls $out_path | head -n 3

100vol_bet.nii.gz
101vol_bet.nii.gz
102vol_bet.nii.gz


## Blood oxygenation level dependent (BOLD) image: 
### With Skull and Scalp (Before) and After Extraction AKA Skull Stripping (After)
<img src=pics/bf_and_after_bold_bet_masked.png>

## Run Inventory With Parallel

In [4]:
time find $root_dir/linux-2.6.32.67 -type f | parallel -n 1 --jobs 8 'make_inventory.sh' >> /dev/null


real	1m46.358s
user	4m39.934s
sys	6m7.481s


# Best Use Case: File System Manipulation And Quick Regex

In [None]:
# become root and run 
# free && sync && echo 3 > /proc/sys/vm/drop_caches && free
time find "${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -name "*.c" |  parallel --jobs 8 -n 1 'pcregrep -no "(?sm)if\s*\(.*?\)" /dev/null' | sed 's/\s//g' > /dev/null

#FIFO For Moving Data
This is my question on SO and the answer is nice: <url> http://stackoverflow.com/questions/30688178/maintaining-a-fifo-readable-across-different-executions </url>


* mkfifo \$root_dir/pipe
* cat > \$root_dir/pipe &
    * get pid
* run inventory wit output to pipe
* kill cat with pid

NOTE: Terminate cat *kill -HUP $pid* before ending mySQL load.

In [9]:
mkfifo $root_dir/pipe
cat > $root_dir/pipe
pid=$(echo $!)
echo $pid




In [None]:
mysqladmin --user=root --password=test create LinuxInventory
csvsql --db "mysql://root:test@127.0.0.1/LinuxInventory" --tables "Inventory" --insert pipe 
#"${root_dir}/linux_inventory.csv"

### Check that it worked

The notebook isn't using the alias the way I expected so I had to type out the full command with user and password.
In normal practice you'd make an alias as in the comments.

In [28]:
alias mysql='mysql --user=root --password=test'
mysql -e "SELECT ext, COUNT(ext) FROM inventory GROUP BY ext ORDER BY COUNT(ext) DESC LIMIT 5;" LinuxKernel

+------+------------+
| ext  | COUNT(ext) |
+------+------------+
| c    |      13154 |
| h    |      11639 |
| S    |       1080 |
| txt  |        857 |
| dts  |        115 |
+------+------------+


# Working With Data Base ( for small data exploration )

I keep a mySQL and Mongo DB on my laptop for various experiments and local programming but it's also great to lean on sqlite for small command line driven exploratory exercises.

This way I can easily version and share the complete workflow.

Python and R have database connection layers, but sometimes you just want a csv without any additional platform.


In [5]:
query="select ext, count(ext) from inventory group by ext order by count(ext);"
sql2csv --db mysql://root:test@127.0.0.1/LinuxKernel --query "${query}" | tail -n 5 | column -t -s, 

dts  115
txt  857
S    1080
h    11639
c    13154


In [34]:
# paste:   -s  --serial    -d --delimiter
sql2csv --db mysql://root:test@127.0.0.1/LinuxKernel --query "${query}" | tail -n 5 | cut -d, -f 2 | paste -s -d+ | bc


26845


# Map Reduce Via MRJob and Shell Commands

In [8]:
cat ~/.mrjob.conf | sed  's/\(aws_.*:\s*\)\(.*$\)/\1******/g'

runners:
  emr:
    aws_access_key_id: ******
    aws_secret_access_key: ******
    ec2_key_pair: Example
    ec2_key_pair_file: ~/.ssh/Example.pem
    python_bin: python2.7
    strict_protocols: true
    bootstrap:
    - sudo python2.7 -m pip install mrjob

  hadoop:
    strict_protocols: true
  inline:
    strict_protocols: true
  local:
    strict_protocols: true


#Helpful Custom Shell Scripts

In [28]:
function pp
# script to pretty print variables in a easy to read list 

{
    var=$1
    echo $var | sed 's/:/\n/g' | sort | uniq
}

pp $PATH

/bin
/home/daniel/anaconda/bin
/home/daniel/anaconda/envs/py27/bin
/home/daniel/bin
/home/daniel/FSL
/home/daniel/spark-1.5.2-bin-hadoop2.6/bin
/home/daniel/spark-1.5.2-bin-hadoop2.6/sbin
/opt/afni_bin/linux_xorg7_64
/sbin
/usr/bin
/usr/games
/usr/lib/cmtk/bin/
/usr/lib/fsl/5.0
/usr/local/bin
/usr/local/sbin
/usr/sbin


# Links

* http://www.commandlinefu.com/commands/browse