# Command Line Tools for Data Manipulation


<img src=pics/command_line_fu.png /img>




<center><b> Leveraging Linux Shell Programs for Data Science Tasks</b></center>

* Extraction in this field can be difficult. We are not just pulling data from a data base...

* Skipping Extraction for now, let's focus on Transformation. 

* Storage we will touch on for small prototyping project only.

<img src=pics/ETL_diagram.png /img>

# Shell programs are written in C, are close to the Kernel and written in C.
<div align="left">
<table style="width:25%">
 <b>Authors</b>
* Ken Thompson
* Lee E. McMahon 
* Richard M. Stallman
     

Are you really going to write better, more robust, faster code than these people??




<b>What Is The Command Line Good For ?</b>

* Small data sets
* Rehearsal for Map Reduce



#Imaginary use case: File Inventory

For now, imagine we have a data file in CSV format with inventory information about
a hard-drive.


In [None]:
#!/bin/bash

function count_lines
{
    input=$1
    echo $(cat $input | wc -l)
}

function check_for_ascii
{
    input=$1
    bool=$(file $input | grep -ic "ascii")
    if [[ $bool -gt 0 ]];then
        echo 1
    else
        echo 0
    fi
}

function get_extension
{
    input="$1"
    base=$(basename "$input")
    test_=$(echo "$base" | grep -c "\.")

    if [ $test_ -eq 0 ];then
        echo "NONE"
    else
        ext=$(echo "$base" | rev | cut -d. -f 1 | rev)
    fi

    echo ${ext}
}

#### calls here ###
input=$1

count=$(count_lines $input)
ascii_bool=$(check_for_ascii $input)
ext=$(get_extension $input)

if [ $ascii_bool == 1 ];then
    printf "\"${input}\",\"${count}\",\"${ext}\"\n"
else
    printf "\"{$input}\",\"0\",\"${ext}\"\n"
fi

In [None]:
#make a header
printf "\"path\",\"nlines\",\"ext\"\n" > $root/linux_inventory.csv
# run inventory program on files 
find $root/linux-2.6.32.67 -type f | xargs -n 1 $root/make_inventory.sh >> $root/linux_inventory.csv

In [29]:
root_dir="/home/daniel/git/Python2.7/DataScience/notebooks/command_line_pres_data"



In [31]:
cat $root_dir/linux_inventory.csv | wc -l

30486


In [25]:
head -n 5 "${root_dir}/linux_inventory.csv" | cut -d/ -f 7- | column -t -s,

"path"                                                          "nlines"  "ext"
command_line_pres_data/linux-2.6.32.67/net/wireless/scan.c"     "1027"    "c"
command_line_pres_data/linux-2.6.32.67/net/wireless/core.h"     "401"     "h"
command_line_pres_data/linux-2.6.32.67/net/wireless/ibss.c"     "509"     "c"
command_line_pres_data/linux-2.6.32.67/net/wireless/nl80211.c"  "4896"    "c"


## What Types Of File Are Present ? : Sort by extension

In [26]:
cat $root_dir/linux_inventory.csv | cut -d, -f 3 | sort | uniq -ic | sort -n | tail -n 5

    857 "txt"
   1080 "S"
   2818 "NONE"
  11638 "h"
  13154 "c"


## Map Reduce Via MRJob

Would be great if those '}' and '{' inner braces weren't there.

We could run 'sed' via the terminal for small data but for really large data, we need
Map Reduce.

In [38]:
head -n 1 $root_dir/map_reduce_sed_example.txt

{"ip": "121.197.179.5", "ua": "bt-ua-1", "ts": 1443552358} {"eventName": "data_test", "action": "bar", "pid": 5193, "geo": "NL", "v": 9174834}


In [39]:
head -n 1 $root_dir/map_reduce_sed_example.txt | sed 's/}/,/1' | sed 's/{//2'

{"ip": "121.197.179.5", "ua": "bt-ua-1", "ts": 1443552358, "eventName": "data_test", "action": "bar", "pid": 5193, "geo": "NL", "v": 9174834}


In [None]:
from mrjob.util import bash_wrap
from mrjob.job import MRJob
from mrjob.step import MRStep
import json

class Process(MRJob):

    def steps(self):
        return [
                MRStep(mapper = bash_wrap("sed 's/}/,/1' | sed 's/{//2'")),
                MRStep(mapper = self.mapper,
                       reducer = self.reducer)
               ]

    def mapper(self, _, line):
        dict_ = json.loads(line)
        if self.key == "eventName":
            string = "%s %s" %(dict_[self.key], dict_['action'])
            yield string, 1
        else:
            yield dict_[self.key], 1

    def reducer(self, key, value):
        yield key, sum(value)


if __name__ == "__main__":
    Process().run()