# Command Tools 

Ipython run in a terminal is about the best way to work with data interactively. Loading a csv into a Pandas dataframe or just a Numpy array is very powerful. However, there are times when  a handful of shell programs can either save the day or just make the day more productive.

Primarliy, I've found the program <i>find</i> to be extremely helpful, and combing it with <i>grep / pcregrep</i> can be enough to get a report out. 

I also find myself loading data into a table on mySQL as the size of the test data grows. This has the advantage of being a little easier to share with coworkers too.

Csvkit was brought to my attention by the O'Reilly book, <u>Data Science on the Command Line</u> and I've come to really like it for small scale work. Pandas can also push and pull data into and out of tables. 

There are times when keeping my preliminary preprocessing steps completely shell driven is convenient especially if used with a makefile. I've been exploring the use of Drake, since it can encorporate native Python and I think that will be the future. Since GNU Make has been around a long time and is typically installed on any machine you're likely to come across, I've been trying to use it more and appreciate it before making the switch to Drake.





## Programs you'll need to install
* csvkit by onryx, install with pip or conda
* pcregrep, Debian sudo apt-get install
* parallel, in the Debian repo.

## Other
I do contrast using Python and Pandas. You can skip running it if you don't already have it installed.

* mySQL
* Pandas

## Links
* regular expression: http://www.rexegg.com/
* freeing memory: http://unix.stackexchange.com/questions/87908/how-do-you-empty-the-buffers-and-cache-on-a-linux-system
* csvkit: https://csvkit.readthedocs.org/en/0.9.1/
* stackoverflow on "find" program, http://stackoverflow.com/questions/1489277/how-to-use-prune-option-of-find-in-sh


# How to use this notebook
The ipython notebook is great for Python and there's some support for BASH and SQL. However, it's not perfect yet, and you will have to modify some lines.

The cells using bash commands, have the bash magic at the top and a shell variable:

That path is specific to my personal computer and will not work for you. You will need to change **every** occurance to the root path you downloaded the data set to.

In the future I'll use this BASH kernel for ipython notebooks:
https://github.com/takluyver/bash_kernel

The above kernel would allow me to export any shell variables once. 
I also like this SQL magic provided here: 
https://github.com/catherinedevlin/ipython-sql

This notebook doesn't explain the individual commands in great detail. I am assuming that you either know them, or will be researching them on your own as you work through the commands. I have made some effort to make limited notes about the commands you see. Mainly, these notes are for explaining something not easily understood right away although I still assume you have googled or man paged the command.

# Man page
If you are very new to the command line you might not know how the man pages work.
The manual is opened in vi like environment so navigation might be difficult at first.
Pressing _h_ while in the man pages will bring up a help file.

Here is summary of a few commands for the man page viewer:
* q to quit
* / to search
* n for next forward match
* N for match backwards
* h  man page navigation help
* j scroll up
* k scroll down

# Things I don't cover here
Quoting is tricky when dealing with the shell. The shell has to interpret ( or not ) everything passed in the command line. That means if there's spaces, or special tokens in the strings you are passing around, you'll have to escape the strings and wrap them in quotes, double or single.

This is a topic in itself. I didn't want to get sidetracked here so please read about it. In fact you'll have to
because you'll end up banging your head against a wall sooner than later because of it.

A nice work around, is to write intermediate steps of a shell script to a file. Then read back the file. Since you are not passing the strings over through the shell interpreter directly through stdin, they will not need special escaping or treatment. There's nothing wrong with this, it can help make things more clear to others, rather than using obscure escape sequences.

I made a point to use only _nice_ file names here. Windoze people have a nasty habit of using spaces which is a huge PITA.

## In brief:
1. Single quotes, ' , are for string literals with no special actions sue to tokens
2. Double quotes, " , allow the expansion of shell variables and parameters, \$FOO, and \\ escape charaters
3. back ticks, ` , are completely different and are used to evaluate a command. Although I prefer the \$(cmd) syntax

##Imaginary use case

I'm given a hard drive full of source code and told that we will be doing a dependency analysis for some particular C libraries. We don't know in advance what we are looking for at the moment. That may be revealed on a client call or more information will be passed down. For now, we have the directories what can we do with it quickly  ?

For this exercise I'm using the Linux Kernel 2.6 because it's sufficiently large and free to download.

####The first thing I'd do, is inventory this thing we are given and prepare some basic report about what we have been given.

Below is is a simple inventory shell script, where we get the lines of code (including blanks, comments everything...not a real good method)
and the file extension. This is more advanced than I plan to cover in this notebook. So for now just accept it as is if it's all new to you. 

###The point is, for this exercise we need an inventory.

This code goes into a file named run_inventory.sh

In [1]:
%%bash # this line is here for syntax highlighting only

#!/bin/bash

function count_lines
{
    input=$1
    echo $(cat $input | wc -l)
}

function check_for_ascii
{
    input=$1
    bool=$(file $input | grep -ic "ascii")
    if [[ $bool -gt 0 ]];then
        echo 1
    else
        echo 0
    fi
}

function get_extension
{
    input="$1"
    base=$(basename "$input")
    test_=$(echo "$base" | grep -c "\.")

    if [ $test_ -eq 0 ];then
        echo "NONE"
    else
        ext=$(echo "$base" | rev | cut -d. -f 1 | rev)
    fi

    echo ${ext}
}

#### calls here ###
input=$1

count=$(count_lines $input)
ascii_bool=$(check_for_ascii $input)
ext=$(get_extension $input)

if [ $ascii_bool == 1 ];then
    printf "\"${input}\",\"${count}\",\"${ext}\"\n"
else
    printf "\"{$input}\",\"0\",\"${ext}\"\n"
fi

bash: #: No such file or directory


#Make an inventory
The _find_ program is tricky to learn, it has many options. I plan to dive more into _find_, but for now, I really just want an inventory.csv to play with. 

I give find the path to search, which is our Linux Kernel directory, then I tell find, to only return the type, "f" which means files. No directories. 

I then use a pipe '|' to pass the output into the input of the next program 'run\_inventory.sh'. We need to use _xargs_ to limit the way the output is presented to 'run\_inventory.sh'. _Xargs_ is another topic to learn about, so for now, just trust me. 

In [30]:
%%bash
root="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
#make a header
printf "\"path\",\"nlines\",\"ext\"\n" > $root/linux_inventory.csv
find $root/linux-2.6.32.67 -type f | xargs -n 1 $root/make_inventory.sh >> $root/linux_inventory.csv

The > is for stream redirection. The output stream from _printf_ is sent to a new file. This will clobber existing files so practive with it. The second use is slightly different, >> is an appending stream redirection.

Since we don't want to clobber the linux_inventory.csv, we use the append operation for the actual data.

In [31]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

cat "${root_dir}/linux_inventory.csv" | head -n 5 

"path","nlines","ext"
"/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux-2.6.32.67/.gitignore","77","gitignore"
"/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux-2.6.32.67/scripts/.gitignore","10","gitignore"
"/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux-2.6.32.67/scripts/module-common.lds","8","lds"
"/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux-2.6.32.67/scripts/checkkconfigsymbols.sh","59","sh"


cat: write error: Broken pipe


The pipe, |, takes output from another process, and supplies it as input to the next program.
You will get different output for the two lines below. The first line tells _grep_ to operate on the output stream of the _find_ command. The second tells _grep_ to operate on files whose path is given as the output from _find_. 

If you don't know what or how _head_ works, try the man page for it. Look up the option _n_ .

In [None]:
%%bash
root="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

find $root/linux-2.6.32.67 -type f | grep "2.6"
#
find $root/linux-2.6.32.67 -type f | xargs grep "2.6"

# let's make a summary of the kinds of files we were given by extension.

The next three cells are illustrating the how one can use <i>cat, cut, sort and uniq</i> to get a preliminary output of this inventory.

I broke out the steps so that you can follow along. There maybe short cuts and other arguments I didn't use. Sometimes it doesn't matter if you do something the correct way, just that you get it done quickly and you are confident that you know what you did. This type of prototyping is great, because you can see what's happening.

### The program <i> cut </i> is super helpful. We can parse many output streams with <i>cut</i>.
If the format of a steam in tablular, then <i>awk</i> maybe the best, but <i>awk</i> is a whole animal into itself and I think most days, I rely on <i>cut</i> and then step into ipython use use Pandas or just Numpy.

The <i>-d,</i> option is to set the delimiter and the <i>-f 3</i> says to use the 3rd field.

In [32]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
cat $root_dir/linux_inventory.csv | cut -d, -f 3 | tail -n 5

"c"
"c"
"c"
"c"
"NONE"


In [13]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
cat "${root_dir}/linux_inventory.csv" | cut -d, -f 3 | sort  | tail -n  5


"y"
"y"
"y"
"y"
"ymfsb"


In [63]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
cat "${root_dir}/linux_inventory.csv" | cut -d, -f 3 | sort | uniq -c | tail -n 5

      1 "x86"
    105 "xml"
      6 "xsl"
      5 "y"
      1 "ymfsb"


In [33]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
cat $root_dir/linux_inventory.csv | cut -d, -f 3 | sort | uniq -ic | sort -n | tail -n 5

    849 "txt"
   1080 "S"
   2391 "NONE"
  11622 "h"
  13147 "c"


## Putting this command line data into a database.
* Pandas:  df.to_sql()
* csvkit: csvsql

### Both Pandas and csvkit use SqlAlchemy to handle the connection to the DB. 

In [42]:
%%bash 
root="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
# I'm not expert at SqlAlchemy, but learning the connection strings are quite important
# That's the, "mysql://<log-in>:<passwd>@<ip>/<db-name>"  I'm running mySQL on locally, localhost has IP 127.0.0.1
csvsql --db "mysql://root:test@127.0.0.1/LinuxKernel" --table "inventory" --insert "${root}/linux_inventory.csv"




### Check that it worked

The notebook isn't using the alias the way I expected so I had to type out the full command with user and password.
In normal practice you'd make an alias as in the comments.

In [2]:
%%bash
#alias mysql='mysql --user=root --password=test'
#mysql -e "SELECT * FROM inventory LIMIT 5;" LinuxKernel

mysql --user=root --password=test -e "SELECT * FROM inventory LIMIT 5;" LinuxKernel


path	isText	ext	type	size	sloc	comments	blank	tot_lines
linux-2.6.32.67/arch/h8300/mm/fault.c	1	c	C source ASCII text	1441	23	28	6	57
linux-2.6.32.67/arch/h8300/mm/kmap.c	1	c	C source ASCII text	1276	27	25	8	60
linux-2.6.32.67/arch/h8300/mm/Makefile	1	NULL	ASCII text	114	0	0	0	5
linux-2.6.32.67/arch/h8300/mm/init.c	1	c	C source ASCII text	5463	120	56	27	201
linux-2.6.32.67/arch/h8300/mm/memory.c	1	c	C source ASCII text	1099	26	21	9	56


In [57]:
%%bash
sql2csv --db mysql://root:test@127.0.0.1/LinuxKernel --query "select ext, count(ext) from inventory group by ext order by count(ext);" | tail -n 5

dts,115
txt,849
S,1080
h,11623
c,13147


### The numbers checkout.

#### Lets explore <i>awk</i> a little. The easiest thing to make do (and perhaps the most useful), is to print columns or rows from a file stream.

<i>awk</i>
* FS field separator set to comma
* $3 is the 3rd column

<i>grep</i>
* -E regular expression
* -c count occurance (grep operates per line, we use pcregrep for multiple line searches)

There is a way to do the whole regex and count in <i>awk</i> I just didn't want to get into it. I'm still learning <i>awk</i> myself, and I haven't decided on it's usefulness over other tools. If I'm already in a database then forget it...I mostly want to show use of <i>grep</i> and <i>awk</i> for column parsing ( rather than cut ).

In [46]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

cat "${root_dir}/linux_inventory.csv"  | awk  'FS=","; {print $3}' | grep -cE "txt"

1699


In [21]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"

sql2csv --db mysql://root:test@127.0.0.1/LinuxKernel --query "select ext, count(ext) from inventory where ext='txt';"

ext,count(ext)
txt,2491


# Break down the commands

In [None]:
# do not run
find "${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -iname "*.c" | \
    xargs -n 1 pcregrep -no "(?sim)[a-z]+\w*\(.*?\)" /dev/null | \
    sed 's/\s//g' | \
    tail -n 10 

## find ${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -iname "*.c"

* path to search
* -maxdepth 2 dont', search past 2 directories deep
* -mindepth 1, search at least 1 directory deep   (using this to save time and the notebook kept crashing)
* -type f, only look for files not directories or links or anything else
* -iname,  case insensitive glob style name matching, just like <i>ls</i> uses

## xargs
A neat helper program, that takes the output from another shell program, and parses it into discrete intput arguments for the next program in a pipe.

This will allow us to pass one line at a time from find, to grep. Some programs do not need xargs, as they are designed to take a stream of input. Not in this case however.

## pcregrep -no "(?sim)[a-z]+\w*\(.*?\)" /dev/null
We also introduct, <i>Pearl Compatiple Regular Expression Grep (pcregrep)</i>. 
The <i>(?sim)</i> are options:

* s dot matches everything including newlines "\n"
* i case insensitive
* m multiline

The <i>(.*?)</i> makes the greedy "." stop, after encountering a left parenthesis, escaped like this \\( 
This page explains the "Lazy Trap" issue, where the .*? contron on the greedy match "jumps the fence"
http://www.rexegg.com/regex-quantifiers.html#lazytrap

### This is going to hunt for functions
That is, strings that match (?sim)[a-z]+\w*\(.*?\), where:

* [a-z] a list of lower case letters
* '+' at least once, or more
* \w any alpha-numeric zero or more times
* \\( literal left parenthesis
* greedy match until right paren

The <i>/dev/null</i> is a trick to make grep print the file path it's working on.

## sed 
used to remove any spaces:
sed 's/\s//g'

* s  substitute
* \s  regular expression for any kind of while space
* //  replace with nothing....easier to read if it was, sed 's/foo/bar/g' , replace 'foo' with 'bar'
* g globally, as many times as a match can be made

In [10]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

find "${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -iname "*.c" | xargs -n 1 pcregrep -no "(?sim)[a-z]+\w*\(.*?\)" /dev/null | sed 's/\s//g' | tail -n 10 

/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:332:mutex_lock(&dcookie_mutex)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:334:list_del(&user->next)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:335:kfree(user)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:337:is_live()
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:338:dcookie_exit()
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:340:mutex_unlock(&dcookie_mutex)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:343:EXPORT_SYMBOL_GPL(dcookie_register)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:344:EXPORT_SYMBOL_GPL(dcookie_unregister)
/home/daniel/git/Python2.7/DataScience/comm

### This example used the regular expression syntax where as above, I used the "globbing" rules.

In [11]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
find "${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -regextype posix-extended -regex ".*\.(c|h|cpp)" | \
    xargs -n 1 pcregrep  -no "(?sim)[a-z]+\w*\(.*?\)" /dev/null | sed 's/\s//g' | tail -n 10 

/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:332:mutex_lock(&dcookie_mutex)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:334:list_del(&user->next)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:335:kfree(user)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:337:is_live()
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:338:dcookie_exit()
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:340:mutex_unlock(&dcookie_mutex)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:343:EXPORT_SYMBOL_GPL(dcookie_register)
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/fs/dcookies.c:344:EXPORT_SYMBOL_GPL(dcookie_unregister)
/home/daniel/git/Python2.7/DataScience/comm

# An aside about loops vs the _parallel_ program
## While loops in bash

In [68]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

while read f;do   
    file_=$(echo $f | cut -d, -f 1 | sed 's/\"//g')
    grep -Eo -m 1 "^#include <linux"  "${file_}" /dev/null
done < "${root_dir}/linux_inventory.csv"

Process is interrupted.


In [95]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

cat $root_dir/prep.sh 

root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

input_=$1
file_=$(echo $input_ | cut -d, -f 1 | sed s/\"//g)
out=$(grep -Eo -m 1 "^#include <linux"  "${root_dir}/${file_}" /dev/null)
echo $out


In [98]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

tail -n 10 "${root_dir}/linux_inventory.csv" | xargs -P 4 -n 1 $root_dir/prep.sh


/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/drivers/rapidio/rio.c:#include <linux
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/drivers/rapidio/switches/tsi500.c:#include <linux
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/drivers/rapidio/rio.h:#include <linux

/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/drivers/rapidio/rio-access.c:#include <linux
/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/drivers/rapidio/rio-scan.c:#include <linux

/home/daniel/git/Python2.7/DataScience/command_line_pres_data/linux-2.6.32.67/drivers/rapidio/rio-sysfs.c:#include <linux



## <i>Parallel</i>

A powerful program that makes running shell programs on multiple cores trivial, is <i>parallel</i> .
It also cleans up code and makes things a single line when used with just a single core.

In [12]:
%%bash
# Free the cache to really test the timing
# become root and run 
# free && sync && echo 3 > /proc/sys/vm/drop_caches && free

root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
time find "${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -name "*.c" | \
    parallel --jobs 1 -n 1 'pcregrep -no "(?sm)if\s*\(.*?\)" /dev/null' | sed 's/\s//g' > /dev/null


real	0m2.031s
user	0m0.828s
sys	0m1.250s


In [13]:
%%bash
# become root and run 
# free && sync && echo 3 > /proc/sys/vm/drop_caches && free
time find "${root_dir}/linux-2.6.32.67" -maxdepth 2 -mindepth 1 -type f -name "*.c" | \
    parallel --jobs 4 -n 1 'pcregrep -no "(?sm)if\s*\(.*?\)" /dev/null' | sed 's/\s//g' > /dev/null

find: `/linux-2.6.32.67': No such file or directory

real	0m0.121s
user	0m0.056s
sys	0m0.043s


## Add columns to our inventory database
Let's say that we want to add a categorical columns to our SQL table. This is a nice way to store information about a table without wororying about normalization. We are not DBA's trying to maintain strict schema here. We just need a fast and intuitive way to query our data.

## Categorical Variables
I had some difficulty when I first started reading about categorical variables. They are writen and talked about in various contexts. For now, I'm thinking about a column in a database like table, that contains a string, which describes the row.

In contrast, I could have had a boolean flag like columns. For every file with an include statement that has "linux" in it, give it a 1 and the rest a 0. Then for another condition, I'd have to add a flag for it, "foo" with 1 and 0's and so on. 

In this case, a categorical variable is a single column that makes it wasy to do "group by", and aggregrate across categories. We can use more logic, such as "where" clauses, to define even more granular categories.

In [17]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

find "${root_dir}/linux-2.6.32.67" -regextype posix-extended -regex ".*\.(c|h)" -type f | \
    parallel --jobs 4 -n 1 'grep -Eo -m 1 "^#include\s\Wlinux/\w+\.h\W" /dev/null' > "${root_dir}/sub_list.txt"

In [None]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

csvsql --db "mysql://root:test@127.0.0.1/LinuxKernel" --table "inventory" --insert "${root_dir}/linux_inventory.csv"

Use sed ( stream editor ) to add the header row. You can certainly open the file in an editor and do this manually. I like to practice sed as much as possible b/c it takes a while to learn . I often use this synatx, 
<i>sed 's/foo/bar/g' to substitue.</i> 

However, this a an "address" command and to be honest I've been forgetting how to use it, but that's what this Meetup is all about, practice through teaching and sharing. 

In [25]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

sed -i '1 a\path_:include' "${root_dir}/sub_list.txt" # address command, 'append' at first line

In [26]:
%%bash 
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

csvsql -d ":" --db "mysql://root:test@127.0.0.1/LinuxKernel" --table "sub_list" --insert "${root_dir}/sub_list.txt"  




In [None]:
sql="update inventory set category = case when inventory.path in (select path_ from sub_list) then 'linux' else NULL end;"

mysql -e "${sql}" LinuxKernel

In [None]:
%%bash
sql2csv --db "mysql://root:test@127.0.0.1/LinuxKernel" --table "inventory" --query "SELECT * from inventory;" > linux_inventory_include.csv

In [28]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"


find "${root_dir}/linux-2.6.32.67" -regextype posix-extended -regex ".*\.(c|h)" -type f | \
    parallel --jobs 4 -n 1 'grep -Eo -m 1 "^#include\s\Wlinux/kernel\.h\W" /dev/null' > linux_inventory_kernel.txt

## Do the same thing with Pandas

In [1]:
import sqlalchemy
import pandas as pd
import numpy as np
from os import path
root_dir = "/home/daniel/git/Python2.7/DataScience/command_line_pres_data"
engine = sqlalchemy.create_engine("mysql://root:test@127.0.0.1/LinuxKernel")

### What if we did the find | grep steps in  Python....? 

In [32]:
import re

ob = re.compile("#include\s\Wlinux/(?P<name>\w+)\.h\W")

def regex_check(filename):
    fname = path.join(root_dir, filename)
    f = open(fname, 'r')
    match = ob.search(f.read())
    f.close()
    
    if match:
        return match.group('name')
    
    else:
        return None

In [33]:
print regex_check('linux-2.6.32.67/drivers/rapidio/rio-access.c')

rio


In [34]:
src = path.join(root_dir, "linux_inventory.csv")
df = pd.read_csv(src, quotechar='"', quoting=1)
df['category'] = df.apply(lambda row: regex_check(row['path']) if row['ext'] == 'h' or row['ext'] == 'c' else None, axis=1)

In [36]:
df['category'][0:10]

0          mutex
1             if
2         kernel
3    etherdevice
4            err
5         bitops
6           None
7         kernel
8         device
9           None
Name: category, dtype: object

In [38]:
df.to_sql("inventory2", engine, flavor='mysql')

In [48]:
%%bash
sql="select count(path) as cts, category from inventory2 group by category order by cts desc limit 10;"
mysql -uroot -ptest -e "${sql}" LinuxKernel

cts	category
15235	NULL
3002	module
2266	kernel
1534	types
1327	init
319	delay
318	errno
259	fs
245	sched
241	mm


In [None]:
%%bash
root_dir="/home/daniel/git/Python2.7/DataScience/command_line_pres_data"

sql2csv --db "mysql://root:test@127.0.0.1/LinuxKernel" --query "select count(path) as cts, category from inventory2 group by category order by cts desc;" \
> "${root_dir}/categories.csv"

## Back to Command line

## Save the output to a file for the future
We'll redirect the output from standard out (terminal display) to a file.

In [1]:
%%bash
#root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"
cat $root_dir/linux_inventory.csv | cut -d, -f 3 | sort | uniq -c | sort -n > $root_dir/ext_list.txt

## Makefile
Let's try using a makefile. We have 2 steps required to create the sorted list of extensions and their counts.

1. make an inventory
2. sort the inventory by extension

We also have 2 dependencies

1. Linux kernel source
2. inventory

There's one final output, the sorted list of counts by extension

The idea behind the makefile, is that if we change a dependency, then we want the target output steps to run again.
If a file was added to the Linux kernel, then we need a new inventory and then a file extension count list. in the a real world usage I'd make some more effort to avoid running the entire inventory. Here, that's a bit overkill.


## What is happening
Make keeps track of when a file or directory has been modified. If something was touched, then the recipe is invoked to handle that updated information. Make can make use of functions, shell paramters although it a slightly different form.

Makefile are typically named, "makefile", and are tab delimited. Make has to parse the makefile so there's some special syntax that is similar to but distinct from that of the shell.

In [2]:
%%bash
#root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"
cat -n makefile

cat: makefile: No such file or directory


In [3]:
%%bash
#root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"
cat -n $root_dir/makefile

     1	# simple makefile for creating a new inventory and extension list
     2	# if the kernel is updated with new source
     3	
     4	# global path prefix
     5	root_dir = /home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data
     6	
     7	# prerequistes
     8	kernel = $(root_dir)/linux-2.6.32.67
     9	inventory = $(root_dir)/linux_inventory.csv
    10	
    11	# target
    12	extension_list = $(root_dir)/ext_list.txt
    13	
    14	# shell script required for recipe
    15	make_inventory = $(root_dir)/make_inventory.sh
    16	
    17	
    18	$(extension_list): $(inventory)
    19		cat $(inventory) | cut -d, -f 3 | sort | uniq -c | sort -n > $(extension_list)
    20	
    21	
    22	$(inventory): $(kernel)
    23		find $(kernel) -type f | parallel -n 1 --jobs 2 $(make_inventory) > $(inventory)
    24		sed -i '1 i\path_,tot_lines' $(inventory)


### Test it out

We can use the _touch_ command to update the modifcation dates of a file or directory.
There are at least 3 ways to see the modifcation dates of a file, the most common being to _stat_. incidentaly, _stat_ is a really basic program that is called internally in many other shell programs.

In [8]:
%%bash
#root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"
stat $root_dir/linux-2.6.32.67

  File: ‘/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux-2.6.32.67’
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 805h/2053d	Inode: 7494692     Links: 23
Access: (0775/drwxrwxr-x)  Uid: ( 1000/  dcuneo)   Gid: ( 1000/  dcuneo)
Access: 2015-09-29 18:31:19.851654480 -0700
Modify: 2015-09-29 18:31:19.795654477 -0700
Change: 2015-09-29 18:31:19.795654477 -0700
 Birth: -


The program _touch_ is used to update the modification time to the present date. It is also used to create an empty file when one needs such a thing.

In [10]:
%%bash
#root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"
touch $root_dir/linux-2.6.32.67
stat $root_dir/linux-2.6.32.67

  File: ‘/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux-2.6.32.67’
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 805h/2053d	Inode: 7494692     Links: 23
Access: (0775/drwxrwxr-x)  Uid: ( 1000/  dcuneo)   Gid: ( 1000/  dcuneo)
Access: 2015-09-29 18:51:46.139712980 -0700
Modify: 2015-09-29 18:51:46.139712980 -0700
Change: 2015-09-29 18:51:46.139712980 -0700
 Birth: -


In [14]:
%%bash
#root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"
# if we are in the directory where the makefile exists, just issue the _make_ command
# because this notebook could theoretically be run from anywhere, I'm using the full path and the -f option
make -f $root_dir/makefile

find /home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux-2.6.32.67 -type f | parallel -n 1 --jobs 2 /home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/make_inventory.sh > /home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux_inventory.csv
sed -i '1 i\path_,tot_lines' /home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux_inventory.csv
cat /home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/linux_inventory.csv | cut -d, -f 3 | sort | uniq -c | sort -n > /home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/ext_list.txt


In [15]:
%%bash
#root_dir="/home/daniel/git/Python2.7/DataScience/command_line_data"
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"

stat $root_dir/ext_list.txt

  File: ‘/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/ext_list.txt’
  Size: 3435      	Blocks: 8          IO Block: 4096   regular file
Device: 805h/2053d	Inode: 7354139     Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  dcuneo)   Gid: ( 1000/  dcuneo)
Access: 2015-09-29 19:01:56.123742080 -0700
Modify: 2015-09-29 18:58:19.531731747 -0700
Change: 2015-09-29 18:58:19.531731747 -0700
 Birth: -


## Storing the output of command to a variable (shell paramter)

We can set a shell paramter from the output of another shell command/program.
You've seen this trick used in the simple inventory program earlier. I used it a lot actually, to 
assign the output of commands to a shell paramter.

In [5]:
%%bash

echo $USER # shell variable setup when you login
var=$(echo $USER | cut -c 1-4)
echo $var

dcuneo
dcun


Another example with the _date_ program.

In [6]:
%%bash
date

Mon Sep 28 19:38:23 PDT 2015


In [8]:
%%bash
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"

dir_path=${root_dir}/$(date +%Y_%m_%d_%H:%M:%S) # date can take a format string
echo $dir_path

/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/2015_09_28_19:39:13


## Shell Loops

There are two kinds of loops that I tend to use:

1. while%%bash
2. for

### Tests

The square brackets are called "tests" . This is another topic in shell scripting that I can't really cover right here but you can see a use case for it. 

The loop below is rather contrived. We would really just use _cat_ command to see the contents. But it's a good practice because the output is predictable.

In [9]:
%%bash
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"

while read f;do
    echo $f
done < $root_dir/ext_list.txt

1 "1992-1997"
1 "1994-2004"
1 "1995-2002"
1 "1996-2002"
1 "278"
1 "5"
1 "act2000"
1 "AddingFirmware"
1 "AdvancedTopics"
1 "agh"
1 "aic79xx"
1 "aic7xxx"
1 "arcmsr"
1 "asp"
1 "au0828"
1 "audio"
1 "auto"
1 "avmb1"
1 "awk"
1 "ax"
1 "binfmt"
1 "bttv"
1 "buddha"
1 "build"
1 "CAPI"
1 "cc"
1 "cert"
1 "ChangeLog"
1 "char"
1 "clean"
1 "Coding"
1 "common"
1 "concap"
1 "Conclusion"
1 "copyright"
1 "cpia"
1 "cpia2"
1 "cputype"
1 "cx23885"
1 "cycladesZ"
1 "DAC960"
1 "dino"
1 "diversion"
1 "DOC"
1 "drm"
1 "drv_ba_resend"
1 "dtc"
1 "dvb-usb"
1 "Early-stage"
1 "em28xx"
1 "ext"
1 "FIRST"
1 "FlashPoint"
1 "Followthrough"
1 "FPE"
1 "freeze"
1 "freezer"
1 "fwinst"
1 "gate"
1 "gdbinit_200MHz_16MB"
1 "gdbinit_300MHz_32MB"
1 "gdbinit_400MHz_32MB"
1 "generic"
1 "gigaset"
1 "glade"
1 "headersinst"
1 "hfc-pci"
1 "HiSax"
1 "history"
1 "hm12"
1 "host"
1 "hp300"
1 "hysdn"
1 "hz"
1 "i2400m"
1 "icn"
1 "ide"
1 "include"
1 "inc_shipped"
1 "inf"
1 "ini"
1 "Intro"
1 "ioctl"
1 "iosched"
1 "ips"
1 "ipw2100"
1 "ipw2200"
1 "

Bash will delimit the output of the cat, by spaces or newlines (\n). So in order to get the output
as we'd like, we need each line to be delimited by \n, thus the IFS syntax.

Read is a better way to work with the contents of a file where it's assumed that you want data on a per-line basis.
Most of the time we do.

In [10]:
%%bash
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"

IFS=$'\n'
for f in $(cat $root_dir/ext_list.txt);do                                                                                 
    echo $f
done

      1 "1992-1997"
      1 "1994-2004"
      1 "1995-2002"
      1 "1996-2002"
      1 "278"
      1 "5"
      1 "act2000"
      1 "AddingFirmware"
      1 "AdvancedTopics"
      1 "agh"
      1 "aic79xx"
      1 "aic7xxx"
      1 "arcmsr"
      1 "asp"
      1 "au0828"
      1 "audio"
      1 "auto"
      1 "avmb1"
      1 "awk"
      1 "ax"
      1 "binfmt"
      1 "bttv"
      1 "buddha"
      1 "build"
      1 "CAPI"
      1 "cc"
      1 "cert"
      1 "ChangeLog"
      1 "char"
      1 "clean"
      1 "Coding"
      1 "common"
      1 "concap"
      1 "Conclusion"
      1 "copyright"
      1 "cpia"
      1 "cpia2"
      1 "cputype"
      1 "cx23885"
      1 "cycladesZ"
      1 "DAC960"
      1 "dino"
      1 "diversion"
      1 "DOC"
      1 "drm"
      1 "drv_ba_resend"
      1 "dtc"
      1 "dvb-usb"
      1 "Early-stage"
      1 "em28xx"
      1 "ext"
      1 "FIRST"
      1 "FlashPoint"
      1 "Followthrough"
      1 "FPE"
      1 "freeze"
      1 "freezer"
      1 "fwinst"


Let's do something slighly more interesting and introduce the _test_ while were at it.

In [11]:
%%bash
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"

while read f;do
        count=$(echo $f | awk '{print $1}')
    if [ $count -gt 10 ];then
        echo $f
    fi
done < $root_dir/ext_list.txt

11 "c_shipped"
13 "tst"
14 "ppm"
15 "lds"
23 "pl"
26 "HEX"
28 "debug"
33 "tmpl"
34 "sh"
50 "boot"
79 "gitignore"
105 "xml"
111 "ihex"
115 "dts"
857 "txt"
1080 "S"
2818 "NONE"
11638 "h"
13154 "c"


### Use case
Maybe you run a program in a loop, and save each output to a new file named with the dat and time

In [12]:
%%bash
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"

for ind in {1..5};do 
    fname="${root_dir}/$(date +%Y_%m_%d_%H:%M:%S).test"
    echo $fname
done

/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/2015_09_28_19:41:38.test
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/2015_09_28_19:41:38.test
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/2015_09_28_19:41:38.test
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/2015_09_28_19:41:38.test
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/2015_09_28_19:41:38.test


In [13]:
%%bash
root_dir="/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data"

for ind in {1..5};do 
    fname="${root_dir}/test_$ind.txt"
    echo $fname
done

/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/test_1.txt
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/test_2.txt
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/test_3.txt
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/test_4.txt
/home/dcuneo/git/PersonalDS/DataScience/command_line_pres_data/test_5.txt
