# Command line


***

## Command Line (Terminal)

One of the most underappreciated tools for dealing with data is the command line of the unix/linux operating system. As you probably know, an operating system runs computer-based devices, including servers, laptops, smartphones, and many others, usually "under the hood".   "Unix" is the family of operating systems that runs the most devices.  If you have a Mac -- there's Unix under the hood.  An iPhone?  Unix under the hood. An Android phone? Guess what ... Unix under the hood.  And what are you running on now?  You guessed it. (But Windows machines?  Not Unix.)

You can get under the hood of a Unix-based computer via the command line, often called the Terminal, and do all manner of crazy things.  We will study a few of them.

We can access the command line of our cloud machines through the Jupyter Notebooks. You can use "shell" commands (such as the following) by prefixing the line with an exclamation point.

(The "shell" is a technical name for the command line interface you see in a Terminal window.  There are actually different shells with different commands & syntax, even for the same operating system, but we won't delve into that in this class.  We will use the particular shell that Jupyterhub provides us by default.  The default is the "bash" shell, for those of you who care.)

*Here we will give you a brief overview, and show some very useful commands.  If you're serious about dealing with data, and you don't have much command-line experience, you should get the book "Data Science at the Command Line" by Janssens. The first half of this (thin) book gives an excellent practical guide to dealing with data at the Unix command line.*


#### Interaction with files and folders

We can navigate the folder structure in which we are working.  Folders are called "directories". You will typically use commands such as `ls` (list directory contents) and `cd` (change to another directory--but wait on that for a second). You can make a directory with `mkdir` or move (`mv`) and copy (`cp`) files. To delete a file you can `rm` (remove) it--careful, there's no getting it back.  To see the contents of a file you can `cat` it to the screen.  (Why `cat`?  That command actually concatenates multiple files and outputs the result.  If you give it one, then it just outputs that single file.  The default is to output it to the screen.  You'll see below that we can send command outputs elsewhere besides to the screen--like into another command!)

Many commands have options you can set when running them. For example to get a listing of files as a vertical list with extra details you can pass the `-l` (list) flag, e.g. `ls -l`. During the normal course of using the command line, you will learn the most useful flags. If you want to see all possible options you can always read the `man` (manual) page for a command, e.g. `man ls`. When you are done reading the `man` page, you can exit by hitting `q` to quit.


In [None]:
! git clone https://github.com/yizuc/datamining.git

In [None]:
%cd /content/datamining/Module1_Bash_Pandas
!ls

In [None]:
!ls -l

In [18]:
!mkdir test

In [None]:
!ls

Ok, so above we said to wait on changing directories for a second.  That's the cd command.  But watch this -- normally you'd think this would put you in the data directory:

In [20]:
!cd data

In [None]:
!ls

NOTICE .. when you `ls` you see that you are still in the same directory you were below.  (See the same contents, including the data directory there.)

Understanding why requires understanding how unix works, so we won't go into it here.  (Ok, the bottom line is that the way the Jupyter notebooks work, using the `!` command they essentially create a process, change the directory in that new process, and then exit the process, leaving you back where you were.)  The trick here is to use the `%` instead of `!`.  A minor pain to remember, but...

In [None]:
%cd data

In [None]:
!ls

In [24]:
!cp users.csv users2.csv

In [None]:
!ls

To go back up a directory, we use `..` like this:

In [None]:
%cd ..

In [None]:
!ls

In [None]:
!ls test/

In [None]:
# WARNING: THIS WILL DELETE THE TEST FOLDER AND THE FILE JUST CREATED
!rm -rf test/
!rm data/users2.csv

In [None]:
!ls

#### Data manipulation and exploration
Virtually anything you want to do with a data file can be done at the command line. There are dozens of commands that can be put together to get almost any result! Let's try it.

Lets take a look at the the file `data/users.csv`.

Before we do anything, lets take a look at the first few lines of the file to get an idea of what's in it.

In [None]:
!head data/users.csv

Maybe we want to see a few more lines of the file,

In [None]:
!head -15 data/users.csv

How about the last few lines of the file?

In [None]:
!tail data/users.csv

We can count how many lines are in the file by using `wc` (wordcount--which counts more than just words) with the `-l` flag to count lines,

In [None]:
!wc -l data/users.csv

It looks like there are three columns in this file, lets take a look at the first one alone. Here, we can `cut` the field (`-f`) we want as long as we give the proper delimeter to tell what separates fields (`-d` defaults to tab).

In [None]:
!cut -f1 -d',' data/users.csv

That's a lot of output. Let's combine the `cut` command with the `head` command by *piping* the output of one command into another command.  The vertical bar `|` is the *pipe*.

In [None]:
!cut -f1 -d',' data/users.csv | head

We can use pipes (`|`) to string together many commands to create very powerful one-liners. For example, let's figure out the number of _unique_ users in the first column of the data file. We will get all the values from the first column, sort them, reduce that to only the unique values, and then count the number of lines in the result:

In [None]:
!cut -f1 -d',' data/users.csv | sort | uniq | wc -l

Or, we can get a list of the top-10 most frequently occuring users. If we give `uniq` the `-c` flag, it will return the number of times each value occurs. Since these counts are the first entry in each new line, we can tell `sort` to expect numbers (`-n`) and to give us the results in reverse (`-r`) order. Note, that when you want to use two or more single letter flags, you can just place them one after another.

In [None]:
!cut -f1 -d',' data/users.csv | sort | uniq -c | sort -nr | head

**Don't freak out at this point.**  Of course if you've never used this before, you're not expected to master all that at once.  The point here is to show how powerful a data manipulation tool the Unix command-line can be.  You can get resources (like the thin book mentioned above) to teach you and to use as a reference.  Using this notebook you should be able to use these particular commands in different ways by changing (and experimenting with) the flags.

After some exploration we decide we want to keep only part of our data and bring it into a new file. Let's find all the records that have a negative value in the second and third columns and put these results in a file called `data/negative_users.csv`. Searching through files can be done using _[regular expressions](http://www.robelle.com/smugbook/regexpr.html#expression)_ with a tool called `grep` (Global Regular Expression Printer). 

And do you remember that we can send the output of a command to the screen or into another command?  You can also direct output of a command (or a string of commands) into a file using the "redirection" operator `>`.  (NB: the pipe is also a redirection operator.  Another redirection operator is `>>`, which concatenates the output to the end of the file, rather than overwriting it.)

In [None]:
!grep '.*,-.*,-.*' data/users.csv > data/negative_users.csv

(You'll have to look into the regular expression link to figure that one out.)

In [None]:
!ls data