# CSV command-line kung fu

You might be surprised how much data slicing and dicing you can do from the command line using some simple tools and I/O redirection + piping. (See [A Quick Introduction to Pipes and Redirection](http://bconnelly.net/working-with-csvs-on-the-command-line/#a-quick-introduction-to-pipes-and-redirection)). 

To motivate the use of command line, rather than just reading everything into Python, commandline tools will often process data much faster. (Although, In this case, we are using a Python based commandline tool.)  Most importantly, you can launch many of these commands simultaneously from the commandline, computing everything in parallel using the multiple CPU core you have in your computer. If you have 4 core, you have the potential to process the data four times faster than a single-threaded Python program.

## Set up for CSV on commandline

```bash
pip install csvkit
```

## Extracting rows with grep

Now, let me introduce you to the `grep` command that lets us filter the lines in a file according to a regular expression. Here's how to find all rows that contain `Annie Cyprus`:

In [3]:
! grep 'Claire Gute' SampleSuperstore.csv | head -3

1.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
2.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back",731.9399999999999,3.0,0.0,219.58199999999997
5492.0,CA-2017-164098,2017-01-26,2017-01-27,First Class,CG-12520,Claire Gute,Consumer,United States,Houston,Texas,77070.0,Central,OFF-ST-10000615,Office Supplies,Storage,"SimpliFile Personal File, Black Granite, 15w x 6-15/16d x 11-1/4h",18.16,2.0,0.2,1.8160000000000016


We didn't have to write any code. We didn't have to jump into a development environment or editor. We just asked for the sales for Annie. If we want to write the data to a file, we simply redirect it:

In [4]:
! grep 'Claire Gute' SampleSuperstore.csv > /tmp/Annie.csv
! head -3 /tmp/Annie.csv # show first 3 lines of that new file

1.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
2.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back",731.9399999999999,3.0,0.0,219.58199999999997
5492.0,CA-2017-164098,2017-01-26,2017-01-27,First Class,CG-12520,Claire Gute,Consumer,United States,Houston,Texas,77070.0,Central,OFF-ST-10000615,Office Supplies,Storage,"SimpliFile Personal File, Black Granite, 15w x 6-15/16d x 11-1/4h",18.16,2.0,0.2,1.8160000000000016


## Filtering with csvgrep

[csvkit](https://csvkit.readthedocs.io/en/1.0.3/) is an amazing package with lots of cool CSV utilities for use on the command line. `csvgrep` is one of them.

It is used to filter tabular data to only those rows where certain columns contain a given value or match a regular expression. Here is an example:

In [25]:
! csvgrep -c 2 -r '^CA-2016' SampleSuperstore.csv

Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
1.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
2.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back",731.9399999999999,3.0,0.0,219.58199999999997
3.0,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters by Universal,14.62,2.0,0.0,6.8713999999999995
14.0,CA-2016-161389,2016-12-05,2016-12-1

5658.0,CA-2016-145261,2016-12-18,2016-12-21,First Class,AH-10120,Adrian Hane,Home Office,United States,Salem,Oregon,97301.0,West,TEC-AC-10000991,Technology,Accessories,Sony Micro Vault Click 8 GB USB 2.0 Flash Drive,112.77600000000001,3.0,0.2,-8.458199999999998
5659.0,CA-2016-145261,2016-12-18,2016-12-21,First Class,AH-10120,Adrian Hane,Home Office,United States,Salem,Oregon,97301.0,West,FUR-TA-10002530,Furniture,Tables,"Iceberg OfficeWorks 42"" Round Tables",377.45,5.0,0.5,-264.21500000000003
5660.0,CA-2016-145261,2016-12-18,2016-12-21,First Class,AH-10120,Adrian Hane,Home Office,United States,Salem,Oregon,97301.0,West,OFF-LA-10000407,Office Supplies,Labels,Avery White Multi-Purpose Labels,15.936000000000002,4.0,0.2,5.1792
5661.0,CA-2016-145261,2016-12-18,2016-12-21,First Class,AH-10120,Adrian Hane,Home Office,United States,Salem,Oregon,97301.0,West,TEC-PH-10004833,Technology,Phones,Macally Suction Cup Mount,28.68,3.0,0.2,-7.17
5662.0,CA-2016-108875,2016-09-24,2016-10-01,Standard 

Filter by customer ID

In [28]:
!cut -d "," -f 6  SampleSuperstore.csv| head

Customer ID
CG-12520
CG-12520
DV-13045
SO-20335
SO-20335
BH-11710
BH-11710
BH-11710
BH-11710
cut: stdout: Broken pipe


In [29]:
! csvgrep -c 6 -r '^(CG-12520|IM-15070)$' SampleSuperstore.csv

Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
1.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136
2.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back",731.9399999999999,3.0,0.0,219.58199999999997
14.0,CA-2016-161389,2016-12-05,2016-12-10,Standard Class,IM-15070,Irene Maddox,Consumer,United States,Seattle,Washington,98103.0,West,OFF-BI-10003656,Office Supplies,Binders,Fellowes PB200 Plastic Comb Binding Machine,407.97600000000006,3.0,0.2,132.59219999999993
1253.0,CA-2015-154956,2015-07-04,2015-07-09,S

## Beginning, end of files

If we'd like to see just the header row, we can use `head`:

In [7]:
! head -2 SampleSuperstore.csv

Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
1.0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2.0,0.0,41.9136


If, on the other hand, we want to see everything but that row, we can use `tail` (which I pipe to `head` so then I see only the first two lines of output):

In [9]:
! tail SampleSuperstore.csv | head -2

9985.0,CA-2015-100251,2015-05-17,2015-05-23,Standard Class,DV-13465,Dianna Vittorini,Consumer,United States,Long Beach,New York,11561.0,East,OFF-LA-10003766,Office Supplies,Labels,Self-Adhesive Removable Labels,31.5,10.0,0.0,15.120000000000001
9986.0,CA-2015-100251,2015-05-17,2015-05-23,Standard Class,DV-13465,Dianna Vittorini,Consumer,United States,Long Beach,New York,11561.0,East,OFF-SU-10000898,Office Supplies,Supplies,"Acme Hot Forged Carbon Steel Scissors with Nickel-Plated Handles, 3 7/8"" Cut, 8""L",55.6,4.0,0.0,16.123999999999995


The output would normally be many thousands of lines here so I have *piped* the output to the `head` command to print just the first two rows.  We can pipe many commands together, sending the output of one command as input to the next command.

### Exercise

Count how many sales items there are in the `Technology` product category that are also `High` order priorities? Hint: `wc -l` counts the number of lines.

In [10]:
! grep Technology, SampleSuperstore.csv | grep High, | wc -l

       0


##  Extracting columns with csvcut

Extracting columns is also pretty easy with `csvcut`. For example, let's say we wanted to get the customer name column (which is 12th by my count).

In [13]:
! csvcut -c 12 -e latin1 SampleSuperstore.csv | head -10

Postal Code
42420.0
42420.0
90036.0
33311.0
33311.0
90032.0
90032.0
90032.0
90032.0


Actually, hang on a second. We don't want the `Customer Name` header to appear in the list so we combine with the `tail` we just saw to strip the header.

In [15]:
! csvcut -c 12 -e latin1 SampleSuperstore.csv | tail +2 | head -10

42420.0
42420.0
90036.0
33311.0
33311.0
90032.0
90032.0
90032.0
90032.0
90032.0
tail: stdout: Broken pipe


What if we want a unique list? All we have to do is sort and then call `uniq`:

In [16]:
! csvcut -c 12 -e latin1 SampleSuperstore.csv | tail +2 | sort | uniq | head -10

10009.0
10011.0
10024.0
10035.0
1040.0
10550.0
10701.0
10801.0
11520.0
11550.0


You can get multiple columns at once in the order specified. For example, here is how to get the sales ID and the customer name together (name first then ID):

In [17]:
! csvcut -c 12,2 -e latin1 SampleSuperstore.csv |head -10

Postal Code,Order ID
42420.0,CA-2016-152156
42420.0,CA-2016-152156
90036.0,CA-2016-138688
33311.0,US-2015-108966
33311.0,US-2015-108966
90032.0,CA-2014-115812
90032.0,CA-2014-115812
90032.0,CA-2014-115812
90032.0,CA-2014-115812


Naturally, we can write any of this output to a file using the `>` redirection operator. Let's do that and put each of those columns into a separate file and then `paste` them back with the customer name first.

In [18]:
! csvcut -c 2 -e latin1 SampleSuperstore.csv > /tmp/IDs
! csvcut -c 12 -e latin1 SampleSuperstore.csv > /tmp/names
! paste /tmp/names /tmp/IDs | head -10

Postal Code	Order ID
42420.0	CA-2016-152156
42420.0	CA-2016-152156
90036.0	CA-2016-138688
33311.0	US-2015-108966
33311.0	US-2015-108966
90032.0	CA-2014-115812
90032.0	CA-2014-115812
90032.0	CA-2014-115812
90032.0	CA-2014-115812
paste: stdout: Broken pipe


Amazing, right?! This is often a very efficient means of manipulating data files because you are directly talking to the operating system instead of through Python libraries. We also don't have to write any code, we just have to know some syntax for terminal commands.

Not impressed yet? Ok, how about creating a histogram indicating the number of sales per customer sorted in reverse numerical order? We've already got the list of customers and we can use an argument on `uniq` to get the count instead of just making a unique set. Then, we can use a second `sort` with arguments to reverse sort and use numeric rather than text-based sorting. This gives us a histogram:

In [19]:
! csvcut -c 12 -e latin1 SampleSuperstore.csv | tail +2 | sort | uniq -c | sort -r -n | head -10

 263 10035.0
 230 10024.0
 229 10009.0
 203 94122.0
 193 10011.0
 166 94110.0
 165 98105.0
 160 19134.0
 151 98103.0
 151 90049.0


### Exercise

Modify the command so that you get a histogram of the shipping mode.

In [14]:
! csvcut -c 8 -e latin1 data/SampleSuperstoreSales.csv | tail +2 | sort | uniq -c | sort -r -n | head -10

6270 Regular Air
1146 Delivery Truck
 983 Express Air
