Hotels data pipeline
====================

We'll use this notebook to show how to build a data pipeline to process data. This could also be rerun with new data if need be.* 

We'll be using Hotel Occupancy Tax Receipts data, which can be used to see which hotels around the state pull in the most money. For more information about the data itself, see this [README.md](https://github.com/utdata/cli-tools/blob/master/data/hoteltax/README.md).

## The Goal

* We're going to download one year of quarterly report files.
* We'll add a header file
* We'll then pull out just hotels in Austin.
* We'll convert them into normalized, well-formatted csv files.
* We'll then put all the Austin files into a single big file.

(One note on this ... I would usually normalize all the data and not just the Austin records, but it takes a little while to run on big files, so we'll cut it down first.)

## Windows

I'm having trouble getting the bash_kernel to work on Windows. You can use `Git Bash` to run all these commands seperately, without being in a notebook. It's not ideal because you can't annotate them or rerun them unless we put them in a Bash script. You might want to use a lab computer instead.

## Cleanup

This resets this project to a beginning point. Not certain I'll keep it for realz.

A couple of very important things:

* The `cd` path needs to go to our class code directory, which I suggest you create in your Documents folder, calling it **rwd**.
* the project name, `hotels` in this case, needs to set to something that doesn't already exist for something else. In other works, if you use this to start your own project, make sure you set a new project name so you don't delete this hotels work.
* Start `jupyter notebook` from your class folder and save your .ipynb files there. Do NOT save your ipynb file inside your project folder or you will delete it with the following commands!!!!!

In [1]:
# cd to working directory
cd ~/Documents/rwd/
# remove existing hotels directory if there
mkdir -p hotels
# remove contents of hotels
rm -rf hotels/*
# move into it
cd hotels
# make sure you are there
pwd

/Users/christian/Documents/rwd/hotels



## Getting the data

We'll use a new command called `curl` to download our data. For more information on curl, you can read the [man page](https://curl.haxx.se/docs/manpage.html) or this [handy tip sheet](http://www.thegeekstuff.com/2012/04/curl-examples/), which is much more understandable.

Since we are in a Bash notebook, we can use our command-line tools. Let's make sure we know where we are. Type in `pwd` in the prompt below and then do shift-return to execute the command.

In [2]:
# make sure we are starting in the `hotels` directory
# if you aren't, then get there.
pwd

/Users/christian/Documents/rwd/hotels


## Getting the raw data
Ok, we need will create a new folder called `data`, where we will store our raw data:

In [3]:
# create the directory
mkdir data
# move into the directory
cd data
# show we are inside `hotels/data`
pwd

/Users/christian/Documents/rwd/hotels/data


### Look at the data
If you read about the tax data in the intro, you'll see you can get it from the comptroller, but they have some naming issues, so to help with this assignment, I've saved the data and we'll pull it down from this [github repo](https://github.com/utdata/cli-tools/tree/master/hoteltax/data). Open that link in a new window and look at it.

* `header.csv` is a file I created that has the header names. It will get used later.
* The next 10 files that end with `hotl15XX.csv` are monthly files for 2015.
* The last four that end with `hotl15qX.csv` are the quarterly files are are after in for this example.

### curl
Now we'll use `curl` to pull down a quarterly hotel tax file. Let's do it first as it takes a couple seconds to run, then I'll explain it.

In [4]:
curl -O -L https://raw.githubusercontent.com/utdata/cli-tools/master/data/hoteltax/data/hotl15q1.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0  9 4674k    9  463k    0     0   234k      0  0:00:19  0:00:01  0:00:18  319k 36 4674k   36 1693k    0     0   571k      0  0:00:08  0:00:02  0:00:06  694k 73 4674k   73 3454k    0     0   870k      0  0:00:05  0:00:03  0:00:02 1003k 99 4674k   99 4657k    0     0   937k      0  0:00:04  0:00:04 --:--:-- 1048k100 4674k  100 4674k    0     0   941k      0  0:00:04  0:00:04 --:--:-- 1365k


Let's break down that `curl` statement.

* `curl` is the command. I think of it as "capture URL". [man curl](http://man.cx/curl)
* `-O` (that's capital O, not zero). This outputs result to a file to your computer instead of to your screen, using the same file name as it was originally.
* `-L` stands for `--location`, and it will allow the request to follow a redirect link. It's good to use it.
* And then we have the url of the file.

Let's check that the file made it to our computer, and if some data in it:

In [31]:
ls -l

total 9352
-rw-r--r--  1 christian  staff  4786600 Jul 17 13:52 hotl15q1.csv


Looks like it is there, and it is 4.7M. Pretty big file.

Because our file names are well formatted, we can pull down multiple files at the same time. The quarterly file names have `hotl` followed by `q` for quarter, then a number for that quarter. There is a feature in `curl` where we can get sequences of alphanumeric series in the url by using [].

* `file[1-4].csv` would get you file1.csv, file2.csv, file3.csv and file4.csv.
* If numbers are 0-based, like file01.csv, then you can do it like this: `file[01-04].csv`.

Our are don't have the zero spacing, but the monthly files do. Remember that for your assignment ;-).

Let's pull down the four quarterly files:

In [5]:
curl -O -L https://raw.githubusercontent.com/utdata/cli-tools/master/data/hoteltax/data/hotl15q[1-4].csv


[1/4]: https://raw.githubusercontent.com/utdata/cli-tools/master/data/hoteltax/data/hotl15q1.csv --> hotl15q1.csv
--_curl_--https://raw.githubusercontent.com/utdata/cli-tools/master/data/hoteltax/data/hotl15q1.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 12 4674k   12  591k    0     0   339k      0  0:00:13  0:00:01  0:00:12  501k 33 4674k   33 1583k    0     0   614k      0  0:00:07  0:00:02  0:00:05  784k 54 4674k   54 2559k    0     0   716k      0  0:00:06  0:00:03  0:00:03  850k 70 4674k   70 3295k    0     0   717k      0  0:00:06  0:00:04  0:00:02  817k 86 4674k   86 4031k    0     0   723k      0  0:00:06  0:00:05  0:00:01  824k100 4674k  100 4674k    0     0   719k      0  0:00:06  0:00:06 --:-

This will pull down all 4 quarterly files. It takes a little bit, and Jupyter Notebooks will have an asterisk for that line until it is complete. Once it is, we take a look to make sure we have all the files.

In [6]:
ls -l

total 38384
-rw-r--r--  1 christian  staff  4786600 Jul 17 14:10 hotl15q1.csv
-rw-r--r--  1 christian  staff  4883450 Jul 17 14:10 hotl15q2.csv
-rw-r--r--  1 christian  staff  4913675 Jul 17 14:10 hotl15q3.csv
-rw-r--r--  1 christian  staff  5062200 Jul 17 14:10 hotl15q4.csv


## Adding a header row

Let's take a look at the top of the first file. We'll use a command called `head` which looks at the first ten lines of the file. We'll also show that tab complete works here, so type in head hot and then hit tab, and you'll get a pop-up that shows available files to choose from. Choose the right one, then use shift-return to execute.

In [7]:
head hotl15q1.csv

32015066601,"JENNIFER SALES                                    ","6363 S NETHERLAND WAY                   ","CENTENNIAL          ","CO","80016",000,00001,"PEAK2PAR                                          ","22018 E COSTILLA DR                     ","AURORA              ","CO","80016",   ,    1,     10122.08,      6372.08
32054882710,"HOLIDAY HACIENDAS LLC                             ","25500 BRUSH COLLEGE RD                  ","HARRISONVILLE       ","MO","64701",000,00001,"HOLIDAY HACIENDAS LLC                             ","25500 BRUSH COLLEGE RD                  ","HARRISONVILLE       ","MO","64701",   ,    3,      4924.00,      4924.00
32038336601,"HAGEMAN RESERVE LLC                               ","147 DUNHAM RANCH RD                     ","SULPHUR BLUFF       ","TX","75481",112,00003,"HAGEMAN RESERVE LLC                               ","8910 PURDUE RD                          ","INDIANAPOLIS        ","IN","46268",   ,   29,     51800.79,     51800.79
32051814922,"BRETT BAE

OK, this looks like something, but we don't have a header row, which sucks. We can compare it against our [table layout](https://github.com/utdata/cli-tools/blob/master/hoteltax/HOTELTAX_LYOT.TXT), copied here:

```
Column_Order|Column_Description|Data_Type|Size
Col01|Taxpayer Number|Number|11
Col02|Taxpayer Name|Char|50
Col03|Taxpayer Address|Char|40
Col04|Taxpayer City|Char|20
Col05|Taxpayer State|Char|2
Col06|Taxpayer Zip Code|Number|5
Col07|Taxpayer County|Number|3
Col08|Outlet Number|Number|5
Col09|Location Name|Char|50
Col10|Location Address|Char|40
Col11|Location City|Char|20
Col12|Location State|Char|2
Col13|Location Zip Code|Number|5
Col14|Location County|Number|3
Col15|Location Room Capacity|Number|5
Col16|Location Tot Room Receipts|Number|13
Col17|Location Taxable Receipts|Number|13

```

### editing a file in place

We need to add a header row to all of these files or we'll have problems later. This is one place that we are making changes to the original file, though we'll make a backup while we're doing it, and then remove them. We could do this manually, but even with a text editor, it can be tricky because the file are so big.

We'll use a command-line program called `sed`. I'll be honest, this took me a couple of hours to figure out, especially since Mac handles `sed -i` differently than unix, and I thought I was going crazy. So, this is possibly a Mac-only solution.

First, we need the text that will go in the header row. I built this by hand based on the layout file above:

```
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
```

Here is the command to change one of the files, and then I'll explain it:

In [8]:
# adding header row to q1
# if on Windows, leave out the '.bak' part
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl15q1.csv



OK, let's take a look at that file to make sure the header line was added properly.

In [9]:
# checking header q1
head -n 5 hotl15q1.csv

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32015066601,"JENNIFER SALES                                    ","6363 S NETHERLAND WAY                   ","CENTENNIAL          ","CO","80016",000,00001,"PEAK2PAR                                          ","22018 E COSTILLA DR                     ","AURORA              ","CO","80016",   ,    1,     10122.08,      6372.08
32054882710,"HOLIDAY HACIENDAS LLC                             ","25500 BRUSH COLLEGE RD                  ","HARRISONVILLE       ","MO","64701",000,00001,"HOLIDAY HACIENDAS LLC                             ","25500 BRUSH COLLEGE RD                  ","HARRISONVILLE       ","MO","64701",   ,    3,      4924.00,      4924.00
32038336601,"HAGEMAN RESERVE LLC                     

Looks good. Now, let's do this for the other 3 files. (We don't want to do the first one again, or we'll add the header twice.)

In [10]:
# adding headers for q2-q4
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl15q2.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl15q3.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl15q4.csv



Let's check the first two lines of all the files to make sure it is OK.

In [11]:
# checking for header rows
head -n 2 hotl15q1.csv
head -n 2 hotl15q2.csv
head -n 2 hotl15q3.csv
head -n 2 hotl15q4.csv

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32015066601,"JENNIFER SALES                                    ","6363 S NETHERLAND WAY                   ","CENTENNIAL          ","CO","80016",000,00001,"PEAK2PAR                                          ","22018 E COSTILLA DR                     ","AURORA              ","CO","80016",   ,    1,     10122.08,      6372.08
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32015066601,"JENNIFER SALES                                    ","6363 S NETHERLAND 

### removing bak files
Now, one thing about this ... The `sed` command created a bunch of *.bak* files as backups that we'll need to get rid of. To do so, we get to use the power of `rm`.

(If you ran the `sed` command on windows without the `'.bak'` part, you won't have the backup files.)

Let's look at the directory first:


In [12]:
ls -l

total 76768
-rw-r--r--  1 christian  staff  4786894 Jul 17 14:10 hotl15q1.csv
-rw-r--r--  1 christian  staff  4786600 Jul 17 14:10 hotl15q1.csv.bak
-rw-r--r--  1 christian  staff  4883744 Jul 17 14:10 hotl15q2.csv
-rw-r--r--  1 christian  staff  4883450 Jul 17 14:10 hotl15q2.csv.bak
-rw-r--r--  1 christian  staff  4913969 Jul 17 14:10 hotl15q3.csv
-rw-r--r--  1 christian  staff  4913675 Jul 17 14:10 hotl15q3.csv.bak
-rw-r--r--  1 christian  staff  5062494 Jul 17 14:10 hotl15q4.csv
-rw-r--r--  1 christian  staff  5062200 Jul 17 14:10 hotl15q4.csv.bak


We'll remove all the files with .bak in the name, and then `ls` the directory again.

In [13]:
# removes files that end with .bak
rm *.bak
# lists the directory again to check
ls -l

total 38384
-rw-r--r--  1 christian  staff  4786894 Jul 17 14:10 hotl15q1.csv
-rw-r--r--  1 christian  staff  4883744 Jul 17 14:10 hotl15q2.csv
-rw-r--r--  1 christian  staff  4913969 Jul 17 14:10 hotl15q3.csv
-rw-r--r--  1 christian  staff  5062494 Jul 17 14:10 hotl15q4.csv


## grep for Austin, normalize the files

A couple of things about that file. Note that Col02, Taxpayer name is 50 characters long. Look at the first line of data:

```
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32015066601,"JENNIFER SALES                                    ","6363 S NETHERLAND WAY                   ","CENTENNIAL          ","CO","80016",000,00001,"PEAK2PAR                                          ","22018 E COSTILLA DR                     ","AURORA              ","CO","80016",   ,    1,     10122.08,      6372.08
```

Notice anything about that "JENNIFER SALES" name? That name is in quotes, but look where quote mark closes. That field uses the full 50 characters, filled in with spaces. Same for the address, and all the other fields. This is one of the things **csvkit** can help us with, cleaning up and normalizing a csv file.( You might want to have the [csvkit docs](https://csvkit.readthedocs.io) open as a reference as we go through this.)

We want to use this technique on our files, but they are pretty big and it would take some time. So, to save some time and to show you the power of **pipes**, we'll find just our Austin hotels and then normalize that data.

Let's move out of our data directory and create a new one to put our processed data into, so it is separated from the raw data.

In [14]:
# moving out of data into hotels
cd ../
# making sure we are in hotels
pwd

/Users/christian/Documents/rwd/hotels


In [15]:
# make our data-done directory
# we use the -p here so it won't error if it already exists
mkdir -p data-done
# list just to show it was created
ls -l

total 0
drwxr-xr-x  6 christian  staff  204 Jul 17 14:11 data
drwxr-xr-x  2 christian  staff   68 Jul 17 14:11 data-done


### csvcut to see columns

Let's use [csvcut](https://csvkit.readthedocs.io/en/540/scripts/csvcut.html) with the `-n` option to peak at the header row of the first file to confirm our Location City column. The `-n` stands for `--names`, because it is usually used to see the header row.

In [16]:
csvcut -n data/hotl15q1.csv

  1: Taxpayer Number
  2: Taxpayer Name
  3: Taxpayer Address
  4: Taxpayer City
  5: Taxpayer State
  6: Taxpayer Zip Code
  7: Taxpayer County
  8: Outlet Number
  9: Location Name
 10: Location Address
 11: Location City
 12: Location State
 13: Location Zip Code
 14: Location County
 15: Location Room Capacity
 16: Location Tot Room Receipts
 17: Location Taxable Receipts


### grep for Austin

OK, we can see `Location City` is in the 11th column, which looks good. We can use `csvgrep` to find the AUSTIN rows.

Grep is a command-line tool for regular expressions, and [csvgrep](https://csvkit.readthedocs.io/en/540/scripts/csvgrep.html) works the same way. It needs a couple of arguments:

* `-c` is which column to search. We want `11`.
* `-m` you would use to match an exact string. We could instead use `-r` to build a regular expression.

We know we need column 11, but the match word will be tricky. We can't search for just the word "AUSTIN" because the 

The other thing we are going to do is to **pipe** the result into another command, `head` in this case. This is so we can just look at the first couple of lines to test our output before using it. This **pipe** concept is really important: You can take the "out" result of command and make it the "in" command of another, and you can string these together into a pipeline. That's what we're working on here, piece by piece ... a pipeline to cut and clean our files.

So, our `csvgrep` command searches the 11th column for the word AUSTIN at the beginning (the ^). We then pipe it into `head` and use the `-n` flag to show just 5 lines.

In [17]:
# grop for austin, show first 5 lines
csvgrep -c 11 -r '^AUSTIN' data/hotl15q1.csv | head -n 5

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32051871906,DSN HOSPITALITY LLC                               ,4710 S LAMAR BLVD                       ,AUSTIN              ,TX,78745,227,00001,DSN HOSPITALITY LLC                               ,3110 STATE HIGHWAY 71 EAST              ,AUSTIN              ,TX,78745,011,   37,     91205.03,     90870.01
32054409241,JEANETTE WELSHE                                   ,13801 EVERGREEN WAY                     ,AUSTIN              ,TX,78737,105,00001,BED AND BREAKFAST                                 ,13801 EVERGREEN WAY                     ,AUSTIN              ,TX,78737,105,    4,      5417.92,      5417.92
32047098168,AMY MARIE CAPUTO                                  ,13601 PAISANO CIR               

OK, that looks like we are getting the right row. Now we are going to pipe that result into `in2csv` to normalize it and then again into `head` so we just look at the top of the file.

In this case, `in2csv` needs a `-f` flag for filetype, which we will set as `csv`.

In [18]:
# grep for austin, convert to csv, show first five lines
csvgrep -c 11 -r "^AUSTIN" data/hotl15q1.csv | in2csv -f csv | head -n 5

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32051871906,DSN HOSPITALITY LLC,4710 S LAMAR BLVD,AUSTIN,TX,78745,227,00001,DSN HOSPITALITY LLC,3110 STATE HIGHWAY 71 EAST,AUSTIN,TX,78745,011,37,91205.03,90870.01
32054409241,JEANETTE WELSHE,13801 EVERGREEN WAY,AUSTIN,TX,78737,105,00001,BED AND BREAKFAST,13801 EVERGREEN WAY,AUSTIN,TX,78737,105,4,5417.92,5417.92
32047098168,AMY MARIE CAPUTO,13601 PAISANO CIR,AUSTIN,TX,78737,105,00001,FLORA PROPERTIES/AMY M. CAPUTO,13601 PAISANO CIR,AUSTIN,TX,78737,105,4,7280.23,7280.23
32055460730,NATHANIEL R BAUERNFEIND,163 KINLOCH CT,AUSTIN,TX,78737,105,00001,NATHANIEL R BAUERNFEIND,163 KINLOCH CT,AUSTIN,TX,78737,105,1,4735.0,4735.0


See the difference in the files now? All the bad space is gone. We can take that command above and instead of piping it into `head`, we can redirect it into a new file in the data-done folder. We do this with `>` and then specify the file location, which we'll call `hotl15q1-austin.csv`.

In [19]:
# grep for austin, convert to csv, put in new file
csvgrep -c 11 -r "^AUSTIN" data/hotl15q1.csv | in2csv -f csv > data-done/hotl15q1-austin.csv



List the new `data-done` directory to see that the finished file is there:

In [20]:
# check for newly-made file in data-done
ls -l data-done

total 448
-rw-r--r--  1 christian  staff  226903 Jul 17 14:11 hotl15q1-austin.csv


OK, let's go ahead and process all of the files so we have clean versions of the Austin records.

(FYI, There is a better way to do this with a loop of some sort, but I don't know how. Yet.)

When I set this up, I just copied that first one over 3 more times, then updated the file names in both places on each line.

In [21]:
# Make sure you get the file names right
# this will take a couple of seconds
csvgrep -c 11 -r "^AUSTIN" data/hotl15q1.csv | in2csv -f csv > data-done/hotl15q1-austin.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl15q2.csv | in2csv -f csv > data-done/hotl15q2-austin.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl15q3.csv | in2csv -f csv > data-done/hotl15q3-austin.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl15q4.csv | in2csv -f csv > data-done/hotl15q4-austin.csv




In [22]:
# making sure they are all there
ls -l data-done

total 1800
-rw-r--r--  1 christian  staff  226903 Jul 17 14:11 hotl15q1-austin.csv
-rw-r--r--  1 christian  staff  224339 Jul 17 14:11 hotl15q2-austin.csv
-rw-r--r--  1 christian  staff  225120 Jul 17 14:11 hotl15q3-austin.csv
-rw-r--r--  1 christian  staff  238415 Jul 17 14:11 hotl15q4-austin.csv


## Stack into a single file

Now we can use [csvstack](https://csvkit.readthedocs.io/en/540/scripts/csvstack.html) to combine all the files into one big file.

* **-g** flag lets us create a new column and give a value to each row that defines which file it came from. In our case, we need to know what month it came from, so we'll list all the months.
* **-n** let's us name that group column. We'll call it Month.

Then we list all the files we want to put together. When when use **-g**, which have to have the same number of groupings as we do input files.

I'm breaking this command up into multiple lines using "\" at the end so you can see the whole command. The group names and the files need to be in the same order.

In [23]:
csvstack -n Quarter -g Q1,Q2,Q3,Q4 \
data-done/hotl15q1-austin.csv data-done/hotl15q2-austin.csv \
data-done/hotl15q3-austin.csv data-done/hotl15q4-austin.csv \
> data-done/austin-hotels.csv



Let's `ls` that directory to make sure the new file is there:

In [24]:
ls -l data-done

total 3624
-rw-r--r--  1 christian  staff  931618 Jul 17 14:12 austin-hotels.csv
-rw-r--r--  1 christian  staff  226903 Jul 17 14:11 hotl15q1-austin.csv
-rw-r--r--  1 christian  staff  224339 Jul 17 14:11 hotl15q2-austin.csv
-rw-r--r--  1 christian  staff  225120 Jul 17 14:11 hotl15q3-austin.csv
-rw-r--r--  1 christian  staff  238415 Jul 17 14:11 hotl15q4-austin.csv


## Quick stats on files

We'll use [csvstat](https://csvkit.readthedocs.io/en/540/scripts/csvstat.html) to take a closer look at the combined file. Sometimes the result is all you need for a story ... the min, max, sum, mean and median of a particular column.

In [25]:
csvstat data-done/austin-hotels.csv

  1. Quarter
	<class 'str'>
	Nulls: False
	Values: Q4, Q2, Q3, Q1
  2. Taxpayer Number
	<class 'int'>
	Nulls: False
	Min: 10204561905
	Max: 32059607492
	Sum: 179205405825924
	Mean: 30348078886.69331
	Median: 32049933412
	Standard Deviation: 5292074890.428423
	Unique values: 1221
	5 most frequent values:
		32049933412:	829
		32052153940:	104
		32022337540:	96
		32043490237:	96
		12016274339:	48
  3. Taxpayer Name
	<class 'str'>
	Nulls: False
	Unique values: 1226
	5 most frequent values:
		TURNKEY VACATION RENTALS, INC.:	829
		EMERSON GUEST PROPERTIES LLC:	104
		CHEREEN FISHER:	96
		VACATIONCAKE LLC:	96
		ESA P PORTFOLIO OPERATING LESSEE LLC:	48
	Max length: 50
  4. Taxpayer Address
	<class 'str'>
	Nulls: False
	Unique values: 1176
	5 most frequent values:
		4544 S LAMAR BLVD STE G300:	465
		4544 S LAMAR BLVD BLDG 300:	364
		PO BOX 3089 C/O HOTSPOT TAX SERVICES:	123
		707 JOSEPHINE ST:	104
		1709 BLUEBONNET LN:	96
	Max length: 40
  5. Taxpayer Ci

## The result, and possible next steps

You can already see a couple of things here.

* The most a single hotel reported in a quarter in 2015 was 18,139,550.47
* The mean (or average) reported by all establishments was 177,591.70, but given the median is 6064.93 there are many establishments that did not make that much money.

Now your single `austin-hotels` file can be analyzed so you are looking at one year of data all together.

By having this all in this notebook, you can run the whole process over again by going to the Kernel menu and choosing **Restart and Run All**.

If you find you had a mistake somewhere along the line, you can fix it, then **Restart and Run All**.

Imagine if you had done all this by hand in Excel, and then found you made an error early in the process. Or, worse yet, you didn't discover you made an error.


## Best quarter for a hotel

You can use all these commands **csvkit** to make a pretty decent display of your data. We're going to string together a command to show the best quarters for hotels based on our processed file.

In [4]:
# just making sure I'm in the hotels folder
cd ~/Documents/rwd/hotels/
# get columns, sort by room receipts, get the top, csvlook for display
csvcut -c 1,10,11,17 data-done/austin-hotels.csv | csvsort -c 4 -r | head -n 10 | csvlook

|----------+--------------------------------+---------------------+-----------------------------|
|  Quarter | Location Name                  | Location Address    | Location Tot Room Receipts  |
|----------+--------------------------------+---------------------+-----------------------------|
|  Q4      | JW MARRIOTT AUSTIN DOWNTOWN    | 110 E 2ND ST        | 18139550.47                 |
|  Q2      | JW MARRIOTT AUSTIN DOWNTOWN    | 110 E 2ND ST        | 17954220.22                 |
|  Q3      | JW MARRIOTT AUSTIN DOWNTOWN    | 110 E 2ND ST        | 16639896.29                 |
|  Q1      | AUSTIN HILTON CONVENTION HOTEL | 500 E 4TH ST        | 15510707.02                 |
|  Q2      | AUSTIN HILTON CONVENTION HOTEL | 500 E 4TH ST        | 14011634.18                 |
|  Q4      | AUSTIN HILTON CONVENTION HOTEL | 500 E 4TH ST        | 12743324.32                 |
|  Q1      | JW MARRIOTT AUSTIN DOWNTOWN    | 110 E 2ND ST        | 11833556.27                 |
|  Q3     

Let's break down that command:

* `csvcut -c 1,10,11,17 austin-hotels.csv`. I'm using csvcut to just get certain collumns. Remember you can figure out which columns numbers by doing `csvcut -n filename` to get a printout of the first line, usually the header.
* `csvsort -c 4 -r` I'm using [csvsort]() here, and I'm sorting on Location Tot Room Receipts, which is now the 4th column since we cut it. I set the `-r` to get descending order to get the most on the top.
* `head -n 10` to get just the top 10 from the result of the sort.
* `csvlook` makes the nice little table view.

The order of this is important. You can't start with `head` or you'll only have the first 10 rows of the file to consider for your sort.