Hotels data pipeline
====================

We'll use this notebook to show how to build a data pipeline to process data. This could also be rerun with new data if need be.* 

We'll be using Hotel Occupancy Tax Receipts data, which can be used to see which hotels around the state pull in the most money. For more information about the data itself, see this [README.md](https://github.com/utdata/cli-tools/blob/master/hoteltax/README.md).


## The Goal

* We're going to download one year of monthly files.
* We'll convert them into well-formatted csv files.
* We'll then put all those monthly files into a single big file.
* We'll then pull out just hotels in Austin.

## Getting the data

We'll use a new command called `curl` to download our data. For more information on curl, you can read the [man page](https://curl.haxx.se/docs/manpage.html) or this [handy tip sheet](http://www.thegeekstuff.com/2012/04/curl-examples/), which is much more understandable.

Since we are in a Bash notebook, we can use our command-line tools. Let's make sure we know where we are. Type in `pwd` in the prompt below and then do shift-return to execute the command.

# fix this!!!!!!!!
Fix this so everything always runs out of /hotels/. Switch back and forth is kinda nuts. Maybe.

In [28]:
pwd

/Users/christian/Documents/code/cli-tools/csvkit/hotels


Ok, we need to be inside the data folder when we download or data, so let us move there:

In [2]:
cd data



In [3]:
pwd

/Users/christian/Documents/code/cli-tools/csvkit/hotels/data


Now we'll use `curl` to pull down a single monthly hotel tax file. If you read about the tax data, you'll see you can get it from the comptroller, but they have some naming issues, so to help with this assignment, I've saved the data and we'll pull it down from this github repo.

In [4]:
curl -O -L https://raw.githubusercontent.com/utdata/cli-tools/master/hoteltax/data/hotl1501.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  6 2769k    6  175k    0     0   117k      0  0:00:23  0:00:01  0:00:22  172k 58 2769k   58 1631k    0     0   657k      0  0:00:04  0:00:02  0:00:02  812k100 2769k  100 2769k    0     0   842k      0  0:00:03  0:00:03 --:--:--  985k


Let's break down that `curl` statement.

* `curl` is the command. I think of it as "capture URL"
* `-O` (that's capital O, not zero). This outputs result to a file to your computer instead of to your screen.
* `-L` stands for `--location`, and it will allow the request to follow a redirect link. It's good to use it.
* And then we have the url of the file.

Let's check that the file made it to our computer, and if some data in it:

In [5]:
ls -l

total 67304
-rw-r--r--  1 christian  staff  2836275 Jul 10 22:58 hotl1501.csv
-rw-r--r--  1 christian  staff  2798838 Jul 10 21:20 hotl1502.csv
-rw-r--r--  1 christian  staff  2804719 Jul 10 21:20 hotl1503.csv
-rw-r--r--  1 christian  staff  2792044 Jul 10 21:20 hotl1504.csv
-rw-r--r--  1 christian  staff  2805369 Jul 10 21:20 hotl1505.csv
-rw-r--r--  1 christian  staff  2884019 Jul 10 21:20 hotl1506.csv
-rw-r--r--  1 christian  staff  2881094 Jul 10 21:20 hotl1507.csv
-rw-r--r--  1 christian  staff  2962019 Jul 10 21:20 hotl1508.csv
-rw-r--r--  1 christian  staff  2988669 Jul 10 21:20 hotl1509.csv
-rw-r--r--  1 christian  staff  2705594 Jul 10 21:20 hotl1510.csv
-rw-r--r--  1 christian  staff  2944794 Jul 10 21:20 hotl1511.csv
-rw-r--r--  1 christian  staff  3034819 Jul 10 21:20 hotl1512.csv


Looks like it is there, and it is 2.8M. Pretty big file.

Because our file names are well formatted, we can pull down multiple files at the same time. The file names have `hotl` followed by the year and month: `YYMM`. There is a feature in `curl` where we can get sequences of alphanumeric series in the url by using [], like this:

In [6]:
curl -O -L https://raw.githubusercontent.com/utdata/cli-tools/master/hoteltax/data/hotl15[01-12].csv


[1/12]: https://raw.githubusercontent.com/utdata/cli-tools/master/hoteltax/data/hotl1501.csv --> hotl1501.csv
--_curl_--https://raw.githubusercontent.com/utdata/cli-tools/master/hoteltax/data/hotl1501.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  9 2769k    9  255k    0     0   147k      0  0:00:18  0:00:01  0:00:17  147k 45 2769k   45 1263k    0     0   461k      0  0:00:06  0:00:02  0:00:04  461k 88 2769k   88 2449k    0     0   655k      0  0:00:04  0:00:03  0:00:01  655k100 2769k  100 2769k    0     0   725k      0  0:00:03  0:00:03 --:--:--  725k

[2/12]: https://raw.githubusercontent.com/utdata/cli-tools/master/hoteltax/data/hotl1502.csv --> hotl1502.csv
--_curl_--https://raw.githubusercontent.co

This pulled down all 12 files. Now we can take a look to make sure we have all the files.

In [7]:
ls -l

total 67304
-rw-r--r--  1 christian  staff  2836275 Jul 10 22:59 hotl1501.csv
-rw-r--r--  1 christian  staff  2798250 Jul 10 22:59 hotl1502.csv
-rw-r--r--  1 christian  staff  2804425 Jul 10 22:59 hotl1503.csv
-rw-r--r--  1 christian  staff  2791750 Jul 10 22:59 hotl1504.csv
-rw-r--r--  1 christian  staff  2805075 Jul 10 22:59 hotl1505.csv
-rw-r--r--  1 christian  staff  2883725 Jul 10 22:59 hotl1506.csv
-rw-r--r--  1 christian  staff  2880800 Jul 10 22:59 hotl1507.csv
-rw-r--r--  1 christian  staff  2961725 Jul 10 22:59 hotl1508.csv
-rw-r--r--  1 christian  staff  2988375 Jul 10 22:59 hotl1509.csv
-rw-r--r--  1 christian  staff  2705300 Jul 10 22:59 hotl1510.csv
-rw-r--r--  1 christian  staff  2944500 Jul 10 22:59 hotl1511.csv
-rw-r--r--  1 christian  staff  3034525 Jul 10 22:59 hotl1512.csv


## Head to look at a file

Let's take a look at the top of the first file. We'll use a command called `head` which looks at the first ten lines of the file. We'll also show that tab complete works here, so type in head hot and then hit tab, and you'll get a pop-up that shows available files to choose from. Choose the right one, then use shift-return to execute.

In [8]:
head hotl1501.csv

32050067050,"COASTAL WAVES VACATIONS, LLC                      ","3616 7 MILE RD                          ","GALVESTON           ","TX","77554",084,00010," COASTAL WAVES VACATIONS                          ","4158 GREEN HERON DR                     ","GALVESTON           ","TX","77554",084,    1,       530.00,       530.00
32049649729,"ALL SEASONS RENTALS,  INC.                        ","15113 BAT HAWK CIR                      ","AUSTIN              ","TX","78738",227,00022," CROSSING C                                       ","613 HI CIR N                            ","HORSESHOE BAY       ","TX","78657",150,    1,         0.00,         0.00
32039987733,"TIRUPATI LODGING CORP.                            ","1407 S MAIN ST                          ","HIGHLANDS           ","TX","77562",101,00001," HIGHLANDS SUITES                                 ","1407 S MAIN ST                          ","HIGHLANDS           ","TX","77562",101,   31,     27264.87,     22864.87
32001947285,"IRMA A HA

OK, this looks like something, but we don't have a header row, which sucks. We can compare it against our [table layout](https://github.com/utdata/cli-tools/blob/master/hoteltax/HOTELTAX_LYOT.TXT):

```
Column_Order|Column_Description|Data_Type|Size
Col01|Taxpayer Number|Number|11
Col02|Taxpayer Name|Char|50
Col03|Taxpayer Address|Char|40
Col04|Taxpayer City|Char|20
Col05|Taxpayer State|Char|2
Col06|Taxpayer Zip Code|Number|5
Col07|Taxpayer County|Number|3
Col08|Outlet Number|Number|5
Col09|Location Name|Char|50
Col10|Location Address|Char|40
Col11|Location City|Char|20
Col12|Location State|Char|2
Col13|Location Zip Code|Number|5
Col14|Location County|Number|3
Col15|Location Room Capacity|Number|5
Col16|Location Tot Room Receipts|Number|13
Col17|Location Taxable Receipts|Number|13

```

We need to add a header row to all of these files or we'll have problems later. This is one place that we are making changes to the original file, though we'll make a backup while we're doing it, and then remove them. We could do this manually, but even with a text editor, it takes a long time to just save the changes.

We'll use a program called `sed`. I'll be honest, this took me a couple of hours to figure out, especially since Mac handles `sed -i` differently than unix, and I thought I was going crazy. So, this is possibly a Mac-only solution.

First, we need the text that will go in the header row. I built this by hand based on the layout file above:

```
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
```

Here is the command to change one of the files, and then I'll explain it:

In [9]:
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1501.csv



OK, let's take a look at that file to make sure the header line was added properly.

In [10]:
head -n 5 hotl1501.csv

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32050067050,"COASTAL WAVES VACATIONS, LLC                      ","3616 7 MILE RD                          ","GALVESTON           ","TX","77554",084,00010," COASTAL WAVES VACATIONS                          ","4158 GREEN HERON DR                     ","GALVESTON           ","TX","77554",084,    1,       530.00,       530.00
32049649729,"ALL SEASONS RENTALS,  INC.                        ","15113 BAT HAWK CIR                      ","AUSTIN              ","TX","78738",227,00022," CROSSING C                                       ","613 HI CIR N                            ","HORSESHOE BAY       ","TX","78657",150,    1,         0.00,         0.00
32039987733,"TIRUPATI LODGING CORP.                  

Looks good. Now, let's do this for the other 11 files. (We don't want to do the first one again, or we'll add the header twice.)

In [11]:
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1502.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1503.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1504.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1505.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1506.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1507.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1508.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1509.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1510.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1511.csv
sed -i '.bak' '1i \
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
' hotl1512.csv



Now, one thing about this ... we created a bunch of *.bak* files as backups that we'll need to get rid of. To do so, we get to use the power of `rm`.

Let's look at all the files first.

In [12]:
ls -l

total 134608
-rw-r--r--  1 christian  staff  2836569 Jul 10 22:59 hotl1501.csv
-rw-r--r--  1 christian  staff  2836275 Jul 10 22:59 hotl1501.csv.bak
-rw-r--r--  1 christian  staff  2798544 Jul 10 22:59 hotl1502.csv
-rw-r--r--  1 christian  staff  2798250 Jul 10 22:59 hotl1502.csv.bak
-rw-r--r--  1 christian  staff  2804719 Jul 10 22:59 hotl1503.csv
-rw-r--r--  1 christian  staff  2804425 Jul 10 22:59 hotl1503.csv.bak
-rw-r--r--  1 christian  staff  2792044 Jul 10 22:59 hotl1504.csv
-rw-r--r--  1 christian  staff  2791750 Jul 10 22:59 hotl1504.csv.bak
-rw-r--r--  1 christian  staff  2805369 Jul 10 22:59 hotl1505.csv
-rw-r--r--  1 christian  staff  2805075 Jul 10 22:59 hotl1505.csv.bak
-rw-r--r--  1 christian  staff  2884019 Jul 10 22:59 hotl1506.csv
-rw-r--r--  1 christian  staff  2883725 Jul 10 22:59 hotl1506.csv.bak
-rw-r--r--  1 christian  staff  2881094 Jul 10 22:59 hotl1507.csv
-rw-r--r--  1 christian  staff  2880800 Jul 10 22:59 hotl1507.csv.bak
-rw-r--r--  1 christ

We'll remove all the files with .bak in the name, and then `ls` the directory again.

In [13]:
rm *.bak
ls -l

total 67304
-rw-r--r--  1 christian  staff  2836569 Jul 10 22:59 hotl1501.csv
-rw-r--r--  1 christian  staff  2798544 Jul 10 22:59 hotl1502.csv
-rw-r--r--  1 christian  staff  2804719 Jul 10 22:59 hotl1503.csv
-rw-r--r--  1 christian  staff  2792044 Jul 10 22:59 hotl1504.csv
-rw-r--r--  1 christian  staff  2805369 Jul 10 22:59 hotl1505.csv
-rw-r--r--  1 christian  staff  2884019 Jul 10 22:59 hotl1506.csv
-rw-r--r--  1 christian  staff  2881094 Jul 10 22:59 hotl1507.csv
-rw-r--r--  1 christian  staff  2962019 Jul 10 22:59 hotl1508.csv
-rw-r--r--  1 christian  staff  2988669 Jul 10 22:59 hotl1509.csv
-rw-r--r--  1 christian  staff  2705594 Jul 10 22:59 hotl1510.csv
-rw-r--r--  1 christian  staff  2944794 Jul 10 22:59 hotl1511.csv
-rw-r--r--  1 christian  staff  3034819 Jul 10 22:59 hotl1512.csv


In [14]:
head -n 3 hotl1501.csv

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32050067050,"COASTAL WAVES VACATIONS, LLC                      ","3616 7 MILE RD                          ","GALVESTON           ","TX","77554",084,00010," COASTAL WAVES VACATIONS                          ","4158 GREEN HERON DR                     ","GALVESTON           ","TX","77554",084,    1,       530.00,       530.00
32049649729,"ALL SEASONS RENTALS,  INC.                        ","15113 BAT HAWK CIR                      ","AUSTIN              ","TX","78738",227,00022," CROSSING C                                       ","613 HI CIR N                            ","HORSESHOE BAY       ","TX","78657",150,    1,         0.00,         0.00


A couple of things about that file. Note that Col02, Taxpayer name is 50 characters long. Look at the first line of data:

```
Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32050067050,"COASTAL WAVES VACATIONS, LLC                      ","3616 7 MILE RD                          ","GALVESTON           ","TX","77554",084,00010," COASTAL WAVES VACATIONS                          ","4158 GREEN HERON DR                     ","GALVESTON           ","TX","77554",084,    1,       530.00,       530.00
```

Notice anything about that "COASTAL WAVES VACATIONS, LLC" name? That name is in quotes, but look where quote mark closes. That field uses the full 50 characters, filled in with spaces. Same for the address, and all the other fields. This is one of the things **csvkit** can help us with, cleaning up and normalizing a csv file.

Let's move out of our data directory and create a new one to put our processed data into. We don't ever want to change our original data.

In [15]:
cd ../



In [27]:
# we use the -p here so it won't error if it already exists
mkdir -p data-done



You might want to have the [csvkit docs](https://csvkit.readthedocs.io) open as a reference as we go through this.

Csvkit has a command called `in2csv` that not only converts .xlsx files to .csv, it also cleans up .csv files like all the spaces in our name. We want to use this technique on our files, but at 2.8Ms, it would take some time to do one file, much less all 12. It's worth doing if you have time. We don't.

Remember that our goal is to look at all the Austin hotels? Well, we can cut our files down to the smaller Austin files and then clean them up at the same time. To do this, we need to find out un which column to search for Austin.

We can see from our table layout above, that the 11th column is the Location City, but let's use `csvcut -n` to peak at the header row of the first file to confirm. the `-n` stands for `--names`, because it is usually used to see the header row.

In [17]:
csvcut -n data/hotl1501.csv

  1: Taxpayer Number
  2: Taxpayer Name
  3: Taxpayer Address
  4: Taxpayer City
  5: Taxpayer State
  6: Taxpayer Zip Code
  7: Taxpayer County
  8: Outlet Number
  9: Location Name
 10: Location Address
 11: Location City
 12: Location State
 13: Location Zip Code
 14: Location County
 15: Location Room Capacity
 16: Location Tot Room Receipts
 17: Location Taxable Receipts


OK, we can see `Location City` is in the 11th column, which looks good. We can use `csvgrep` to find the AUSTIN rows.

Grep is a command-line tool for regular expressions, and `csvgrep` works the same way. It needs a couple of arguments:

* `-c` is which column to search. We want `11`.
* `-m` you would use to match an exact string. We could instead use `-r` to build a regular expression.

We know we need column 11, but the match word will be tricky. We can't search for just the word "AUSTIN" because the 

The other thing we are going to do is to **pipe** the result into another command, `head` in this case. This is so we can just look at the first couple of lines to test our output before using it. This **pipe** concept is really important: You can take the "out" result of command and make it the "in" command of another, and you can string these together into a pipeline. That's what we're working on here, piece by piece ... a pipeline to cut and clean our files.

So, our `csvgrep` command searches the 11th column for the word AUSTIN at the beginning (the ^). We then pipe it into `head` and use the `-n` flag to show just 5 lines.

In [18]:
csvgrep -c 11 -r '^AUSTIN' data/hotl1501.csv | head -n 5

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32016602719,BHAKTA DAYARAM                                    ,2627 MANOR RD                           ,AUSTIN              ,TX,78722,227,00001,ACE MOTEL                                         ,2627 MANOR RD                           ,AUSTIN              ,TX,78722,227,   27,      9165.00,      8580.00
32016658448,SIDNEY CORINNE LOCK                               ,4300 AVENUE G                           ,AUSTIN              ,TX,78751,227,00005,ADAMS HOUSE BED & BREAKFAST                       ,4300 AVENUE G                           ,AUSTIN              ,TX,78751,227,    3,     14903.00,     13544.00
32043492993,AMANDA K CRIBBS                                   ,4202 FLAGSTAFF DR               

OK, that looks like we are getting the right row. Now we are going to pipe that result into `in2csv` and then again into `head` to see what it looks like.

In this case, `in2csv` needs a `-f` flag for filetype, which we will set as `csv`.

In [19]:
csvgrep -c 11 -r "^AUSTIN" data/hotl1501.csv | in2csv -f csv | head -n 5

Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip Code,Taxpayer County,Outlet Number,Location Name,Location Address,Location City,Location State,Location Zip Code,Location County,Location Room Capacity,Location Tot Room Receipts,Location Taxable Receipts
32016602719,BHAKTA DAYARAM,2627 MANOR RD,AUSTIN,TX,78722,227,00001,ACE MOTEL,2627 MANOR RD,AUSTIN,TX,78722,227,27,9165.0,8580.0
32016658448,SIDNEY CORINNE LOCK,4300 AVENUE G,AUSTIN,TX,78751,227,00005,ADAMS HOUSE BED & BREAKFAST,4300 AVENUE G,AUSTIN,TX,78751,227,3,14903.0,13544.0
32043492993,AMANDA K CRIBBS,4202 FLAGSTAFF DR,AUSTIN,TX,78759,227,00007,ALLANDALE RENTALS,1107A BRENTWOOD ST,AUSTIN,TX,78757,227,2,250.0,250.0
32043492993,AMANDA K CRIBBS,4202 FLAGSTAFF DR,AUSTIN,TX,78759,227,00009,ALLANDALE RENTALS,11900 ALOE VERA TRL,AUSTIN,TX,78750,246,8,2728.0,2728.0


See the difference in the files now? All the bad space is gone. We can take that command above and instead of piping it into `head`, we can redirect it into a new file in the data-done folder. We do this with `>` and then specify the file location, which we'll call `hotl1501-austin.csv`.

In [20]:
csvgrep -c 11 -r "^AUSTIN" data/hotl1501.csv | in2csv -f csv > data-done/hotl1501-atx.csv



List the new `data-done` directory to see that the finished file is there:

In [21]:
ls -l data-done

total 4504
-rw-r--r--  1 christian  staff  1162130 Jul 10 22:39 austin-hotels.csv
-rw-r--r--  1 christian  staff    89362 Jul 10 23:01 hotl1501-atx.csv
-rw-r--r--  1 christian  staff    88878 Jul 10 22:13 hotl1502-atx.csv
-rw-r--r--  1 christian  staff    93104 Jul 10 22:13 hotl1503-atx.csv
-rw-r--r--  1 christian  staff    91828 Jul 10 22:13 hotl1504-atx.csv
-rw-r--r--  1 christian  staff    90653 Jul 10 22:13 hotl1505-atx.csv
-rw-r--r--  1 christian  staff    89636 Jul 10 22:13 hotl1506-atx.csv
-rw-r--r--  1 christian  staff    86633 Jul 10 22:13 hotl1507-atx.csv
-rw-r--r--  1 christian  staff    90431 Jul 10 22:13 hotl1508-atx.csv
-rw-r--r--  1 christian  staff    98754 Jul 10 22:13 hotl1509-atx.csv
-rw-r--r--  1 christian  staff    95913 Jul 10 22:13 hotl1510-atx.csv
-rw-r--r--  1 christian  staff   101552 Jul 10 22:13 hotl1511-atx.csv
-rw-r--r--  1 christian  staff   100217 Jul 10 22:13 hotl1512-atx.csv
-rw-r--r--  1 christian  staff        0 Jul 10 22:31 test.csv


OK, let's go ahead and process all of the files so we have clean versions of the Austin records. There is definitely a better way to do this with a loop of some sort, but I don't know how. Yet.

When I set this up, I just copied that first one over 11 more times, then updated the file names in both places on each line.

In [22]:
# Make sure you get the file names right
csvgrep -c 11 -r "^AUSTIN" data/hotl1501.csv | in2csv -f csv > data-done/hotl1501-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1502.csv | in2csv -f csv > data-done/hotl1502-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1503.csv | in2csv -f csv > data-done/hotl1503-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1504.csv | in2csv -f csv > data-done/hotl1504-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1505.csv | in2csv -f csv > data-done/hotl1505-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1506.csv | in2csv -f csv > data-done/hotl1506-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1507.csv | in2csv -f csv > data-done/hotl1507-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1508.csv | in2csv -f csv > data-done/hotl1508-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1509.csv | in2csv -f csv > data-done/hotl1509-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1510.csv | in2csv -f csv > data-done/hotl1510-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1511.csv | in2csv -f csv > data-done/hotl1511-atx.csv
csvgrep -c 11 -r "^AUSTIN" data/hotl1512.csv | in2csv -f csv > data-done/hotl1512-atx.csv



In [23]:
# making sure they are all there
ls -l data-done

total 4504
-rw-r--r--  1 christian  staff  1162130 Jul 10 22:39 austin-hotels.csv
-rw-r--r--  1 christian  staff    89362 Jul 10 23:01 hotl1501-atx.csv
-rw-r--r--  1 christian  staff    88878 Jul 10 23:01 hotl1502-atx.csv
-rw-r--r--  1 christian  staff    93104 Jul 10 23:01 hotl1503-atx.csv
-rw-r--r--  1 christian  staff    91828 Jul 10 23:01 hotl1504-atx.csv
-rw-r--r--  1 christian  staff    90653 Jul 10 23:01 hotl1505-atx.csv
-rw-r--r--  1 christian  staff    89636 Jul 10 23:01 hotl1506-atx.csv
-rw-r--r--  1 christian  staff    86633 Jul 10 23:01 hotl1507-atx.csv
-rw-r--r--  1 christian  staff    90431 Jul 10 23:01 hotl1508-atx.csv
-rw-r--r--  1 christian  staff    98754 Jul 10 23:02 hotl1509-atx.csv
-rw-r--r--  1 christian  staff    95913 Jul 10 23:02 hotl1510-atx.csv
-rw-r--r--  1 christian  staff   101552 Jul 10 23:02 hotl1511-atx.csv
-rw-r--r--  1 christian  staff   100217 Jul 10 23:02 hotl1512-atx.csv
-rw-r--r--  1 christian  staff        0 Jul 10 22:31 test.csv


## Stack into a single file

Now we can use [csvstack](https://csvkit.readthedocs.io/en/540/scripts/csvstack.html) to combine all the files into one big file.

* **-g** flag lets us create a new column and give a value to each row that defines which file it came from. In our case, we need to know what month it came from, so we'll list all the months.
* **-n** let's us name that group column. We'll call it Month.

Then we list all the files we want to put together. When when use **-g**, which have to have the same number of groupings as we do input files.

I'm breaking this command up into multiple lines using "\" at the end so you can see the whole command. The group names and the files need to be in the same order.

In [24]:
csvstack -n Month -g \
January,February,March,April,May,June,July,August,September,October,November,December \
data-done/hotl1501-atx.csv data-done/hotl1502-atx.csv data-done/hotl1503-atx.csv \
data-done/hotl1504-atx.csv data-done/hotl1505-atx.csv data-done/hotl1506-atx.csv \
data-done/hotl1507-atx.csv data-done/hotl1508-atx.csv data-done/hotl1509-atx.csv \
data-done/hotl1510-atx.csv data-done/hotl1511-atx.csv data-done/hotl1512-atx.csv \
> data-done/austin-hotels.csv



## Quick stats on files

We'll use [csvstat](https://csvkit.readthedocs.io/en/540/scripts/csvstat.html) to take a closer look at the combined file. Sometimes the result is all you need for a story ... the min, max, sum, mean and median of a particular column.

In [25]:
csvstat data-done/austin-hotels.csv

  1. Month
	<class 'str'>
	Nulls: False
	Unique values: 12
	5 most frequent values:
		November:	612
		December:	603
		September:	593
		October:	575
		March:	555
	Max length: 9
  2. Taxpayer Number
	<class 'int'>
	Nulls: False
	Min: 10204561905
	Max: 32059158348
	Sum: 187936198875413
	Mean: 28087908963.59483
	Median: 32047098168
	Standard Deviation: 7532652034.898627
	Unique values: 252
	5 most frequent values:
		32049933412:	2302
		32043490237:	276
		32052153940:	275
		32022337540:	264
		12016274339:	144
  3. Taxpayer Name
	<class 'str'>
	Nulls: False
	Unique values: 256
	5 most frequent values:
		TURNKEY VACATION RENTALS, INC.:	2302
		VACATIONCAKE LLC:	276
		EMERSON GUEST PROPERTIES LLC:	275
		CHEREEN FISHER:	264
		ESA P PORTFOLIO OPERATING LESSEE LLC:	120
	Max length: 50
  4. Taxpayer Address
	<class 'str'>
	Nulls: False
	Unique values: 225
	5 most frequent values:
		4544 S LAMAR BLVD STE G300:	1117
		4544 S LAMAR BLVD BLDG 300:	871
		3006

## The result, and possible next steps

You can already see a couple of things here.

* The most a single hotel reported in 2015 was 8,846,099.
* The mean (or average) reported by all establishments was 152,185, but given the median is 4,595 there are many establishments that did not make that much money.

Now your single `austin-hotels` file can be analyzed so you are looking at one year of data all together.

By having this all in this notebook, you can run the whole process over again by going to the Kernel menu and choosing **Restart and Run All**.

If you find you had a mistake somewhere along the line, you can fix it, then **Restart and Run All**.

Imagine if you had done all this by hand in Excel, and then found you made an error early in the process. Or, worse yet, you didn't discover you made an error.
