# Processing mixed beverage data
This python notebook uses a python library called [agate](http://agate.readthedocs.io/) to download, clean and process a monthly [Mixed Beverage Gross Receipts](https://comptroller.texas.gov/taxes/mixed-beverage/receipts.php) files from the Texas Comptroller [data repository](https://comptroller.texas.gov/transparency/open-data/search-datasets/).

This first part of this pulls in python modules that we will use:

In [1]:
import agate
import re



In [2]:
# this is surpress the timezone warning
import warnings
warnings.filterwarnings('ignore')

In [3]:
%%bash
## downloads the mixedbev file
## You have to set this URL based on the directory site
curl -O https://comptroller.texas.gov/auto-data/odc/MIXEDBEV_03_2017.CSV

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 12 2493k   12  319k    0     0   565k      0  0:00:04 --:--:--  0:00:04  621k100 2493k  100 2493k    0     0  2930k      0 --:--:-- --:--:-- --:--:-- 3121k



There is supposedly a way to call a file from a [remote url](http://agate-remote.readthedocs.io/en/0.2.0/) into agate, but I decided to use bash above to curl the file instead.

Next, we'll use a bash command to peek at the data, which we know is a mess:

In [4]:
%%bash
head -n 5 MIXEDBEV_03_2017.CSV

"MB821424    ","ABI-HAUS                      ","959 N 2ND ST                  ","ABILENE             ","TX","79601","221","          ","2017/01", 000000523.40
"MB638028    ","ABILENE BEEHIVE INC           ","442 CEDAR ST STE A            ","ABILENE             ","TX","79601","221","          ","2017/02", 000002610.52
"MB543114    ","ABILENE BOWLING LANES INC     ","279 RUIDOSA AVE               ","ABILENE             ","TX","79605","221","          ","2017/02", 000000256.27
"MB933130    ","ABILENE CABARET LLC           ","1918 BUTTERNUT ST             ","ABILENE             ","TX","79602","221","          ","2017/02", 000000699.41
"N 037863    ","ABILENE COUNTRY CLUB          ","4039 S TREADAWAY BLVD         ","ABILENE             ","TX","79602","221","          ","2017/02", 000001801.63


### Study variables
These are set here depending on what city or county you want to study.

First, we'll list the files in our directory that we have downloaded so far:

In [5]:
ls

MIXEDBEV_02_2017.CSV         counties.csv
MIXEDBEV_03_2017.CSV         headers.txt
Mixed beverages agate.ipynb  mixbev-env.txt
README.md


Then we set some values based on those.

- The `file` is the name of the file we want to process
- The `month_studied` is the month before our file. We'll want to check that later on in the script to make sure we are using the month that has the most records.
- The `city_studies` is a variable for chart later of the city we are interested in.

In [6]:
# this is our source file, which may have been downloaded above
file = 'MIXEDBEV_03_2017.CSV'

# setting the month_studied var.
# This should be checked in the table below that counts records by month
month_studied = '2017/02'

# city studied needs to be ALL CAPS, as that's how the data comes
city_studied = 'AUSTIN'

# county studied should be ALL CAPS, as we set it to search that way
county_studied = 'BASTROP'

### Data variables
These probalby won't change between analysis or data sets

In [7]:
column_names = [
    'TABC Permit Number',
    'Trade Name',
    'Location Address',
    'Location City',
    'Location State',
    'Location Zip Code',
    'Location County Code',
    'Blank',
    'Report Period',
    'Report Tax'
]
specified_types = {
    'Location Zip Code': agate.Text(),
    'Location County Code': agate.Text()
}

### Processing the file


In [8]:
# this imports the file specified above
mixbev_raw = agate.Table.from_csv(file, column_names, encoding='iso-8859-1', column_types=specified_types)

# prints table fields
print(mixbev_raw)

| column               | data_type |
| -------------------- | --------- |
| TABC Permit Number   | Text      |
| Trade Name           | Text      |
| Location Address     | Text      |
| Location City        | Text      |
| Location State       | Text      |
| Location Zip Code    | Text      |
| Location County Code | Text      |
| Blank                | Boolean   |
| Report Period        | Text      |
| Report Tax           | Number    |



In [9]:
# This creates a new interim table with results of compute function
# that takes the four columns that need trimming and strips them
# adding them to the end of the table with new names
mixbev_trim = mixbev_raw.compute([
    ('Permit', agate.Formula(agate.Text(), lambda r: r['TABC Permit Number'].strip())),
    ('Name', agate.Formula(agate.Text(), lambda r: r['Trade Name'].strip())),
    ('Address', agate.Formula(agate.Text(), lambda r: r['Location Address'].strip())),
    ('City', agate.Formula(agate.Text(), lambda r: r['Location City'].strip()))
])

In [10]:
## shows the new columns added to the interim table
print(mixbev_trim)

| column               | data_type |
| -------------------- | --------- |
| TABC Permit Number   | Text      |
| Trade Name           | Text      |
| Location Address     | Text      |
| Location City        | Text      |
| Location State       | Text      |
| Location Zip Code    | Text      |
| Location County Code | Text      |
| Blank                | Boolean   |
| Report Period        | Text      |
| Report Tax           | Number    |
| Permit               | Text      |
| Name                 | Text      |
| Address              | Text      |
| City                 | Text      |



In [11]:
## creates new table with just stuff we need with clean names
# new_table = table.select(['3rd_column_name', '1st_column_name', '2nd_column_name'])
mixbev_cleaned = mixbev_trim.select([
    'Permit',
    'Name',
    'Address',
    'City',
    'Location State',
    'Location County Code',
    'Report Period',
    'Report Tax'
]).rename(column_names = {
    'Location State': 'State',
    'Location County Code': 'CountyCode',
    'Report Period': 'Period',
    'Report Tax': 'Tax'
})

In [12]:
## these are now the columns present in our new column
print(mixbev_cleaned)

| column     | data_type |
| ---------- | --------- |
| Permit     | Text      |
| Name       | Text      |
| Address    | Text      |
| City       | Text      |
| State      | Text      |
| CountyCode | Text      |
| Period     | Text      |
| Tax        | Number    |



In [13]:
# and this peeks at the data
# I did send this to_csv and made sure columns were trimmed
mixbev_cleaned.limit(5).print_table()

| Permit   | Name                 | Address              | City    | State | CountyCode | ... |
| -------- | -------------------- | -------------------- | ------- | ----- | ---------- | --- |
| MB638028 | ABILENE BEEHIVE INC  | 442 CEDAR ST STE A   | ABILENE | TX    | 221        | ... |
| MB543114 | ABILENE BOWLING L... | 279 RUIDOSA AVE      | ABILENE | TX    | 221        | ... |
| MB933130 | ABILENE CABARET LLC  | 1918 BUTTERNUT ST    | ABILENE | TX    | 221        | ... |
| N 037863 | ABILENE COUNTRY CLUB | 4039 S TREADAWAY ... | ABILENE | TX    | 221        | ... |
| MB200506 | ABILENE SEAFOOD T... | 1882 S CLACK ST      | ABILENE | TX    | 221        | ... |


### Create establishment column

We do this so we make sure we have single establishments instead of grouping trade names together from different addresses, like 'CHILI'S BAR & GRILL'.

In [14]:
mixbev_cleaned_est = mixbev_cleaned.compute([
    ('Establishment', agate.Formula(agate.Text(), lambda r: '%(Name)s %(Address)s' % r))
])

In [15]:
print(mixbev_cleaned_est)

| column        | data_type |
| ------------- | --------- |
| Permit        | Text      |
| Name          | Text      |
| Address       | Text      |
| City          | Text      |
| State         | Text      |
| CountyCode    | Text      |
| Period        | Text      |
| Tax           | Number    |
| Establishment | Text      |



In [16]:
mixbev_establishment = mixbev_cleaned_est.select('Establishment')
mixbev_establishment.print_table(max_column_width=80)

| Establishment                                      |
| -------------------------------------------------- |
| ABILENE BEEHIVE INC 442 CEDAR ST STE A             |
| ABILENE BOWLING LANES INC 279 RUIDOSA AVE          |
| ABILENE CABARET LLC 1918 BUTTERNUT ST              |
| ABILENE COUNTRY CLUB 4039 S TREADAWAY BLVD         |
| ABILENE SEAFOOD TAVERN 1882 S CLACK ST             |
| ABUELO'S BEVERAGE CORPORATION 4782 S 14TH ST       |
| ACE IN THE HOLE 133 EPLENS CT                      |
| AMNESIA, LLC. 1850 S CLACK ST                      |
| BILLIARDS PLUS 5495 S 7TH ST                       |
| BONZAI JAPANESE STEAK HOUSE 1802 S CLACK ST        |
| BREAKERS SPORTS BAR 1874 S CLACK ST                |
| BUFFALO WILD WINGS GRILL & BAR 1010 E OVERLAND TRL |
| BUFFALO WILD WINGS GRILL AND B 4401 RIDGEMONT DR   |
| CAHOOTS CATFISH & OYSTER BAR/J 301 S 11TH ST       |
| CHELSEA'S ST PUB 4310 BUFFALO GAP RD STE 1342      |
| CHILI'S GRILL & BAR 4302 S CLACK ST                |
| CHILIS G

In [17]:
# importing countes.csv, ensuring that the 'code' column is text
counties = agate.Table.from_csv('counties.csv', column_types={'code': agate.Text()}).rename()

In [18]:
print(counties)

| column | data_type |
| ------ | --------- |
| id     | Number    |
| county | Text      |
| code   | Text      |



In [19]:
counties.print_table()

| id | county    | code |
| -- | --------- | ---- |
|  1 | Anderson  | 001  |
|  2 | Andrews   | 002  |
|  3 | Angelina  | 003  |
|  4 | Aransas   | 004  |
|  5 | Archer    | 005  |
|  6 | Armstrong | 006  |
|  7 | Atascosa  | 007  |
|  8 | Austin    | 008  |
|  9 | Bailey    | 009  |
| 10 | Bandera   | 010  |
| 11 | Bastrop   | 011  |
| 12 | Baylor    | 012  |
| 13 | Bee       | 013  |
| 14 | Bell      | 014  |
| 15 | Bexar     | 015  |
| 16 | Blanco    | 016  |
| 17 | Borden    | 017  |
| 18 | Bosque    | 018  |
| 19 | Bowie     | 019  |
| 20 | Brazoria  | 020  |
| ... | ...       | ...  |


In [20]:
mixbev_joined = mixbev_cleaned_est.join(counties, 'CountyCode', 'code')

In [21]:
print(mixbev_joined)

| column        | data_type |
| ------------- | --------- |
| Permit        | Text      |
| Name          | Text      |
| Address       | Text      |
| City          | Text      |
| State         | Text      |
| CountyCode    | Text      |
| Period        | Text      |
| Tax           | Number    |
| Establishment | Text      |
| id            | Number    |
| county        | Text      |



In [22]:
mixbev = mixbev_joined.select([
    'Permit',
    'Name',
    'Address',
    'Establishment',
    'City',
    'State',
    'county',
    'Period',
    'Tax'
]).rename(column_names = {
    'county': 'County'
})

In [23]:
print(mixbev)

| column        | data_type |
| ------------- | --------- |
| Permit        | Text      |
| Name          | Text      |
| Address       | Text      |
| Establishment | Text      |
| City          | Text      |
| State         | Text      |
| County        | Text      |
| Period        | Text      |
| Tax           | Number    |



In [24]:
mixbev.print_table()

| Permit   | Name                 | Address              | Establishment        | City    | State | ... |
| -------- | -------------------- | -------------------- | -------------------- | ------- | ----- | --- |
| MB638028 | ABILENE BEEHIVE INC  | 442 CEDAR ST STE A   | ABILENE BEEHIVE I... | ABILENE | TX    | ... |
| MB543114 | ABILENE BOWLING L... | 279 RUIDOSA AVE      | ABILENE BOWLING L... | ABILENE | TX    | ... |
| MB933130 | ABILENE CABARET LLC  | 1918 BUTTERNUT ST    | ABILENE CABARET L... | ABILENE | TX    | ... |
| N 037863 | ABILENE COUNTRY CLUB | 4039 S TREADAWAY ... | ABILENE COUNTRY C... | ABILENE | TX    | ... |
| MB200506 | ABILENE SEAFOOD T... | 1882 S CLACK ST      | ABILENE SEAFOOD T... | ABILENE | TX    | ... |
| MB541702 | ABUELO'S BEVERAGE... | 4782 S 14TH ST       | ABUELO'S BEVERAGE... | ABILENE | TX    | ... |
| MB932373 | ACE IN THE HOLE      | 133 EPLENS CT        | ACE IN THE HOLE 1... | ABILENE | TX    | ... |
| MB969482 | AMNESIA, LLC.        | 1850 S CLA

### Looking at dates of the records

Here we have to:
- create a tableset using group_by by the period
- creaet a table using aggregate function to count
- create a table to sort the period in reverse order
- Then print the sorted table (top 10 rows)

In [25]:
by_period = mixbev.group_by('Period')

period_totals = by_period.aggregate([
    ('count', agate.Count())
])

period_totals_sorted = period_totals.order_by('count', reverse=True)

period_totals_sorted.limit(10).print_table(max_rows=None)


| Period  |  count |
| ------- | ------ |
| 2017/02 | 14,090 |
| 2017/01 |  1,423 |
| 2016/12 |    141 |
| 2016/11 |     52 |
| 2017/03 |     32 |
| 2016/10 |     26 |
| 2016/09 |     21 |
| 2016/08 |     13 |
| 2016/07 |      9 |
| 2016/05 |      8 |


We have an answer here that we need for the future, and that is the period of time that has the most records. If we are studying a particular month of records, we need to set that so we can change it later with a different dataset.

Now we can filter or select the records where the month equals what we want.

In [26]:
mixbev_month = mixbev.where(lambda row: row['Period'] == month_studied)
len(mixbev_month)

14090

### Tops sales in a city

Uses the `city_studied` variable at the top of the workbook

In [27]:
# Filters the data to the specified city
mixbev_city = mixbev_month.where(lambda row: row['City'] == city_studied)

# groups the data based on Establishment and City
city_grouped = mixbev_city.group_by('Establishment').group_by('City')
# computes the sales based on the grouping
summary = city_grouped.aggregate([
    ('total_sales', agate.Sum('Tax'))
])
# sorts the results by most sold
summary_sorted = summary.order_by('total_sales', reverse=True)
# prints the top 10 results
summary_sorted.limit(10).print_table(max_column_width=80)


| Establishment                                      | City   | total_sales |
| -------------------------------------------------- | ------ | ----------- |
| WLS BEVERAGE CO 110 E 2ND ST                       | AUSTIN |   79,804.43 |
| ROSE ROOM/ 77 DEGREES 11500 ROCK ROSE AVE          | AUSTIN |   29,873.89 |
| 400 BAR/CUCARACHA/CHUPACABRA/J 400 E 6TH ST        | AUSTIN |   28,395.13 |
| THE DOGWOOD DOMAIN 11420 ROCK ROSE AVE STE 700     | AUSTIN |   27,372.64 |
| KUNG FU SALOON 11501 ROCK ROSE AVE STE 140         | AUSTIN |   23,468.15 |
| BARTON CREEK COUNTRY CLUB 8212 BARTON CLUB DR      | AUSTIN |   23,065.68 |
| TOP GOLF 2700 ESPERANZA XING                       | AUSTIN |   22,937.51 |
| SAN JACINTO BEVERAGE COMPANY L 98 SAN JACINTO BLVD | AUSTIN |   22,361.92 |
| ALAMO DRAFTHOUSE CINEMA 1120 S LAMAR BLVD          | AUSTIN |   22,324.66 |
| THE PALAZIO 501 E BEN WHITE BLVD                   | AUSTIN |   21,275.24 |


In [28]:
summary_sorted.limit(10).print_bars('Establishment', 'total_sales', width=80)

Establishment                                      total_sales
WLS BEVERAGE CO 110 E 2ND ST                         79,804.43 ▓░░░░░░░░░░░░░░░░
ROSE ROOM/ 77 DEGREES 11500 ROCK ROSE AVE            29,873.89 ▓░░░░░░          
400 BAR/CUCARACHA/CHUPACABRA/J 400 E 6TH ST          28,395.13 ▓░░░░░░          
THE DOGWOOD DOMAIN 11420 ROCK ROSE AVE STE 700       27,372.64 ▓░░░░░           
KUNG FU SALOON 11501 ROCK ROSE AVE STE 140           23,468.15 ▓░░░░░           
BARTON CREEK COUNTRY CLUB 8212 BARTON CLUB DR        23,065.68 ▓░░░░░           
TOP GOLF 2700 ESPERANZA XING                         22,937.51 ▓░░░░░           
SAN JACINTO BEVERAGE COMPANY L 98 SAN JACINTO BLVD   22,361.92 ▓░░░░            
ALAMO DRAFTHOUSE CINEMA 1120 S LAMAR BLVD            22,324.66 ▓░░░░            
THE PALAZIO 501 E BEN WHITE BLVD                     21,275.24 ▓░░░░            
                                                               +---------------+
                                              

In [29]:
# Filters the data to the specified city
mixbev_county = mixbev_month.where(lambda row: row['County'].upper() == county_studied)

# groups the data based on Establishment and City
county_grouped = mixbev_county.group_by('Establishment').group_by('County')
# computes the sales based on the grouping
cn_summary = county_grouped.aggregate([
    ('total_sales', agate.Sum('Tax'))
])
# sorts the results by most sold
cn_summary_sorted = cn_summary.order_by('total_sales', reverse=True)
# prints the top 10 results
cn_summary_sorted.limit(10).print_table(max_column_width=80)


| Establishment                                     | County  | total_sales |
| ------------------------------------------------- | ------- | ----------- |
| LOST PINES BEVERAGE LLC 575 HYATT LOST PINES ROAD | Bastrop |   28,824.53 |
| OLD TOWN RESTURANT AND BAR/PIN 931 MAIN ST        | Bastrop |    4,089.61 |
| CHILI'S GRILL & BAR 734 HIGHWAY 71 W              | Bastrop |    2,659.63 |
| NEIGHBOR'S 601 CHESTNUT ST UNIT C                 | Bastrop |    1,849.13 |
| RED ROCK STEAKHOUSE & SALOON 101 S LENTZ ST       | Bastrop |    1,712.92 |
| LA HACIENDA RESTAURANT 1800 WALNUT ST             | Bastrop |    1,570.01 |
| VERANDA 910 MAIN ST                               | Bastrop |    1,420.40 |
| MORELIA MEXICAN CAFE 608 W ALAMO ST               | Bastrop |    1,225.89 |
| JALISCO MEXICAN RESTAURANT 244 HIGHWAY 290 W      | Bastrop |    1,135.31 |
| SP BASTROP THEATRE, LP 1600 CHESTNUT ST           | Bastrop |    1,113.87 |


In [30]:
cn_summary_sorted.limit(10).print_bars('Establishment', 'total_sales', width=80)

Establishment                                     total_sales
LOST PINES BEVERAGE LLC 575 HYATT LOST PINES ROAD   28,824.53 ▓░░░░░░░░░░░░░░░░ 
OLD TOWN RESTURANT AND BAR/PIN 931 MAIN ST           4,089.61 ▓░░               
CHILI'S GRILL & BAR 734 HIGHWAY 71 W                 2,659.63 ▓░░               
NEIGHBOR'S 601 CHESTNUT ST UNIT C                    1,849.13 ▓░                
RED ROCK STEAKHOUSE & SALOON 101 S LENTZ ST          1,712.92 ▓░                
LA HACIENDA RESTAURANT 1800 WALNUT ST                1,570.01 ▓░                
VERANDA 910 MAIN ST                                  1,420.40 ▓░                
MORELIA MEXICAN CAFE 608 W ALAMO ST                  1,225.89 ▓░                
JALISCO MEXICAN RESTAURANT 244 HIGHWAY 290 W         1,135.31 ▓░                
SP BASTROP THEATRE, LP 1600 CHESTNUT ST              1,113.87 ▓░                
                                                              +---+------------+
                                               