# Processing mixed beverage data
This Jupyter Notebook uses curl to down download [Mixed Beverage Gross Receipts](https://comptroller.texas.gov/taxes/mixed-beverage/receipts.php) files from the Texas Comptroller's [data center](https://comptroller.texas.gov/transparency/open-data/search-datasets/), and then a python library called [agate](http://agate.readthedocs.io/) to clean and process that data for [stories similar to this one](http://www.mystatesman.com/business/austin-alcohol-sales-percent-february/Oo2txZUkuDlqBl0rU9O1lJ/) on monthly alcohol sales.

This is a stripped down version (compared to the original fork) that skips explanation of steps beyond commenting.

## Get to the goods

To skip most of the setup code and get to what you really want, search for:

- Top sales statewide
- Austin sales
- Central Texas cities

### File download

- Go to the [Texas Comptroller data center](https://comptroller.texas.gov/transparency/open-data/search-datasets/) and copy the url for the CSV for this month and enter it below.
- You also need to set to set the [processing variables](Processing-variables) for this month.

In [34]:
%%bash
## downloads the mixedbev file into mixbev-files folder
## You have to set this URL based on location in data center
cd ../mixbev-files
curl -O https://comptroller.texas.gov/auto-data/odc/MIXEDBEV_04_2017.CSV

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 24 2614k   24  649k    0     0   520k      0  0:00:05  0:00:01  0:00:04  519k 88 2614k   88 2313k    0     0  1028k      0  0:00:02  0:00:02 --:--:-- 1028k100 2614k  100 2614k    0     0  1091k      0  0:00:02  0:00:02 --:--:-- 1091k


In [35]:
# imports the libraries we will use
import agate
from decimal import Decimal
import re

# this surpresses the timezone warning
# Might comment out during development so other warnings
# are not surpressed
import warnings
warnings.filterwarnings('ignore')

### Processing variables
Then we set some values based on those.

- The **`file`** is the name of the file we want to process
- The **`tax_rate`** is the value we need for this file to get the Gross Recipts (vs the Tax Reported, which is just the tax amount the establishment paid). The comptroller [has information on the tax](https://comptroller.texas.gov/taxes/mixed-beverage/receipts.php), but this [old record layout](https://github.com/utdata/cli-tools/blob/master/data/mixbevtax/OLD-MIXEDBEVTAX-LAYOUT.txt) best describes the math.
- The **`month_studied`** is the YYYY/MM designation for the month before the file release. The file released in February has mostly records from January, but can also have any other month, so we set here the specific month we want. Note there is a check later on that counts the number of files by month, which is worth checking.

Here are the files we have downloaded:

In [36]:
ls ../mixbev-files

MIXEDBEV_02_2017.CSV  MIXEDBEV_04_2015.CSV  MIXEDBEV_04_2017.CSV
MIXEDBEV_03_2017.CSV  MIXEDBEV_04_2016.CSV


In [37]:
# this is our source file, which may have been downloaded above
# Swap out the file name here and date below as needed
file = '../mixbev-files/MIXEDBEV_04_2017.CSV'

# setting the month_studied var.
# This should be checked in the table below that counts records by month
month_studied = '2017/03'

# Sets the tax rate to convert Report Tax to Gross Receipts
# It's 6.7 since January 1, 2014
tax_rate = Decimal('6.7')

### import and processing

In [38]:
# sets the column names of the original data set.
column_names = [
    'TABC Permit Number',
    'Trade Name',
    'Location Address',
    'Location City',
    'Location State',
    'Location Zip Code',
    'Location County Code',
    'Blank',
    'Report Period',
    'Report Tax'
]
# Helps us import some text fields that may be considered numbers in error.
specified_types = {
    'Location Zip Code': agate.Text(),
    'Location County Code': agate.Text()
}

# this imports the file specified above, along with the proper types
mixbev_raw = agate.Table.from_csv(file, column_names, encoding='iso-8859-1', column_types=specified_types)

# mixbev_trim creates a new interim table with results of compute function
# that takes the four columns that need trimming and strips them of white space,
# adding them to the end of the table with new names.
# The last computation does the math to create the Gross Receipts based on the tax_rate set above

mixbev_trim = mixbev_raw.compute([
    ('Permit', agate.Formula(agate.Text(), lambda r: r['TABC Permit Number'].strip())),
    ('Name', agate.Formula(agate.Text(), lambda r: r['Trade Name'].strip())),
    ('Address', agate.Formula(agate.Text(), lambda r: r['Location Address'].strip())),
    ('City', agate.Formula(agate.Text(), lambda r: r['Location City'].strip())),
    ('Receipts_compute', agate.Formula(agate.Number(), lambda r: (r['Report Tax'] / tax_rate) * 100))
])

# the Receipts_compute computation above returns as a decimal number,
# so this function rounds those numbers.
# I might refactor this late so I can use it elsewhere.
def round_receipt(row):
    return row['Receipts_compute'].quantize(Decimal('0.01'))

# This compute method uses round_recipt function above,
# putting the results into a new table.
mixbev_round = mixbev_trim.compute([
    ('Receipts', agate.Formula(agate.Number(), round_receipt))
])

# creates new table, selecting just the columns we need
# then renames some of them for ease later.
mixbev_cleaned = mixbev_round.select([
    'Permit',
    'Name',
    'Address',
    'City',
    'Location State',
    'Location Zip Code',
    'Location County Code',
    'Report Period',
    'Report Tax',
    'Receipts'
]).rename(column_names = {
    'Location State': 'State',
    'Location Zip Code': 'Zip',
    'Location County Code': 'CountyCode',
    'Report Period': 'Period',
    'Report Tax': 'Tax'
})

# Concatenates the name and address
mixbev_cleaned_est = mixbev_cleaned.compute([
    ('Establishment', agate.Formula(agate.Text(), lambda row: '%(Name)s %(Address)s' % row))
])

# importing countes.csv, ensuring that the 'code' column is text
counties = agate.Table.from_csv('../resource-files/counties.csv', column_types={'code': agate.Text()})

# joines the counties table to the mixed bev cleaned data with establishments
mixbev_joined = mixbev_cleaned_est.join(counties, 'CountyCode', 'code')

# get just the columns we need and rename county
# THIS is the finished, cleaned mixbev table
mixbev = mixbev_joined.select([
    'Permit',
    'Name',
    'Address',
    'Establishment',
    'City',
    'State',
    'Zip',
    'county',
    'Period',
    'Tax',
    'Receipts'
]).rename(column_names = {
    'county': 'County'
})


### Looking at dates of the records

This basically confirms that the file has multiple dates, and that we are looking at the right month of data. Typically a data set will have mostly reports from the previous month, but there are always also submissions from other months. We want to filter out those other months, which we do based on the `month_studied` variable set near the top of the file, which should match the period at the top of the table below.


In [39]:
# Pivot the mixbev table by Period. Default it give a Count of the records
# We then order the table by Count in descending order
by_period = mixbev.pivot('Period').order_by('Count', reverse=True)

# prints the table of period and number of records
by_period.limit(5).print_table(max_rows=None)

| Period  |  Count |
| ------- | ------ |
| 2017/03 | 14,144 |
| 2017/02 |  1,718 |
| 2017/01 |    258 |
| 2016/12 |     61 |
| 2017/04 |     49 |


In [40]:
## filters the records to our month_studied
mixbev_month = mixbev.where(lambda row: row['Period'] == month_studied)

# function to group sales by a specific location
# City or County passed in should be ALL CAPS
# Location_type can be 'City' or 'County'

def location_sum(location_type, location):
    # Filters the data to the specified city
    location_filtered = mixbev_month.where(lambda row: row[location_type].upper() == location)

    # groups the data based on Establishment and location
    location_grouped = location_filtered.group_by('Establishment').group_by(location_type)
    # computes the sales based on the grouping
    location_summary = location_grouped.aggregate([
        ('Tax_sum', agate.Sum('Tax')),
        ('Receipts_sum', agate.Sum('Receipts'))
    ])
    
    # sorts the results by most sold
    location_summary_sorted = location_summary.order_by('Receipts_sum', reverse=True)
    # prints the top 10 results
    
    return(location_summary_sorted)


## Top sales statewide

Because we want to group our results by more than one field and perform more than one aggregation, we'll do this a little differently. We'll use group_by to create a grouped table, then perform aggregations on that new table to computer the Tax and Receipts columns.

In [41]:
# groups the data based on Establishment and City
mixbev_grouped = mixbev_month.group_by('Establishment').group_by('County').group_by('City')

# computes the sales based on the grouping
state_summary = mixbev_grouped.aggregate([
    ('Tax_sum', agate.Sum('Tax')),
    ('Sales_sum', agate.Sum('Receipts'))
])

# sorts the results by most sold. We could probalby chain it above if we wanted to.
state_summary_sorted = state_summary.order_by('Sales_sum', reverse=True)

# summing sales statewide for month
print('\nTotal sales across the state for the given month: {}\n'.format(
    mixbev_month.aggregate(agate.Sum('Receipts'))
))

print('Top sales by establishment statewide\n')

# prints the top 10 results
state_summary_sorted.limit(10).print_table(max_column_width=40)


Total sales across the state for the given month: 589318616.73

Top sales by establishment statewide

| Establishment                            | County  | City        |    Tax_sum |    Sales_sum |
| ---------------------------------------- | ------- | ----------- | ---------- | ------------ |
| THREE NRG PARK 2000 SOUTH LOOP W         | Harris  | HOUSTON     | 251,360.81 | 3,751,653.88 |
| ARAMARK SPORTS AND ENTERTAINME 211 AT... | Bexar   | SAN ANTONIO | 127,188.37 | 1,898,333.88 |
| GAYLORD TEXAN 1501 GAYLORD TRL           | Tarrant | GRAPEVINE   | 106,811.33 | 1,594,198.96 |
| HOSPITALITY INTERNATIONAL, INC 23808 ... | Bexar   | SAN ANTONIO | 102,455.66 | 1,529,188.96 |
| LEVY RESTAURANTS AT TOYOTA CEN 1510 P... | Harris  | HOUSTON     |  79,008.41 | 1,179,230.00 |
| SALC, INC. 2201 N STEMMONS FWY FL 1      | Dallas  | DALLAS      |  75,850.63 | 1,132,098.96 |
| WLS BEVERAGE CO 110 E 2ND ST             | Travis  | AUSTIN      |  75,785.84 | 1,131,131.94 |
| OMNI DALLAS CONVENTION

## Austin sales and sums

With this, we refernce the location_sum function above, and pass the type of location (City) and the name of the city (AUSTIN). At the same time, we limit the result of that function to the first 10 records, and then print the results. We are basically stringing together a bunch of stuff at once.

In [42]:
# uses the city_sum function to filter
austin = location_sum('City', 'AUSTIN')

print('\nTotal sales across the state for the given month: {}\n'.format(
    austin.aggregate(agate.Sum('Receipts_sum'))
))

# print the resulting table
austin.limit(5).print_table(max_column_width=40)


Total sales across the state for the given month: 75059310.65

| Establishment                            | City   |   Tax_sum | Receipts_sum |
| ---------------------------------------- | ------ | --------- | ------------ |
| WLS BEVERAGE CO 110 E 2ND ST             | AUSTIN | 75,785.84 | 1,131,131.94 |
| 400 BAR/CUCARACHA/CHUPACABRA/J 400 E ... | AUSTIN | 43,934.51 |   655,738.96 |
| SAN JACINTO BEVERAGE COMPANY L 98 SAN... | AUSTIN | 37,672.15 |   562,270.90 |
| RAIN ON 4TH 217 W 4TH ST STE B           | AUSTIN | 36,828.02 |   549,671.94 |
| THE BLIND PIG PUB 317 E 6TH ST           | AUSTIN | 32,472.48 |   484,663.88 |


## More Central Texas cities

In [43]:
location_sum('City', 'BASTROP').limit(3).print_table(max_column_width=40)

| Establishment                            | City    |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ------- | -------- | ------------ |
| OLD TOWN RESTURANT AND BAR/PIN 931 MA... | BASTROP | 5,095.81 |    76,056.87 |
| CHILI'S GRILL & BAR 734 HIGHWAY 71 W     | BASTROP | 2,838.38 |    42,363.88 |
| NEIGHBOR'S 601 CHESTNUT ST UNIT C        | BASTROP | 2,812.92 |    41,983.88 |


In [44]:
location_sum('City', 'BEE CAVE').limit(3).print_table(max_column_width=40)

| Establishment                            | City     |  Tax_sum | Receipts_sum |
| ---------------------------------------- | -------- | -------- | ------------ |
| HCG BEVERAGE, LLC 12525 BEE CAVE PKWY    | BEE CAVE | 6,331.63 |    94,501.94 |
| WOODY TAVERN AND GRILL, INC. 12801 SH... | BEE CAVE | 6,323.46 |    94,380.00 |
| MAUDIE'S HILL COUNTRY, LLC 12506 SHOP... | BEE CAVE | 5,968.29 |    89,078.96 |


In [45]:
location_sum('City', 'BUDA').limit(3).print_table(max_column_width=40)

| Establishment                  | City |  Tax_sum | Receipts_sum |
| ------------------------------ | ---- | -------- | ------------ |
| BUCKS BACKYARD 1750 S FM 1626  | BUDA | 7,863.05 |   117,358.96 |
| WILLIE'S JOINT 824 MAIN ST     | BUDA | 4,295.03 |    64,104.93 |
| PINBALLZ KINGDOM 15201 S IH 35 | BUDA | 3,521.92 |    52,565.97 |


In [46]:
location_sum('City', 'CEDAR PARK').limit(3).print_table(max_column_width=40)

| Establishment                            | City       |   Tax_sum | Receipts_sum |
| ---------------------------------------- | ---------- | --------- | ------------ |
| CHUY'S 4911 183A TOLL RD                 | CEDAR PARK | 12,001.04 |   179,120.00 |
| LUPE TORTILLA MEXICAN RESTAURA 4501 1... | CEDAR PARK |  7,989.81 |   119,250.90 |
| RYAN SANDERS SPORTS SERVICES, 2100 AV... | CEDAR PARK |  6,700.20 |   100,002.99 |


In [47]:
location_sum('City', 'DRIPPING SPRINGS').limit(3).print_table(max_column_width=40)

| Establishment                            | City             |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ---------------- | -------- | ------------ |
| TRUDY'S FOUR STAR 13059 FOUR STAR BLVD   | DRIPPING SPRINGS | 4,229.71 |    63,130.00 |
| DEEP EDDY DISTILLING CO 2250 E HIGHWA... | DRIPPING SPRINGS | 4,001.30 |    59,720.90 |
| FLORES MEXICAN RESTAURANT 2440 E HIGH... | DRIPPING SPRINGS | 3,631.73 |    54,204.93 |


In [48]:
location_sum('City', 'GEORGETOWN').limit(3).print_table(max_column_width=40)

| Establishment                      | City       |  Tax_sum | Receipts_sum |
| ---------------------------------- | ---------- | -------- | ------------ |
| EL MONUMENTO 205 W 2ND ST          | GEORGETOWN | 6,815.03 |   101,716.87 |
| HARDTAILS 1515 N IH 35             | GEORGETOWN | 4,984.66 |    74,397.91 |
| DOS SALSAS CAFE INC 1104 S MAIN ST | GEORGETOWN | 4,818.90 |    71,923.88 |


In [49]:
location_sum('City', 'KYLE').limit(3).print_table(max_column_width=40)

| Establishment                            | City |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ---- | -------- | ------------ |
| CASA GARCIA'S MEXICAN RESTAURA 5401 F... | KYLE | 5,228.68 |    78,040.00 |
| EVO ENTERTAINMENT CENTER 3200 KYLE XING  | KYLE | 4,822.32 |    71,974.93 |
| CENTERFIELD SPORTS BAR & GRILL 200 W ... | KYLE | 2,941.43 |    43,901.94 |


In [50]:
location_sum('City', 'LAGO VISTA').limit(3).print_table(max_column_width=40)

| Establishment                            | City       |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ---------- | -------- | ------------ |
| COPPERHEAD GRILL 6115 LOHMANS FORD RD    | LAGO VISTA | 1,543.61 |    23,038.96 |
| MARIA'S BAR & GRILL MEXICAN RE 20602 ... | LAGO VISTA |   490.10 |     7,314.93 |


In [51]:
location_sum('City', 'LAKEWAY').limit(3).print_table(max_column_width=40)

| Establishment                            | City    |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ------- | -------- | ------------ |
| THE GROVE WINE BAR AND KITCHEN 3001 R... | LAKEWAY | 7,590.96 |   113,297.91 |
| LAKEWAY RESORT AND SPA 101 LAKEWAY DR    | LAKEWAY | 6,797.01 |   101,447.91 |
| HIGH 5 ENTERTAINMENT 1502 RANCH ROAD ... | LAKEWAY | 4,468.96 |    66,700.90 |


In [52]:
location_sum('City', 'LEANDER').limit(3).print_table(max_column_width=40)

| Establishment                            | City    |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ------- | -------- | ------------ |
| BROOKLYN HEIGHTS PIZZERIA 3550 LAKELI... | LEANDER | 3,792.26 |    56,600.90 |
| JARDIN DEL REY 703 S HIGHWAY 183         | LEANDER | 2,939.29 |    43,870.00 |
| TAPATIA JALISCO #3 LLC 651 N US 183      | LEANDER |   755.42 |    11,274.93 |


In [53]:
location_sum('City', 'LIBERTY HILL').limit(3).print_table(max_column_width=40)

| Establishment                            | City         |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ------------ | -------- | ------------ |
| JARDIN CORONA 15395 W STATE HIGHWAY 29   | LIBERTY HILL | 2,991.81 |    44,653.88 |
| MARGARITA'S RESTAURANT 10280 W STATE ... | LIBERTY HILL | 2,156.32 |    32,183.88 |
| ELENAS MEXICAN RESTAURANT 14801 W STA... | LIBERTY HILL |   280.32 |     4,183.88 |


In [54]:
location_sum('City', 'PFLUGERVILLE').limit(3).print_table(max_column_width=40)

| Establishment                          | City         |  Tax_sum | Receipts_sum |
| -------------------------------------- | ------------ | -------- | ------------ |
| MAVERICKS 1700 GRAND AVENUE PKWY STE 2 | PFLUGERVILLE | 7,534.15 |   112,450.00 |
| HANOVER'S DRAUGHT HAUS 108 E MAIN ST   | PFLUGERVILLE | 4,784.67 |    71,412.99 |
| LAST CALL 1615 GRAND AVENUE PKWY STE 2 | PFLUGERVILLE | 4,573.08 |    68,254.93 |


In [55]:
location_sum('City', 'ROUND ROCK').limit(5).print_table(max_column_width=40)

| Establishment                            | City       |   Tax_sum | Receipts_sum |
| ---------------------------------------- | ---------- | --------- | ------------ |
| CHUY'S ROUND ROCK 2320 N INTERSTATE 35   | ROUND ROCK | 10,927.56 |   163,097.91 |
| THIRD BASE ROUND ROCK, LLC 3107 S INT... | ROUND ROCK | 10,351.03 |   154,492.99 |
| FAST EDDIE'S NEIGHBORHOOD BILL 100 PA... | ROUND ROCK | 10,327.98 |   154,148.96 |
| RICK'S CABARET 3105 S INTERSTATE 35      | ROUND ROCK |  9,802.36 |   146,303.88 |
| JACK ALLEN'S KITCHEN 2500 HOPPE TRL      | ROUND ROCK |  9,461.94 |   141,222.99 |


In [56]:
location_sum('City', 'SAN MARCOS').limit(5).print_table(max_column_width=40)

| Establishment                          | City       |  Tax_sum | Receipts_sum |
| -------------------------------------- | ---------- | -------- | ------------ |
| ZELICKS 336 W HOPKINS ST               | SAN MARCOS | 9,369.34 |   139,840.90 |
| CHUY'S SAN MARCOS 1121 N INTERSTATE 35 | SAN MARCOS | 7,638.87 |   114,012.99 |
| THE MARC 120 E SAN ANTONIO ST          | SAN MARCOS | 7,286.78 |   108,757.91 |
| SEAN PATRICK'S 202 E SAN ANTONIO ST    | SAN MARCOS | 7,081.83 |   105,698.96 |
| PLUCKERS WING BAR 105 N INTERSTATE 35  | SAN MARCOS | 6,924.58 |   103,351.94 |


In [57]:
location_sum('City', 'SPICEWOOD').limit(3).print_table(max_column_width=40)

| Establishment                            | City      |  Tax_sum | Receipts_sum |
| ---------------------------------------- | --------- | -------- | ------------ |
| ANGEL'S ICEHOUSE 21815 W HWY 71          | SPICEWOOD | 4,225.15 |    63,061.94 |
| POODIES HILLTOP ROADHOUSE 22308 STATE... | SPICEWOOD | 3,499.87 |    52,236.87 |
| APIS RESTAURANT 23526 STATE HIGHWAY 71 W | SPICEWOOD | 2,262.45 |    33,767.91 |


In [58]:
location_sum('City', 'SUNSET VALLEY').limit(3).print_table(max_column_width=40)

| Establishment                            | City          |  Tax_sum | Receipts_sum |
| ---------------------------------------- | ------------- | -------- | ------------ |
| DOC'S BACKYARD 5207 BRODIE LN # 100      | SUNSET VALLEY | 6,248.62 |    93,262.99 |
| BJ'S RESTAURANT AND BREWHOUSE 5207 BR... | SUNSET VALLEY | 3,561.98 |    53,163.88 |
| LONGHORN STEAKHOUSE #5423 4809 W HIGH... | SUNSET VALLEY | 2,439.60 |    36,411.94 |


In [59]:
location_sum('City', 'WEST LAKE HILLS').limit(3).print_table(max_column_width=40)

| Establishment                            | City            |  Tax_sum | Receipts_sum |
| ---------------------------------------- | --------------- | -------- | ------------ |
| LUPE TORTILLA MEXICAN RESTAURA 701 S ... | WEST LAKE HILLS | 6,853.36 |   102,288.96 |
| CHIPOTLE CHIPOTLE MEXICAN GRIL 3300 B... | WEST LAKE HILLS |    39.12 |       583.88 |


## Sales by county example

In this case, we pass in the location type of 'County' and then a county name in caps to get the most sales in a particular county.

In [60]:
location_sum('County', 'CALDWELL').limit(3).print_table(max_column_width=40)

| Establishment                            | County   |  Tax_sum | Receipts_sum |
| ---------------------------------------- | -------- | -------- | ------------ |
| GUADALAJARA MEXICAN RESTAURANT 1710 S... | Caldwell | 1,258.86 |    18,788.96 |
| THE PEARL 110 N MAIN ST                  | Caldwell |   644.94 |     9,625.97 |
| MR TACO 1132 E PIERCE ST                 | Caldwell |   618.61 |     9,232.99 |


## Sales by ZIP Code
Just making sure that 78701 is at the top of this list, which it has been every month for a decade.

In [61]:
# top zip code gross receipts
zip_receipts = mixbev_month.pivot('Zip', aggregation=agate.Sum('Receipts')).order_by('Sum', reverse=True)
zip_receipts.limit(5).print_table()

| Zip   |           Sum |
| ----- | ------------- |
| 78701 | 32,492,045.23 |
| 78205 | 13,819,245.15 |
| 75201 | 11,461,946.05 |
| 77002 |  9,463,902.34 |
| 78704 |  8,062,829.61 |
