# Processing mixed beverage data

This Jupyter Notebook uses data.texas.gov [Mixed Beverage Gross Receipts data](https://data.texas.gov/Government-and-Taxes/Mixed-Beverage-Gross-Receipts/naix-2893), and then a python library called [agate](http://agate.readthedocs.io/) to clean and process that data for [stories similar to this one](http://www.mystatesman.com/business/austin-alcohol-sales-percent-february/Oo2txZUkuDlqBl0rU9O1lJ/) on monthly alcohol sales.

This is a work in progress.

- The first version will use a downloaded file to process. The data is in a new format, so it has to be reworked
- The second phase will work on pulling the data directly from Socrata.

In [6]:
# import libraries
import agate

In [2]:
# this surpresses the timezone warning
# Might comment out during development so other warnings
# are not surpressed
import warnings
warnings.filterwarnings('ignore')

Show downloaded files. The path is `../mixbev-files/YYYY`. There are also oder files in `../mixbev-files/old-format/`

In [48]:
ls ../mixbev-files/2017

Mixed_Beverage_Gross_Receipts_2017_09.csv
Mixed_Beverage_Gross_Receipts_2017_10.csv


## Import the file

We'll set some project variables, including the file name, and then import the data into agate.

In [49]:
# This is the source file, which should be donwloaded
file = '../mixbev-files/2017/Mixed_Beverage_Gross_Receipts_2017_09.csv'

In [50]:
#Helps us import some text fields that may be considered numbers in error.
specified_types = {
    'Taxpayer Number': agate.Text(),
    'Location Number': agate.Text(),
    'Taxpayer Zip': agate.Text(),
    'Location Zip': agate.Text(),
    'Location County': agate.Text(),
    'Taxpayer County': agate.Text()
}

# this imports the file specified above, along with the proper types
mixbev_raw = agate.Table.from_csv(file, column_types=specified_types)

# prints table fields so we an check thoes data types
print(mixbev_raw)

| column                     | data_type |
| -------------------------- | --------- |
| Taxpayer Number            | Text      |
| Taxpayer Name              | Text      |
| Taxpayer Address           | Text      |
| Taxpayer City              | Text      |
| Taxpayer State             | Text      |
| Taxpayer Zip               | Text      |
| Taxpayer County            | Text      |
| Location Number            | Text      |
| Location Name              | Text      |
| Location Address           | Text      |
| Location City              | Text      |
| Location State             | Text      |
| Location Zip               | Text      |
| Location County            | Text      |
| Inside/Outside City Limits | Boolean   |
| TABC Permit Number         | Text      |
| Responsibility Begin Date  | Date      |
| Responsibility End Date    | Date      |
| Obligation End Date        | Date      |
| Liquor Receipts            | Number    |
| Wine Receipts              | Number    |
| Beer Rece

## Create establishment column

We do this so we make sure we have single establishments instead of grouping trade names together from different addresses, like 'CHILI'S BAR & GRILL'.

In [51]:
# Concatenates the name and address
mixbev_establishment = mixbev_raw.compute([
    ('Establishment', agate.Formula(agate.Text(), lambda row: '%(Location Name)s (%(Location Address)s)' % row))
])

# Uncomment line below to print Establishment to check what is looks like
# mixbev_establishment.select('Establishment').limit(5).print_table(max_column_width=80)

## Import and merge counties lookup table

We do this to get county names. I got this list from the comptroller.

NOTE: Wisdom would suggest we join on the code column from counites, but the data.texas.gov data does not have the zero padding from those values, so I'm using the id column.

In [52]:
# importing countes.csv, ensuring that the 'code' column is text
counties = agate.Table.from_csv('../resource-files/counties.csv', column_types={
    'code': agate.Text(),
    'id': agate.Text()
})

# uncomment below to peek at the column names and an example
# print(counties)
# counties.limit(5).print_table()

# joines the counties table to the mixed bev cleaned data with establishments
mixbev_joined = mixbev_establishment.join(counties, 'Location County', 'id')

# uncomment below if you want to check that the merge was succesful 
# print(mixbev_joined)

In [53]:
# get just the columns we need and rename county
# THIS is the finished, cleaned mixbev table
mixbev = mixbev_joined.select([
    'Location Name',
    'Location Address',
    'Establishment',
    'Location City',
    'Location State',
    'Location Zip',
    'county',
    'Total Receipts',
    'Obligation End Date'
]).rename(column_names = {
    'Location Name' : 'Name',
    'Location Address' : 'Address',
    'Location City': 'City',
    'Location State': 'State',
    'Location Zip': 'Zip',
    'Total Receipts' : 'Receipts',
    'county': 'County',
    'Obligation End Date': 'Report date'
})

# peek at the column names
print(mixbev)

| column        | data_type |
| ------------- | --------- |
| Name          | Text      |
| Address       | Text      |
| Establishment | Text      |
| City          | Text      |
| State         | Text      |
| Zip           | Text      |
| County        | Text      |
| Receipts      | Number    |
| Report date   | Date      |



## Location sums function

Because we want to get the top sellers in a bunch of cities and couties, we create a function so we don't have to repeat the code. This function allows us to pass in a city or county name to filter the monthly receipts table and then sum the Tax and Receipts columns. The result can then be acted on to print or aggreggate. It is used later in the file.

In [54]:
# function to group sales by a specific location
# City or County passed in should be ALL CAPS
# Location_type can be 'City' or 'County'

def location_sum(location_type, location):
    # Filters the data to the specified city
    location_filtered = mixbev.where(lambda row: row[location_type].upper() == location)

    # groups the data based on Establishment and location
    location_grouped = location_filtered.group_by('Establishment').group_by(location_type)
    # computes the sales based on the grouping
    location_summary = location_grouped.aggregate([
        ('Receipts_sum', agate.Sum('Receipts'))
    ])
    
    # sorts the results by most sold
    location_summary_sorted = location_summary.order_by('Receipts_sum', reverse=True)
    # prints the top 10 results
    
    return(location_summary_sorted)

In [55]:
# double-checking I'm looking at one report data
mixbev_dates = mixbev.select('Report date').distinct('Report date')
mixbev_dates.print_table()
print('\nNumber of records in table: {}'.format(
        len(mixbev))
     )

| Report date |
| ----------- |
|  2017-09-30 |

Number of records in table: 15536


## Top sales statewide

Because we want to group our results by more than one field and perform more than one aggregation, we'll do this a little differently. We'll use group_by to create a grouped table, then perform aggregations on that new table to computer the Tax and Receipts columns.

In [56]:
# summing sales statewide for month

print('Total statewide sales for this month are: {}\n'.format(
    mixbev.aggregate(agate.Sum('Receipts')) # <<< I should format better
))

# groups the data based on Establishment and City
mixbev_grouped = mixbev.group_by('Establishment').group_by('County').group_by('City')

# computes the sales based on the grouping
state_summary = mixbev_grouped.aggregate([
    ('Sales_sum', agate.Sum('Receipts'))
])

# sorts the results by most sold. We could probalby chain it above if we wanted to.
state_summary_sorted = state_summary.order_by('Sales_sum', reverse=True)

# prints the top 10 results
state_summary_sorted.limit(10).print_table(max_column_width=40)


Total statewide sales for this month are: 572120305

| Establishment                            | County  | City        | Sales_sum |
| ---------------------------------------- | ------- | ----------- | --------- |
| AT&T STADIUM (1 LEGENDS WAY)             | Tarrant | ARLINGTON   | 3,883,207 |
| GAYLORD TEXAN (1501 GAYLORD TRL)         | Tarrant | GRAPEVINE   | 1,739,696 |
| HOSPITALITY INTERNATIONAL, INC. (2380... | Bexar   | SAN ANTONIO | 1,376,321 |
| WLS BEVERAGE CO (110 E 2ND ST)           | Travis  | AUSTIN      | 1,100,292 |
| OMNI DALLAS CONVENTION CENTER (555 S ... | Dallas  | DALLAS      |   983,038 |
| ARAMARK SPORTS & ENTERTAINMENT SERVIC... | Harris  | HOUSTON     |   948,918 |
| METROPLEX SPORTSERVICE, INC. (1000 BA... | Tarrant | ARLINGTON   |   902,437 |
| RYAN SANDERS SPORTS SERVICES, LLC (92... | Travis  | DEL VALLE   |   873,760 |
| HAPPIEST HOUR, LLC (2616 OLIVE ST)       | Dallas  | DALLAS      |   830,658 |
| SALC, INC. (2201 N STEMMONS FWY FL 1)    | Dallas  | D

## Austin sales and sums

With this, we refernce the location_sum function above, and pass the type of location (City) and the name of the city (AUSTIN). At the same time, we limit the result of that function to the first 10 records, and then print the results. We are basically stringing together a bunch of stuff at once.

In [57]:
# Austin total sales as s city
# This sums the grouped table, but it works

print('Total sales in Austin are: {}\n'.format(
    location_sum('City', 'AUSTIN').aggregate(agate.Sum('Receipts_sum'))
))

# uses the city_sum function to filter
austin = location_sum('City', 'AUSTIN')


# print the resulting table
print('Top sellers in Austin are:\n')
austin.limit(5).print_table(max_column_width=60)

Total sales in Austin are: 64426348

Top sellers in Austin are:

| Establishment                                                | City   | Receipts_sum |
| ------------------------------------------------------------ | ------ | ------------ |
| WLS BEVERAGE CO (110 E 2ND ST)                               | AUSTIN |    1,100,292 |
| ROSE ROOM/ 77 DEGREES (11500 ROCK ROSE AVE)                  | AUSTIN |      585,686 |
| 400 BAR/CUCARACHA/CHUPACABRA/JACKALOPE/MOOSENUCKLE (400 E... | AUSTIN |      503,352 |
| BLIND PIG PUB / PIG PEN (317 E 6TH ST)                       | AUSTIN |      502,440 |
| W HOTEL AUSTIN (200 LAVACA ST)                               | AUSTIN |      441,207 |


## More Central Texas cities

In [71]:
# move these to separate cells so they don't scroll
location_sum('City', 'BASTROP').limit(5).print_table(max_column_width=60)
print('\n')
location_sum('City', 'BEE CAVE').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'BUDA').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'CEDAR PARK').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'DRIPPING SPRINGS').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'GEORGETOWN').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'KYLE').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'LAGO VISTA').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'LAKEWAY').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'LEANDER').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'LIBERTY HILL').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'PFLUGERVILLE').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'ROUND ROCK').limit(5).print_table(max_column_width=60)
print('\n')
location_sum('City', 'SAN MARCOS').limit(5).print_table(max_column_width=60)
print('\n')
location_sum('City', 'SPICEWOOD').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'SUNSET VALLEY').limit(3).print_table(max_column_width=60)
print('\n')
location_sum('City', 'WEST LAKE HILLS').limit(3).print_table(max_column_width=60)


| Establishment                                                | City    | Receipts_sum |
| ------------------------------------------------------------ | ------- | ------------ |
| OLD TOWN RESTURANT AND BAR/PINEY CREEK CHOP HOUSE (931 MA... | BASTROP |       61,145 |
| BACK 9 (834 HIGHWAY 71 W)                                    | BASTROP |       46,248 |
| CHILI'S GRILL & BAR (734 HIGHWAY 71 W)                       | BASTROP |       40,163 |
| NEIGHBOR'S (601 CHESTNUT ST UNIT C)                          | BASTROP |       33,228 |
| LA HACIENDA RESTAURANT (1800 WALNUT ST)                      | BASTROP |       23,257 |


| Establishment                                         | City     | Receipts_sum |
| ----------------------------------------------------- | -------- | ------------ |
| WOODY TAVERN AND GRILL, INC. (12801 SHOPS PKWY # 100) | BEE CAVE |      110,075 |
| MAUDIE'S HILL COUNTRY, LLC (12506 SHOPS PKWY)         | BEE CAVE |       82,482 |
| HCG BEVERAGE, LLC (12525 BEE C

## Sales by county
In this case, we pass in the location type of 'County' and then a county name in caps to get the most sales in a particular county.

In [72]:
# MOVE TO SEPARATE CELLS SO THEY DON'T FORCE A SCROLL
location_sum('County', 'BASTROP').limit(5).print_table(max_column_width=80)
print('\n')
location_sum('County', 'CALDWELL').limit(5).print_table(max_column_width=80)
print('\n')
location_sum('County', 'HAYS').limit(5).print_table(max_column_width=80)
print('\n')
location_sum('County', 'TRAVIS').limit(5).print_table(max_column_width=80)
print('\n')
location_sum('County', 'WILLIAMSON').limit(5).print_table(max_column_width=80)


| Establishment                                                   | County  | Receipts_sum |
| --------------------------------------------------------------- | ------- | ------------ |
| LOST PINES BEVERAGE LLC (575 HYATT LOST PINES ROAD)             | Bastrop |      336,342 |
| OLD TOWN RESTURANT AND BAR/PINEY CREEK CHOP HOUSE (931 MAIN ST) | Bastrop |       61,145 |
| BACK 9 (834 HIGHWAY 71 W)                                       | Bastrop |       46,248 |
| CHILI'S GRILL & BAR (734 HIGHWAY 71 W)                          | Bastrop |       40,163 |
| NEIGHBOR'S (601 CHESTNUT ST UNIT C)                             | Bastrop |       33,228 |


| Establishment                                               | County   | Receipts_sum |
| ----------------------------------------------------------- | -------- | ------------ |
| RISKY BUSINESS (211 E MARKET ST)                            | Caldwell |       20,681 |
| GUADALAJARA MEXICAN RESTAURANT (1710 S COLORADO ST STE 110) | Caldwell |   

## Sales by ZIP Code
A list of sales by ZIP Code. If anything other than 78701 is at the top, it is news.

In [70]:
# top zip code gross receipts
zip_receipts = mixbev.pivot('Zip', aggregation=agate.Sum('Receipts')).order_by('Sum', reverse=True)
zip_receipts.limit(5).print_table()

| Zip   |        Sum |
| ----- | ---------- |
| 78701 | 25,150,987 |
| 75201 | 13,006,593 |
| 78205 |  9,100,329 |
| 77002 |  7,677,477 |
| 76011 |  7,500,736 |
