# Mixed beverage data problems

This Jupyter Notebook is part of a series that outlines problems with  [Mixed Beverage Gross Receipts](https://comptroller.texas.gov/taxes/mixed-beverage/receipts.php) files from the Texas Comptroller's [data center](https://comptroller.texas.gov/transparency/open-data/search-datasets/). It uses a python library called [agate](http://agate.readthedocs.io/) to clean and process that data.

The mixbev_cleanup function does more than it needs to in this case since we just need to pivot by Report Date, but that extra cleaning does not hurt or change the result.

This is a screenshot of the dataset site source code from May 27th, showing which files are connected to which title for the 2017 files:

![2017_data](../data_problem/Screen_Shot_2017-05-27_at_9.40.44 PM.png)

In [1]:
# imports the libraries we will use
import agate
from decimal import Decimal
import re

# this surpresses the timezone warning
# Might comment out during development so other warnings
# are not surpressed
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Sets the tax rate to convert Report Tax to Gross Receipts
# It's 6.7 since January 1, 2014
tax_rate = Decimal('6.7')

# sets the column names of the original data set.
column_names = [
    'TABC Permit Number',
    'Trade Name',
    'Location Address',
    'Location City',
    'Location State',
    'Location Zip Code',
    'Location County Code',
    'Blank',
    'Report Period',
    'Report Tax'
]
# Helps us import some text fields that may be considered numbers in error.
specified_types = {
    'Location Zip Code': agate.Text(),
    'Location County Code': agate.Text()
}

### MIXBEV CLEAN FUNCTION
# This function cleans a raw mixbev file in a number of ways.
# Details are outlined in comments below
###
def process_mixbev(title, path):

    # splits the path to get the file name
    file_name = path.rsplit('/', 1)[-1]

    # creates file location to downloaded data
    file_location = '../data_problem/' + file_name
    
    # this imports the file specified above, along with the proper types
    mixbev_raw = agate.Table.from_csv(file_location, column_names, encoding='iso-8859-1', column_types=specified_types)

    # mixbev_trim creates a new interim table with results of compute function
    # that takes the four columns that need trimming and strips them of white space,
    # adding them to the end of the table with new names.
    # The last computation does the math to create the Gross Receipts
    # column based on the tax_rate variable set above

    mixbev_trim = mixbev_raw.compute([
        ('Permit', agate.Formula(agate.Text(), lambda r: r['TABC Permit Number'].strip())),
        ('Name', agate.Formula(agate.Text(), lambda r: r['Trade Name'].strip())),
        ('Address', agate.Formula(agate.Text(), lambda r: r['Location Address'].strip())),
        ('City', agate.Formula(agate.Text(), lambda r: r['Location City'].strip())),
        ('Receipts_compute', agate.Formula(agate.Number(), lambda r: (r['Report Tax'] / tax_rate) * 100))
    ])

    # the Receipts_compute computation above returns as a decimal number,
    # so this function rounds those numbers.
    def round_receipt(row):
        return row['Receipts_compute'].quantize(Decimal('0.01'))

    # This compute method uses round_receipt function above,
    # putting the results into a new table.
    mixbev_round = mixbev_trim.compute([
        ('Receipts', agate.Formula(agate.Number(), round_receipt))
    ])

    # creates new table, selecting just the columns we need
    # then renames some of them for ease later.
    mixbev_cleaned = mixbev_round.select([
        'Permit',
        'Name',
        'Address',
        'City',
        'Location State',
        'Location Zip Code',
        'Location County Code',
        'Report Period',
        'Report Tax',
        'Receipts'
    ]).rename(column_names = {
        'Location State': 'State',
        'Location Zip Code': 'Zip',
        'Location County Code': 'CountyCode',
        'Report Period': 'Period',
        'Report Tax': 'Tax'
    })

    # Concatenates the name and address for a new column, Establishment
    # This is so we can find individual locations that might have the same
    # name but different addresses
    mixbev_cleaned_est = mixbev_cleaned.compute([
        ('Establishment', agate.Formula(agate.Text(), lambda row: '%(Name)s %(Address)s' % row))
    ])

    # importing countes.csv, ensuring that the 'code' column is text
    counties = agate.Table.from_csv('../resource-files/counties.csv', column_types={'code': agate.Text()})

    # joines the counties table to the mixed bev cleaned data with establishments
    mixbev_joined = mixbev_cleaned_est.join(counties, 'CountyCode', 'code')

    # get just the columns we need and rename county
    # THIS is the finished, cleaned mixbev table
    mixbev = mixbev_joined.select([
        'Permit',
        'Name',
        'Address',
        'Establishment',
        'City',
        'State',
        'Zip',
        'county',
        'Period',
        'Tax',
        'Receipts'
    ]).rename(column_names = {
        'county': 'County'
    })

    # Pivot the mixbev table by Period. Default it give a Count of the records
    # We then order the table by Count in descending order
    mixbev_by_period = mixbev.pivot('Period').order_by('Count', reverse=True)

    
    print('The entry titled:\n {}\n\nThe file name is:\n {}\n'.format(
            title,
            file_name
        ))

    print('The count of records by Report Period is:\n')
    
    # prints the table of period and number of records
    mixbev_by_period.limit(5).print_table(max_rows=None)
    
    return(mixbev)

### Looking at dates of the records

This basically confirms that the file has multiple dates, and that we are looking at the right month of data. Typically a data set will have mostly reports from the previous month, but there are always also submissions from other months. We want to filter out those other months, which we do based on the `month_studied` variable set near the top of the file, which should match the period at the top of the table below.


### Investigate mixbev files

I downloaded the file from Texas Transparency, but these values are saved to explain/print them:

- `mon_year_file_path` is the url to the file on the comptroller's website
- `mon_year_downloaded` is the file location once downloaded
- `mon_year_file_name` is the filename as it existed the comptrollers' website
- `mon_year_title` is the title used for that file on the agency's website


In [None]:
### Mixed Beverage Tax Receipts - DEC 2016
dec_2016_title = 'Mixed Beverage Tax Receipts - DEC 2016'
dec_2016_file_path = 'https://comptroller.texas.gov/auto-data/odc/MIXEDBEV_12_2016.CSV'

dec_mixbev = process_mixbev(dec_2016_title, dec_2016_file_path)

The entry titled:
 Mixed Beverage Tax Receipts - DEC 2016

The file name is:
 MIXEDBEV_12_2016.CSV

The count of records by Report Period is:

| Period  |  Count |
| ------- | ------ |
| 2016/11 | 14,224 |
| 2016/10 |  1,624 |
| 2016/09 |    158 |
| 2016/08 |     48 |
| 2016/12 |     41 |


In [None]:
### Mixed Beverage Tax Receipts - JAN 2017
jan_2017_title = 'Mixed Beverage Tax Receipts - JAN 2017'
jan_2017_file_path = 'https://comptroller.texas.gov/auto-data/odc/MIXEDBEV_03_2017.CSV'

jan_2017_mixbev = process_mixbev(jan_2017_title, jan_2017_file_path)

In [None]:
### Mixed Beverage Tax Receipts - FEB 2017
feb_2017_title = 'Mixed Beverage Tax Receipts - FEB 2017'
feb_2017_file_path = 'https://comptroller.texas.gov/auto-data/odc/MIXEDBEV_04_2017.CSV'

feb_2017_mixbev = process_mixbev(feb_2017_title, feb_2017_file_path)

In [None]:
### Mixed Beverage Tax Receipts - MAR 2017
mar_2017_title = 'Mixed Beverage Tax Receipts - MAR 2017'
mar_2017_file_path = 'https://comptroller.texas.gov/auto-data/odc/MIXEDBEV_05_2017.CSV'

mar_2017_mixbev = process_mixbev(mar_2017_title, mar_2017_file_path)