# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

In [2]:
# Named csv 'Iowa"
Iowa = pd.read_csv('/Users/macbook/GA-DSI/projects/projects-weekly/project-03/Iowa_Liquor_sales_sample_10pct.csv')

In [3]:
# Observe Data 
Iowa.head()


Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,11/04/2015,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,9.0,2.38
1,03/02/2016,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,1.5,0.4
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,$12.59,$18.89,24,$453.36,24.0,6.34
3,02/03/2016,2501,AMES,50010,85.0,Story,1071100.0,AMERICAN COCKTAILS,395,59154,1800 Ultimate Margarita,1750,$9.50,$14.25,6,$85.50,10.5,2.77
4,08/18/2015,3654,BELMOND,50421,99.0,Wright,1031080.0,VODKA 80 PROOF,297,35918,Five O'clock Vodka,1750,$7.20,$10.80,12,$129.60,21.0,5.55


In [4]:
# Remove all NaN values

Iowa = Iowa.dropna()

In [5]:
Iowa.dtypes

Date                      object
Store Number               int64
City                      object
Zip Code                  object
County Number            float64
County                    object
Category                 float64
Category Name             object
Vendor Number              int64
Item Number                int64
Item Description          object
Bottle Volume (ml)         int64
State Bottle Cost         object
State Bottle Retail       object
Bottles Sold               int64
Sale (Dollars)            object
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtype: object

In [6]:
Iowa['Date'] = pd.to_datetime(Iowa['Date'])

In [7]:
# Change date to datetype
Iowa['Date'] = Iowa['Date'].map(convert_date)

NameError: name 'convert_date' is not defined

In [None]:
# Force zip codes to floats, change to ints
Iowa['Zip Code'] = pd.to_numeric(Iowa['Zip Code'], errors = 'coerce')
Iowa['Zip Code'] = Iowa.loc[:, ['Zip Code']].astype(int)

In [None]:
# Change County Number to int
Iowa['County Number'] = Iowa.loc[:, ['County Number']].astype(int)

In [None]:
Iowa.head()

In [None]:
# Checking for NaNs
Iowa.isnull().sum()

In [None]:
# Clean up column names 1: remove units of measurement
Iowa.rename(columns = {'Bottle Volume (ml)': 'Bottle Volume', 'Sale (Dollars)': 'Sales', \
                      'Volume Sold (Liters)': 'Volume Sold', \
                       'Volume Sold (Gallons)': 'Volume Sold Gallons'}, inplace=True)

# Clean up column names 2: convert all to lowercase letters
import string 
lower = string.ascii_lowercase
Iowa.rename(columns = lambda x: x.lower(), inplace=True)

# Clean up column names 3: replace '', '(' and ')'
Iowa.rename(columns = lambda x: x.replace(' ', '_'), inplace=True)
Iowa.rename(columns = lambda x: x.replace('(', ''), inplace=True)
Iowa.rename(columns = lambda x: x.replace(')', ''), inplace=True)

In [None]:
Iowa.head()

In [None]:
# Changing types of all dollar columns

Iowa['state_bottle_retail'] = Iowa['state_bottle_retail'].str.replace('$', '')
Iowa['state_bottle_cost'] = Iowa['state_bottle_cost'].str.replace('$', '')
Iowa['sales'] = Iowa['sales'].str.replace('$', '')

Iowa['state_bottle_retail'] = Iowa.loc[:, ['state_bottle_retail']].astype(float)
Iowa['state_bottle_cost'] = Iowa.loc[:, ['state_bottle_cost']].astype(float)
Iowa['sales'] = Iowa.loc[:, ['sales']].astype(float)

In [None]:

category_df = Iowa.groupby('category')[['category', 'category_name']]
desc_df = Iowa.groupby('item_number')[['item_number', 'item_description']]
county_df = Iowa.groupby('county_number')[['county_number', 'county']]

In [None]:
category_df.head(3)

In [None]:
# Generalize categories

#vodka
#schnapps
#whiskey
#rum
#scotch
#gin
#liqueurs
#brandies
#tequila
#beer
#other

In [None]:
# Create a Year column.

Iowa['year'] = Iowa['date'].map(lambda x: x.year)

In [None]:
#Create profit per bottle and total profit per category
Iowa["profit_per_bottle"] = (Iowa["state_bottle_retail"] - Iowa["state_bottle_cost"]) * Iowa["bottles_sold"]
Iowa['total_profit'] = (Iowa['profit_per_bottle']  * Iowa['bottles_sold'])
Iowa.head()

In [None]:
# Chart profit per county per type
profit_county_per_type = Iowa.groupby('county')[('county', 'category_name', 'total_profit')]
profit_county_per_type.head(5)

In [None]:
profit_store_per_type = Iowa.groupby('store_number')[('store_number', 'city', 'total_profit', 'year')]
profit_store_per_type.sort(['city', 'store_number', 'total_profit', 'year'], axis=1, 'True')


In [None]:
store_profit = pd.pivot_table(Iowa, values=['city'], index=['store_number', 'total_profit', 'year'], aggfunc=np.sum)
store_profit.reset_index(inplace=True)
store_profit.drop('city', axis=1, inplace=True)
store_profit.head()

# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [None]:

store_sales_2015 = store_sales[store_sales['year'] == 2015]

In [None]:
Iowa_cities = pd.pivot_table(Iowa, index=['city', 'year', 'store_number'], values=['total_profit'])


In [None]:
Iowa_cities

In [None]:
# Compute sales per county. Because I got confused.
county_year = Iowa.groupby(by=['county', 'year'], as_index=False)
# Compute sums, means
county_sales = county_year.agg({'sales': [np.sum, np.mean],                           
                           'volume_sold': [np.sum, np.mean],
                           'total_profit': [np.sum, np.mean]})      
                           

In [None]:
#Compute sales per store sum and mean
city_sales = Iowa.groupby(by=['store_number', 'city'], as_index=False)

store_sales = city_sales.agg({'sales':[np.sum, np.mean],
                             'volume_sold': [np.sum, np.mean],
                             'total_profit': [np.sum, np.mean]})
store_sales.head(3)

In [None]:

sns.pairplot(store_sales, kind="reg");

In [None]:
col_list = store_sales.columns

In [None]:
store_sales.loc[:, col_list].corr()

In [None]:
store_sales.columns


## Record your findings

Be sure to write out anything observations from your exploratory analysis.

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

In [None]:
from sklearn import linear_model


## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.