# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [111]:
!pwd

/Users/michaelmainzer/Documents/GA/DSI/mike1/01-projects/notebooks


In [112]:
import pandas as pd
% matplotlib inline
import datetime
import numpy as np
import pandas as pd

In [113]:
def read_csv(path): return pd.read_csv( path, sep = ',' ) 

path = "/Users/michaelmainzer/Documents/GA/DSI/mike1/01-projects/assets/03-project3-assets/Iowa_Liquor_sales_sample.csv" # relative path
df = read_csv(path)

In [114]:
df.head()

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,11/04/2015,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,9.0,2.38
1,03/02/2016,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,1.5,0.4
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,$12.59,$18.89,24,$453.36,24.0,6.34
3,02/03/2016,2501,AMES,50010,85.0,Story,1071100.0,AMERICAN COCKTAILS,395,59154,1800 Ultimate Margarita,1750,$9.50,$14.25,6,$85.50,10.5,2.77
4,08/18/2015,3654,BELMOND,50421,99.0,Wright,1031080.0,VODKA 80 PROOF,297,35918,Five O'clock Vodka,1750,$7.20,$10.80,12,$129.60,21.0,5.55


In [115]:
df.dtypes

Date                      object
Store Number               int64
City                      object
Zip Code                  object
County Number            float64
County                    object
Category                 float64
Category Name             object
Vendor Number              int64
Item Number                int64
Item Description          object
Bottle Volume (ml)         int64
State Bottle Cost         object
State Bottle Retail       object
Bottles Sold               int64
Sale (Dollars)            object
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtype: object

In [116]:
df.count()

Date                     270955
Store Number             270955
City                     270955
Zip Code                 270955
County Number            269878
County                   269878
Category                 270887
Category Name            270323
Vendor Number            270955
Item Number              270955
Item Description         270955
Bottle Volume (ml)       270955
State Bottle Cost        270955
State Bottle Retail      270955
Bottles Sold             270955
Sale (Dollars)           270955
Volume Sold (Liters)     270955
Volume Sold (Gallons)    270955
dtype: int64

In [117]:
df.shape

(270955, 18)

In [118]:
# Remove redundant columns
del df['Volume Sold (Gallons)']

In [119]:
# Remove $ from certain columns
df['State Bottle Cost'] = [x.lstrip('$')for x in df['State Bottle Cost']]
df['State Bottle Retail'] = [x.lstrip('$')for x in df['State Bottle Retail']]
df['Sale (Dollars)'] = [x.lstrip('$')for x in df['Sale (Dollars)']]

#Convert dates
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")

# Drop or replace bad values

# Convert integers
df[['State Bottle Cost','State Bottle Retail', 'Sale (Dollars)']] = df[['State Bottle Cost','State Bottle Retail', 
                                                                        'Sale (Dollars)']].astype(float)

# Add Margin column and Price per liter column
df['Margin'] = df['State Bottle Retail'] - df['State Bottle Cost']
df['Price per Liter'] = (df['State Bottle Retail'] / df['Bottle Volume (ml)'])*1000

In [120]:
# Sales per store, 2015. Only includes stores that were open for all of 2015.

# Filter by our start and end dates
df.sort_values(by=["Store Number", "Date"], inplace=True)
start_date = pd.Timestamp("20150101")
end_date = pd.Timestamp("20151231")
mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
sales2015 = df[mask]

# Group by store name
sales2015 = sales2015.groupby(by=["Store Number"], as_index=False)
# Compute sums, means
sales2015 = sales2015.agg({"Sale (Dollars)": [np.sum, np.mean],
                   "Volume Sold (Liters)": [np.sum, np.mean],
                   "Margin": np.mean,
                   "Price per Liter": np.mean,
                   "Zip Code": lambda x: x.iloc[0], # just extract once, should be the same
                   "City": lambda x: x.iloc[0],
                   "County Number": lambda x: x.iloc[0]})
# Collapse the column indices
sales2015.columns = [' '.join(col).strip() for col in sales2015.columns.values]

In [121]:
# Quick check
#pd.options.display.max_rows = 999
#sales2015

In [122]:
#sales2015.dtypes

In [123]:
#sales2015.count()

In [124]:
#sales2015.shape

In [134]:
#Make dataframe that computes info for sales by store, made in the first quarter of 2015
df.sort_values(by=["Store Number", "Date"], inplace=True)
start_date2 = pd.Timestamp("20150101")
end_date2 = pd.Timestamp("20150331")
mask2 = (df['Date'] >= start_date2) & (df['Date'] <= end_date2)
sales2015Q1 = df[mask2]

sales2015Q1 = sales2015Q1.groupby(by=["Store Number"], as_index=False)

sales2015Q1 = sales2015Q1.agg({"Sale (Dollars)": [np.sum, np.mean],
                   "Volume Sold (Liters)": [np.sum, np.mean],
                   "Margin": np.mean,
                   "Price per Liter": np.mean,
                   "Zip Code": lambda x: x.iloc[0], # just extract once, should be the same
                   "City": lambda x: x.iloc[0],
                   "County Number": lambda x: x.iloc[0]})

sales2015Q1.columns = [' '.join(col).strip() for col in sales2015Q1.columns.values]
pd.options.display.max_rows = 999
sales2015Q1

Unnamed: 0,Store Number,City <lambda>,Sale (Dollars) sum,Sale (Dollars) mean,County Number <lambda>,Price per Liter mean,Zip Code <lambda>,Volume Sold (Liters) sum,Volume Sold (Liters) mean,Margin mean
0,2106,CEDAR FALLS,39287.29,304.552636,7.0,17.846608,50613,2526.10,19.582171,5.033721
1,2113,GOWRIE,2833.25,67.458333,94.0,19.358141,50543,177.11,4.216905,5.275000
2,2130,WATERLOO,24272.57,278.995057,7.0,17.565430,50703,1447.25,16.635057,5.140920
3,2152,ROCKWELL,2003.46,62.608125,17.0,13.991012,50469,151.74,4.741875,4.836875
4,2178,WAUKON,5856.41,122.008542,3.0,16.724712,52172,409.81,8.537708,4.932083
5,2190,DES MOINES,29452.92,84.878732,77.0,21.929651,50314,1666.58,4.802824,5.468040
6,2191,KEOKUK,29085.57,192.619669,56.0,18.592154,52632,1957.28,12.962119,5.827550
7,2200,SAC CITY,4900.43,58.338452,81.0,16.790782,50583,367.72,4.377619,5.879048
8,2205,CLARINDA,6407.74,91.539143,73.0,20.226282,51632,375.38,5.362571,5.324000
9,2228,WINTERSET,5193.97,86.566167,61.0,17.100714,50273,405.62,6.760333,4.884500


In [126]:
#Take columns from Q1 dataframe and place it in the 2015 Total Sales dataframe
sales2015['Q1 Total Sales'] = sales2015Q1['Sale (Dollars) sum']
sales2015['Q1 Average Sale'] = sales2015Q1['Sale (Dollars) mean']

sales2015.columns = [u'Store Number', u'City', u'Total Sales',
       u'Average Sale', u'County Number',
       u'Average Price per Liter', u'Zip Code',
       u'Total Volume Sold (Liters)', u'Average Volume Sold (Liters)',
       u'Average Margin', u'Q1 Total Sales', u'Q1 Average Sale']
sales15 = sales2015

In [133]:
pd.options.display.max_rows = 999
sales15

Unnamed: 0,Store Number,City,Total Sales,Average Sale,County Number,Average Price per Liter,Zip Code,Total Volume Sold (Liters),Average Volume Sold (Liters),Average Margin,Q1 Total Sales,Q1 Average Sale
0,2106,CEDAR FALLS,146326.22,277.658861,7.0,17.856601,50613,9731.85,18.466509,5.166319,39287.29,304.552636
1,2113,GOWRIE,9310.22,63.334830,94.0,18.504292,50543,659.85,4.488776,5.445102,2833.25,67.458333
2,2130,WATERLOO,111871.43,285.386301,7.0,16.835669,50703,6891.37,17.580026,4.925842,24272.57,278.995057
3,2152,ROCKWELL,7721.08,54.759433,17.0,13.020983,50469,633.37,4.491986,4.322624,2003.46,62.608125
4,2178,WAUKON,24324.18,102.633671,3.0,16.062136,52172,1917.12,8.089114,4.868861,5856.41,122.008542
5,2190,DES MOINES,121689.06,92.539209,77.0,23.306242,50314,6322.17,4.807734,5.774259,29452.92,84.878732
6,2191,KEOKUK,125093.49,209.888406,56.0,19.067467,52632,8053.32,13.512282,5.778087,29085.57,192.619669
7,2200,SAC CITY,22811.55,56.604342,81.0,16.707356,50583,1817.24,4.509280,5.620868,4900.43,58.338452
8,2205,CLARINDA,24681.39,85.699271,73.0,19.165570,51632,1556.91,5.405937,5.098785,6407.74,91.539143
9,2228,WINTERSET,17462.07,72.758625,61.0,17.893750,50273,1367.65,5.698542,4.875417,5193.97,86.566167


In [128]:
#sales15.dtypes

In [129]:
#sales15.count()

In [130]:
#sales15.shape

# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [131]:
import seaborn as sns
import matplotlib.pyplot as plt

Unnamed: 0,Store Number,County Number,Category,Vendor Number,Item Number,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Margin,Price per Liter
Store Number,1.0,0.00751,-0.012111,-0.004533,-0.026386,-0.063475,-0.071382,-0.07152,0.014422,-0.017941,-0.017696,-0.07178,-0.013388
County Number,0.00751,1.0,-0.006336,0.001746,0.00681,-0.027694,0.012186,0.012113,0.018945,0.019851,0.010199,0.011963,0.031327
Category,-0.012111,-0.006336,1.0,0.09192,0.116385,-0.009294,-0.013358,-0.013545,-0.000208,0.006382,-0.004375,-0.013915,-0.018943
Vendor Number,-0.004533,0.001746,0.09192,1.0,0.13612,0.024561,0.003477,0.003085,-0.002116,-0.012277,-0.007432,0.002301,0.012246
Item Number,-0.026386,0.00681,0.116385,0.13612,1.0,-0.057282,0.097879,0.097612,-0.004766,0.002987,-0.009555,0.097055,0.160451
Bottle Volume (ml),-0.063475,-0.027694,-0.009294,0.024561,-0.057282,1.0,0.312841,0.313819,-0.012476,0.082446,0.156258,0.315698,-0.306428
State Bottle Cost,-0.071382,0.012186,-0.013358,0.003477,0.097879,0.312841,1.0,0.99996,-0.06298,0.135931,0.009296,0.999642,0.733251
State Bottle Retail,-0.07152,0.012113,-0.013545,0.003085,0.097612,0.313819,0.99996,1.0,-0.062831,0.136114,0.009736,0.999841,0.732808
Bottles Sold,0.014422,0.018945,-0.000208,-0.002116,-0.004766,-0.012476,-0.06298,-0.062831,1.0,0.825446,0.883348,-0.062518,-0.061534
Sale (Dollars),-0.017941,0.019851,0.006382,-0.012277,0.002987,0.082446,0.135931,0.136114,0.825446,1.0,0.84642,0.136449,0.063459


## Record your findings

Be sure to write out any observations from your exploratory analysis.

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

In [132]:
from sklearn import linear_model


## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.