# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [1]:
#import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
%matplotlib inline 

from sklearn import datasets
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

In [2]:
## Load the data into a DataFrame
iowa = pd.read_csv('/Users/macbook/GA-DSI/projects/projects-weekly/project-03/Iowa_Liquor_sales_sample_10pct.csv')

In [3]:
iowa.head(3)

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,11/04/2015,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,9.0,2.38
1,03/02/2016,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,1.5,0.4
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,$12.59,$18.89,24,$453.36,24.0,6.34


In [4]:
iowa.dtypes

Date                      object
Store Number               int64
City                      object
Zip Code                  object
County Number            float64
County                    object
Category                 float64
Category Name             object
Vendor Number              int64
Item Number                int64
Item Description          object
Bottle Volume (ml)         int64
State Bottle Cost         object
State Bottle Retail       object
Bottles Sold               int64
Sale (Dollars)            object
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtype: object

In [5]:
print iowa.columns
print iowa.shape

Index([u'Date', u'Store Number', u'City', u'Zip Code', u'County Number',
       u'County', u'Category', u'Category Name', u'Vendor Number',
       u'Item Number', u'Item Description', u'Bottle Volume (ml)',
       u'State Bottle Cost', u'State Bottle Retail', u'Bottles Sold',
       u'Sale (Dollars)', u'Volume Sold (Liters)', u'Volume Sold (Gallons)'],
      dtype='object')
(270955, 18)


# Clean Data

### Drop duplicate (unnecessary) columns

In [6]:
# Drop cols that contain same info as another col
iowa.drop(['County Number', 'Item Number', 'Volume Sold (Gallons)'], axis=1, inplace=True)

In [7]:
# Made a separate df for category to category name mapping for future reference
category_df = pd.pivot_table(iowa, index=['Category', 'Category Name'], values=['Bottles Sold'])
category_df.drop('Bottles Sold', axis=1, inplace=True)
category_df.head(2)

Category,Category Name
1011100.0,BLENDED WHISKIES
1011200.0,STRAIGHT BOURBON WHISKIES


### Clean column names

In [8]:
# Clean column name 1: Rename columns
iowa.rename(columns={'Store Number':'Store', 'Bottle Volume (ml)':'Bottle Volume', 'Sale (Dollars)':'Sales', \
               'Volume Sold (Liters)':'Volume Sold'}, inplace=True)

# Clean column names 2: Change all column names to lowercase letters
iowa.rename(columns=lambda x: x.lower(), inplace=True)

# Clean column names 3: Replace ' ' with '_'
iowa.rename(columns=lambda x: x.replace(" ","_"), inplace=True)


### Clean values in columns with currency and date

In [9]:
# Convert columns with dollar amounts from object to numeric float
currency = ['sales', 'state_bottle_cost', 'state_bottle_retail']
iowa[currency] = iowa[currency].apply(lambda x: x.str.replace('$',''))
iowa[currency] = iowa[currency].apply(lambda x: x.str.replace(',',''))
iowa[currency] = iowa[currency].apply(lambda x: pd.to_numeric(x))

In [10]:
# Convert date from object to datetime
iowa['date'] = pd.to_datetime(iowa['date'])
%timeit 

In [11]:
# Convert zip code to int
print iowa['zip_code'].nunique()
print iowa['zip_code'].unique()
# One problematic value: 712-2

415
['50674' '52807' '50613' '50010' '50421' '52402' '52501' '50428' '50035'
 '52332' '50265' '52577' '52806' '52656' '52241' '50703' '50208' '52342'
 '51250' '50401' '51351' '52246' '51501' '50111' '52245' '52632' '50125'
 '50501' '50311' '50317' '50124' '52804' '50320' '50651' '50129' '50021'
 '52224' '50533' '50212' '52060' '51401' '50595' '51104' '52404' '52353'
 '50616' '52057' '51201' '50009' '50588' '52802' '51503' '50638' '51106'
 '51360' '52001' '50250' '51461' '52641' '52303' '50115' '52144' '51301'
 '52761' '50851' '51555' '52240' '50126' '50511' '50310' '50263' '50314'
 '52753' '50701' '50140' '52732' '50665' '52601' '712-2' '51041' '51455'
 '51453' '52405' '52302' '50023' '50131' '50662' '52310' '50423' '52208'
 '52361' '50201' '50003' '50315' '52136' '52544' '52556' '51334' '50158'
 '52778' '51601' '52337' '51105' '51632' '50583' '50325' '50707' '51103'
 '51040' '52340' '52101' '50220' '52356' '52172' '52043' '50450' '50676'
 '50036' '52803' '50028' '50112' '50219' '52205

In [12]:
# Drop problematic zip code and convert rest to int
iowa['zip_code'] = pd.to_numeric(iowa["zip_code"], errors='coerce')
iowa.fillna(np.nan)
iowa['zip_code'].dropna().astype(int)
iowa['zip_code'].head(2)

0    50674.0
1    52807.0
Name: zip_code, dtype: float64

In [13]:
iowa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270955 entries, 0 to 270954
Data columns (total 15 columns):
date                   270955 non-null datetime64[ns]
store                  270955 non-null int64
city                   270955 non-null object
zip_code               270738 non-null float64
county                 269878 non-null object
category               270887 non-null float64
category_name          270323 non-null object
vendor_number          270955 non-null int64
item_description       270955 non-null object
bottle_volume          270955 non-null int64
state_bottle_cost      270955 non-null float64
state_bottle_retail    270955 non-null float64
bottles_sold           270955 non-null int64
sales                  270955 non-null float64
volume_sold            270955 non-null float64
dtypes: datetime64[ns](1), float64(6), int64(4), object(4)
memory usage: 31.0+ MB


### Convert dtype object to str

In [14]:
iowa['city'] = iowa['city'].astype(str)
iowa['county'] = iowa['county'].astype(str)
iowa['category_name'] = iowa['category_name'].astype(str)
iowa['item_description'] = iowa['item_description'].astype(str)

### Extract dates into year, quarter and month

In [15]:
def extract_month(x):
    month = str('{:02d}'.format(x.month)) + "-" + str((x.year))
    return month
def extract_quarter(x):
    quarter = "Q" + str(x.quarter) + "-" + str(x.year)
    return quarter

iowa["year"] = iowa["date"].dt.year
iowa["quarter"] = iowa["date"].apply(extract_quarter)
iowa["month"] = iowa["date"].apply(extract_month)

### Check for and replace NaNs

In [16]:
iowa.isnull().sum()
#Will leave zip_code NaNs as I filled problematic zip code with np.nan above

date                     0
store                    0
city                     0
zip_code               217
county                   0
category                68
category_name            0
vendor_number            0
item_description         0
bottle_volume            0
state_bottle_cost        0
state_bottle_retail      0
bottles_sold             0
sales                    0
volume_sold              0
year                     0
quarter                  0
month                    0
dtype: int64

### Remove Duplicate County Names & Fill in Missing Counties

In [17]:
# Match missing County with City
df_county = pd.pivot_table(iowa, index=['city', 'county'], values=['sales'], aggfunc=sum)
df_county.drop('sales', axis=1, inplace=True)
df_county.reset_index(inplace=True)
df_county.head()

Unnamed: 0,city,county
0,ACKLEY,Hardin
1,ACKLEY,Webster
2,ACKLEY,
3,ADAIR,Adair
4,ADEL,Dallas


In [18]:
print df_county['city'].duplicated().sum()
a = df_county[df_county['city'].duplicated() == True].index.tolist()
df_county['city'].loc[a]

44


1               ACKLEY
2               ACKLEY
13             ALTOONA
16             ANAMOSA
25            ATLANTIC
27             AUDUBON
37             BELMOND
39          BETTENDORF
60        CEDAR RAPIDS
65            CHARITON
69            CLARINDA
74             CLINTON
76               CLIVE
83          CORALVILLE
85             CORNING
91              CRESCO
98           DAVENPORT
108         DES MOINES
113            DUBUQUE
116             DUNLAP
137          EVANSDALE
152         FORT DODGE
153         FORT DODGE
180            HAMPTON
182             HARLAN
223          LARCHWOOD
253         MASON CITY
282             NEWTON
286      NORTH LIBERTY
287      NORTH LIBERTY
290            NORWALK
300            OSCEOLA
302          OSKALOOSA
314              PERRY
317      PLEASANTVILLE
334           ROCKWELL
352          SIGOURNEY
355         SIOUX CITY
368       STATE CENTER
401           WATERLOO
405            WAVERLY
410             WESLEY
413        WEST BRANCH
416    WEST

In [19]:
# Did not use drop_duplicates() because I did not want to automatically drop the second duplicate
# So I checked the counties of duplicate cities and dropped incorrect city-county mappings
# It also turns out some towns (Ackley, Clive, and West Des Moines are located in multiple counties)
# Remove from iowa dataframe: index [32, 74, 134, 260, 263, 276, 288, 336]

b = [i+1 for i in a]
c = [i-1 for i in a]
d = a + b + c
e = sorted(d)

df_county.loc[e, :]
county_dict = {'bettendorf':'scott', 'corning':'adams', 'fort dodge': 'webster','newton':'jasper', \
               'north libery':'johnson', 'OSKALOOSA': 'mahaska', 'PERRY':'dallas', 'STATE CENTER': 'marshall'}

In [20]:
# Remove from iowa dataframe: index [32, 74, 134, 260, 263, 276, 288, 336]
df_county.drop([32, 74, 134, 260, 263, 276, 288, 336], inplace=True)

# Rename Ackley, Clive, and West Des Moines to account for parts of town in different counties
df_county.set_value([1,66,380], 'city', ['ACKLEY_WEBSTER', 'CLIVE_POLK', 'WEST_DES_MOINES_POLK'])

Unnamed: 0,city,county
0,ACKLEY,Hardin
1,ACKLEY_WEBSTER,Webster
2,ACKLEY,
3,ADAIR,Adair
4,ADEL,Dallas
5,AFTON,Union
6,AKRON,Plymouth
7,ALBIA,Monroe
8,ALDEN,Hardin
9,ALGONA,Kossuth


In [21]:
# Merge df_county into original iowa df
# Note: Original "county" renamed "county_x" and new "county_y" col created
iowa = pd.merge(iowa,df_county, on = 'city', how = 'left')
iowa.head(3)

Unnamed: 0,date,store,city,zip_code,county_x,category,category_name,vendor_number,item_description,bottle_volume,state_bottle_cost,state_bottle_retail,bottles_sold,sales,volume_sold,year,quarter,month,county_y
0,2015-11-04,3717,SUMNER,50674.0,Bremer,1051100.0,APRICOT BRANDIES,55,Mr. Boston Apricot Brandy,750,4.5,6.75,12,81.0,9.0,2015,Q4-2015,11-2015,Bremer
1,2016-03-02,2614,DAVENPORT,52807.0,Scott,1011100.0,BLENDED WHISKIES,395,Tin Cup,750,13.75,20.63,2,41.26,1.5,2016,Q1-2016,03-2016,Scott
2,2016-03-02,2614,DAVENPORT,52807.0,Scott,1011100.0,BLENDED WHISKIES,395,Tin Cup,750,13.75,20.63,2,41.26,1.5,2016,Q1-2016,03-2016,


In [22]:
# Fill missing county names in "county_x" with county names from "county_y"
iowa['county_x'].fillna(iowa['county_y'], inplace=True)
iowa.isnull().sum()

date                      0
store                     0
city                      0
zip_code                434
county_x                  0
category                102
category_name             0
vendor_number             0
item_description          0
bottle_volume             0
state_bottle_cost         0
state_bottle_retail       0
bottles_sold              0
sales                     0
volume_sold               0
year                      0
quarter                   0
month                     0
county_y               3652
dtype: int64

In [23]:
# Check for remaining missing counties
iowa[iowa["county_x"].isnull()]['city'].unique()

array([], dtype=object)

In [24]:
# Fill in missing counties by mapping to city
def county_name(city):
    if city == "TABOR":
        return "Fremont"
    elif city == "SEYMOUR":
        return "Wayne"
    elif city == "RUNNELLS":
        return "Polk"
    else:
        pass

iowa['county_y'] = iowa['city'].apply(county_name)
iowa['county_x'].fillna(iowa['county_y'], inplace=True)

In [25]:
iowa.isnull().sum()

date                        0
store                       0
city                        0
zip_code                  434
county_x                    0
category                  102
category_name               0
vendor_number               0
item_description            0
bottle_volume               0
state_bottle_cost           0
state_bottle_retail         0
bottles_sold                0
sales                       0
volume_sold                 0
year                        0
quarter                     0
month                       0
county_y               394426
dtype: int64

### Drop Missing Values in "category_name"

In [26]:
# Find duplicates in category name
df_category_name = pd.pivot_table(iowa,index = ['item_description','category_name'], \
                                  values = ['sales'], aggfunc = sum)
df_category_name.reset_index(inplace = True)
df_category_name.drop('sales', axis=1, inplace=True)
df_category_name.duplicated().sum()
# No duplicates

0

In [27]:
# Drop rows with NaN in "category_name" col as it is only 0.233% of total 
# Did not have the time (as I did for counties above) to go through the 632 NaN rows
null_cat_percent = (iowa['category_name'].isnull().sum())/ \
                    float(len(iowa['category_name']))*100
print str(round(null_cat_percent, 3)) + '% of total rows ' + 'is null'
print 'These 632 rows will be dropped from analysis'

iowa.dropna(subset=[['category_name']], inplace=True)
iowa[['category_name']].head(3)

0.0% of total rows is null
These 632 rows will be dropped from analysis


Unnamed: 0,category_name
0,APRICOT BRANDIES
1,BLENDED WHISKIES
2,BLENDED WHISKIES


In [28]:
iowa.isnull().sum()

date                        0
store                       0
city                        0
zip_code                  434
county_x                    0
category                  102
category_name               0
vendor_number               0
item_description            0
bottle_volume               0
state_bottle_cost           0
state_bottle_retail         0
bottles_sold                0
sales                       0
volume_sold                 0
year                        0
quarter                     0
month                       0
county_y               394426
dtype: int64

In [29]:
# Drop county_y column and rename county_x as county
iowa.drop('county_y', axis=1, inplace=True)
iowa.rename(columns={'county_x':'county'}, inplace=True)

### Create new columns with metrics

In [30]:
# Just checking to make that that the 'sales' columns represents total revenue
iowa['sales_check'] = iowa['state_bottle_retail'] * iowa['bottles_sold']
print sum(iowa['sales_check']-iowa['sales'])
iowa.drop('sales_check', axis=1, inplace=True)

1.65947255937e-11


In [31]:
# New metrics added to iowa
iowa['profit'] = (iowa['state_bottle_retail'] - iowa['state_bottle_cost']) * iowa['bottles_sold']
iowa['rev_per_ml'] = iowa['sales'] / iowa['volume_sold']
iowa['price_per_ml'] = iowa['state_bottle_retail'] / iowa['volume_sold']
iowa['margin_percent'] = iowa['profit'] / iowa['sales']
iowa['profit_per_ml'] = iowa['profit'] / iowa['volume_sold']
iowa['profit_%_per_ml'] = iowa['profit_per_ml'] / iowa['rev_per_ml']

In [32]:
print 'Iowa is a "control state", meaning the state has direct control over the wholesale alcohol \
market. The state dictates retail prices at a set margin of 33.3%.'
print 'See Iowa Alcoholic Beverages Division for more info.'
iowa.iloc[:3,11:28]

Iowa is a "control state", meaning the state has direct control over the wholesale alcohol market. The state dictates retail prices at a set margin of 33.3%.
See Iowa Alcoholic Beverages Division for more info.


Unnamed: 0,state_bottle_retail,bottles_sold,sales,volume_sold,year,quarter,month,profit,rev_per_ml,price_per_ml,margin_percent,profit_per_ml,profit_%_per_ml
0,6.75,12,81.0,9.0,2015,Q4-2015,11-2015,27.0,9.0,0.75,0.333333,3.0,0.333333
1,20.63,2,41.26,1.5,2016,Q1-2016,03-2016,13.76,27.506667,13.753333,0.333495,9.173333,0.333495
2,20.63,2,41.26,1.5,2016,Q1-2016,03-2016,13.76,27.506667,13.753333,0.333495,9.173333,0.333495


In [33]:
# Drop margin % columns (they are constant and will not affect model)
iowa.drop(['margin_percent', 'profit_%_per_ml'], inplace=True,axis=1)

# Add $margin per bottle as this is not a constant measure
iowa['profit_per_bottle'] = iowa['state_bottle_retail'] - iowa['state_bottle_cost']
iowa.iloc[:2,11:28]

Unnamed: 0,state_bottle_retail,bottles_sold,sales,volume_sold,year,quarter,month,profit,rev_per_ml,price_per_ml,profit_per_ml,profit_per_bottle
0,6.75,12,81.0,9.0,2015,Q4-2015,11-2015,27.0,9.0,0.75,3.0,2.25
1,20.63,2,41.26,1.5,2016,Q1-2016,03-2016,13.76,27.506667,13.753333,9.173333,6.88


### Drop stores that were only partially open in 2015

In [34]:
# Check for stores that were not open for the full year or closed during the year
dates_open = iowa.groupby(["store"])["date"].agg([min, max])
dates_open.reset_index(inplace=True)
dates_open.tail(3)

Unnamed: 0,store,min,max
1397,9013,2015-06-04,2016-03-09
1398,9018,2015-10-27,2015-10-27
1399,9023,2016-03-08,2016-03-08


In [35]:
def open_date(x):
    if x > pd.to_datetime('2015-03-31'):
        return 1
    else:
        return 0
    
def close_date(x):
    if x <= pd.to_datetime('2015-12-31'):
        return 1
    else:
        return 0
    
dates_open["closed"] = dates_open["max"].apply(close_date)
dates_open["opened"] = dates_open["min"].apply(open_date)
dates_open["partial_year"] = dates_open["closed"] + dates_open["opened"]

In [36]:
dates_open.tail(3)

Unnamed: 0,store,min,max,closed,opened,partial_year
1397,9013,2015-06-04,2016-03-09,0,1,1
1398,9018,2015-10-27,2015-10-27,1,1,2
1399,9023,2016-03-08,2016-03-08,0,1,1


In [37]:
# List of 217 stores not open full year in 2015
partial_stores = list(dates_open[dates_open["partial_year"]!=0]["store"].values)
print 'Number of stores not open for full year 2015: ' + str(len(partial_stores))

open_stores = list(dates_open[dates_open['partial_year'] == 0]['store'].values)
print 'Total stores: '+ str(len(partial_stores) + len(open_stores))
print 'Total stores open all of 2015: ' + str(len(open_stores))


Number of stores not open for full year 2015: 217
Total stores: 1400
Total stores open all of 2015: 1183


In [38]:
# Amount of revenue generated by stores only partially open in 2015
partial_rev = sum(iowa[iowa['store'].isin(partial_stores)]['sales']) / sum(iowa['sales']) 
print 'Only ' + str(round(partial_rev*100,2)) + '% of sales from stores partially open in 2015'
print 'Sales from these 217 stores will be dropped from analysis'

Only 3.77% of sales from stores partially open in 2015
Sales from these 217 stores will be dropped from analysis


## Create a new df with only stores open for full year 2015

In [39]:
# Create new df consisting only of stores open for full year 2015
idf = iowa[iowa['store'].isin(open_stores)]

## Create a new df summarizing 2015 data only

In [40]:
# Create a df summarizing total sales and revenues in 2015 at store-level
sum_metrics = ['sales', 'profit']
iowa_sum_2015 = idf[idf['year'] == 2015].groupby(['store'])[sum_metrics].agg(np.sum)
iowa_sum_2015.columns = ['2015_total_revenue','2015_profit']
iowa_sum_2015.reset_index(inplace=True)
print len(iowa_sum_2015)
iowa_sum_2015.describe(include='all')

1183


Unnamed: 0,store,2015_total_revenue,2015_profit
count,1183.0,1183.0,1183.0
mean,4165.12257,35304.76,11804.658199
std,812.526287,100240.3,33466.651183
min,2106.0,472.08,157.56
25%,3805.5,4691.535,1569.955
50%,4376.0,11085.92,3711.3
75%,4740.5,29639.43,9895.47
max,9010.0,2001567.0,667633.82


In [41]:
# Create a df summarizing mean metrics for 2015 at store-level
mean_metrics = ['sales', 'profit', 'price_per_ml', 'profit_per_ml', 'profit_per_bottle']
iowa_mean_2015 = idf[idf['year'] == 2015].groupby(['store'])[mean_metrics].agg(np.mean)
iowa_mean_2015.columns = ['2015_avg_rev','2015_avg_profit','2015_avg_price_per_ml', \
                         '2015_avg_profit_per_ml', '2015_avg_profit_per_bottle']
iowa_mean_2015.reset_index(inplace=True)
print len(iowa_mean_2015)
iowa_mean_2015.describe(include='all')

1183


Unnamed: 0,store,2015_avg_rev,2015_avg_profit,2015_avg_price_per_ml,2015_avg_profit_per_ml,2015_avg_profit_per_bottle
count,1183.0,1183.0,1183.0,1183.0,1183.0,1183.0
mean,4165.12257,123.9268,41.458439,4.716566,5.687314,4.565414
std,812.526287,117.297322,39.233806,2.916528,1.251092,0.989853
min,2106.0,24.757826,8.268261,0.543021,3.173605,1.488667
25%,3805.5,76.632554,25.596544,2.198975,4.945446,3.935385
50%,4376.0,106.078333,35.546951,4.280716,5.539003,4.54151
75%,4740.5,143.609183,48.125233,6.391313,6.233269,5.097468
max,9010.0,2061.811833,691.704,19.266372,27.428178,11.65375


In [42]:
# Create a df showing all days with sales for each store

idf_open = idf.loc[:,['store', 'date', 'year', 'quarter']]
idf_open.drop_duplicates(['store', 'date'], inplace=True)
idf.sort_values('store').head()

days_open_15 = idf_open[idf_open['year'] == 2015].groupby(['store'])[['date']].agg(len)
days_open_15.reset_index(inplace=True)
days_open_15.columns = ['store', 'days_open_15']
days_open_15['days_open_15'] = days_open_15['days_open_15'].astype(int)

In [43]:
# Create daily metrics (rev_per_day and profit_per_day) with days_open data 
iowa_sum_2015['2015_rev_per_day'] = iowa_sum_2015['2015_total_revenue'] / days_open_15['days_open_15']
iowa_sum_2015['2015_profit_per_day'] = iowa_sum_2015['2015_profit'] / days_open_15['days_open_15']
iowa_sum_2015.head()

Unnamed: 0,store,2015_total_revenue,2015_profit,2015_rev_per_day,2015_profit_per_day
0,2106,146326.22,48838.08,2813.965769,939.193846
1,2113,9310.22,3109.04,198.089787,66.149787
2,2130,223742.86,74650.4,4302.747308,1435.584615
3,2152,15442.16,5175.06,315.146122,105.613469
4,2178,24324.18,8165.7,476.944706,160.111765


In [44]:
# Create df of categorical variables only 
idf_category = idf.loc[:,['store', 'city', 'county', 'zip_code']].drop_duplicates()

# Merge (left join) iowa_sum, iowa_mean and days_open_15 to create df of all stores for 2015 data 
idf_2015 = pd.merge(idf_category, days_open_15, how='left', on='store'). \
merge(iowa_sum_2015, how='left', on='store').merge(iowa_mean_2015, how='left', on='store')
print len(idf_2015)
print 'Note: discrepancy between len(idf) and len(idf_2015) result of 2 stores each counted twice. No time to remove.'
idf_2015.head(3)

1204
Note: discrepancy between len(idf) and len(idf_2015) result of 2 stores each counted twice. No time to remove.


Unnamed: 0,store,city,county,zip_code,days_open_15,2015_total_revenue,2015_profit,2015_rev_per_day,2015_profit_per_day,2015_avg_rev,2015_avg_profit,2015_avg_price_per_ml,2015_avg_profit_per_ml,2015_avg_profit_per_bottle
0,3717,SUMNER,Bremer,50674.0,50,9022.86,3011.02,180.4572,60.2204,34.438397,11.492443,10.623053,5.3898,5.103779
1,2614,DAVENPORT,Scott,52807.0,52,285350.58,95391.52,5487.511154,1834.452308,138.385344,46.261649,7.146836,6.427207,5.232396
2,2106,CEDAR FALLS,Black Hawk,50613.0,52,146326.22,48838.08,2813.965769,939.193846,277.658861,92.671879,2.788873,5.961008,5.166319


## Create a new df summarzing Q1-2015 data only

In [45]:
# Create a df summarizing total sales and revenues in 2015 at store-level
sum_metrics = ['sales', 'profit']
iowa_sum_2015Q1 = idf[idf['quarter'] == 'Q1-2015'].groupby(['store'])[sum_metrics].agg(np.sum)
iowa_sum_2015Q1.columns = ['Q1-15_total_revenue','Q1-15_profit']
iowa_sum_2015Q1.reset_index(inplace=True)
print len(iowa_sum_2015Q1)

# Create a df summarizing mean metrics
mean_metrics = ['sales', 'profit', 'price_per_ml', 'profit_per_ml', 'profit_per_bottle']
iowa_mean_2015Q1 = idf[idf['quarter'] == 'Q1-2015'].groupby(['store'])[mean_metrics].agg(np.mean)
iowa_mean_2015Q1.columns = ['Q1-15_avg_rev','Q1-15_avg_profit','Q1-15_avg_price_per_ml', \
                         'Q1-15_avg_profit_per_ml', 'Q1-15_avg_profit_per_bottle']
iowa_mean_2015Q1.reset_index(inplace=True)
print len(iowa_mean_2015Q1)

# Create days_open df for Q1-2015
days_open_15Q1 = idf_open[idf_open['quarter'] == 'Q1-2015'].groupby(['store'])[['date']].agg(len)
days_open_15Q1.reset_index(inplace=True)
days_open_15Q1.columns = ['store', 'days_open_15Q1']
days_open_15Q1['days_open_15Q1'] = days_open_15Q1['days_open_15Q1'].astype(int)

1183
1183


In [46]:
# Create daily metrics (rev_per_day and profit_per_day) with days_open data 
iowa_sum_2015Q1['rev_per_day'] = iowa_sum_2015Q1['Q1-15_total_revenue'] / days_open_15Q1['days_open_15Q1']
iowa_sum_2015Q1['profit_per_day'] = iowa_sum_2015Q1['Q1-15_profit'] / days_open_15Q1['days_open_15Q1']
iowa_sum_2015Q1.head()

Unnamed: 0,store,Q1-15_total_revenue,Q1-15_profit,rev_per_day,profit_per_day
0,2106,39287.29,13108.37,3273.940833,1092.364167
1,2113,2833.25,944.72,257.568182,85.883636
2,2130,48545.14,16217.36,4045.428333,1351.446667
3,2152,4006.92,1337.2,333.91,111.433333
4,2178,5856.41,1961.28,488.034167,163.44


In [47]:
# Merge (left join) iowa_sum, iowa_mean and days_open_15 to create df of all stores for 2015 data 
idf_2015Q1 = pd.merge(idf_category, days_open_15Q1, how='left', on='store')
idf_2015Q1 = pd.merge(idf_2015Q1,iowa_sum_2015Q1, how='left', on='store'). \
merge(iowa_mean_2015Q1, how='left', on='store')

print len(idf_2015Q1)
print 'Note: discrepancy between len(idf) and len(idf_2015) result of 2 stores each counted twice. No time to remove.'
idf_2015Q1.head(3)

1204
Note: discrepancy between len(idf) and len(idf_2015) result of 2 stores each counted twice. No time to remove.


Unnamed: 0,store,city,county,zip_code,days_open_15Q1,Q1-15_total_revenue,Q1-15_profit,rev_per_day,profit_per_day,Q1-15_avg_rev,Q1-15_avg_profit,Q1-15_avg_price_per_ml,Q1-15_avg_profit_per_ml,Q1-15_avg_profit_per_bottle
0,3717,SUMNER,Bremer,50674.0,11,1583.13,527.81,143.920909,47.982727,35.980227,11.995682,11.149558,5.540246,4.902955
1,2614,DAVENPORT,Scott,52807.0,12,64520.24,21642.5,5376.686667,1803.541667,135.546723,45.467437,5.58463,5.687667,4.805462
2,2106,CEDAR FALLS,Black Hawk,50613.0,12,39287.29,13108.37,3273.940833,1092.364167,304.552636,101.615271,3.432935,5.959059,5.033721


## Create a new df summarzing Q1-2016 data only

In [48]:
# Create a df summarizing total sales and revenues in 2016 at store-level
sum_metrics = ['sales', 'profit']
iowa_sum_2016Q1 = idf[idf['quarter'] == 'Q1-2016'].groupby(['store'])[sum_metrics].agg(np.sum)
iowa_sum_2016Q1.columns = ['Q1-16_total_revenue','Q1-16_profit']
iowa_sum_2016Q1.reset_index(inplace=True)
print len(iowa_sum_2016Q1)

# Create a df summarizing mean metrics
mean_metrics = ['sales', 'profit', 'price_per_ml', 'profit_per_ml', 'profit_per_bottle']
iowa_mean_2016Q1 = idf[idf['quarter'] == 'Q1-2016'].groupby(['store'])[mean_metrics].agg(np.mean)
iowa_mean_2016Q1.columns = ['Q1-16_avg_rev','Q1-16_avg_profit','Q1-16_avg_price_per_ml', \
                         'Q1-16_avg_profit_per_ml', 'Q1-16_avg_profit_per_bottle']
iowa_mean_2016Q1.reset_index(inplace=True)
print len(iowa_mean_2016Q1)

# Create days_open df for Q1-2016
days_open_16Q1 = idf_open[idf_open['quarter'] == 'Q1-2016'].groupby(['store'])[['date']].agg(len)
days_open_16Q1.reset_index(inplace=True)
days_open_16Q1.columns = ['store', 'days_open']
days_open_16Q1['days_open'] = days_open_15Q1['days_open'].astype(int)

1183
1183


KeyError: 'days_open'

In [None]:
iowa_sum_2016Q1.head(3)

In [None]:
# Create daily metrics (rev_per_day and profit_per_day) with days_open data 
iowa_sum_2016Q1['rev_per_day'] = iowa_sum_2016Q1['Q1-16_total_revenue'] / days_open_16Q1['days_open']
iowa_sum_2016Q1['profit_per_day'] = iowa_sum_2016Q1['Q1-16_profit'] / days_open_16Q1['days_open']

# Merge (left join) iowa_sum, iowa_mean and days_open_15 to create df of all stores for 2016 data 
idf_2016Q1 = pd.merge(idf_category, days_open_16Q1, how='left', on='store')
idf_2016Q1 = pd.merge(idf_2016Q1,iowa_sum_2016Q1, how='left', on='store'). \
merge(iowa_mean_2016Q1, how='left', on='store')

idf_2016Q1.head(3)
print len(idf_2016Q1)
print 'Note: discrepancy between len(idf) and len(idf_2016) result of 2 stores each counted twice. No time to remove.'
idf_2016Q1.head(3)

In [None]:
# Make sure all three new dfs are shame shape
print idf_2015.shape
print idf_2015Q1.shape
print idf_2016Q1.shape

In [None]:
###################################################
a1 = idf_2015.copy()
a2 = idf_2015Q1.copy()
a3 = idf_2016Q1.copy()

## Merge 2015, Q1-2015 and Q2-2015 dfs

In [None]:
stores_df = idf_2015.merge(idf_2015Q1, on=['store', 'county', 'city', 'zip_code'], how ='left')\
.merge(idf_2016Q1, on= ['store', 'county', 'city', 'zip_code'], how ="left")

In [None]:
print stores_df.shape
print pd.DataFrame(stores_df.columns)
stores_df.head(3)

In [None]:
# 1: Create idf_sums
idf_sums = idf.groupby(['store'])[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(np.sum)
idf_sums.reset_index(inplace=True)
idf_sums.head(3)

In [None]:
# 2: Create idf_means
idf_means = idf.groupby(['store'])[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(np.mean)
idf_means.reset_index(inplace=True)
idf_means.columns = ['store', 'rev_per_sale', 'prof_per_sale', 'bott_per_sale', \
                    'vol_per_sale']
idf_means.head(3)

In [None]:
# 3: Merge sums and means into idf_sales
idf_sales = idf_sums.merge(idf_means, how='inner', on='store')
idf_sales.head(3)

In [None]:
iowa.head()

In [None]:
# 4: Create new columns in idf for daily averages
idf['sales_pd_15'] = idf['sales'] / idf['days_open_15']
idf['profit_pd_15'] = idf['profit'] / idf['days_open_15']
idf['bottl_pd_15'] = idf['bottles_sold'] / idf['days_open_15']
idf['vol_pd_15'] = idf['volume_sold'] / idf['days_open_15']

idf['sales_pd_16'] = idf['sales'] / idf['days_open_16']
idf['profit_pd_16'] = idf['profit'] / idf['days_open_16']
idf['bottles_pd_16'] = idf['bottles_sold'] / idf['days_open_16']
idf['vol_pd_16'] = idf['volume_sold'] / idf['days_open_16']

In [None]:
# 5: Separate daily average columns into own df
idf_pd = idf.groupby(['store'])[['sales_pd_15', 'profit_pd_15', 'vol_pd_15']].agg(np.sum)
idf_pd.reset_index(inplace=True)
idf_pd.head(3)

In [None]:
# 6: Merge sales and daily average dfs
idf_metrics = idf_sales.merge(idf_pd, how='inner', on='store')
idf_metrics.head(3)

In [None]:
# 7: Create location only df
idf_location = pd.pivot_table(idf, values=['sales'], index=['store', 'city', 'county', 'zip_code'], aggfunc=np.sum)
idf_location.reset_index(inplace=True)
idf_location.drop('sales', axis=1, inplace=True)
idf_location.head(3)

In [None]:
# 8: Create new df by merging metrics and location
idf_new= idf_metrics.merge(idf_location, how='inner', on='store')
idf_new.head(3)

In [None]:
# View correlations
col_list = idf_new.columns
idf_new.loc[:,col_list].corr()

In [None]:
idf_new.loc[:,col_list].corr()['sales']

In [None]:
idf.columns

In [None]:
def my_pivot(df, index, values, aggfunc, plt=False):
    piv = pd.pivot_table(df, index=index, values=values, aggfunc=aggfunc)
    #piv.sort_values(by = ['week'], inplace=True)
    #print piv
    if plt: piv.plot(title= 'Average Current Liquor Sales by County',kind='hist', figsize=(16,8),bins=40)

my_pivot(idf, index=["county"], values=['sales'], aggfunc=np.mean, plt=True)

In [None]:
sns.jointplot(x=idf_new['sales_pd_15'], y=idf_new['sales'])

In [None]:
sns.jointplot(x=idf_new['vol_pd_15'], y=idf_new['sales'])

In [None]:
sns.jointplot(x=idf_new['vol_per_sale'], y=idf_new['sales'])

In [None]:
sns.jointplot(x=idf_new['bott_per_sale'], y=idf_new['sales'])

### Create new dataframe for Q1-2015 only

In [None]:
idf15 = iowa[iowa['store'].isin(open_stores)]
idf15 = idf[idf['quarter'] == 'Q1-2015']

In [None]:
idf15.head(5)

In [None]:
# 1: Create idf_sums
idf15_sums = idf15.groupby(['store'])[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(np.sum)
idf15_sums.reset_index(inplace=True)

# 2: Create idf_means
idf15_means = idf15.groupby(['store'])[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(np.mean)
idf15_means.reset_index(inplace=True)
idf15_means.columns = ['store', 'rev_per_sale', 'prof_per_sale', 'bott_per_sale', \
                    'vol_per_sale']

# 3: Merge sums and means into idf_sales
idf15_sales = idf15_sums.merge(idf15_means, how='inner', on='store')
idf15_sales.head(3)

In [None]:
# 4: Create new columns in idf for daily averages
days_open_Q1_15 = idf_open[idf_open['quarter'] == 'Q1-2015'].groupby(['store'])[['date']].agg(len)
days_open_Q1_15.reset_index(inplace=True)
days_open_Q1_15.columns = ['store', 'days_open_Q1_15']
days_open_Q1_15['days_open_Q1_15'] = days_open_Q1_15['days_open_Q1_15'].astype(int)

idf15['sales_pd_15'] = idf15['sales'] / days_open_Q1_15['days_open_Q1_15']
idf15['profit_pd_15'] = idf15['profit'] / days_open_Q1_15['days_open_Q1_15']
idf15['bottl_pd_15'] = idf15['bottles_sold'] / days_open_Q1_15['days_open_Q1_15']
idf15['vol_pd_15'] = idf15['volume_sold'] / days_open_Q1_15['days_open_Q1_15']

# 5: Separate daily average columns into own df
idf15_pd = idf.groupby(['store'])[['sales_pd_15', 'profit_pd_15', 'vol_pd_15']].agg(np.sum)
idf15_pd.reset_index(inplace=True)
idf15_pd.head(3)

In [None]:
# 6: Merge sales and daily average dfs
idf15_metrics = idf15_sales.merge(idf15_pd, how='inner', on='store')
idf15_metrics.head(3)



In [None]:
# 7: Create location only df
idf_location = pd.pivot_table(idf, values=['sales'], index=['store', 'city', 'county', 'zip_code'], aggfunc=np.sum)
idf_location.reset_index(inplace=True)
idf_location.drop('sales', axis=1, inplace=True)
idf_location.head(3)

# 8: Create new df by merging metrics and location
idf15_new= idf15_metrics.merge(idf_location, how='inner', on='store')
idf15_new.head(3)

In [None]:
idf15_new['sales_2Q-4Q-15'] = idf_new['sales'] - idf15_new['sales']
idf15_new.head(3)

In [None]:
iowa_2015 = iowa[iowa['year'] == 2015]
iowa_2016 = iowa[iowa['year'] == 2016]

In [None]:
idf15_new['sales_2Q-4Q-15'].shape

In [None]:
# train all models only on data from stores with full data from 2015 (no stores which opened or closed)

# base model - predict 2015 Q2-4 with just 2015Q1
y = idf15_new['sales_2Q-4Q-15']
X  = idf15_new['sales']
X = pd.DataFrame(X)

lm = linear_model.LinearRegression()
base_model = lm.fit(X,y)

print "Coef:", base_model.coef_
print "Intercept:", base_model.intercept_
print "Score:", base_model.score(X, y) 
print "MSE:", mean_squared_error(y, base_model.predict(X))

In [None]:
print "Mean cross-val R2:",cross_val_score(lm, X, y, cv=5).mean()

In [None]:
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.33)
lm = linear_model.LinearRegression()
train_base_model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

print "Train Test R2:", train_base_model.score(X_test, y_test) 

### Full Year 2015

In [None]:
# train all models only on data from stores with full data from 2015 (no stores which opened or closed)

# base model - predict 2015 Q2-4 with just 2015Q1
y = iowa[[iowa['quarter']=='Q1-2015']]
X  = idf_new['sales']
X = pd.DataFrame(X)

lm = linear_model.LinearRegression()
base_model = lm.fit(X,y)

print "Coef:", base_model.coef_
print "Intercept:", base_model.intercept_
print "Score:", base_model.score(X, y) 
print "MSE:", mean_squared_error(y, base_model.predict(X))

In [None]:
ax = idf_sales.plot(kind="scatter", x='bottles_sold', y='profit', figsize=(10,6))
ax.set_title('profit x bottles', y=1.01, fontsize=30)
ax.set_xlabel('profit', fontsize=16)
ax.set_ylabel('bottles', fontsize=16)
ax.tick_params(axis='both', which='both', labelsize=16)

### Dataframes by store

In [None]:
#Df showing full year sales by store 2015
print 'Nunique stores: ' + str(idf['store'].nunique())
idf_store_2015 = idf_2015.groupby(['store']) \
[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(sum)
idf_store_2015.reset_index(inplace=True)
idf_store_2015.head(3)

In [None]:
#Df showing full year sales by store 2016
print 'Nunique stores: ' + str(idf['store'].nunique())
idf_store_2016 = idf_2016.groupby(['store']) \
[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(sum)
idf_store_2016.reset_index(inplace=True)
idf_store_2016.head(3)

In [None]:
idf_store_2016.plot(x='store', y=['profit', 'sales'], kind= 'bar')

In [None]:
#Df showing full year sales by county 2015
print 'Nunique counties: ' + str(idf['county'].nunique())
idf_county_2015 = idf_2015.groupby(['county']) \
[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(sum)
idf_county_2015.reset_index(inplace=True)
idf_county_2015.tail(3)

In [None]:
#Df showing full year sales by city 2015
print 'Nunique cities: ' + str(idf['city'].nunique())
idf_city_2015 = idf_2015.groupby(['city']) \
[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(sum)
idf_city_2015.reset_index(inplace=True)
idf_city_2015.tail(3)

In [None]:
#Df showing full year sales by zip 2015
print 'Nunique zip codes: ' + str(idf['zip_code'].nunique())
idf_zip_2015 = idf_2015.groupby(['zip_code']) \
[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(sum)
idf_zip_2015.reset_index(inplace=True)
idf_zip_2015.tail(3)

In [None]:
#Df showing full year sales by category
print 'Nunique categories: ' + str(idf['category_name'].nunique())
idf_category_2015 = idf_2015.groupby(['category_name']) \
[['sales', 'profit', 'bottles_sold', 'volume_sold']].agg(sum)
idf_category_2015.reset_index(inplace=True)
idf_category_2015.tail(3)

In [None]:
quarter_sales = idf[idf["year"]==2015].groupby(["quarter"])["sales"].agg([np.sum])
quarter_sales.columns = ["Total Sales"]
quarter_sales = quarter_sales.applymap(lambda x: x/100000)

ax = quarter_sales.plot(kind="bar", figsize=(10,6))
ax.set_title("2015 Sales by Quarter", y=1.01, fontsize=30)
ax.set_xlabel('Quarter', fontsize=16)
ax.set_ylabel('Sales (in $mm)', fontsize=16)
ax.tick_params(axis='both', which='both', labelsize=16)


In [None]:
def my_pivot(df, index, values, aggfunc, plt=False):
    piv = pd.pivot_table(df, index=index, values=values, aggfunc=aggfunc)
    #piv.sort_values(by = ['week'], inplace=True)
    #print piv
    if plt: piv.plot(title= 'Average Current Liquor Sales by County',kind='hist', figsize=(16,8),bins=40)

my_pivot(idf, index=["store"], values=['sales'], aggfunc=np.mean, plt=True)

In [None]:
sns.heatmap(iowa.corr())

In [None]:
iowa_target = pd.DataFrame(iowa['sales'])

In [None]:
ax = sns.regplot(y_pred, y)
ax.figure.set_figheight(6)
ax.figure.set_figwidth(14)

ax.set_ylabel('Actual Values')
ax.set_xlabel('Predicted Values')
ax.set_title('Predicted vs. Actual Values');

In [None]:
lr_r2 =  r2_score(y_true=y, y_pred=y_pred)
lr_r2

In [None]:
len(lr_model.coef_)

In [None]:
lr_model.coef_

In [None]:
iowa['bottle_volume'].unique()

In [None]:
{'vodka': }
#vodka
#schnapps
#whiskey
#rum
#scotch
#gin
#liqueurs
#brandies
#tequila
#beer
#other


# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
len(iowa['County'].value_counts())

In [None]:
iowa_county = pd.pivot_table(iowa, index=iowa['County'], values=['Sale (Dollars)'])
iowa_county.head()

## Record your findings

Be sure to write out anything observations from your exploratory analysis.

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

In [None]:
from sklearn import linear_model


## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.