# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [273]:
# Import relevant libraries
import pandas as pd
import numpy as np

In [274]:
## Load the data into a DataFrame
df = pd.read_csv('/Users/stel/joce/data_science/project-3-datasets/Iowa_Liquor_sales_sample_10pct.csv')

In [275]:
df.head(3)

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,11/04/2015,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,9.0,2.38
1,03/02/2016,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,1.5,0.4
2,02/11/2016,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,$12.59,$18.89,24,$453.36,24.0,6.34


In [298]:
df[['Date', 'State Bottle Cost', 'State Bottle Retail']][df['Item Number']==19067].drop_duplicates()

Unnamed: 0,Date,State Bottle Cost,State Bottle Retail
2,2016-02-11,$12.59,$18.89
74,2015-12-16,$12.59,$18.89
423,2015-11-04,$12.59,$18.89
487,2015-11-03,$12.59,$18.89
1346,2015-01-29,$12.08,$18.12
1817,2016-03-09,$12.59,$18.89
2103,2015-10-19,$12.59,$18.89
2365,2015-09-22,$12.59,$18.89
3936,2015-10-15,$12.59,$18.89
4824,2015-04-01,$12.59,$18.89


In [277]:
## Transform the dates column
df["Date"] = pd.to_datetime(df["Date"], format='%m/%d/%Y')

In [278]:
df.head(3)

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,2015-11-04,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,9.0,2.38
1,2016-03-02,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,1.5,0.4
2,2016-02-11,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,$12.59,$18.89,24,$453.36,24.0,6.34


Columns that we would be interested in:
- Date
- Store number
- City
- Zip code
- County
- Category
- Category name
- Vendor number
- Item number
- Item description
- State bottle cost
- State bottle retail
- Bottles sold
- Sales (dollars)
- Volumes (liters)

In [279]:
df.dtypes

Date                     datetime64[ns]
Store Number                      int64
City                             object
Zip Code                         object
County Number                   float64
County                           object
Category                        float64
Category Name                    object
Vendor Number                     int64
Item Number                       int64
Item Description                 object
Bottle Volume (ml)                int64
State Bottle Cost                object
State Bottle Retail              object
Bottles Sold                      int64
Sale (Dollars)                   object
Volume Sold (Liters)            float64
Volume Sold (Gallons)           float64
dtype: object

In [280]:
# Check for blanks in city

df.City.unique()

array(['SUMNER', 'DAVENPORT', 'CEDAR FALLS', 'AMES', 'BELMOND',
       'CEDAR RAPIDS', 'OTTUMWA', 'CLEAR LAKE', 'BONDURANT', 'SHELLSBURG',
       'WEST DES MOINES', 'OSKALOOSA', 'WEST POINT', 'CORALVILLE',
       'WATERLOO', 'NEWTON', 'TOLEDO', 'SIOUX CENTER', 'MASON CITY',
       'MILFORD', 'IOWA CITY', 'COUNCIL BLUFFS', 'GRIMES', 'KEOKUK',
       'INDIANOLA', 'FORT DODGE', 'DES MOINES', 'HUXLEY', 'LA PORTE CITY',
       'MARION', 'ANKENY', 'DYSART', 'EAGLE GROVE', 'OGDEN', 'MAQUOKETA',
       'CARROLL', 'WEBSTER CITY', 'SIOUX CITY', 'WASHINGTON',
       'CHARLES CITY', 'MANCHESTER', 'SHELDON', 'ALTOONA', 'STORM LAKE',
       'GRUNDY CENTER', 'SPIRIT LAKE', 'DUBUQUE', 'STUART', 'SCHLESWIG',
       'MOUNT PLEASANT', 'GUTHRIE CENTER', 'FORT ATKINSON', 'SPENCER',
       'MUSCATINE', 'LENOX', 'MISSOURI VALLEY', 'IOWA FALLS', 'ALGONA',
       'WAUKEE', 'LECLAIRE', 'LAMONI', 'CLINTON', 'PARKERSBURG',
       'BURLINGTON', 'DUNLAP', 'ORANGE CITY', 'MANNING', 'LOHRVILLE',
       'JOHNSTON', 'O

In [281]:
# Zip code should be integers

# Find the entries which are causing the type to be forced to object
# We may have to do this a few times, so we'll define a function that takes a column name to do it
def find_non_int(col):
    non_int = []
    for i in df[col]:
        try:
            int(i)
        except:
            if i not in non_int:
                non_int.append(i)
    non_int = pd.Series(non_int)
    return non_int.value_counts(dropna=False)
print find_non_int('Zip Code')

712-2    1
dtype: int64


In [282]:
# Find the cities that have '712-2' as the zip code

df['City'][df['Zip Code'] == '712-2'].value_counts()

DUNLAP    217
Name: City, dtype: int64

In [283]:
# Find the corresponding zip codes and counties for entries with 'DUNLAP' as city
print df['Zip Code'][df['City'] == 'DUNLAP'].value_counts()
print df['County'][df['City'] == 'DUNLAP'].value_counts()
print df['County Number'][df['City'] == 'DUNLAP'].value_counts()

712-2    217
Name: Zip Code, dtype: int64
Harrison    186
Name: County, dtype: int64
43.0    186
Name: County Number, dtype: int64


Based on [Wikipedia](https://en.wikipedia.org/wiki/Dunlap,_Iowa),

> Dunlap is a city in Harrison County, Iowa, United States, along the Boyer River.

> County: Harrison

> Zip code: 51529

In [284]:
# For all entries with 'DUNLAP' as city
# Set zip code as 51529
df.ix[df['City']=='DUNLAP', 'Zip Code'] = '51529'

# Set county as Harrison
df.ix[df['City']=='DUNLAP', 'County'] = 'Harrison'

# Set county number as 43.0
df.ix[df['City']=='DUNLAP', 'County Number'] = 43.0

In [285]:
# Now change zip code to integers
df['Zip Code'] = df['Zip Code'].astype(int)

In [286]:
df.dtypes

Date                     datetime64[ns]
Store Number                      int64
City                             object
Zip Code                          int64
County Number                   float64
County                           object
Category                        float64
Category Name                    object
Vendor Number                     int64
Item Number                       int64
Item Description                 object
Bottle Volume (ml)                int64
State Bottle Cost                object
State Bottle Retail              object
Bottles Sold                      int64
Sale (Dollars)                   object
Volume Sold (Liters)            float64
Volume Sold (Gallons)           float64
dtype: object

In [287]:
# Counties are null for some entries
# Let's look at zip codes
null_county_zips = df['Zip Code'][df['County'].isnull()].unique()
null_county_zips

array([52402, 51103, 50707, 52205, 50677, 50441, 50421, 52241, 52804,
       52732, 50211, 51632, 52317, 50225, 50009, 50401, 50022, 50049,
       52136, 50483, 52802, 50601, 50469, 51537, 50703, 50317, 52358,
       52003, 52404, 51653, 50025, 52590, 51241, 50213, 52591, 50237, 50501])

In [288]:
zip_county = {}
for i in df['Zip Code'].drop_duplicates():
    zip_county[i] = [df['County'][df['Zip Code']==i].unique().tolist()[0]][0]

In [289]:
for i in zip_county:
    if type(zip_county[i]) == float:
        print i, zip_county[i]

51653 nan
50237 nan
52590 nan
50677 nan
51103 nan


In [293]:
df['County'][df['Zip Code']==51632].unique().tolist()[0]

'Clayton'

In [259]:
# We'll edit the counties manually for those which do not have a value elsewhere in the dataset
# or where the zip straddles 2 counties
zip_county[50237] = 'Polk'
zip_county[51653] = 'Fremont'
zip_county[52590] = 'Wayne'
zip_county[50677] = 'Bremer'
zip_county[51103] = 'Woodbury'
zip_county[51632] = 'Page'

In [256]:
zip_county

{50002: 'Adair',
 50003: 'Dallas',
 50006: 'Hardin',
 50009: 'Polk',
 50010: 'Story',
 50014: 'Story',
 50020: 'Cass',
 50021: 'Polk',
 50022: 'Cass',
 50023: 'Polk',
 50025: 'Audubon',
 50028: 'Jasper',
 50033: 'Madison',
 50035: 'Polk',
 50036: 'Boone',
 50044: 'Marion',
 50046: 'Polk',
 50047: 'Warren',
 50048: 'Guthrie',
 50049: 'Lucas',
 50054: 'Jasper',
 50056: 'Story',
 50058: 'Carroll',
 50060: 'Wayne',
 50061: 'Warren',
 50069: 'Dallas',
 50071: 'Wright',
 50072: 'Madison',
 50075: 'Hamilton',
 50076: 'Audubon',
 50107: 'Greene',
 50109: 'Dallas',
 50111: 'Polk',
 50112: 'Poweshiek',
 50115: 'Guthrie',
 50122: 'Hardin',
 50123: 'Wayne',
 50124: 'Story',
 50125: 'Warren',
 50126: 'Hardin',
 50129: 'Linn',
 50130: 'Hamilton',
 50131: 'Polk',
 50135: 'Jasper',
 50136: 'Jasper',
 50138: 'Marion',
 50140: 'Decatur',
 50142: 'Marshall',
 50144: 'Decatur',
 50150: 'Monroe',
 50156: 'Boone',
 50158: 'Marshall',
 50160: 'Warren',
 50161: 'Story',
 50162: 'Marshall',
 50163: 'Marion',
 

In [252]:
for k, v in zip_county.items():
    zip_county[k] = [v, ]

In [239]:
df['City'][df['Zip Code']==51632]

187        CORNING
544        CORNING
888       CLARINDA
890        CORNING
948       CLARINDA
1381      CLARINDA
1433       CORNING
1602      CLARINDA
1670      CLARINDA
2278      CLARINDA
2474      CLARINDA
2549      CLARINDA
2678      CLARINDA
3009       CORNING
3479      CLARINDA
3598      CLARINDA
4176      CLARINDA
4228       CORNING
4829      CLARINDA
5099      CLARINDA
5250       CORNING
5949       CORNING
6727      CLARINDA
6944      CLARINDA
7531       CORNING
7752       CORNING
7916      CLARINDA
8191      CLARINDA
8337      CLARINDA
8575      CLARINDA
            ...   
260445    CLARINDA
260527    CLARINDA
260730    CLARINDA
261631    CLARINDA
261934    CLARINDA
262908     CORNING
263029     CORNING
263063     CORNING
263093     CORNING
263305    CLARINDA
263730    CLARINDA
263846    CLARINDA
263948    CLARINDA
264391    CLARINDA
265238     CORNING
265273    CLARINDA
265713    CLARINDA
265751    CLARINDA
265765    CLARINDA
267070    CLARINDA
267814    CLARINDA
267844     C

In [229]:
df['County'] = df['Zip Code'].map(zip_county)

In [163]:
df[df['County'].isnull()]

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)



# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt

## Record your findings

Be sure to write out anything observations from your exploratory analysis.

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

In [6]:
from sklearn import linear_model


## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.