## Introduction:

Our Dataset is about 3.6GB on disk, and "wc -l datafile.csv" tells us that we have 37 768 482 rows. Each row in the file takes up about 3 lines, so we have about 12589494 potential rows to access, in the dataset. Kaggle Kernels provide enough memory just to read the entire data frame in. 

## Goal: 

Tidy the Dataset, and then make a simple interface to subset+output the data for visualizaitons and analysis.

## Assumes that we are in the root of the folder, to run!

In [39]:
#Imports:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
import re
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("./data/originalcsv"))

resultDir = "./data/stage1/"

# Any results you write to the current directory are saved as output.

['Iowa_Liquor_Sales_reduced.csv', 'splitfile.csv', 'Iowa_Liquor_Sales.csv']


In [40]:
#Settings:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
alt.data_transformers.enable('default', max_rows=None)
%matplotlib inline 
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 40)
pd.set_option('display.width', 1000)

DataTransformerRegistry.enable('default')

In [41]:
#Support functions:

#for each string, and a compiled pattern object, use the findall method.
#return true if we find ONE SINGLE 5 digit zipcode. False otherwise. 
def regapp(x,pattObj):
    hold = pattObj.findall(x)
    result = False
    if (len(hold) == 1):
        result = True
    return result

def cleanup(cell):
    return float(cell.replace("$",""))

def cutgps(cell,myregex):
    store = re.findall(myregex,cell)
    if (store): #empty
         retVal = store[0]
    else:
         retVal = "NA" #Some of the entries are missing their lat/long data
    return retVal

#first, lets extract the GPS coordinates:
#What can we assume? each column is a non-empty string, at least.
def getlatlong(cell, pos):
    if (cell == "NA"):
        return cell
    sectionList = cell.replace("(","").replace(")","").split(",")
    return float(sectionList[pos]) #0 or 1. Float will clip out spaces for us!



## Data Loading:

In [42]:
dfILS = pd.read_csv("./data/originalcsv/Iowa_Liquor_Sales.csv")
dfILS.columns.values

array(['Invoice/Item Number', 'Date', 'Store Number', 'Store Name',
       'Address', 'City', 'Zip Code', 'Store Location', 'County Number',
       'County', 'Category', 'Category Name', 'Vendor Number',
       'Vendor Name', 'Item Number', 'Item Description', 'Pack',
       'Bottle Volume (ml)', 'State Bottle Cost', 'State Bottle Retail',
       'Bottles Sold', 'Sale (Dollars)', 'Volume Sold (Liters)',
       'Volume Sold (Gallons)'], dtype=object)

## Data Tidying:

In [43]:
#we will use this to calculate the percentage of rows lost after cleaning.
lossDict = {}
lossDict['fullsize'] = dfILS.shape[0] 


Lets get some basic information about our data frame. We can already see some columns are upcast (float instead of ints).

In [44]:
dfILS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12591077 entries, 0 to 12591076
Data columns (total 24 columns):
Invoice/Item Number      object
Date                     object
Store Number             int64
Store Name               object
Address                  object
City                     object
Zip Code                 object
Store Location           object
County Number            float64
County                   object
Category                 float64
Category Name            object
Vendor Number            float64
Vendor Name              object
Item Number              int64
Item Description         object
Pack                     int64
Bottle Volume (ml)       int64
State Bottle Cost        object
State Bottle Retail      object
Bottles Sold             int64
Sale (Dollars)           object
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtypes: float64(5), int64(5), object(14)
memory usage: 2.3+ GB


**Checking out nulls:**

In [45]:
dfILS.isnull().sum() #our number of nas.

Invoice/Item Number          0
Date                         0
Store Number                 0
Store Name                   0
Address                   2376
City                      2375
Zip Code                  2420
Store Location            2375
County Number            79178
County                   79178
Category                  8020
Category Name            16086
Vendor Number                3
Vendor Name                  1
Item Number                  0
Item Description             0
Pack                         0
Bottle Volume (ml)           0
State Bottle Cost           10
State Bottle Retail         10
Bottles Sold                 0
Sale (Dollars)              10
Volume Sold (Liters)         0
Volume Sold (Gallons)        0
dtype: int64

In [46]:
#We should be able to drop NAs, and still have over 12M rows to choose from.
#This loss is acceptable.
dfILS.dropna(inplace=True)
lossDict['nullloss'] = (lossDict['fullsize']- dfILS.shape[0])

**Our Column Names need to be tidied up.**

In [47]:
#replace number with id, so we don't have redundant table.column names (later)
nameDict = {"Invoice/Item Number":"invoiceid"
,"Date":"date"
,"Store Number":"storeid"
,"Store Name":"storename"
,"Address":"address"            
,"City":"city"
,"Zip Code":"zipcode"
,"Store Location":"storelocation"
,"County Number":"countyid"
,"County":"countyname"
,"Category":"categoryid"
,"Category Name":"categoryname"
,"Vendor Name":"vendorname"
,"Vendor Number":"vendorid"
,"Item Number":"itemid"
,"Item Description":"itemdescription" #this is an exception (not "name"), but semantically makes more sense.
,"Pack":"pack"
,"Bottle Volume (ml)":"bottlevolumeml"
,"State Bottle Cost":"statebottlecost"
,"State Bottle Retail":"statebottleretail"
,"Bottles Sold":"bottlessold"
,"Sale (Dollars)":"saleprice"
,"Volume Sold (Liters)":"volumesoldlitre"
,"Volume Sold (Gallons)":"volumesoldgallon"}

dfILS.rename(columns = nameDict,inplace=True)

Pandas has a tendency to upcast a lot of the columns. We need to make the datatypes more specific (example: float64 -> int32). 
countynumber, category, vendornumber, and zipcode are cast as floats or string objects. Observe below:

In [48]:
dfILS.tail(3)

Unnamed: 0,invoiceid,date,storeid,storename,address,city,zipcode,storelocation,countyid,countyname,categoryid,categoryname,vendorid,vendorname,itemid,itemdescription,pack,bottlevolumeml,statebottlecost,statebottleretail,bottlessold,saleprice,volumesoldlitre,volumesoldgallon
12591074,INV-08368000076,10/31/2017,5423,Stammer Liquor Corp,615 2nd Ave,Sheldon,51201,"615 2nd Ave\nSheldon 51201\n(43.184614, -95.85...",71.0,OBRIEN,1011500.0,Bottled in Bond Bourbon,85.0,Brown Forman Corp.,20372,Old Forester 1897 Whisky Row Series,6,750,$24.98,$37.47,2,$37.47,1.5,0.39
12591075,INV-08368000077,10/31/2017,5423,Stammer Liquor Corp,615 2nd Ave,Sheldon,51201,"615 2nd Ave\nSheldon 51201\n(43.184614, -95.85...",71.0,OBRIEN,1011200.0,Straight Bourbon Whiskies,85.0,Brown Forman Corp.,20369,Old Forester 1870 Whisky Row Series,6,750,$22.49,$33.74,2,$33.74,1.5,0.39
12591076,INV-08368000078,10/31/2017,5423,Stammer Liquor Corp,615 2nd Ave,Sheldon,51201,"615 2nd Ave\nSheldon 51201\n(43.184614, -95.85...",71.0,OBRIEN,1091100.0,American Distilled Spirit Specialty,481.0,Sugarlands Distilling Company LLC,77309,Sugarlands Shine Peanut Butter & Jelly Moonshine,6,750,$13.00,$19.50,2,$19.50,1.5,0.39


In [49]:
#The following are easy to correct. #the ints are not that big, so we can use int32 instead of 64 to save space.
convertList = ["countyid","vendorid","storeid","categoryid","pack","bottlevolumeml","bottlessold",
               "volumesoldlitre","volumesoldgallon","itemid"]

for item in convertList:
    dfILS = dfILS.astype({item: "int32"})#inplace=True)    


#dfILS = dfILS.astype({"vendornumber": "int32"}, inplace=True)
#dfILS = dfILS.astype({"storenumber": "int32"}, inplace=True)
#dfILS = dfILS.astype({'category':'int32'},inplace=True)
#dfILS = dfILS.astype({'pack':'int32'},inplace=True)
#dfILS = dfILS.astype({'bottlevolumeml':'int32'},inplace=True)
#dfILS = dfILS.astype({'bottlessold':'int32'},inplace=True)
#dfILS = dfILS.astype({'volumesoldlitre':'int32'},inplace=True)
#dfILS = dfILS.astype({'volumesoldgallon':'int32'},inplace=True)
#dfILS = dfILS.astype({'itemnumber':'int32'},inplace=True)


### Dealing with Zipcodes:

Zipcode is classed as a string object type, because of zipcode anomolies in the data. There are zipcodes of the form "752-6". Casting occured because of non-numeric characters. We need to cut out rows that don't conform to a 5 digit code, before casting to int.

In [50]:
dfILS['zipcode'].unique() #if you want to look for yourself.

array(['50702', '52761', '51025', '51040', '50219', '50517', '50126',
       '50208', '50312', '50138', '52240', '51555', '50058', '50266',
       '50701', '52804', '52601', '51501', '50115', '52205', '52627',
       '52632', '50010', '50703', '52807', '50049', '51360', '50023',
       '52722', '51566', '52577', '51351', '52405', '52806', '52101',
       '50111', '50009', '50401', '50665', '50201', '52001', '50158',
       '50533', '50613', '52753', '50428', '50317', '51331', '52404',
       '50314', '52778', '50131', '52772', '52317', '51034', '51249',
       '50450', '52324', '50621', '51103', '52653', '50676', '51031',
       '52314', '52732', '52003', '51250', '50511', '52002', '50536',
       '52031', '52656', '50588', '50616', '52301', '50315', '50458',
       '50320', '50021', '51301', '50529', '52310', '50054', '50310',
       '50313', '50309', '50801', '51445', '52403', '51105', '52159',
       '50022', '52245', '50548', '50265', '50211', '50707', '50674',
       '50677', '516

There appear to be *non numeric characthers ("712-2"), floats, strings and ints all mixed together*. Yikes. Upcast to strings,
filter anomolies, and then convert whats left to ints.

In [51]:
dfILS['zipcode'] = dfILS['zipcode'].apply(str)

In [52]:
pattObj = re.compile(r"[0-9]{5}")
boolSelect = dfILS['zipcode'].apply(regapp,args=(pattObj,))
lossDict['ziploss'] = boolSelect.value_counts().loc[False]
dfILS = dfILS[boolSelect]


We have to step down type incrementally. string '7723632.0' throws a ValueError.

In [53]:
dfILS['zipcode'] = dfILS['zipcode'].apply(float)
dfILS['zipcode'] = dfILS['zipcode'].apply(int) 
dfILS = dfILS.astype({"zipcode": "int32"})#, inplace=True) 
#done!

### Dealing with Sales Columns:

Next, we need to clean up the sales columns. They are strings because the dollar sign symbol was included in the spreadsheet. Again,
the numbers in sales aren't that large, so lets use a float32 instead of a float64 to save some space. 

In [54]:
for columnname in ["statebottlecost","statebottleretail","saleprice"]:
    dfILS[columnname] = dfILS[columnname].apply(cleanup)
    dfILS.astype({columnname:"float32"})#,inplace=True)
    
#dfILS.info() #check to see that the three cols are now floats.

Unnamed: 0,invoiceid,date,storeid,storename,address,city,zipcode,storelocation,countyid,countyname,categoryid,categoryname,vendorid,vendorname,itemid,itemdescription,pack,bottlevolumeml,statebottlecost,statebottleretail,bottlessold,saleprice,volumesoldlitre,volumesoldgallon
6,S28865700001,11/09/2015,2538,Hy-Vee Food Store #3 / Waterloo,1422 FLAMMANG DR,WATERLOO,50702,"1422 FLAMMANG DR\nWATERLOO 50702\n(42.459938, ...",7,Black Hawk,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.620000,$17.43,6,$104.58,9,2
8,S29339300091,11/30/2015,2662,Hy-Vee Wine & Spirits / Muscatine,"522 MULBERRY, SUITE A",MUSCATINE,52761,"522 MULBERRY, SUITE A\nMUSCATINE 52761\n",70,Muscatine,1701100,DECANTERS & SPECIALTY PACKAGES,65,Jim Beam Brands,173,Laphroaig w/ Whiskey Stones,12,750,19.580000,$29.37,4,$117.48,3,0
13,S28866900001,11/11/2015,3650,"Spirits, Stogies and Stuff",118 South Main St.,HOLSTEIN,51025,118 South Main St.\nHOLSTEIN 51025\n(42.490073...,47,Ida,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.620000,$17.43,1,$17.43,1,0
18,S29134300126,11/18/2015,3723,J D Spirits Liquor,1023 9TH ST,ONAWA,51040,"1023 9TH ST\nONAWA 51040\n(42.025841, -96.095845)",67,Monona,1081200,CREAM LIQUEURS,305,MHW Ltd,258,"Rumchata ""GoChatas""",1,6000,99.000000,$148.50,1,$148.50,6,1
21,S29282800048,11/23/2015,2642,Hy-Vee Wine and Spirits / Pella,512 E OSKALOOSA,PELLA,50219,"512 E OSKALOOSA\nPELLA 50219\n(41.397023, -92....",63,Marion,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.620000,$17.43,6,$104.58,9,2
25,S28867000001,11/04/2015,3842,Bancroft Liquor Store,107 N PORTLAND ST PO BX 222,BANCROFT,50517,107 N PORTLAND ST PO BX 222\nBANCROFT 50517\n(...,55,Kossuth,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.620000,$17.43,3,$52.29,4,1
29,S28865800001,11/09/2015,2539,Hy-Vee Food Store / iowa Falls,HIGHWAY 65 SOUTH,IOWA FALLS,50126,HIGHWAY 65 SOUTH\nIOWA FALLS 50126\n,42,Hardin,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.620000,$17.43,6,$104.58,9,2
38,S28867100001,11/09/2015,4604,Pit Stop Liquors / Newton,"1324, 1st AVE E",NEWTON,50208,"1324, 1st AVE E\nNEWTON 50208\n(41.699173, -93...",50,Jasper,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.620000,$17.43,2,$34.86,3,0
42,S29191200001,11/19/2015,2248,Ingersoll Liquor and Beverage,3500 INGERSOLL AVE,DES MOINES,50312,3500 INGERSOLL AVE\nDES MOINES 50312\n(41.5863...,77,Polk,1701100,DECANTERS & SPECIALTY PACKAGES,65,Jim Beam Brands,173,Laphroaig w/ Whiskey Stones,12,750,19.580000,$29.37,36,$1057.32,27,7
50,S29137200001,11/18/2015,2566,Hy-Vee Food Store / Knoxville,813 N LINCOLN STE 1,KNOXVILLE,50138,813 N LINCOLN STE 1\nKNOXVILLE 50138\n(41.3254...,63,Marion,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.620000,$17.43,12,$209.16,18,4


Unnamed: 0,invoiceid,date,storeid,storename,address,city,zipcode,storelocation,countyid,countyname,categoryid,categoryname,vendorid,vendorname,itemid,itemdescription,pack,bottlevolumeml,statebottlecost,statebottleretail,bottlessold,saleprice,volumesoldlitre,volumesoldgallon
6,S28865700001,11/09/2015,2538,Hy-Vee Food Store #3 / Waterloo,1422 FLAMMANG DR,WATERLOO,50702,"1422 FLAMMANG DR\nWATERLOO 50702\n(42.459938, ...",7,Black Hawk,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.430000,6,$104.58,9,2
8,S29339300091,11/30/2015,2662,Hy-Vee Wine & Spirits / Muscatine,"522 MULBERRY, SUITE A",MUSCATINE,52761,"522 MULBERRY, SUITE A\nMUSCATINE 52761\n",70,Muscatine,1701100,DECANTERS & SPECIALTY PACKAGES,65,Jim Beam Brands,173,Laphroaig w/ Whiskey Stones,12,750,19.58,29.370001,4,$117.48,3,0
13,S28866900001,11/11/2015,3650,"Spirits, Stogies and Stuff",118 South Main St.,HOLSTEIN,51025,118 South Main St.\nHOLSTEIN 51025\n(42.490073...,47,Ida,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.430000,1,$17.43,1,0
18,S29134300126,11/18/2015,3723,J D Spirits Liquor,1023 9TH ST,ONAWA,51040,"1023 9TH ST\nONAWA 51040\n(42.025841, -96.095845)",67,Monona,1081200,CREAM LIQUEURS,305,MHW Ltd,258,"Rumchata ""GoChatas""",1,6000,99.00,148.500000,1,$148.50,6,1
21,S29282800048,11/23/2015,2642,Hy-Vee Wine and Spirits / Pella,512 E OSKALOOSA,PELLA,50219,"512 E OSKALOOSA\nPELLA 50219\n(41.397023, -92....",63,Marion,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.430000,6,$104.58,9,2
25,S28867000001,11/04/2015,3842,Bancroft Liquor Store,107 N PORTLAND ST PO BX 222,BANCROFT,50517,107 N PORTLAND ST PO BX 222\nBANCROFT 50517\n(...,55,Kossuth,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.430000,3,$52.29,4,1
29,S28865800001,11/09/2015,2539,Hy-Vee Food Store / iowa Falls,HIGHWAY 65 SOUTH,IOWA FALLS,50126,HIGHWAY 65 SOUTH\nIOWA FALLS 50126\n,42,Hardin,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.430000,6,$104.58,9,2
38,S28867100001,11/09/2015,4604,Pit Stop Liquors / Newton,"1324, 1st AVE E",NEWTON,50208,"1324, 1st AVE E\nNEWTON 50208\n(41.699173, -93...",50,Jasper,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.430000,2,$34.86,3,0
42,S29191200001,11/19/2015,2248,Ingersoll Liquor and Beverage,3500 INGERSOLL AVE,DES MOINES,50312,3500 INGERSOLL AVE\nDES MOINES 50312\n(41.5863...,77,Polk,1701100,DECANTERS & SPECIALTY PACKAGES,65,Jim Beam Brands,173,Laphroaig w/ Whiskey Stones,12,750,19.58,29.370001,36,$1057.32,27,7
50,S29137200001,11/18/2015,2566,Hy-Vee Food Store / Knoxville,813 N LINCOLN STE 1,KNOXVILLE,50138,813 N LINCOLN STE 1\nKNOXVILLE 50138\n(41.3254...,63,Marion,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.430000,12,$209.16,18,4


Unnamed: 0,invoiceid,date,storeid,storename,address,city,zipcode,storelocation,countyid,countyname,categoryid,categoryname,vendorid,vendorname,itemid,itemdescription,pack,bottlevolumeml,statebottlecost,statebottleretail,bottlessold,saleprice,volumesoldlitre,volumesoldgallon
6,S28865700001,11/09/2015,2538,Hy-Vee Food Store #3 / Waterloo,1422 FLAMMANG DR,WATERLOO,50702,"1422 FLAMMANG DR\nWATERLOO 50702\n(42.459938, ...",7,Black Hawk,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,6,104.580002,9,2
8,S29339300091,11/30/2015,2662,Hy-Vee Wine & Spirits / Muscatine,"522 MULBERRY, SUITE A",MUSCATINE,52761,"522 MULBERRY, SUITE A\nMUSCATINE 52761\n",70,Muscatine,1701100,DECANTERS & SPECIALTY PACKAGES,65,Jim Beam Brands,173,Laphroaig w/ Whiskey Stones,12,750,19.58,29.37,4,117.480003,3,0
13,S28866900001,11/11/2015,3650,"Spirits, Stogies and Stuff",118 South Main St.,HOLSTEIN,51025,118 South Main St.\nHOLSTEIN 51025\n(42.490073...,47,Ida,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,1,17.430000,1,0
18,S29134300126,11/18/2015,3723,J D Spirits Liquor,1023 9TH ST,ONAWA,51040,"1023 9TH ST\nONAWA 51040\n(42.025841, -96.095845)",67,Monona,1081200,CREAM LIQUEURS,305,MHW Ltd,258,"Rumchata ""GoChatas""",1,6000,99.00,148.50,1,148.500000,6,1
21,S29282800048,11/23/2015,2642,Hy-Vee Wine and Spirits / Pella,512 E OSKALOOSA,PELLA,50219,"512 E OSKALOOSA\nPELLA 50219\n(41.397023, -92....",63,Marion,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,6,104.580002,9,2
25,S28867000001,11/04/2015,3842,Bancroft Liquor Store,107 N PORTLAND ST PO BX 222,BANCROFT,50517,107 N PORTLAND ST PO BX 222\nBANCROFT 50517\n(...,55,Kossuth,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,3,52.290001,4,1
29,S28865800001,11/09/2015,2539,Hy-Vee Food Store / iowa Falls,HIGHWAY 65 SOUTH,IOWA FALLS,50126,HIGHWAY 65 SOUTH\nIOWA FALLS 50126\n,42,Hardin,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,6,104.580002,9,2
38,S28867100001,11/09/2015,4604,Pit Stop Liquors / Newton,"1324, 1st AVE E",NEWTON,50208,"1324, 1st AVE E\nNEWTON 50208\n(41.699173, -93...",50,Jasper,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,2,34.860001,3,0
42,S29191200001,11/19/2015,2248,Ingersoll Liquor and Beverage,3500 INGERSOLL AVE,DES MOINES,50312,3500 INGERSOLL AVE\nDES MOINES 50312\n(41.5863...,77,Polk,1701100,DECANTERS & SPECIALTY PACKAGES,65,Jim Beam Brands,173,Laphroaig w/ Whiskey Stones,12,750,19.58,29.37,36,1057.319946,27,7
50,S29137200001,11/18/2015,2566,Hy-Vee Food Store / Knoxville,813 N LINCOLN STE 1,KNOXVILLE,50138,813 N LINCOLN STE 1\nKNOXVILLE 50138\n(41.3254...,63,Marion,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,12,209.160004,18,4


### Separating GPS coordinates from the Store Location Column:

Next we will split the Store Location Column. There appears to be zipcode and GPS information encoded in these columns. Individual addresses for a store don't matter, as this dataset needs to be aggregated at a larger geographical area to do reasonable modelling. The storelocaiton column will be split (and dropped). We will add a latitude and longitude column, instead.

In [55]:
#First, mutate the store location column; replace it with just the GPS substring.
myregex = r"(\(.+,.+\))"
dfILS['storelocation'] = dfILS['storelocation'].apply(cutgps,args=(myregex,))

In [56]:
#suppressed as I don't need this for Tableau!
#dfILS['latitude'] = dfILS['storelocation'].apply(getlatlong,args=(0,))
#dfILS['longitude'] = dfILS['storelocation'].apply(getlatlong,args=(1,))


In [57]:
dfILS.drop(columns=["storelocation"], axis=1,inplace=True)

Finally, our GPS conversion has introduced some NAs - as not every storelocation had GPS coordinates. Lets check string NAs.

In [58]:
#boolSelect2 = dfILS['latitude'] == "NA"
#boolSelect2.value_counts() #our number of nas.

931655 NAs is roughly 10 percent of our data. Should they be dumped? I choose not to. I'll deal with "NA"s further down the pipeline. 

**Cleaning up City Names:**

They are in all uppercase. Lets make them lower case instead.


In [59]:
dfILS["city"] = dfILS["city"].str.lower()


### Summary and Check of Data Tidying:

We have saved some memory by casting the columns to smaller types.

In [60]:
dfILS.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12488910 entries, 6 to 12591076
Data columns (total 23 columns):
invoiceid            object
date                 object
storeid              int32
storename            object
address              object
city                 object
zipcode              int32
countyid             int32
countyname           object
categoryid           int32
categoryname         object
vendorid             int32
vendorname           object
itemid               int32
itemdescription      object
pack                 int32
bottlevolumeml       int32
statebottlecost      float64
statebottleretail    float64
bottlessold          int32
saleprice            float64
volumesoldlitre      int32
volumesoldgallon     int32
dtypes: float64(3), int32(11), object(9)
memory usage: 1.7+ GB


In [61]:
rowsum = lossDict['nullloss'] + lossDict['ziploss']
print("Rows Lost: " + str(rowsum) + "\n Percentage Loss: " + str(rowsum/lossDict['fullsize']))

Rows Lost: 102167
 Percentage Loss: 0.008114238361023445


## Subsetting and Saving our Data:

Now that our data is tidy, we can subset and save it to .csv files. There are some examples below:

In [62]:
dfILS.head(5)

Unnamed: 0,invoiceid,date,storeid,storename,address,city,zipcode,countyid,countyname,categoryid,categoryname,vendorid,vendorname,itemid,itemdescription,pack,bottlevolumeml,statebottlecost,statebottleretail,bottlessold,saleprice,volumesoldlitre,volumesoldgallon
6,S28865700001,11/09/2015,2538,Hy-Vee Food Store #3 / Waterloo,1422 FLAMMANG DR,waterloo,50702,7,Black Hawk,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,6,104.58,9,2
8,S29339300091,11/30/2015,2662,Hy-Vee Wine & Spirits / Muscatine,"522 MULBERRY, SUITE A",muscatine,52761,70,Muscatine,1701100,DECANTERS & SPECIALTY PACKAGES,65,Jim Beam Brands,173,Laphroaig w/ Whiskey Stones,12,750,19.58,29.37,4,117.48,3,0
13,S28866900001,11/11/2015,3650,"Spirits, Stogies and Stuff",118 South Main St.,holstein,51025,47,Ida,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,1,17.43,1,0
18,S29134300126,11/18/2015,3723,J D Spirits Liquor,1023 9TH ST,onawa,51040,67,Monona,1081200,CREAM LIQUEURS,305,MHW Ltd,258,"Rumchata ""GoChatas""",1,6000,99.0,148.5,1,148.5,6,1
21,S29282800048,11/23/2015,2642,Hy-Vee Wine and Spirits / Pella,512 E OSKALOOSA,pella,50219,63,Marion,1701100,DECANTERS & SPECIALTY PACKAGES,962,Duggan's Distillers Products Corp,238,Forbidden Secret Coffee Pack,6,1500,11.62,17.43,6,104.58,9,2


## Putting Redundant Information into Separate Tables:

We can shrink this data frame significantly, by putting coupled elements in a separate table, that can be joined by the user
at a later date. In particular:

- store number and store name
- county number and county name
- category and category name
- vendor number and vendor name
- item number and item description

We can just store the integer number column, and store a reduced table of unique values in a separate file. Via a left join, we can 
reconstruct the data if needed.

In [63]:
def uniquefilewrite(colA,colB,filename):
    storage = dfILS.drop_duplicates(subset=[colA,colB],keep="first",inplace=False)
    storage.loc[:,[colA,colB]].to_csv(filename + ".csv",index=False)
    return

In [64]:
colA = ["storeid","countyid","categoryid","vendorid","itemid"]
colB = ["storename","countyname","categoryname","vendorname","itemdescription"]

for tup in list(zip(colA,colB)):
    uniquefilewrite(tup[0],tup[1],(resultDir+tup[1]))
    

### Writing out our dataframe

The data is now tidy, free of redundant data, and clean

Kaggle Session information indicates our data only takes up < 1.4GB of space. Much better!

In [65]:
colList = ["invoiceid","date","city", "zipcode", "storeid","countyid",
           "categoryid", "vendorid","itemid","bottlevolumeml","statebottlecost",
            "statebottleretail", "bottlessold", "saleprice" ,"volumesoldlitre"]

dfILS.loc[:,colList].to_csv(resultDir+"ILS_clean.csv",index=False)
