## Join Table Corrections:

While loading the ILS set of tables into Datagrip, I got non-unique key errors for some of the left join tables I constructed. In this notebook, I will do some basic cleaning and investigate the non-uniques.

## Load Data and Setup:

In [18]:
#Libraries
import pandas as pd
import numpy as np
import os

In [19]:
#Settings
from IPython.core.interactiveshell import InteractiveShell
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 40)
pd.set_option('display.width', 1000)

In [22]:
#What tables are we working with?
#Everything except the ILS_clean.csv
inputDir = "./data/stage1/"
outputDir = "./data/stage2/"
os.listdir(inputDir)

['vendorname.csv',
 'storename.csv',
 'countyname.csv',
 'categoryname.csv',
 'ILS_clean.csv',
 'itemdescription.csv']

In [34]:
vendorDF = pd.read_csv(inputDir+"vendorname.csv")
storeDF = pd.read_csv(inputDir+"storename.csv")
countyDF = pd.read_csv(inputDir+"countyname.csv")
categoryDF = pd.read_csv(inputDir+"categoryname.csv")
itemDF = pd.read_csv(inputDir+"itemdescription.csv")

## Experimentation and Examination:

In [20]:
#from this, we know there are 99 counties, and each group has two elements.
countyGroup = countyDF.groupby("countyid")
#countyGroup.count()

In [22]:
#the isupper() islower() function doesn't quite separate the two sets the way we want. They will return true if at least
#one character is uppercase (all strings will return true), so we can't filter on this. Do it manually.
#I see full separation, so lets take a chance.
countyDF.iloc[0:99,:]
countyDF.iloc[100:200,:]

Unnamed: 0,countynumber,countyname
100,9,BREMER
101,42,HARDIN
102,77,POLK
103,50,JASPER
104,12,BUTLER
105,34,FLOYD
106,41,HANCOCK
107,45,HOWARD
108,19,CHICKASAW
109,66,MITCHELL


In [19]:
#our index and county number end on 99, but start at 0 and 1 (respectively). There is a redundant county.
#This is number 85
countyDF = countyDF.loc[0:99,:]
countyDF.sort_values(by="countyid",inplace=True)
countyDF = countyDF[countyDF["countyid"] != "STORY"]
countyDF.reset_index(inplace=True,drop=True)

countyDF.to_csv("./countynames.csv",index=False,header=True)

In [5]:
#Next, lets look at categoryname:
categoryGroup = categoryDF.groupby("categoryid")
#Some duplicates, some not. There appears to be 104 unique items in the set.
#There are some serious problems with the catagories. For example: Irish and Japanese whiskies are mapped to the 
#same number. 
categoryGroup.sum()

Unnamed: 0_level_0,categoryname
categorynumber,Unnamed: 1_level_1
1011100,BLENDED WHISKIESBlended Whiskies
1011200,STRAIGHT BOURBON WHISKIESStraight Bourbon Whis...
1011250,SINGLE BARREL BOURBON WHISKIES
1011300,TENNESSEE WHISKIESSingle Barrel Bourbon Whiskies
1011400,BOTTLED IN BOND BOURBONTennessee Whiskies
1011500,STRAIGHT RYE WHISKIESBottled in Bond Bourbon
1011600,CORN WHISKIESStraight Rye Whiskies
1011700,Corn Whiskies
1011800,Iowa Distillery Whiskies
1012100,CANADIAN WHISKIESCanadian Whiskies


In [43]:
#Lets take a look at the dataframe sections
categoryDF.shape
categoryDF.iloc[1:99,:]
categoryDF.iloc[100:132]

(132, 2)

In [6]:
#Just look at the groups with more than 1 item:
for key, group in categoryGroup:
    if len(group) == 1: #group links to a data frame, but the length of a DF is the number of rows. OK.
        print(group.shape)
        print(group.iloc[0,0])
        print(group)

(1, 2)
1011250
    categorynumber                    categoryname
53         1011250  SINGLE BARREL BOURBON WHISKIES
(1, 2)
1011700
     categorynumber   categoryname
119         1011700  Corn Whiskies
(1, 2)
1011800
     categorynumber              categoryname
116         1011800  Iowa Distillery Whiskies
(1, 2)
1012210
    categorynumber        categoryname
16         1012210  SINGLE MALT SCOTCH
(1, 2)
1022200
    categorynumber        categoryname
78         1022200  100% Agave Tequila
(1, 2)
1022300
     categorynumber categoryname
117         1022300       Mezcal
(1, 2)
1031000
    categorynumber    categoryname
75         1031000  American Vodka
(1, 2)
1031080
   categorynumber    categoryname
7         1031080  VODKA 80 PROOF
(1, 2)
1031090
    categorynumber       categoryname
52         1031090  OTHER PROOF VODKA
(1, 2)
1031110
    categorynumber     categoryname
65         1031110  LOW PROOF VODKA
(1, 2)
1032000
    categorynumber    categoryname
86         1032000  Imported

In [25]:
#Item Description
itemGroup = itemDF.groupby("itemid")
#How do I find all groups with 2 or more elements?
#How do I single them out?

keyL = []
for key, df in itemGroup:
    if (len(df) >= 2):
        keyL.append(key)

checkDF = itemGroup.get_group(keyL.pop(0))
for key in keyL:
    checkDF = checkDF.append(itemGroup.get_group(key))

checkDF
#So we see the following problems: Some duplicate elements, some elements that are similar but typed in differently,
#and potential control character issues. 

Unnamed: 0,itemid,itemdescription
6358,155,Pinnacle Vodka w/Shaker
6504,155,Pinnacle Vodka w/Punch Dispenser
2,258,"Rumchata ""GoChatas"""
6204,258,"""Rumchata """"GoChatas"""""""
5915,308,Jack Daniel's 4YR Rye Single Barrel
6220,308,Jack Daniel's 4YR Rye Single Barrel
2675,472,Jack Daniels Tennessee Honey w/Glass
6364,472,Jack Daniels TN Honey w/Glass
4364,614,Rumchata w/Thermal Cup
6348,614,Rumchata w/Mug


## Join Table Transformations. Removal of Redundant Rows:

In [38]:
#Support Functions:

def chainreplace(thestr):
    return thestr.replace("\''","").replace("\"","").replace(",","").replace(";","").replace("\n","").replace("`","")



'''#We should construct a dataframe, with the following rules:
#1) Group all elements by category number
#2) For each group:
    - check to see if both entries the same
    - add the lower case one if so
    - if not, OR them together with a string operation.
#3) We also check for bad control characters, `,:"';` and the like.
'''

def reformatcatDF(targetDF,col1,col2):
    tarGroup = targetDF.groupby(col1)
    tarNumL = []
    tarNameL = []
    for key, group in tarGroup: #groups cant be zero!
        if (len(group) == 1) or (group.iloc[0,1].upper() == group.iloc[1,1].upper()):
            tarNumL.append(key)
            tarNameL.append(chainreplace(group.iloc[0,1]).lower())
        else: #must be of size two.
            tarNumL.append(key)
            tarNameL.append(chainreplace(group.iloc[0,1] + " OR " + group.iloc[1,1]).lower())
    return pd.DataFrame({col1:tarNumL,col2:tarNameL})



### categorynames.csv:

For multi-label categories, I just concatenated the names with an OR. It is not clear which category was intended, or if one replaced the other. For upper/lower case issues, I choose one catagory. All catagories are converted to lower case, for easy reading.

In [39]:
catDFProcessed = reformatcatDF(categoryDF,"categoryid","categoryname")
#catDFProcessed.head(200)
catDFProcessed.to_csv(outputDir+"catagorynames.csv",index=False,header=True)

Unnamed: 0,categoryid,categoryname
0,1011100,blended whiskies
1,1011200,straight bourbon whiskies
2,1011250,single barrel bourbon whiskies
3,1011300,tennessee whiskies or single barrel bourbon wh...
4,1011400,bottled in bond bourbon or tennessee whiskies
5,1011500,straight rye whiskies or bottled in bond bourbon
6,1011600,corn whiskies or straight rye whiskies
7,1011700,corn whiskies
8,1011800,iowa distillery whiskies
9,1012100,canadian whiskies


### countynames.csv:

There are two sets of county names - a lower and upper case set. I keep the lower case set, and remove one Uppercase redundant name ("STORY"), to make the set unique.

In [None]:
#our index and county number end on 99, but start at 0 and 1 (respectively). There is a redundant county.
#This is number 85
countyDF = countyDF.loc[0:99,:]
countyDF.sort_values(by="countynumber",inplace=True)
countyDF = countyDF[countyDF["countyname"] != "STORY"]
countyDF.reset_index(inplace=True,drop=True)

countyDF.to_csv(outputDir+"countynames.csv",index=False,header=True)

### vendornames.csv


### storenames.csv
