# EDA and Further Cleaning

In this notebook I will utilize EDA to identify how I need to further clean my data.  I imagine I will need to remove a lot of features after vectorization and thus increase my stop words vocabulary.

**I hypothesize I will need to ...**
- Remove references to the beer or other beers from reviews.
- Remove any reference of the style of beer
- Remove words that are relative to stories( I noticed there are some reviews that talk about _how_ they came about the beer in question.
- Remove words that don't appear often and/or don't contributed to describing the beer itself.

In [2]:
import pandas as pd


In [3]:
beers = pd.read_csv('data/Beers_clean.csv')
#beers2 = pd.read_csv('data/Beers_clean_2.csv')

In [4]:
beers.head()

Unnamed: 0,Beer_Name,Brewery_Name,ABV,Type,full_text,clean_reviews
0,That's What She Said,Tree House Brewing Company,5.6,Milk / Sweet Stout,"thanks to my buddy back east for the pint can,...",thank buddy back east pint cool beer first non...
1,Triple Bag,Long Trail Brewing Co.,11.0,American Strong Ale,Was surprised to even see this at my local sto...,surprised even see local store buy pack lo...
2,Great Lakes Devil's Pale Ale,Great Lakes Brewery,6.6,English Pale Ale,Like a brown or scotch with molasses and licor...,like brown scotch molass licorice nose lot lac...
3,Hornswoggled,Cigar City Brewing,5.0,Irish Red Ale,On tap at Engine 15. \nSurprised to see that t...,tap engine surprise see hornswoggle red ale...
4,Schlafly Raspberry Hefeweizen,The Schlafly Tap Room,4.1,Fruit / Vegetable Beer,A: The beer is hazy reddish yellow in color an...,beer hazy reddish yellow color light amount vi...


In [5]:
# review example
beers['clean_reviews'].loc[2]

'like brown scotch molass licorice nose lot lace brown body tap volo dry flavour lot different malt variety caramel hop kick finish good great lake brew ontario football field ontap c   can mostly dig followthrough whole   theme   beer appear clear dark reddish amber hue one finger soapy white head leave thin webbed lace around glass smell caramel malt earthy weedy hop taste sweet caramel malt somewhat acrid piney hop carbonation moderate body average weight thin acidity finish ramp hop soapy piney flavour increase noticeably   seem like cross basic american english ipa hopwise pour deep dark brown color ale big foamy head good retention nice lace aroma dry malt medium note caramel malt taste also dominate medium caramel malt slightly bitter ending hop presence subdued expect medium carbonation interesting necessarily something would go back often pint sized receive dyan late trade time review number   fucking metal consume listen new miley cyrus album unholy music could find   appeara

In [6]:
# NLP Imports
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction import text 


In [7]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

# Iteration 1
iteration_1 = ['beer', 'head', 'taste', 'nice', 'good', 'like', 'medium', 'finish', 'bit', 'flavor', 'aroma', 
                'body', 'color', 'glass', 'little', 'pours', 'notes', 0,1,2,3,4,5,6,7,8,9]
[stopwords.append(word) for word in iteration_1];



In [1]:
#stopwords

In [9]:
# vectorization Options
CV = CountVectorizer(min_df=10, stop_words = stopwords)
text_vec = CV.fit_transform(beers['clean_reviews'])

# convert to DF
word_df = pd.DataFrame(text_vec.todense(), columns = CV.get_feature_names() )
word_df.head()

Unnamed: 0,aa,aaah,aah,aal,aamber,aaron,ab,aba,aback,abacus,...,zombie,zone,zound,zs,zucchini,zum,zwanze,zwickel,zythos,zywiec
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [74]:
#[print(word) for word in word_df.columns]

In [10]:
# word Counts : This takes a while to run.
word_count_df = pd.DataFrame( columns = ['Word','Count'])

for col in word_df.columns:
    word_count_df.loc[len(word_count_df)] = [col, word_df[col].sum()]
       
word_count_df.head()

Unnamed: 0,Word,Count
0,aa,76
1,aaah,10
2,aah,21
3,aal,276
4,aamber,13


In [11]:
word_count_df.sort_values(by='Count', ascending=False, inplace = True)

In [12]:
#word_count_df.head(25)
['well','one','bottle','lacing','overall','flavors','nose','really','smell']
word_count_df.tail(25)

Unnamed: 0,Word,Count
13154,prepp,10
7581,groundwork,10
16959,terrys,10
7576,grotesque,10
412,alcoholwise,10
1136,awakening,10
5120,donation,10
7567,grizzly,10
5117,domineering,10
1141,awas,10


# Spelling Errors

### Found out there are many misspelled words in my corpus. (Turns out people writing after drinking beer are prone to making errors).

**Utilizing the `autocorrect` libraries `spell` function I can correct many words.**
- One of the nice things is if it doesn't have a correct spelling for a word it will just return the original word.

- I have noticed that this is fucking up correcting "Vanila" to "Manila" and "Molass" to "Molars".  Thats a 2 for 2 notice on my part of it destroying information that is critical for my model's logic. (Obviously "Vanilla" and "Molasses"

- Im going to need to inject some conditional probability into this spell checking using the `py enchant` library which returns a list of possibilities.  Those possibilities will be searched against the list and count of words that appear in the corpus and select the spelling that appears the most often.  (Hopefully "Vanilla" appears much more in this dataset and "Manila"

# PICK UP HERE FOR TEXT CLEANING

Using Spell checking I'm capable of removing aprox. 1500 misspelled words.
- However, I noticed that there are some misspelled occurrences of the word "Vanilla" as "Vanila" which this spell checker will correct to "Manila"

In [11]:
from autocorrect import spell
#spell('allagashs')
words_checked = [spell(word) for word in word_df.columns] # This part takes a while



In [13]:
print(len(word_df.columns))
print(len(set(words_checked)))

19206
17774


In [18]:
# autcorrection tries to correct to capital names sometimes, thus string.lower()
correction_map_dict = {original:correction.lower() for original, correction in zip(word_df.columns,words_checked)}

In [121]:
import nltk
words = set(nltk.corpus.words.words())

In [26]:
def sentence_correct(clean_review):
    new_sentence = []
    for word in clean_review.split():
        try:
            new_sentence.append(correction_map_dict[word])
        except:
            new_sentence.append(word)
    return ' '.join(new_sentence)

In [27]:
test = beers['clean_reviews'].loc[2]

sentence_correct(test)



'like brown scotch molars licorice nose lot lace brown body tap vols dry flavour lot different malt variety caramel hop kick finish good great lake brew ontario football field ontal c can mostly dig followthrough whole theme beer appear clear dark reddish amber hue one finger soapy white head leave thin webbed lace around glass smell caramel malt earthy weedy hop taste sweet caramel malt somewhat acrid piney hop carbonation moderate body average weight thin acidity finish ramp hop soapy piney flavour increase noticeably seem like cross basic american english ipa topwise pour deep dark brown color ale big foamy head good retention nice lace aroma dry malt medium note caramel malt taste also dominate medium caramel malt slightly bitter ending hop presence subdued expect medium carbonation interesting necessarily something would go back often pint sized receive dyan late trade time review number fucking metal consume listen new miley cyrus album unholy music could find appearance dirty cr

## Experimenting with Py Enchant

In [28]:
!pip install pyenchant

Collecting pyenchant
[?25l  Downloading https://files.pythonhosted.org/packages/4f/c5/5c18df3c5dbf2ce1e6fc8b0fcce1a5dfe7c4ec5ab33b76722fcca9cbfff5/pyenchant-2.0.0-py2.py3.cp27.cp32.cp33.cp34.cp35.cp36.pp27.pp33.pp35-none-macosx_10_6_intel.macosx_10_9_intel.whl (4.3MB)
[K    100% |████████████████████████████████| 4.3MB 3.2MB/s eta 0:00:01
[?25hInstalling collected packages: pyenchant
Successfully installed pyenchant-2.0.0


In [34]:
# Importing the Enchant module 
import enchant 
  
# Using 'en_US' dictionary 
d = enchant.Dict("en_US") 
  
# Will suggest similar words 
# form given dictionary 
print(d.suggest("vanila")) 
print(d.suggest("molass")) 
print(d.suggest("allagash")) 

['vanilla', 'manila', 'Danila', 'Manila', 'Valina']
['morass', 'molars', 'mo lass', 'molasses', 'molar', 'molal', 'Moluccas']
['Malagasy', 'Allahabad', 'Gallagher', 'Callaghan']


In [41]:
# FUCK.  Manila appears naturally as a color option.
word_count_df[word_count_df['Word'].isin(['vanilla', 'manila', 'Danila', 'Manila', 'Valina'])]
# however, the overwhelming option is vanilla.

Unnamed: 0,Word,Count
18130,vanilla,30862
10211,manila,36


In [42]:
# Ok Morass is a real word, however it refers to a swamp land.
word_count_df[word_count_df['Word'].isin(['morass', 'molars', 'mo lass', 'molasses', 'molar', 'molal', 'Moluccas'])]

Unnamed: 0,Word,Count
10912,molasses,133
10988,morass,32


In [44]:
word_count_df[word_count_df['Word'].isin(['Malagasy', 'Allahabad', 'Gallagher', 'Callaghan','allagash'])]

Unnamed: 0,Word,Count
453,allagash,524


1. Figure out all unique words.
2. CVec - with min samples 1
3. Organize Data by Sum or observations of words.
4. 
5.
6.


In [71]:
CV2 = CountVectorizer(min_df=1, stop_words = stopwords)
AWV = CV2.fit_transform(beers['clean_reviews'])

all_words = pd.DataFrame(AWV.todense(), columns = CV2.get_feature_names() )


all_word_count_df = pd.DataFrame( columns = ['Word','Count'])

for col in all_words.columns:
    all_word_count_df.loc[len(all_word_count_df)] = [col, all_words[col].sum()]
       
all_word_count_df.head()

Unnamed: 0,Word,Count
0,aa,76
1,aaa,3
2,aaaaaaaalllllllllllll,1
3,aaaaaaand,1
4,aaaaahahhahaa,1


(150110, 2)

In [83]:
# Shit took fucking 5 hours to finish running.
corrections = []
for word in all_words.columns:
 # visual check
    check_list = d.suggest(word) # get suggested spelling option.

    # filter all words DF for suggest spelling opt. occ.
    t_df = all_word_count_df[all_word_count_df['Word'].isin(check_list)] 
    
    # get sum or all spelling options occurances
    check_list_occ = t_df['Count'].sum() 
    
    # filter if any option occurs more than 50% of the time.
    t_df2 = t_df[t_df['Count'] > check_list_occ*0.5] 
    
    if len(t_df2)>0:# if so, take option 1
        corrections.append(t_df2['Word'].values[0])
    else: # else, return original word.
        corrections.append(word)
    

In [85]:
all_word_count_df['auto_corrected'] = corrections

In [87]:
# Wow this shit fucking sucks.  Need to apply it to obviously misspelled words somehow.
all_word_count_df.sort_values(by='Count', ascending=False)

Unnamed: 0,Word,Count,auto_corrected
64094,hop,214922,top
79445,malt,206710,malty
76027,light,191734,slight
103920,pour,158418,sour
23189,carbonation,129795,carbonation
129258,sweet,118065,sweet
35099,dark,105828,dank
73245,lace,101365,lace
91249,note,98648,nose
120450,smell,96138,smell
