# EDA and Further Cleaning

In this notebook I will utilize EDA to identify how I need to further clean my data.  I imagine I will need to remove a lot of features after vectorization and thus increase my stop words vocabulary.

**I hypothesize I will need to ...**
- Remove references to the beer or other beers from reviews.
- Remove any reference of the style of beer
- Remove words that are relative to stories( I noticed there are some reviews that talk about _how_ they came about the beer in question.
- Remove words that don't appear often and/or don't contributed to describing the beer itself.

In [2]:
import pandas as pd


In [3]:
beers = pd.read_csv('data/Beers_clean.csv')
#beers2 = pd.read_csv('data/Beers_clean_2.csv')

In [4]:
beers.head()

Unnamed: 0,Beer_Name,Brewery_Name,ABV,Type,full_text,clean_reviews
0,That's What She Said,Tree House Brewing Company,5.6,Milk / Sweet Stout,"thanks to my buddy back east for the pint can,...",thank buddy back east pint cool beer first non...
1,Triple Bag,Long Trail Brewing Co.,11.0,American Strong Ale,Was surprised to even see this at my local sto...,surprised even see local store buy pack lo...
2,Great Lakes Devil's Pale Ale,Great Lakes Brewery,6.6,English Pale Ale,Like a brown or scotch with molasses and licor...,like brown scotch molass licorice nose lot lac...
3,Hornswoggled,Cigar City Brewing,5.0,Irish Red Ale,On tap at Engine 15. \nSurprised to see that t...,tap engine surprise see hornswoggle red ale...
4,Schlafly Raspberry Hefeweizen,The Schlafly Tap Room,4.1,Fruit / Vegetable Beer,A: The beer is hazy reddish yellow in color an...,beer hazy reddish yellow color light amount vi...


In [5]:
# review example
beers['clean_reviews'].loc[2]

'like brown scotch molass licorice nose lot lace brown body tap volo dry flavour lot different malt variety caramel hop kick finish good great lake brew ontario football field ontap c   can mostly dig followthrough whole   theme   beer appear clear dark reddish amber hue one finger soapy white head leave thin webbed lace around glass smell caramel malt earthy weedy hop taste sweet caramel malt somewhat acrid piney hop carbonation moderate body average weight thin acidity finish ramp hop soapy piney flavour increase noticeably   seem like cross basic american english ipa hopwise pour deep dark brown color ale big foamy head good retention nice lace aroma dry malt medium note caramel malt taste also dominate medium caramel malt slightly bitter ending hop presence subdued expect medium carbonation interesting necessarily something would go back often pint sized receive dyan late trade time review number   fucking metal consume listen new miley cyrus album unholy music could find   appeara

In [6]:
# NLP Imports
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction import text 


In [7]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')


In [8]:
#stopwords

In [9]:
# vectorization Options
CV = CountVectorizer(min_df=10, stop_words = stopwords)
text_vec = CV.fit_transform(beers['clean_reviews'])

# convert to DF
word_df = pd.DataFrame(text_vec.todense(), columns = CV.get_feature_names() )
word_df.head()

Unnamed: 0,aa,aaah,aah,aal,aamber,aaron,ab,aba,aback,abacus,...,zombie,zone,zound,zs,zucchini,zum,zwanze,zwickel,zythos,zywiec
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
# word Counts : This takes a while to run.
word_count_df = pd.DataFrame( columns = ['Word','Count'])

for col in word_df.columns:
    word_count_df.loc[len(word_count_df)] = [col, word_df[col].sum()]
       
word_count_df.head()

Unnamed: 0,Word,Count
0,aa,76
1,aaah,10
2,aah,21
3,aal,276
4,aamber,13


In [11]:
word_count_df.sort_values(by='Count', ascending=False, inplace = True)

In [12]:
#word_count_df.head(25)
word_count_df.tail(25)

Unnamed: 0,Word,Count
5242,drinakability,10
14825,sehr,10
10665,midtongue,10
17093,thisbut,10
10679,mikereaser,10
14819,seethe,10
18239,verrry,10
2318,brownnearly,10
2314,brownishtan,10
2205,breweryorient,10


# Spelling Errors

### Found out there are many misspelled words in my corpus. (Turns out people writing after drinking beer are prone to making errors).

**Utilizing the `autocorrect` libraries `spell` function I can correct many words.**
- One of the nice things is if it doesn't have a correct spelling for a word it will just return the original word.

- I have noticed that this is fucking up correcting "Vanila" to "Manila" and "Molass" to "Molars".  Thats a 2 for 2 notice on my part of it destroying information that is critical for my model's logic. (Obviously "Vanilla" and "Molasses")

- Im going to need to inject some conditional probability into this spell checking using the `py enchant` library which returns a list of possibilities.  Those possibilities will be searched against the list and count of words that appear in the corpus and select the spelling that appears the most often.  (Hopefully "Vanilla" appears much more in this dataset and "Manila"

# PICK UP HERE FOR TEXT CLEANING



Using Spell checking I'm capable of removing aprox. 1500 misspelled words.
- However, I noticed that there are some misspelled occurrences of the word "Vanilla" as "Vanila" which this spell checker will correct to "Manila"

1. Vectorize all text to get list of all words
2. Run through list with PyEnchant English corpus to identify non english words.
    - We will assume that these words are either misspellings or random jibberish.  
3. _Figure out a way to identify random jibberish_

In [13]:
#!pip install autocorrect
#!pip install pyenchant

In [14]:
word_df.columns[:10]

Index(['aa', 'aaah', 'aah', 'aal', 'aamber', 'aaron', 'ab', 'aba', 'aback',
       'abacus'],
      dtype='object')

In [15]:
import enchant

In [16]:
d = enchant.Dict("en_US")

In [17]:
# checkking all words to see if they exist in the english corpus
word_check = {word : d.check(word) for word in word_df.columns}

In [18]:
# Wow, I though I was bad at english and spelling.
# These sorry review saps managed to use 7500 words that don't exist
# 10 times or more!
pd.Series(word_check).value_counts()


True     11722
False     7500
dtype: int64

In [19]:
from autocorrect import spell

spell('ahhhh')


'ahahah'

In [20]:
is_word_df = pd.DataFrame(pd.Series(word_check), columns = ['IsWord'])
test_pack  = is_word_df[is_word_df['IsWord']==False].sample(6, random_state = 5)
# Test words to run through the function I need to build
print(test_pack.index)


Index(['thisit', 'whtie', 'offflavor', 'definitly', 'choclate', 'onthe'], dtype='object')


In [23]:
true_words  = is_word_df[is_word_df['IsWord']==True].index
false_words = is_word_df[is_word_df['IsWord']==False].index


In [200]:
false_words



Index(['aa', 'aaah', 'aah', 'aal', 'aamber', 'aaron', 'aba', 'abas', 'abbaye',
       'abbeystyle',
       ...
       'zinfindale', 'zippiness', 'zoe', 'zound', 'zs', 'zum', 'zwanze',
       'zwickel', 'zythos', 'zywiec'],
      dtype='object', length=7500)

In [30]:
true_words


Index(['ab', 'aback', 'abacus', 'abandon', 'abate', 'abates', 'abbey',
       'abbeys', 'abbot', 'abbreviated',
       ...
       'zestful', 'zesty', 'zilch', 'zing', 'zingy', 'zip', 'zippy', 'zombie',
       'zone', 'zucchini'],
      dtype='object', length=11722)

In [33]:
true_count_df = word_count_df[word_count_df['Word'].isin(true_words)]

In [115]:
import nltk
from nltk.corpus import brown

In [189]:
from time import sleep

In [194]:
word_dfc = word_df.copy()

In [195]:
def corpus_spell(word, corpus = true_count_df):
    suggest_words = d.suggest(word)
    
    occurance_dict = {}
    for s_word in suggest_words:

        if true_count_df[true_count_df['Word'] == s_word]['Count'].any():
            occurance_dict[s_word] = true_count_df[true_count_df['Word'] == s_word]['Count'].item()
        else:
            occurance_dict[s_word] = 0

    s_word_df = pd.DataFrame(pd.Series(occurance_dict),
                       columns = ['Occurence'])
    s_word_df['OccurenceFrequency'] = s_word_df['Occurence']/sum(s_word_df['Occurence'])
    s_word_df.sort_values(by = 'Occurence', inplace = True, ascending = False)
    s_word_df.reset_index(inplace=True)
    s_word_df['PoS'] = s_word_df['index'].apply(lambda x : 
                        nltk.FreqDist(t for w, t in brown.tagged_words()
                        if w.lower() == x).most_common())
    sleep(1)
    print( s_word_df.iloc[0]['Occurence'],s_word_df.iloc[0]['OccurenceFrequency'])
    #return s_word_df
    if s_word_df.iloc[0]['Occurence'] > 100 and s_word_df.iloc[0]['OccurenceFrequency'] > 0.75:
        if len(s_word_df.iloc[0]['PoS']) == 0:
                word_dfc.drop(word, axis = 1, inplace = True)
                print('Dropped column ',word)
                print('No Parts of speech found')
            
        elif len(s_word_df.iloc[0]['PoS']) > 1:
            if s_word_df.iloc[0]['PoS'][0][0] in (('RB','JJ','JJT')) or s_word_df.iloc[0]['PoS'][1][0] in (('RB','JJ','JJT')):
                change = s_word_df.iloc[0]['index']
                word_dfc[change] = word_dfc[change]+word_dfc[word]
                word_dfc.drop(word, axis = 1, inplace = True)
                print(f"{word} corrected to {change}")
                
            else:
                word_dfc.drop(word, axis = 1, inplace = True)
                print('Dropped column ',word)
                print('Not a descriptive word : 2 parts checked')
                
        elif s_word_df.iloc[0]['PoS'][0][0] in (('RB','JJ','JJT')):
                change = s_word_df.iloc[0]['index']
                word_dfc[change] = word_dfc[change]+word_dfc[word]
                word_dfc.drop(word, axis = 1, inplace = True)
                print(f"{word} corrected to {change}")
                
        else:
            word_dfc.drop(word, axis = 1, inplace = True)
            print('Dropped column ',word)
            print('Not a descriptive word : 1 part checked')
    else:
        word_dfc.drop(word, axis = 1, inplace = True)
        print('Dropped column ',word)
        print('Not frequent enough')
        


In [167]:
from IPython.display import clear_output

In [176]:
false_words[103]

'ahold'

In [191]:
corpus_spell('ahold')

10739 1.0
Dropped column  ahold
Not a descriptive word : 2 parts checked


In [196]:
words_corrected = 0
issuewords = []
for w in false_words:  
    print(words_corrected)
    try:
        corpus_spell(w)
    except:
        issuewords.append(w)
    words_corrected += 1
    if words_corrected % 10 == 0:
        clear_output()
    

In [197]:
word_dfc.to_csv('corrected_cvec.csv')

In [198]:
word_dfc.shape

(9490, 11751)

In [199]:
word_df.shape

(9490, 19222)

In [201]:
from concurrent.futures import ThreadPoolExecutor
import sys

def some_action(line):
    print('a')

with ThreadPoolExecutor() as executor:
    for line in sys.stdin:
        print(line)
        future = executor.submit(some_action, line)

In [202]:
import multiprocessing

pool = multiprocessing.Pool(  )

multiprocessing.cpu_count()

NameError: name 'args' is not defined

In [207]:
from multiprocessing import Pool
import time

def f(x):
    return x*x
start = time.time()
if __name__ == '__main__':
    with Pool(5) as p:
        print(p.map(f, [1, 2, 3]))
        
print(time.time() - start)

[1, 4, 9]
0.1437549591064453


In [208]:
nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() == 'time').most_common()


[('NN', 1568), ('NN-TL', 21), ('NN-HL', 8), ('VB', 1)]

In [210]:
start = time.time()
for word in word_dfc.columns[:5]:
    print(nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() ==  word).most_common())
    
print(time.time() - start)


[('NN', 2), ('NP-TL', 1)]
[('RB', 2)]
[]
[('VB', 15), ('NN', 2)]
[]
15.329962968826294


In [2]:
951/6

158.5

In [277]:

def identify_if_desc(word,):

    pos = nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() ==  word).most_common()
    if len(pos) > 0:
        pos2 = [a for a, b in pos]
        for part in pos2:  
            if part in ('RB','JJ','JJT','JJR','JJS','RBR','RBT'):
                return word


In [276]:
# How many pools I can use.  Recommended to use cpu counts minus 1
multiprocessing.cpu_count()

8

In [278]:
start = time.time()
if __name__ == '__main__':
    with Pool(7) as p:
        x = p.map(identify_if_desc , word_dfc.columns)
        
print(time.time() - start)

x = set(x)

9419.33224105835


In [279]:
9419/(60*60)

2.616388888888889

In [284]:
# Dumped Descriptive words into pickle.  
import pickle
file = 'descriptive_words'
outfile = open(file, 'wb')
pickle.dump(x, outfile)
outfile.close()