In [1]:
import pandas as pd
# import NLTK (natural language toolkit)
import nltk 
nltk.download('wordnet') # 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4') # open multilingual wordnet library

[nltk_data] Downloading package wordnet to C:\Users\javier.perez-
[nltk_data]     alvaro\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\javier.perez-
[nltk_data]     alvaro\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\javier.perez-
[nltk_data]     alvaro\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\javier.perez-
[nltk_data]     alvaro\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to C:\Users\javier.perez-
[nltk_data]     alvaro\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Stemming & Lemmatization

Stemming and Lemmatization are techniques to nomalize text.

reading -> read

Books -> book

Stories -> stori (from stemming) or story (from lemmatization)

More info [here](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

## The problem with the bag-of-words approach

In [2]:
# toy training data
X_train = ['I love the book',
           'This is a great book',
           'The fit is great',
           'I love the shoes']
y_train = ['books',
           'books',
           'clothings',
           'clothings']

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
pd.DataFrame(data=X_train_dtm.toarray(),
             columns=vect.get_feature_names_out(),
             index=X_train)

Unnamed: 0,book,fit,great,is,love,shoes,the,this
I love the book,1,0,0,0,1,0,1,0
This is a great book,1,0,1,1,0,0,0,1
The fit is great,0,1,1,1,0,0,1,0
I love the shoes,0,0,0,0,1,1,1,0


In [4]:
# train a naive bayes model
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(X_train_dtm,y_train)

MultinomialNB()

In [5]:
# toy testing data 
X_test = ['I like the book',
          'Shoes are alright',
          'I love the books',
          'I lost a shoe']

X_test_dtm = vect.transform(X_test)
nb_clf.predict(X_test_dtm)

array(['books', 'clothings', 'clothings', 'books'], dtype='<U9')

The predictions for 'I love the books' and 'I lost a shoe' are wrong. Why? Because the model hasn't seen the words 'books' and 'shoe'

## Stemming

In [6]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [7]:
# initialize the stemmer
stemmer = PorterStemmer()

In [8]:
stemmer.stem('reading')

'read'

In [9]:
stemmer.stem('books')

'book'

In [10]:
# organize, organizes, and organizing
stemmer.stem('organize')

'organ'

In [11]:
stemmer.stem('organizes')

'organ'

In [12]:
stemmer.stem('organizing')

'organ'

The tokenizer breaks a sentence into its individual words

In [13]:
phrase = 'I love the books.'
words = word_tokenize(phrase)
words

['I', 'love', 'the', 'books', '.']

In [14]:
stemmed_words = [stemmer.stem(word) for word in words]
stemmed_words

['i', 'love', 'the', 'book', '.']

In [15]:
' '.join(stemmed_words)

'i love the book .'

## Lemmatization

In [16]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [17]:
lemmatizer = WordNetLemmatizer()

The lemmatizer expects the parts of speech; by default, each token is a noun

In [18]:
lemmatizer.lemmatize('eats', pos='v')

'eat'

In [19]:
lemmatizer.lemmatize('ate', pos='v')

'eat'

In [20]:
lemmatizer.lemmatize('running', pos='v') # she is running 

'run'

In [21]:
lemmatizer.lemmatize('running', pos='n') # running is good for you

'running'

In [22]:
lemmatizer.lemmatize('better', pos = 'r')  # She sings better than me (adverb)

'well'

In [23]:
lemmatizer.lemmatize('better', pos='v') # to better oneself

'better'

In [24]:
lemmatizer.lemmatize('better', pos='a') # The better team won the match (adjective)

'good'

In [25]:
# parts of speech tagging
pos_list = nltk.pos_tag(words)
pos_list

[('I', 'PRP'), ('love', 'VBP'), ('the', 'DT'), ('books', 'NNS'), ('.', '.')]

In [26]:
# process parts of speech function
def process_pos(pos):
    if pos.startswith('J'): # adjectives
        return wordnet.ADJ
    elif pos.startswith('V'): # verbes
        return wordnet.VERB
    elif pos.startswith('N'): # nouns
        return wordnet.NOUN
    elif pos.startswith('R'): # adverbs
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [27]:
words

['I', 'love', 'the', 'books', '.']

In [28]:
lemmatized_words = [lemmatizer.lemmatize(word, pos=process_pos(pos)) 
                    for word,pos 
                    in nltk.pos_tag(words)]
lemmatized_words

['I', 'love', 'the', 'book', '.']

In [29]:
' '.join(lemmatized_words)

'I love the book .'

## Stopwords Removal

The set of most common words in english: this, that, he, it, ... They don't add much meaning to the sentences.

In [30]:
from nltk.corpus import stopwords

In [31]:
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [32]:
phrase = 'here is an example sentence demostrating the removal of stopwords'
phrase

'here is an example sentence demostrating the removal of stopwords'

In [33]:
words = word_tokenize(phrase)
stripped_phrase = [word for word in words if word not in stop_words]
" ".join(stripped_phrase)

'example sentence demostrating removal stopwords'

## Punctuation removal

In [34]:
import string
punctuation = [punc for punc in string.punctuation]
punctuation

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [35]:
phrase = 'Hello! How are you?'
words = word_tokenize(phrase)
stripped_phrase = [word for word in words if word not in punctuation]
" ".join(stripped_phrase)

'Hello How are you'

## Yelp reviews

In [36]:
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/yelp.csv'
yelp = pd.read_csv(url)[['text','stars']]
yelp.head()

Unnamed: 0,text,stars
0,My wife took me here on my birthday for breakf...,5
1,I have no idea why some people give bad review...,5
2,love the gyro plate. Rice is so good and I als...,4
3,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5
4,General Manager Scott Petello is a good egg!!!...,5


In [37]:
yelp.stars

0       5
1       5
2       4
3       5
4       5
       ..
9995    3
9996    4
9997    4
9998    2
9999    5
Name: stars, Length: 10000, dtype: int64

In [38]:
# keep reviews that only contains the 5-stars and 1-star reviews
yelp = yelp[yelp.stars.isin([1,5])].reset_index(drop=True)
yelp

Unnamed: 0,text,stars
0,My wife took me here on my birthday for breakf...,5
1,I have no idea why some people give bad review...,5
2,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5
3,General Manager Scott Petello is a good egg!!!...,5
4,Drop what you're doing and drive here. After I...,5
...,...,...
4081,Yes I do rock the hipster joints. I dig this ...,5
4082,Only 4 stars? \n\n(A few notes: The folks that...,5
4083,I'm not normally one to jump at reviewing a ch...,5
4084,Let's see...what is there NOT to like about Su...,5


In [39]:
print(yelp.loc[0,'text'])

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!


In [40]:
text = yelp.loc[0,'text']
words = word_tokenize(text)
words = [word.lower() for word in words]
lemmatized_words = [lemmatizer.lemmatize(word, pos=process_pos(pos)) 
                    for word,pos in nltk.pos_tag(words) 
                    if word not in stop_words and word not in punctuation]
print(' '.join(lemmatized_words))

wife take birthday breakfast excellent weather perfect make sit outside overlook ground absolute pleasure waitress excellent food arrive quickly semi-busy saturday morning look like place fill pretty quickly early get good favor get bloody mary phenomenal simply best 've ever 'm pretty sure use ingredient garden blend fresh order amaze everything menu look excellent white truffle scramble egg vegetable skillet tasty delicious come 2 piece griddle bread amaze absolutely make meal complete best `` toast '' 've ever anyway ca n't wait go back


In [41]:
for i in range(len(yelp)):
    text = yelp.loc[i,'text']
    words = word_tokenize(text)
    words = [word.lower() for word in words]
    lemmatized_words = [lemmatizer.lemmatize(word, pos=process_pos(pos)) 
                        for word,pos in nltk.pos_tag(words) 
                        if word not in stop_words and word not in punctuation]
    yelp.loc[i,'processed_text'] = ' '.join(lemmatized_words)

In [42]:
yelp

Unnamed: 0,text,stars,processed_text
0,My wife took me here on my birthday for breakf...,5,wife take birthday breakfast excellent weather...
1,I have no idea why some people give bad review...,5,idea people give bad review place go show plea...
2,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",5,rosie dakota love chaparral dog park 's conven...
3,General Manager Scott Petello is a good egg!!!...,5,general manager scott petello good egg go deta...
4,Drop what you're doing and drive here. After I...,5,drop 're drive eat go back next day food good ...
...,...,...,...
4081,Yes I do rock the hipster joints. I dig this ...,5,yes rock hipster joint dig place little bit sc...
4082,Only 4 stars? \n\n(A few notes: The folks that...,5,4 star note folk rat place low must isolate in...
4083,I'm not normally one to jump at reviewing a ch...,5,'m normally one jump review chain restaurant e...
4084,Let's see...what is there NOT to like about Su...,5,let 's see ... like surprise stadium well 9.50...


In [43]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split

In [44]:
y = yelp.stars
X = yelp.processed_text

In [45]:
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [84]:
pipe = Pipeline(steps=[
    ('vect', CountVectorizer()), 
    ('clf', MultinomialNB()) 
])

In [48]:
pipe.fit(X_train,y_train)

Pipeline(steps=[('vect', TfidfVectorizer()), ('clf', MultinomialNB())])

In [49]:
y_test_pred = pipe.predict(X_test)

In [50]:
confusion_matrix(y_test,y_test_pred)

array([[  3, 177],
       [  0, 842]], dtype=int64)

In [105]:
params_dic =  {'vect__max_features' : [2000,5000,7000,10000],
               'vect__min_df' : [5,25,50],
               'vect__max_df' : [1.0,0.9,0.8],
               'vect__ngram_range' : [(1,1), (1,2)],
               }

grid = GridSearchCV(pipe,
                    params_dic,
                    scoring='accuracy', 
                    cv=5, 
                    n_jobs=-1,
                    verbose=2)
grid.fit(X_train,y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect', CountVectorizer()),
                                       ('clf', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'vect__max_df': [1.0, 0.9, 0.8],
                         'vect__max_features': [2000, 5000, 7000, 10000],
                         'vect__min_df': [5, 25, 50],
                         'vect__ngram_range': [(1, 1), (1, 2)]},
             scoring='accuracy', verbose=2)

In [106]:
grid.best_params_

{'vect__max_df': 1.0,
 'vect__max_features': 5000,
 'vect__min_df': 5,
 'vect__ngram_range': (1, 2)}

In [107]:
best_pipe = grid.best_estimator_

In [108]:
y_test_pred = best_pipe.predict(X_test)
confusion_matrix(y_test,y_test_pred)

array([[140,  40],
       [ 36, 806]], dtype=int64)

In [109]:
accuracy_score(y_test,y_test_pred)

0.9256360078277887

### How does the model choose between 5-stars or 1-star ratings

In [110]:
# store the vocabulary of X_train
words = best_pipe['vect'].get_feature_names_out()

In [111]:
best_pipe['clf'].classes_

array([1, 5], dtype=int64)

In [112]:
# number of times each word appears across all 1-star docs
bad_word_count = best_pipe['clf'].feature_count_[0,:]
# number of times each word appears across all 5-stars docs
good_word_count = best_pipe['clf'].feature_count_[1,:]

In [113]:
# create a DataFrame of words with their separate 1-star and 5-stars counts
words = pd.DataFrame({'word' : words,
                      'bad' : bad_word_count, 
                      'good' : good_word_count}).set_index('word')
words.head()

Unnamed: 0_level_0,bad,good
word,Unnamed: 1_level_1,Unnamed: 2_level_1
00,27.0,35.0
000,4.0,10.0
00pm,1.0,5.0
10,73.0,132.0
10 15,2.0,6.0


In [114]:
# add 1 to the columns counts to avoid dividing by 0
words.bad = words.bad+1
words.good = words.good+1

In [115]:
# convert the counts into frequencies
words.bad = words.bad/words.bad.sum()
words.good = words.good/words.good.sum()
words.head()

Unnamed: 0_level_0,bad,good
word,Unnamed: 1_level_1,Unnamed: 2_level_1
00,0.000558,0.000231
000,0.0001,7.1e-05
00pm,4e-05,3.9e-05
10,0.001475,0.000855
10 15,6e-05,4.5e-05


In [116]:
# ratios
words['bad_ratio'] = words.bad/words.good
words['good_ratio'] = words.good/words.bad

In [117]:
words.sort_values(by='good_ratio', ascending=False).head(20)

Unnamed: 0_level_0,bad,good,bad_ratio,good_ratio
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
fantastic,4e-05,0.001318,0.030255,33.052909
one favorite,2e-05,0.000437,0.045604,21.927783
perfect,8e-05,0.001517,0.052561,19.025577
yum,2e-05,0.000354,0.056383,17.735707
favorite,0.00016,0.002398,0.066511,15.035043
amaze,0.0001,0.001369,0.072796,13.737111
pasty,2e-05,0.00027,0.073835,13.543631
reasonably,2e-05,0.00027,0.073835,13.543631
awesome,0.00014,0.001852,0.075374,13.26723
perfection,2e-05,0.000264,0.075636,13.221163


In [118]:
words.sort_values(by='bad_ratio', ascending=False).head(20)

Unnamed: 0_level_0,bad,good,bad_ratio,good_ratio
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
never return,0.000239,6e-06,37.213064,0.026872
service horrible,0.000239,6e-06,37.213064,0.026872
poor service,0.000239,6e-06,37.213064,0.026872
never come,0.000219,6e-06,34.111975,0.029315
unprofessional,0.000199,6e-06,31.010886,0.032247
disgust,0.000578,1.9e-05,29.97719,0.033359
mediocre best,0.000179,6e-06,27.909798,0.03583
unacceptable,0.000179,6e-06,27.909798,0.03583
horrible experience,0.000179,6e-06,27.909798,0.03583
waste money,0.000179,6e-06,27.909798,0.03583


In [119]:
yelp[yelp.processed_text.str.contains('dr ')].iloc[0].text

"I love Dr. Scott!! He is the funniest doctor I've ever met and he makes me feel good too! Dr. Scott is always willing to bend over backwards to make sure you are provided the best service and treatment possible. The massage therapists rock too! I will be a lifelong patient of Integrated Chiropractic."

In [120]:
print(yelp[yelp.processed_text.str.contains('mozzarella')].iloc[3].text)

WOW this place is good!  SO good!  And not just yummy good, but intrinsically good.  Check out their amazing list of environmentally responsible business practices - http://essencebakery.com/essence_bakery_environmentally_friendly.shtml !  

That, and it's cute.  And it tastes Yummy!  And they are SO nice!  SOOO nice!  And their deserts are just disgustingly cute and beautiful and freaking good.  I almost bought the mini box of 4 cupcakes for just $3.50.  Bite sized so not too bad, but if I bought them today they'd be gone before I got home. 

I got their grilled cheese w/ mozzarella, basil and tomato on grilled buttery brioche bread with a light and tasty green salad on the side.  It will be hard to not come here every week.  I love that all the drinks are refillable - and it's up to you to refill them.  Coffee, tea or soda, just walk on up and fill your glass.  And I love that it's all so fresh and local!  

Important note - You order at the counter, they give you a number and bring 