---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*

# Case Study: Sentiment Analysis

### Data Prep

In [2]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Sample the data to speed up computation
# Comment out this line to match with lecture
df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
394349,Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat...,,244.95,5,Very good one! Better than Samsung S and iphon...,0.0
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0


In [3]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0,0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0,1
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0,0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0,1
277158,Nokia N8 Unlocked GSM Touch Screen Phone Featu...,Nokia,95.0,5,I fell in love with this phone because it did ...,0.0,1
100311,Blackberry Torch 2 9810 Unlocked Phone with 1....,BlackBerry,77.49,5,I am pleased with this Blackberry phone! The p...,0.0,1
251669,Motorola Moto E (1st Generation) - Black - 4 G...,Motorola,89.99,5,"Great product, best value for money smartphone...",0.0,1
279878,OtterBox 77-29864 Defender Series Hybrid Case ...,OtterBox,9.99,5,I've bought 3 no problems. Fast delivery.,0.0,1
406017,Verizon HTC Rezound 4G Android Smarphone - 8MP...,HTC,74.99,4,Great phone for the price...,0.0,1
302567,"RCA M1 Unlocked Cell Phone, Dual Sim, 5Mp Came...",RCA,159.99,5,My mom is not good with new technoloy but this...,4.0,1


In [4]:
# Most ratings are positive
df['Positively Rated'].mean()

0.74717766860786672

In [5]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [6]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 Everything about it is awesome!


X_train shape:  (23052,)


# CountVectorizer

In [9]:
import re

In [10]:
cat_in_the_hat_docs=[
       "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
       "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
       "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
       "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
       "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" 
      ]

In [25]:
# Custom Stop Words
cv = CountVectorizer(cat_in_the_hat_docs,stop_words=["all","in","the","is","and"])
count_vector=cv.fit_transform(cat_in_the_hat_docs)
count_vector.shape

(5, 40)

In [26]:
# any stop words that we explicitly specified?
cv.stop_words

['all', 'in', 'the', 'is', 'and']

Stop Words using MIN_DF
The goal of MIN_DF is to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis.

Instead of using a minimum term frequency (total occurrences of a word) to eliminate words, MIN_DF looks at how many documents contained a term, better known as document frequency. The MIN_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

Eliminating words that appeared in less than 2 documents:

In [28]:
# ignore terms that appeared in less than 2 documents 
cv = CountVectorizer(cat_in_the_hat_docs,min_df=2)
count_vector=cv.fit_transform(cat_in_the_hat_docs)

# To see what’s remaining, all we need to do is check the vocabulary again with:
cv.vocabulary_ 

{'about': 0,
 'all': 1,
 'cat': 2,
 'hat': 3,
 'in': 4,
 'learning': 5,
 'library': 6,
 'the': 7}

Stop Words using MAX_DF
Just as we ignored words that were too rare with MIN_DF, we can ignore words that are too common with MAX_DF. MAX_DF looks at how many documents contained a term, and if it exceeds the MAX_DF threshold, then it is eliminated from consideration. The MAX_DF value can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.85 meaning, ignore words appeared in 85% of the documents as they are too common).

In [30]:
# ignore terms that appear in 50% of the documents
cv = CountVectorizer(cat_in_the_hat_docs,max_df=0.50)
count_vector=cv.fit_transform(cat_in_the_hat_docs)

# I’ve typically used a value from 0.75-0.85 depending on the task 
# and for more aggressive stop word removal you can even use a smaller value.
# Now, to see which words have been eliminated, you can use cv.stop_words_ (see output below):
cv.stop_words_

{'about', 'all', 'cat', 'hat', 'in', 'learning', 'library', 'the'}

In [24]:
for text in cat_in_the_hat_docs:
    text=re.sub("(\\W)"," \\1 ",text)
    print(text) 
    # split based on whitespace
    #return re.split("\\s+",text)

One   Cent ,    Two   Cents ,    Old   Cent ,    New   Cent :    All   About   Money    ( Cat   in   the   Hat ' s   Learning   Library
Inside   Your   Outside :    All   About   the   Human   Body    ( Cat   in   the   Hat ' s   Learning   Library ) 
Oh ,    The   Things   You   Can   Do   That   Are   Good   for   You :    All   About   Staying   Healthy    ( Cat   in   the   Hat ' s   Learning   Library ) 
On   Beyond   Bugs :    All   About   Insects    ( Cat   in   the   Hat ' s   Learning   Library ) 
There ' s   No   Place   Like   Space :    All   About   Our   Solar   System    ( Cat   in   the   Hat ' s   Learning   Library ) 


In [13]:
import re
 
def my_tokenizer(text):
    # create a space between special characters 
    text=re.sub("(\\W)"," \\1 ",text)
 
    # split based on whitespace
    return re.split("\\s+",text)
    
 
cv = CountVectorizer(cat_in_the_hat_docs,tokenizer=my_tokenizer)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
print(cv.vocabulary_)

{'one': 34, 'cent': 14, ',': 4, 'two': 47, 'cents': 15, 'old': 32, 'new': 29, ':': 5, 'all': 7, 'about': 6, 'money': 28, '(': 2, 'cat': 13, 'in': 22, 'the': 44, 'hat': 19, "'": 1, 's': 38, 'learning': 25, 'library': 26, 'inside': 24, 'your': 49, 'outside': 36, 'human': 21, 'body': 10, ')': 3, '': 0, 'oh': 31, 'things': 46, 'you': 48, 'can': 12, 'do': 16, 'that': 43, 'are': 8, 'good': 18, 'for': 17, 'staying': 41, 'healthy': 20, 'on': 33, 'beyond': 9, 'bugs': 11, 'insects': 23, 'there': 45, 'no': 30, 'place': 37, 'like': 27, 'space': 40, 'our': 35, 'solar': 39, 'system': 42}


In [15]:
import re
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
 
# init stemmer
porter_stemmer=PorterStemmer()
 
def my_cool_preprocessor(text):
    
    text=text.lower() 
    text=re.sub("\\W"," ",text) # remove special chars
    text=re.sub("\\s+(in|the|all|for|and|on)\\s+"," _connector_ ",text) # normalize certain words
    
    # stem words
    words=re.split("\\s+",text)
    stemmed_words=[porter_stemmer.stem(word=word) for word in words]
    return ' '.join(stemmed_words)
 
cv = CountVectorizer(cat_in_the_hat_docs,preprocessor=my_cool_preprocessor)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
print(cv.vocabulary_)

{'one': 25, 'cent': 8, 'two': 37, 'old': 23, 'new': 20, '_connector_': 0, 'about': 1, 'money': 19, 'cat': 7, 'the': 34, 'hat': 11, 'learn': 16, 'librari': 17, 'insid': 15, 'your': 39, 'outsid': 27, 'human': 13, 'bodi': 4, 'oh': 22, 'thing': 36, 'you': 38, 'can': 6, 'do': 9, 'that': 33, 'are': 2, 'good': 10, 'stay': 31, 'healthi': 12, 'on': 24, 'beyond': 3, 'bug': 5, 'insect': 14, 'there': 35, 'no': 21, 'place': 28, 'like': 18, 'space': 30, 'our': 26, 'solar': 29, 'system': 32}


In [17]:
# word-level bi-grams only:
# only bigrams, word level
cv = CountVectorizer(cat_in_the_hat_docs,ngram_range=(2,2),preprocessor=my_cool_preprocessor)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
print(cv.vocabulary_)

{'one cent': 35, 'cent two': 19, 'two cent': 47, 'cent old': 18, 'old cent': 33, 'cent new': 17, 'new cent': 30, 'cent _connector_': 16, '_connector_ about': 0, 'about money': 7, 'money cat': 29, 'cat _connector_': 15, '_connector_ the': 2, 'the hat': 44, 'hat learn': 22, 'learn librari': 27, 'insid your': 26, 'your outsid': 50, 'outsid _connector_': 37, 'about _connector_': 5, '_connector_ human': 1, 'human bodi': 24, 'bodi cat': 12, 'oh _connector_': 32, '_connector_ thing': 3, 'thing you': 46, 'you can': 49, 'can do': 14, 'do that': 20, 'that are': 43, 'are good': 10, 'good _connector_': 21, '_connector_ you': 4, 'you _connector_': 48, 'about stay': 9, 'stay healthi': 41, 'healthi cat': 23, 'on beyond': 34, 'beyond bug': 11, 'bug _connector_': 13, 'about insect': 6, 'insect cat': 25, 'there no': 45, 'no place': 31, 'place like': 38, 'like space': 28, 'space _connector_': 40, 'about our': 8, 'our solar': 36, 'solar system': 39, 'system cat': 42}


In [18]:
# unigrams and bigrams, word level
cv = CountVectorizer(cat_in_the_hat_docs,ngram_range=(1,2),preprocessor=my_cool_preprocessor)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
print(cv.vocabulary_)

{'one': 60, 'cent': 24, 'two': 84, 'old': 56, 'new': 50, '_connector_': 0, 'about': 6, 'money': 48, 'cat': 22, 'the': 78, 'hat': 33, 'learn': 43, 'librari': 45, 'one cent': 61, 'cent two': 28, 'two cent': 85, 'cent old': 27, 'old cent': 57, 'cent new': 26, 'new cent': 51, 'cent _connector_': 25, '_connector_ about': 1, 'about money': 9, 'money cat': 49, 'cat _connector_': 23, '_connector_ the': 3, 'the hat': 79, 'hat learn': 34, 'learn librari': 44, 'insid': 41, 'your': 89, 'outsid': 64, 'human': 37, 'bodi': 16, 'insid your': 42, 'your outsid': 90, 'outsid _connector_': 65, 'about _connector_': 7, '_connector_ human': 2, 'human bodi': 38, 'bodi cat': 17, 'oh': 54, 'thing': 82, 'you': 86, 'can': 20, 'do': 29, 'that': 76, 'are': 12, 'good': 31, 'stay': 72, 'healthi': 35, 'oh _connector_': 55, '_connector_ thing': 4, 'thing you': 83, 'you can': 88, 'can do': 21, 'do that': 30, 'that are': 77, 'are good': 13, 'good _connector_': 32, '_connector_ you': 5, 'you _connector_': 87, 'about stay'

In [19]:
#only character level bigrams 
cv = CountVectorizer(cat_in_the_hat_docs,ngram_range=(2,2),analyzer='char_wb')
count_vector=cv.fit_transform(cat_in_the_hat_docs)

Limiting Vocabulary Size
When your feature space gets too large, you can limit its size by putting a restriction on the vocabulary size. Say you want a max of 10,000 n-grams. CountVectorizer will keep the top 10,000 most frequent n-grams and drop the rest.

In [20]:
#only bigrams and unigrams, limit to vocab size of 10
cv = CountVectorizer(cat_in_the_hat_docs,max_features=10)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
count_vector.shape

(5, 10)

Using CountVectorizer to Extract N-Gram / Term Counts
Finally, you may want to use CountVectorizer to obtain counts of your n-grams. This is slightly tricky to do with CountVectorizer, but achievable as shown below:

In [23]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """return n-gram counts in descending order of counts"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    results=[]
    
    # word index, count i
    for idx, count in sorted_items:
        
        # get the ngram name
        n_gram=feature_names[idx]
        
        # collect as a list of tuples
        results.append((n_gram,count))
 
    return results

 
cv = CountVectorizer(cat_in_the_hat_docs,ngram_range=(1,2),preprocessor=my_cool_preprocessor,max_features=100)
count_vector=cv.fit_transform(cat_in_the_hat_docs)
 
#sort the counts of first book title by descending order of counts
sorted_items=sort_coo(count_vector[0].tocoo())
 
#Get feature names (words/n-grams). It is sorted by position in sparse matrix
feature_names=cv.get_feature_names()
n_grams=extract_topn_from_vector(feature_names,sorted_items,10)

# The counts are first ordered in descending order. 
# Then from this list, each feature name is extracted and returned with corresponding counts.
n_grams

[('cent', 4),
 ('_connector_', 2),
 ('two cent', 1),
 ('two', 1),
 ('the hat', 1),
 ('the', 1),
 ('one cent', 1),
 ('one', 1),
 ('old cent', 1),
 ('old', 1)]

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [8]:
vect.get_feature_names()[::2000]

['00',
 'arroja',
 'comapañias',
 'dvds',
 'golden',
 'lands',
 'oil',
 'razonable',
 'smallsliver',
 'tweak']

In [31]:
len(vect.get_feature_names())

19601

In [32]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<23052x19601 sparse matrix of type '<class 'numpy.int64'>'
	with 613289 stored elements in Compressed Sparse Row format>

In [33]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [34]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.897433277667


In [35]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'terrible' 'slow' 'junk' 'poor' 'sucks' 'horrible' 'useless'
 'waste' 'disappointed']

Largest Coefs: 
['excelent' 'excelente' 'excellent' 'perfectly' 'love' 'perfect' 'exactly'
 'great' 'best' 'awesome']


# Tfidf

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

5442

In [37]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.889951006492


In [38]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['61' 'printer' 'approach' 'adjustment' 'consequences' 'length' 'emailing'
 'degrees' 'handsfree' 'chipset']

Largest tfidf: 
['unlocked' 'handy' 'useless' 'cheat' 'up' 'original' 'exelent' 'exelente'
 'exellent' 'satisfied']


In [39]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'slow' 'disappointed' 'worst' 'terrible' 'never' 'return' 'doesn'
 'horrible' 'waste']

Largest Coefs: 
['great' 'love' 'excellent' 'good' 'best' 'perfect' 'price' 'awesome' 'far'
 'perfectly']


In [40]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


# n-grams

In [41]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

29072

In [42]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.91106617946


In [43]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'junk' 'poor' 'slow' 'worst' 'broken' 'not good' 'terrible'
 'defective' 'horrible']

Largest Coefs: 
['excellent' 'excelente' 'excelent' 'perfect' 'great' 'love' 'awesome'
 'no problems' 'good' 'best']


In [44]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]
