---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*

# Case Study: Sentiment Analysis

### Data Prep

In [1]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.zip', compression='zip')

# Sample the data to speed up computation
# Comment out this line to match with lecture
# df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [2]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [3]:
# Most ratings are positive
df['Positively Rated'].mean()

0.748269374249846

In [4]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [5]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 Excellent exactly as described thanks


X_train shape:  (231202,)


# CountVectorizer

In [6]:
X_train

5781                  Excellent exactly as described thanks
118777    Love this phone. Bought about 6 months ago and...
218170    Great phone. Much better than most phones at t...
382658    This item is as described in the add. I only m...
39488                                   Nice! Worked great!
                                ...                        
159246                                            excellent
408354    I love this cell phone. I dont do anything on ...
197432    Although I'm only 26 I'm kind of a backwoods h...
153503              for the money not bad, but cheaply made
410168    Did even buy this phone six months ago and alr...
Name: Reviews, Length: 231202, dtype: object

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [8]:
vect

In [18]:
vect.get_feature_names_out()[::2000]

array(['00', '4h', 'adpaters', 'assembly', 'blasts', 'cashiers',
       'condidtion', 'debit', 'domestic', 'estimates', 'flawlessy',
       'gothrough', 'hui', 'irritating', 'lighting5', 'microcomputer',
       'nigeria', 'p7_l00', 'poorer', 'quirkyness', 'responsibility',
       'sens', 'sorrow', 'synch', 'trace', 'usvi', 'within3'],
      dtype=object)

In [22]:
len(vect.get_feature_names_out())

53271

In [23]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<231202x53271 sparse matrix of type '<class 'numpy.int64'>'
	with 6113585 stored elements in Compressed Sparse Row format>

In [24]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [33]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9185552594839563


In [35]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names_out())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'junk' 'garbage' 'unusable' 'useless' 'waste' 'terrible'
 'horrible' 'false' 'awful']

Largest Coefs: 
['excelente' 'excelent' 'exelente' 'loves' 'loving' 'excellent' 'perfecto'
 'complaints' 'awesome' 'bien']


# Tfidf

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names_out())

18024

In [54]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


AUC:  0.9272231191634459


In [64]:
X_train_vectorized.max(0).toarray()[0].argsort()

array([15244, 17402,  1280, ..., 16534,  9926, 16275], dtype=int64)

In [58]:
feature_names = np.array(vect.get_feature_names_out())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['storageso' 'warmness' 'aggregration' 'commenter' 'pthalo' '1300' '34ghz'
 'bridging' 'srgb' 'seizing']

Largest tfidf: 
['too' 'malo' 'true' 'bjvjjbkvjvj' 'satisfied' 'problems' 'satisfecho'
 'malfunction' 'satisfactory' 'horrible']


In [65]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'worst' 'terrible' 'useless' 'disappointed' 'waste' 'poor' 'return'
 'horrible' 'returning']

Largest Coefs: 
['love' 'great' 'excellent' 'amazing' 'perfect' 'easy' 'awesome' 'best'
 'perfectly' 'loves']


In [66]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


# n-grams

In [68]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names_out())

200192

In [70]:
vect.get_feature_names_out()[::2000]

array(['00', '30fps without', '93', 'additional 50', 'all parts',
       'amazing too', 'and fill', 'and xplus', 'app where',
       'arrived pretty', 'audio books', 'basically its', 'because tried',
       'big it', 'bottom', 'but affordable', 'calendar just',
       'capable phone', 'chance', 'clear cache', 'completely still',
       'cool by', 'current os', 'decent build', 'device up',
       'dissapointed in', 'drastic', 'electronics was', 'european market',
       'expect phone', 'farther', 'fine blu', 'for explanation',
       'from cellcow', 'gb cpu', 'going again', 'great flashlight',
       'hands and', 'have provided', 'high ppi', 'hugely also', 'in lot',
       'instructions from', 'is frozen', 'it comparable', 'its moments',
       'key as', 'launcher if', 'like screen', 'longer now', 'maintain',
       'me much', 'mini laptop', 'more fast', 'my blu', 'needs of',
       'no cons', 'not registered', 'of cardboard', 'old news',
       'only arrived', 'or zero', 'overall condi

In [71]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


AUC:  0.9604960900141196


In [73]:
feature_names = np.array(vect.get_feature_names_out())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'worst' 'junk' 'not worth' 'not happy' 'not good'
 'not satisfied' 'garbage' 'terrible' 'horrible']

Largest Coefs: 
['excelent' 'excelente' 'not bad' 'exelente' 'excellent' 'no problems'
 'perfect' 'awesome' 'no issues' 'perfecto']


In [74]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]
