---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*

# Case Study: Sentiment Analysis

### Data Prep

In [1]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Sample the data to speed up computation
# Comment out this line to match with lecture
# df = df.sample(frac=0.1, random_state=10)

df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [2]:
# Drop missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [3]:
# Most ratings are positive
df['Positively Rated'].mean()

0.7482686025879323

In [4]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [5]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry:

 I bought a BB Black and was deliveried a White BB.Really is not a serious provider...Next time is better to cancel the order.


X_train shape:  (231207,)


# CountVectorizer

 - First approach: bag of word model. This ignore the structure, just focus on how often the words appear
 - It convert text into a matrix of token counts (numeric data)
 - After that sklearn can work with these (numeric) data
 - what CountVectorizer().fit does: tokenize the text to tokens, normalize all of them to lower case, use them to build a vocabulary
 - then, vect.transform() will transform the text X_train into bag-of-word representation (document matrix)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

In [12]:
# get the vocabulary
vect.get_feature_names()[::2000]

['00',
 '4less',
 'adr6275',
 'assignment',
 'blazingly',
 'cassettes',
 'condishion',
 'debi',
 'dollarsshipping',
 'esteem',
 'flashy',
 'gorila',
 'human',
 'irullu',
 'like',
 'microsaudered',
 'nightmarish',
 'p770',
 'poori',
 'quirky',
 'responseive',
 'send',
 'sos',
 'synch',
 'trace',
 'utiles',
 'withstanding']

In [13]:
# count how many features/tokens do we have
len(vect.get_feature_names())

53216

 - The transformed document is stored in SciPy sparse matrix. Each row is a document (a single review); each column is a word in our trained vocabulary
 - Each entry in the matrix = how many time the word occur in a document

In [14]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [10]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [11]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9267064844136235


In [21]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'false' 'worthless' 'mony' 'junk' 'garbage' 'useless' 'messing'
 'blacklist' 'unusable']

Largest Coefs: 
['excelent' 'excelente' 'exelente' 'excellent' 'loving' 'loves'
 'efficient' 'perfecto' 'amazing' 'lovely']


# Tfidf
 - Term Frequency-Inverse Document Frequency
 - weight term based on how important they are in the document
 - high weight: for terms appear often in a particular document, but do not appear often in the whole corpus 
 - low weight: terms which commonly used the in the whole corpus
 - low weight: terms rarely used or only appear in very long document
 - TfidfVectorizer().fit: tokenize the text, build vocabulary from tokens
 - then vect.transform() convert the data into matrix representation, which can be used in ML method

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())

17951

In [23]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9266100666746837


### Understanding in details TF-IDF

https://stackoverflow.com/questions/36966019/how-aretf-idf-calculated-by-the-scikit-learn-tfidfvectorizer

https://stackoverflow.com/questions/45680421/is-this-correct-tfidf

Please ignore the link below, because it is not wrong but doesn't take into account the smoothing (+1) which is in the end not explain the case

https://datascience.stackexchange.com/questions/15311/weights-for-keywords-in-a-set-of-documents-using-term-frequency-and-inverse-docu

TF - IDF stands for  term frequency–inverse document frequency
TF counts the frequency of a term / total #terms in a given document. For each term in a document, this value changes.
IDF counts the log of ratio of total document / term appearing in #documents . This value is constant for a given unique term. Greater the idf value for a term, higher its significance.
Example:

Document 1: This is a sample example.

Document 2: This is another example.

__Calculate the TF__
 - TF(is, Document 1) = 1
 - TF(is, Document 2) = 1
 
__Calculate the IDF__:
The IDF for term = "is" (with smoothing) = log(2+1/2+1) + 1= 1
Similarly the IDFs for 'this', 'example' are also 1
The IDF for sample and another is: log(2+1/1+1) + 1 =  1.40546

__Calculate the TF-IDF for document 1__: TFIDF = TF * IDF
 - TFIDF(is, document 1) = 1 * 1 = 1
 - Similarly, the TFIDF for "this", "example" in document 1 = 1
 - The TFIDF for "sample" = 1 * 1.40546 = 1.40546
 - The ITFIDF vector for document 1 is [1 (this), 1(is), 1.40546(sample), 1(example)]
__Last step: normalize the TF-IDF vector:__
 - devide each element of the vector by (1^2 + 1^2 + 1.40546^2 + 1^2)
 

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

d = pd.Series(['this is a sample example','this is another example'])
df = pd.DataFrame(d)
# tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0)
tfidf_vectorizer = TfidfVectorizer()
# if you want, say only top 2 features(terms)
# tfidf_vectorizer = TfidfVectorizer(analyzer='word', min_df=0, max_features=2, max_df = 3)
# Terms with given below:
# occurred in too many documents (max_df, tfidf score = 3)  
# occurred in too few documents (min_df, tfidf score = 0)
# cut off by feature selection (max_features, tfidf score = 2).
print(df[0])
tfidf = tfidf_vectorizer.fit_transform(df[0])
# dictionary: key is the words (feature names) and value is the index of the feature name is in the feature list
print(tfidf_vectorizer.vocabulary_)
# output: {u'this': 4, u'sample': 3, u'is': 2, u'example': 1, u'another': 0}
print(tfidf_vectorizer.idf_)
# output(constant): [ 1.40546511  1.          1.          1.40546511  1.        ]
print(tfidf_vectorizer.get_feature_names())
print(tfidf)

# Document 1: this: 1+1/4+1 * log(2/2)
print(np.log(3/2) +1)
print(1/np.sqrt(1**2 + 1**2 + 1.40546511**2 + 1**2))
# output: 
#(0, 1)        0.448320873199    Document 1, term = example
#(0, 3)        0.630099344518    Document 1, term = sample
#(0, 2)        0.448320873199    Document 1, term = is
#(0, 4)        0.448320873199    Document 1, term = this
#(1, 0)        0.630099344518    Document 2, term = another
#(1, 1)        0.448320873199    Document 2, term = example
#(1, 2)        0.448320873199    Document 2, term = is
#(1, 4)        0.448320873199    Document 2, term = this

0    this is a sample example
1     this is another example
Name: 0, dtype: object
{'this': 4, 'is': 2, 'sample': 3, 'example': 1, 'another': 0}
[1.40546511 1.         1.         1.40546511 1.        ]
['another', 'example', 'is', 'sample', 'this']
  (0, 4)	0.44832087319911734
  (0, 2)	0.44832087319911734
  (0, 3)	0.6300993445179441
  (0, 1)	0.44832087319911734
  (1, 4)	0.44832087319911734
  (1, 2)	0.44832087319911734
  (1, 1)	0.44832087319911734
  (1, 0)	0.6300993445179441
1.4054651081081644
0.44832087295952644


#### Find smalest and leargest tfidf 
__My understanding of the code below__: `X_train_vectorized` is a matrix. Rows are reviews, columns are word, each entry: how important a word appear in a document (weight). So `X_train_vectorized.max(0)` takes maximum in the `axis=0` (row axis). This means it give the maximum of each column. This means the maximum weight of each word across all documents

In [28]:
feature_names = np.array(vect.get_feature_names())


# print(X_train_vectorized.toarray().shape) # 231207 reviews x 17951 words
# after take maximum value from all rows, we have array of 1 x 17951 words
# print(X_train_vectorized.max(0).toarray().shape) # (1,17951)

print(vect.idf_[:5])
print(X_train_vectorized.max(0).toarray()[0][:5])
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

[ 6.52778689  8.38822839 10.86616637 11.27163148 10.78612366]
[0.71042189 0.32454897 0.31976905 0.26919504 0.21120331]
Smallest tfidf:
['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300'
 '625nits' 'a10' 'submarket' 'brawns']

Largest tfidf: 
['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico'
 'aceptable' 'problems' 'excellant']


In [25]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']

Largest Coefs: 
['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']


In [26]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


# n-grams

In [27]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names())

198917

In [28]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.9671350730214722


In [None]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

In [None]:
# These reviews are now correctly identified
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting

In [6]:
text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)

X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_
print(X_idf)

[1.91629073 1.91629073 1.91629073 1.91629073 1.51082562 1.91629073
 1.51082562]
