---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

*Note: Some of the cells in this notebook are computationally expensive. To reduce runtime, this notebook is using a subset of the data.*

# Case Study: Sentiment Analysis

### Data Prep

In [47]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Sample the data to speed up computation (Comment out this line to match with lecture)
# df = df.sample(frac=0.1, random_state=10) # 全データで行う為コメントアウトしました。


# The dataset is the Amazon Reviews on Unlocked_Mobile phones
# we can see we have the Product Name, Brand, Price, Rating, Review text and the number of people who found the review helpful
df.head()
# we'll be focusing on the Rating and Reviews columns

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [48]:
# First, we'll drop any rows with missing values
df.dropna(inplace=True)

# Next, remove any ratings = 3, we'll assume these are neutral
df = df[df['Rating'] != 3]

# Finally, we'll create a new column that will serve as our target for our model.
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.head(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1
5,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,1,I already had a phone with problems... I know ...,1.0,0
6,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,The charging port was loose. I got that solder...,0.0,0
7,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,2,"Phone looks good but wouldn't stay charged, ha...",0.0,0
8,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I originally was using the Samsung S2 Galaxy f...,0.0,1
11,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,This is a great product it came after two days...,0.0,1


In [49]:
# Most ratings are positive
df['Positively Rated'].mean()
# ↓we have imbalanced classes

0.74826860258793226

In [50]:
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], 
                                                    df['Positively Rated'], 
                                                    random_state=0)

In [51]:
print('X_train first entry:\n\n', X_train.iloc[0])
print('\n\nX_train shape: ', X_train.shape)
# ↓ We'll need to convert these into a numeric representation that scikit-learn can use.

X_train first entry:

 I bought a BB Black and was deliveried a White BB.Really is not a serious provider...Next time is better to cancel the order.


X_train shape:  (231207,)


In [52]:
X_train.head()

97039     I bought a BB Black and was deliveried a White...
243783    overall i am very happy so far with this phone...
88792     the keyboard stutters! after i made a research...
388802    excellent smart phone, good performance. all p...
161607    I received my new Blu Vivo 5 Smartphone 3 days...
Name: Reviews, dtype: object

# CountVectorizer

### The bag-of-words approach
・ is simple and commonly used way to represent text for use in machine learning.<br>
・ CountVectorizer allows us to use the bag-of-words approach by converting a collection of text documents into a matrix of token counts.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)
# やっていることは、
# 1. the tokenization of the trained data
# 2. building of the vocabulary

# Fitting the CountVectorizer tokenizes each document
# by finding all sequences of characters of at least two letters or numbers 
# separated by word boundaries.
# Converts everything to lowercase and builds a vocabulary using these tokens.

In [54]:
# get_feature_names メソッドを使用することで vocabulary を取得できる。（vocabularyはトレーニングデータの中のいずれかのトークンからbuildされる）

# 以下は2000の倍数のfeaturesのvocabralyを取得している（これにより少しサンプルがわかる。）
vect.get_feature_names()[::2000]

['00',
 '4less',
 'adr6275',
 'assignment',
 'blazingly',
 'cassettes',
 'condishion',
 'debi',
 'dollarsshipping',
 'esteem',
 'flashy',
 'gorila',
 'human',
 'irullu',
 'like',
 'microsaudered',
 'nightmarish',
 'p770',
 'poori',
 'quirky',
 'responseive',
 'send',
 'sos',
 'synch',
 'trace',
 'utiles',
 'withstanding']

In [55]:
vect.get_feature_names()[:15] # 最初の15個

['00',
 '000',
 '0000',
 '00000',
 '000000',
 '0000000',
 '0000from',
 '0001',
 '0004',
 '000ma',
 '000mah',
 '000mh',
 '000restricted',
 '001',
 '002']

In [56]:
len(vect.get_feature_names()) # 全部で53000個のvocabularyがある

53216

In [57]:
# transform the documents in the training data to a document-term matrix
# （a document term matrixにtransformし、 the bag-of-word representation of X_train　を得ている。）
X_train_vectorized = vect.transform(X_train)

X_train_vectorized

# This representation is stored in a SciPy sparse matrix,
# where each row corresponds to a document
# and each column a word from our training vocabulary.

<231207x53216 sparse matrix of type '<class 'numpy.int64'>'
	with 6117776 stored elements in Compressed Sparse Row format>

In [58]:
X_train_vectorized[0, :100] # 一部を切り取ったもの : 最初のドキュメントの100個までのfeature (=> https://docs.scipy.org/doc/scipy/reference/sparse.html)

# The entries in this matrix are the number of times each word appears in each document.
# 最初のドキュメントでvocabularyが何回使用されたかを表す。
# vacabularyの数はレビューの文章よりもはるかに多いので、ほとんどのentryは0である。

<1x100 sparse matrix of type '<class 'numpy.int64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [59]:
# We'll use LogisticRegression, which works well for high dimensional sparse data.
# ほとんどのentryが0のような疎なデータセットにはLogisticRegressionがよく適している。
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [60]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
# （テストデータのレビュー文書にはあってトレーニングデータに無いvocabularyは無視される。）
predictions = model.predict(vect.transform(X_test))

# compute the area under the curve score
print('AUC: ', roc_auc_score(y_test, predictions))

# 0.926のスコアが得られた。

AUC:  0.92648398605


In [63]:
# 次、area under the curve scoreの後にcoefficients（相関値）を見ていく。

# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest

# 悪い言葉ほど高評価と無相関であることを示す。
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
# excelentのような言葉だと高評価と高い相関を示す。
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['worst' 'false' 'worthless' 'junk' 'garbage' 'mony' 'useless' 'messing'
 'unusable' 'horrible']

Largest Coefs: 
['excelent' 'excelente' 'exelente' 'excellent' 'loving' 'loves' 'efficient'
 'perfecto' 'amazing' 'love']


# Tfidf
####  Term frequency-inverse document frequency

・ featureをリスケーリングすることを可能にする。<br>
・ 文書にとって重要な用語次第で重み付けをする。<br>
・ 特定の文書にある用語が頻発する場合（そしてcorpusのような一般文書にあまり出てこない用語）は、その文書にとってその用語がとても重要だとして高い重み付けをする。<br>
・ 逆に言えば、低いtf-idfを示す用語は、どの文書でも一般的に高い頻度で出てくる用語か、滅多に使われず長い文書でのみ登場する用語であることを示す。

In [64]:
# CountVectorizerと同じように、TfidfVectorizerをインスタンス化して、トレーニングデータにfitさせる。
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)

len(vect.get_feature_names())

# 同じように文書のtokenizingをするので、feature数は同じになると、私たちは予期するが、実際にはfeature数は(53000より)ずっと少なくなる。
# このトリックは何かと言うと、CountVectorizor も tf–idf Vectorizor もモデルのパフォーマンスをよくする為に、
# feature数を減らす働きをする引数を取ることができる。

# min_df: vocabularyとしてカウントするには　最低この数だけの文書にトークンが登場する必要がある。（固有名詞のようなのが省かれる。）
# both take an argument, mindf, which allows us to specify a minimum number of documents
# in which a token needs to appear to become part of the vocabulary.


17951

In [65]:
X_train_vectorized = vect.transform(X_train)

# LogisticRegressionを使用するところは同じ。
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

# AUCスコア:0.927は劇的に改善していないが、より少ないfeature数で同じAUCスコアを得られることに高い意味がある。

AUC:  0.926610066675


In [66]:
# 高いtf-idfを示す用語、低いtf-idfを示す用語を抽出する。
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

# 低いtf-idfを示す用語はほぼどのような文書にも出てくる用語か、長い文章のレビューの中にしか登場しない用語が出てくる。
print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
# 高いtf-idfを示す用語は特定のレビューで頻出する用語が並ぶ。（但しどのレビューにも出てくるような単語では無い。）
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['commenter' 'pthalo' 'warmness' 'storageso' 'aggregration' '1300'
 '625nits' 'a10' 'submarket' 'brawns']

Largest tfidf: 
['defective' 'batteries' 'gooood' 'epic' 'luis' 'goood' 'basico'
 'aceptable' 'problems' 'excellant']


In [67]:
# CountVectorizorと同様に、高評価レビューと高い相関値を示す用語、低い相関値を示す用語を抽出する。
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))


Smallest Coefs:
['not' 'worst' 'useless' 'disappointed' 'terrible' 'return' 'waste' 'poor'
 'horrible' 'doesn']

Largest Coefs: 
['love' 'great' 'excellent' 'perfect' 'amazing' 'awesome' 'perfectly'
 'easy' 'best' 'loves']


In [68]:
# These reviews are treated the same by our current model
# 続きの文章を1(高評価)、0(低評価)にきちんと区別できてはいない。
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

# CountVectorizorがbag-of-wordsの形式をとるため、文書の順序に無関係となっているので、
# The phone is working（問題なし）と The phone is not working（問題あり）がどちらに対しても登場してしまう可能性があるが
# この現在のTfidfVectorizerモデルではどちらもnegativeなfeatureとして捉え、どちらもネガティブなレビュー（低評価）とみなしている。


[0 0]


# n-grams

In [69]:
# これを防ぐ手段としてCountVectorizorにはn-gramが用意されている。(例えばbigramであれば、is workingを高評価、not workingを低評価とみなす)

# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams（minimumなsequenceの長さとmaximumなsequenceの長さ）
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

# 1語と2語をfeatureにするため、長いレビューが多いとfeature数はすごく多くなってしまう。53000個のvocabularyから約４倍になっている。
len(vect.get_feature_names())

198917

In [70]:
# その為、トレーニングも時間がかなりかかる。しかしAUCは0.92台からよく改善されている。
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))

AUC:  0.967143758101


In [71]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()
# no goodやnot happyのtermが低評価のfeatureとして捉え
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
# not badやno problemsのtermを高評価のfeatureとして捉え、明らかに区別できるようになった。その為AUCが改善したと言える。
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'worst' 'junk' 'not good' 'not happy' 'horrible' 'garbage'
 'terrible' 'looks ok' 'nope']

Largest Coefs: 
['not bad' 'excelent' 'excelente' 'excellent' 'perfect' 'no problems'
 'exelente' 'awesome' 'no issues' 'great']


In [72]:
# These reviews are now correctly identified
# 続きの文章を1(高評価)、0(低評価)にきちんと区別できるようになった。
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


In [None]:
# The vectorizers we saw in this tutorial are very flexible and also support tasks such as removing stop words or limitization.
# So be sure to check the documentation for more info.