# Las Vegas 评论文本情感分析

### 该 Notebook 对应的博客[链接](http://xiehongfeng100.github.io/2018/08/09/yelper-las-vegas-review-text-sentiment-analysis/)

如果说评论中的 stars 值代表了用户对某一个商店粗粒度的评价，那他或她写下的具体评论文本就代表了其对商店细粒度的评价。`对评论文本的情感分析能够给我们增加一个细粒度衡量商店好坏的角度。`

下文使用了两种方法来对评论文本进行情感分析。一种基于 Affin 库，这个方法没有显式学习过程，作为一个 Baseline 跟第二种进行比较；第二种基于 CNN 模型，有具体学习过程。这两种方法在后文会做描述。

In [1]:
import numpy as np
import pandas as pd

# 1. 加载数据

## 1.1 评论数据集（不含文本）

In [2]:
# 评论
yelp_lv_rvs = pd.read_csv('../../dataset/las_vegas/review/las_vegas_review_with_db_id.csv')

In [3]:
len(yelp_lv_rvs)

1604246

In [4]:
yelp_lv_rvs[:5]

Unnamed: 0,db_id,review_id,user_db_id,business_db_id,stars,year
0,3,---3OXpexMp0oAg77xWfYA,999269,92729,5,2012
1,6,---94vtJ_5o_nikEs6hUjg,313272,122971,5,2014
2,8,---D6-P4MpS86LYldBfX7w,735101,160943,4,2016
3,20,---WDP9kwKyVQiw9GTgNmQ,1045600,12131,1,2014
4,22,---zHMCae68gIbSbtXxD5w,971613,15470,4,2015


## 1.2 评论文本数据集

In [5]:
# 评论文本
yelp_lv_rts = pd.read_csv('../../dataset/las_vegas/review/las_vegas_review_text_preprocessed_with_db_id.csv')

In [6]:
len(yelp_lv_rts) # 这里值跟 len(yelp_lv_rvs) 不一致是因为评论文本之前处理过，一些非英语评论或空的都被清理掉，可参考 https://github.com/xiehongfeng100/yelper_dpps_and_eda/blob/master/dpps/las_vegas/lv_dpps04_Preprocess_Review_Text.ipynb

1604044

In [7]:
yelp_lv_rts[:5]

Unnamed: 0,review_db_id,text_words
0,3,pizza make night good people great pizza anyth...
1,6,one absolute favorite restaurant usually go on...
2,8,know place star lifesaver stay mandalay bay lo...
3,20,nd time eat today st time great dont think hus...
4,22,regal locate village square super convenient p...


# 2. 二值化评论文本情感度

`评论中 stars 值为 5 的文本定为正面，stars 值为 1 或 2 的定为负面`（stars 值处于 3 或 4 的属于中等评价，不好定为正面或负面，所以这里直接舍弃）。注意这里我们并没有采用[按时间过滤过后的 stars 值](http://xiehongfeng100.github.io/2018/07/31/yelper-dpps-las-vegas-data-preprocessing/#stars-%E5%A4%84%E7%90%86)来作为文本正面或负面的度量，因为文本的情感跟时间几乎没有关系。

## 2.1 获取评论数据集 db_id、stars 两列

In [8]:
# 取出 db_id、stars 两列，并重命名 db_id 列为 review_db_id，以便于跟文本数据集进行合并
rv_id_stars = yelp_lv_rvs[['db_id', 'stars']]
rv_id_stars = rv_id_stars.rename(index=str, columns={'db_id': 'review_db_id'})

In [9]:
len(rv_id_stars)

1604246

In [10]:
rv_id_stars[:5]

Unnamed: 0,review_db_id,stars
0,3,5
1,6,5
2,8,4
3,20,1
4,22,4


## 2.2 合并 stars 值到文本数据集

In [11]:
# 将 stars 值合并到文本数据集
yelp_lv_rts = yelp_lv_rts.join(rv_id_stars.set_index('review_db_id'), on='review_db_id')

In [12]:
len(yelp_lv_rts)

1604044

In [13]:
yelp_lv_rts[:5]

Unnamed: 0,review_db_id,text_words,stars
0,3,pizza make night good people great pizza anyth...,5
1,6,one absolute favorite restaurant usually go on...,5
2,8,know place star lifesaver stay mandalay bay lo...,4
3,20,nd time eat today st time great dont think hus...,1
4,22,regal locate village square super convenient p...,4


## 2.3 二值化文本的情感度

In [14]:
# 舍弃 stars 值为 3 或 4 的评论
yelp_lv_filterd_rts = yelp_lv_rts[(yelp_lv_rts.stars < 3) | (yelp_lv_rts.stars == 5)]

In [15]:
len(yelp_lv_filterd_rts)

1067530

In [16]:
yelp_lv_filterd_rts[:10]

Unnamed: 0,review_db_id,text_words,stars
0,3,pizza make night good people great pizza anyth...,5
1,6,one absolute favorite restaurant usually go on...,5
3,20,nd time eat today st time great dont think hus...,1
6,25,wow get bad review come friend amaze meal ever...,5
7,26,allegiant disaster fare cheap cheap enough fli...,1
8,28,go twice leave eh feel visit decide time switc...,2
11,34,every time come food perfect one favorite sush...,5
13,37,place awesome vibe brett rubin b day bash sit ...,5
14,40,husband go honeymoon enjoy nice dinner mention...,5
16,46,romantic spot lv great people watch nice menu ...,5


In [17]:
# 二值化
yelp_lv_bin_rts = yelp_lv_filterd_rts.assign(stars_binarized=[0 if stars < 3 else 1 for stars in yelp_lv_filterd_rts.stars])

In [18]:
yelp_lv_bin_rts[:10]

Unnamed: 0,review_db_id,text_words,stars,stars_binarized
0,3,pizza make night good people great pizza anyth...,5,1
1,6,one absolute favorite restaurant usually go on...,5,1
3,20,nd time eat today st time great dont think hus...,1,0
6,25,wow get bad review come friend amaze meal ever...,5,1
7,26,allegiant disaster fare cheap cheap enough fli...,1,0
8,28,go twice leave eh feel visit decide time switc...,2,0
11,34,every time come food perfect one favorite sush...,5,1
13,37,place awesome vibe brett rubin b day bash sit ...,5,1
14,40,husband go honeymoon enjoy nice dinner mention...,5,1
16,46,romantic spot lv great people watch nice menu ...,5,1


# 3. 基于 Affin 的文本情感分析（Baseline）

在[上一篇 Notebook](https://github.com/xiehongfeng100/yelper_dpps_and_eda/blob/master/eda/las_vegas/lv_eda02_Review_Text_Sentiment_Analysis_EDA.ipynb) 我们已经使用 Affin 来衡量评论文本的情感度，不过那时主要是用来分析商店 Mon Ami Gabi 的。这里，基本 Affin 的情感分析将作为一个 Baseline，跟下文基于 CNN 的进行比较。

## 3.1 情感度计算

In [19]:
from afinn import Afinn
import multiprocessing as mp

In [20]:
af = Afinn()

In [21]:
def calc_sentiment(text_words):
    return np.mean([af.score(word) for word in text_words.split(' ')]) # 直接使用 af.score(text_words) 正确率更低

In [22]:
%%time
# 注意 multiprocessing pool map 返回的结果跟输入是同序的，所以不用担心计算结果乱序
pool = mp.Pool(processes=12)
affin_sentiments = pool.map(calc_sentiment, yelp_lv_bin_rts.text_words)
pool.close()
pool.join()

CPU times: user 1.52 s, sys: 532 ms, total: 2.06 s
Wall time: 1min 50s


In [23]:
yelp_lv_bin_rts = yelp_lv_bin_rts.assign(affin_sentiment=affin_sentiments)

In [24]:
yelp_lv_bin_rts[:5]

Unnamed: 0,review_db_id,text_words,stars,stars_binarized,affin_sentiment
0,3,pizza make night good people great pizza anyth...,5,1,0.818182
1,6,one absolute favorite restaurant usually go on...,5,1,0.448276
3,20,nd time eat today st time great dont think hus...,1,0,-0.058824
6,25,wow get bad review come friend amaze meal ever...,5,1,0.28169
7,26,allegiant disaster fare cheap cheap enough fli...,1,0,-0.211538


## 3.2 情感度归一化

以上计算得到的情感度不方便我们进行二分类，所以我们对其归一化到 0~1 之间，然后按 0.5 作为阈值判断情感度属于正面还是负面

In [25]:
from sklearn import preprocessing

In [26]:
min_max_scaler = preprocessing.MinMaxScaler()

In [27]:
yelp_lv_bin_rts = yelp_lv_bin_rts.assign(affin_sentiment_scaled=min_max_scaler.fit_transform(yelp_lv_bin_rts.affin_sentiment[:, None]))

In [28]:
yelp_lv_bin_rts[:5]

Unnamed: 0,review_db_id,text_words,stars,stars_binarized,affin_sentiment,affin_sentiment_scaled
0,3,pizza make night good people great pizza anyth...,5,1,0.818182,0.545455
1,6,one absolute favorite restaurant usually go on...,5,1,0.448276,0.492611
3,20,nd time eat today st time great dont think hus...,1,0,-0.058824,0.420168
6,25,wow get bad review come friend amaze meal ever...,5,1,0.28169,0.468813
7,26,allegiant disaster fare cheap cheap enough fli...,1,0,-0.211538,0.398352


## 3.3 准确率

准确率就是分类正确（正面和负面）的评论数量除以总数。从以下结果可以看出来，**用 Affin 分析得到的正确率仅 0.5124，跟我们随机猜的正确率差不多**。

In [29]:
count_correct = len(yelp_lv_bin_rts[(yelp_lv_bin_rts.stars_binarized == 1) & (yelp_lv_bin_rts.affin_sentiment_scaled >= 0.5)]) + \
                len(yelp_lv_bin_rts[(yelp_lv_bin_rts.stars_binarized == 0) & (yelp_lv_bin_rts.affin_sentiment_scaled < 0.5)])

In [30]:
correct_rate = count_correct / float(len(yelp_lv_bin_rts))

In [31]:
correct_rate

0.5124343109795509

# 4. 基于 CNN 的文本情感分析

大多数情况下我们都是把 CNN 用在图片处理上，但其实 CNN 用在文本分类同样厉害！不一样的是图片是二维的，文本是一维的。

这里我们参考的架构来自于一篇论文 [A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1510.03820)：

![](images/sentiment/cnn-text-classification-arch.png)  

如果我们之前学习过二维 CNN（[示意图](https://en.wikipedia.org/wiki/File:Typical_cnn.png)）的话，会发现这个架构跟前者完全一致，也是由`输入层+卷积层（激活函数）+池化层+全连接层（含输出层）`组成。

上图中，
- 输入层为一个 7*5 的矩阵，每一个行向量表示一个词（如 like）；这里的 CNN “一维”正是指`以行向量（词）为基本单位`构成了“一维”列向量
- 卷积层使用了 3 种卷积核，各有 2 个；输入层的数据和卷积核进行卷积相乘后得到 3 中共 6 个 Feature Map
- 池化层使用 Max Pooling 方法取出各 Feature Map 中最大值，最后组装成一个 6 个特征数的列向量，输入到全连接层
- 全连接层（含输出层）利用 Softmax 函数计算出分类结果

这里解释的可能不是很清晰，可以参考论文中的解释：
> Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states. 

## 4.1 数据处理及数据集分割

In [32]:
import keras
from keras.preprocessing import sequence
from keras.preprocessing import text as txt
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [33]:
text_words = yelp_lv_bin_rts.text_words.tolist()

In [34]:
tk = txt.Tokenizer(split=' ')
tk.fit_on_texts(text_words)

In [35]:
x = tk.texts_to_sequences(text_words)
y = yelp_lv_bin_rts.stars_binarized.tolist()

In [36]:
x[:2]

[[138, 12, 68, 8, 46, 6, 138, 173, 33, 6, 554],
 [11,
  1015,
  134,
  42,
  336,
  2,
  11,
  2145,
  224,
  207,
  3081,
  5081,
  168,
  11,
  3309,
  148,
  45,
  1577,
  40,
  57,
  42,
  3222,
  1118,
  429,
  921,
  65,
  36,
  65,
  4]]

In [37]:
# 将长度不足 maxlen 的输入向量后边补 0
maxlen = 50
x = sequence.pad_sequences(x, maxlen=maxlen, padding='post')
x[:2]

array([[ 138,   12,   68,    8,   46,    6,  138,  173,   33,    6,  554,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [  11, 1015,  134,   42,  336,    2,   11, 2145,  224,  207, 3081,
        5081,  168,   11, 3309,  148,   45, 1577,   40,   57,   42, 3222,
        1118,  429,  921,   65,   36,   65,    4,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]], dtype=int32)

In [38]:
y[:2]

[1, 1]

In [39]:
# 分割训练和测试数据集
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print len(X_train), len(y_train), len(X_test), len(y_test)

854024 854024 213506 213506


## 4.2 计算词汇表大小

经过以上转换后，输入数据 x 已经用一个数字组成的词汇表来表示。这里我们需要计算出这个词汇表的大小，以便作为嵌入层（Embedding）的参数传入。

In [40]:
vacabulary_size = 0
for item in x:
    max_num = max(item)
    if max_num > vacabulary_size:
        vacabulary_size = max_num

# 上边循环算出的只是最大下标值，加 1 之后才是真正的词汇表大小
vacabulary_size += 1

In [41]:
vacabulary_size

216661

## 4.3 训练

In [42]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import Dense, Activation
from keras.layers import Conv1D, GlobalMaxPooling1D

In [43]:
# 嵌入层(Embedding)
input_dim = vacabulary_size # 词汇表大小
output_dim = 128 # 输出向量大小
input_length = maxlen # 输入（行）向量长度，即每个词表示成向量后的长度

# 卷积层(Convolution)
kernel_size = 5 # 卷积核的大小
filters = 64 # 卷积核的数目

# 训练参数(Training)
batch_size = 30
epochs = 2

In [44]:
%%time

# 定义 Sequential 模型
cnn_model = Sequential()

# 将输入向量转化为维度为（batch_size, input_length, output_dim）的稠密矩阵（Dense vectors）
cnn_model.add(Embedding(input_dim, output_dim, input_length=input_length))

# 添加卷积层
cnn_model.add(Conv1D(filters,
                     kernel_size,
                     padding='valid',
                     activation='relu',
                     strides=1))

# 添加池化层
cnn_model.add(GlobalMaxPooling1D())

# 添加全连接层
cnn_model.add(Dense(1))
cnn_model.add(Activation('sigmoid'))

# 定义损失函数、优化器、衡量指标
cnn_model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

# 训练
cnn_model.fit(X_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(X_test, y_test))

# 计算测试集准确率
score, acc = cnn_model.evaluate(X_test, y_test, batch_size=batch_size)
print 'Test score:', score
print 'Test accuracy:', acc

Train on 854024 samples, validate on 213506 samples
Epoch 1/2
Epoch 2/2
Test score: 0.11517073286927942
Test accuracy: 0.9567974559067239
CPU times: user 14min 47s, sys: 6min 25s, total: 21min 13s
Wall time: 15min 18s


## 4.4 保存模型

In [45]:
cnn_model.save('yelp_las_vegas_review_text_sentiment_analysis_cnn_model.h5')

## 4.5 加载模型

In [46]:
from keras.models import load_model

In [47]:
loaded_model = load_model('yelp_las_vegas_review_text_sentiment_analysis_cnn_model.h5')

In [48]:
score, acc = loaded_model.evaluate(X_test, y_test, batch_size=batch_size)
print 'Test score:', score
print 'Test accuracy:', acc

Test score: 0.11517073286927942
Test accuracy: 0.9567974559067239


## 4.6 预测所有评论文本的情感度并保存结果

In [49]:
# 记得用原有的 Tokenizer(tk) 实例来转化全部的评论文本
all_text_words = yelp_lv_rts.text_words.tolist()
all_text_sequences = tk.texts_to_sequences(all_text_words)
all_text_sequences = sequence.pad_sequences(all_text_sequences, maxlen=maxlen, padding='post')

In [50]:
len(all_text_sequences)

1604044

In [51]:
all_text_sequences[:2]

array([[ 138,   12,   68,    8,   46,    6,  138,  173,   33,    6,  554,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0],
       [  11, 1015,  134,   42,  336,    2,   11, 2145,  224,  207, 3081,
        5081,  168,   11, 3309,  148,   45, 1577,   40,   57,   42, 3222,
        1118,  429,  921,   65,   36,   65,    4,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0]], dtype=int32)

In [52]:
# 为所有文本预测情感度
cnn_sentiments = loaded_model.predict(all_text_sequences, batch_size=batch_size)

In [53]:
yelp_lv_rts = yelp_lv_rts.assign(cnn_sentiment=cnn_sentiments)

In [54]:
yelp_lv_rts[:10]

Unnamed: 0,review_db_id,text_words,stars,cnn_sentiment
0,3,pizza make night good people great pizza anyth...,5,0.808333
1,6,one absolute favorite restaurant usually go on...,5,0.992467
2,8,know place star lifesaver stay mandalay bay lo...,4,0.00104
3,20,nd time eat today st time great dont think hus...,1,0.000239
4,22,regal locate village square super convenient p...,4,0.532928
5,24,super good food friend order lbs shrimp lb cra...,4,0.95816
6,25,wow get bad review come friend amaze meal ever...,5,0.99966
7,26,allegiant disaster fare cheap cheap enough fli...,1,0.001531
8,28,go twice leave eh feel visit decide time switc...,2,0.000279
9,31,book deluxe suite vdara past weekend two night...,4,0.876377


In [55]:
# 将结果保存起来
yelp_lv_rts[['review_db_id', 'cnn_sentiment']].to_csv('../../dataset/las_vegas/review/las_vegas_review_text_sentiment_with_db_id.csv', index=False)