# Installation

In [1]:
!pip install collection

Defaulting to user installation because normal site-packages is not writeable
Collecting collection
  Downloading collection-0.1.6.tar.gz (5.0 kB)
Building wheels for collected packages: collection
  Building wheel for collection (setup.py) ... [?25ldone
[?25h  Created wheel for collection: filename=collection-0.1.6-py3-none-any.whl size=5114 sha256=9833c237870afd4fd0893b2775d5e55dddd23bbc4b3318fb225c6233c6158d84
  Stored in directory: /home/chunyi/.cache/pip/wheels/3b/e0/fe/8e68dd2243f4e4741fd3950f2dbeb2fdf4b604767fde39598f
Successfully built collection
Installing collected packages: collection
Successfully installed collection-0.1.6


In [2]:
!pip install jieba

Defaulting to user installation because normal site-packages is not writeable
Collecting jieba
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
[K     |████████████████████████████████| 19.2 MB 117 kB/s eta 0:00:01
[?25hBuilding wheels for collected packages: jieba
  Building wheel for jieba (setup.py) ... [?25ldone
[?25h  Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314478 sha256=fdd27b992ab37ad4693023b50bc3c020b1891f7b0e79d9119f92bc8149883677
  Stored in directory: /home/chunyi/.cache/pip/wheels/ca/38/d8/dfdfe73bec1d12026b30cb7ce8da650f3f0ea2cf155ea018ae
Successfully built jieba
Installing collected packages: jieba
Successfully installed jieba-0.42.1


In [6]:
!pip install keras

Defaulting to user installation because normal site-packages is not writeable
Collecting keras
  Downloading Keras-2.4.3-py2.py3-none-any.whl (36 kB)
Installing collected packages: keras
Successfully installed keras-2.4.3


# Import

In [7]:
import os
import numpy as np
import pandas as pd
import string
import jieba
# import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from collections import Counter
from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.initializers import Constant
from keras.optimizers import Adam


# Implementation

In [8]:
# read data

document = pd.read_csv("ICDM_REVIEWS_TO_RELEASE_encoding=gb18030.csv", encoding = "GB18030", header = None)
document

Unnamed: 0,0,1,2,3,4,5
0,review_id,label,user,ip,star,text
1,REVIEW0,+,USER1,IP1,5,他家的面很有嚼劲，牛肉汤很有味道，服务员的服务也特别的好，其中服务员顾存芳的服务特好！好喜欢...
2,REVIEW1,-,USER2,IP2,4,鲜榨果汁很不错，水果都是很新鲜的，口感也很好，是个小妹妹态度非常好，环境也很好，但是专家告诉...
3,REVIEW2,+,USER3,IP3,5,跟老婆过二人世界，就定了他家的一间包房，他家挺好的，跟老婆点了几个菜，边吃边聊，包房的话，那...
4,REVIEW3,-,USER4,IP4,5,我是在开业那天去得，他们的环境很好，进去给人一种很温软的感觉 他们的服务很好，销售很耐心的...
...,...,...,...,...,...,...
9761,REVIEW9760,+,USER9063,IP4317,5,之前一直在苏浙汇吃，看到这边新开了家丰收日就去尝了下，味道不错，价格也比较实惠，不是很贵，包...
9762,REVIEW9761,-,USER9064,IP25,4,我一直等着世界末日的到来，等到21号，钱用光了，什么都没有了，结果它没有来，TMD谁造的谣啊...
9763,REVIEW9762,+,USER9065,IP1687,5,菜很好吃，尤其是剁椒鱼头，强烈推荐！！建议老板稍微扩大一下场地，位置太少！每次去都要等！
9764,REVIEW9763,-,USER9066,IP5535,5,公司楼下 最好吃的麻辣烫 老板是河南人 煮麻辣烫的小姑娘长得不错


In [9]:
# rename header

header = document.iloc[0]
document = document[1:]
document = document.rename(columns = header)
data = pd.DataFrame()
data["label"] = document["label"].copy()
data["text"] = document["text"].copy()

In [10]:
data

Unnamed: 0,label,text
1,+,他家的面很有嚼劲，牛肉汤很有味道，服务员的服务也特别的好，其中服务员顾存芳的服务特好！好喜欢...
2,-,鲜榨果汁很不错，水果都是很新鲜的，口感也很好，是个小妹妹态度非常好，环境也很好，但是专家告诉...
3,+,跟老婆过二人世界，就定了他家的一间包房，他家挺好的，跟老婆点了几个菜，边吃边聊，包房的话，那...
4,-,我是在开业那天去得，他们的环境很好，进去给人一种很温软的感觉 他们的服务很好，销售很耐心的...
5,-,好像说属于宁波菜系，东西还可以，只是上菜速度比较慢，好在服务还不错。还有赠送的餐前小吃，水果...
...,...,...
9761,+,之前一直在苏浙汇吃，看到这边新开了家丰收日就去尝了下，味道不错，价格也比较实惠，不是很贵，包...
9762,-,我一直等着世界末日的到来，等到21号，钱用光了，什么都没有了，结果它没有来，TMD谁造的谣啊...
9763,+,菜很好吃，尤其是剁椒鱼头，强烈推荐！！建议老板稍微扩大一下场地，位置太少！每次去都要等！
9764,-,公司楼下 最好吃的麻辣烫 老板是河南人 煮麻辣烫的小姑娘长得不错


In [11]:
# remove punctuation, stopwords
# source of Chinese punct: https://github.com/tsroten/zhon/blob/develop/docs/index.rst

def tokenize(text):
    """Given a sentence remove its punctuation and stop words"""
    
    with open("stopwords.txt", 'r',encoding='GB18030') as f:
        stopwords = f.read().splitlines()
    
    l=list()
    seg_list = jieba.cut(text, cut_all=False)
    for seg in seg_list:
        if seg not in stopwords:
            l.append(seg)
    
    cleaned_text = [w for w in l if w not in stopwords] # removing stop-words
    return cleaned_text # using the first 10 tokens only



In [12]:
data["texts"] = data["text"].apply(tokenize)
data

Building prefix dict from the default dictionary ...
Dumping model to file cache /scratch/local/jieba.cache
Loading model cost 0.736 seconds.
Prefix dict has been built successfully.


Unnamed: 0,label,text,texts
1,+,他家的面很有嚼劲，牛肉汤很有味道，服务员的服务也特别的好，其中服务员顾存芳的服务特好！好喜欢...,"[他家, 面, 嚼, 劲, 牛肉汤, 味道, 服务员, 服务, 特别, 服务员, 顾存芳, ..."
2,-,鲜榨果汁很不错，水果都是很新鲜的，口感也很好，是个小妹妹态度非常好，环境也很好，但是专家告诉...,"[鲜榨, 果汁, 不错, 水果, 新鲜, 口感, 小妹妹, 态度, 环境, 专家, 告诉, ..."
3,+,跟老婆过二人世界，就定了他家的一间包房，他家挺好的，跟老婆点了几个菜，边吃边聊，包房的话，那...,"[老婆, 二人, 世界, 就定, 他家, 一间, 包房, 他家, 挺, 老婆, 点, 几个,..."
4,-,我是在开业那天去得，他们的环境很好，进去给人一种很温软的感觉 他们的服务很好，销售很耐心的...,"[开业, 那天, 环境, 一种, 温软, 感觉, , , 服务, 销售, 耐心, 老公,..."
5,-,好像说属于宁波菜系，东西还可以，只是上菜速度比较慢，好在服务还不错。还有赠送的餐前小吃，水果...,"[好像, 说, 宁波, 菜系, 东西, 上菜, 速度, 比较慢, 服务, 不错, 赠送, 餐..."
...,...,...,...
9761,+,之前一直在苏浙汇吃，看到这边新开了家丰收日就去尝了下，味道不错，价格也比较实惠，不是很贵，包...,"[苏浙, 汇吃, 新开, 家, 丰收, 日, 尝, 味道, 不错, 价格, 实惠, 贵, 包..."
9762,-,我一直等着世界末日的到来，等到21号，钱用光了，什么都没有了，结果它没有来，TMD谁造的谣啊...,"[世界末日, 到来, 21, 号, 钱, 用光, TMD, 谁造, 谣, 剩, 100, 钱..."
9763,+,菜很好吃，尤其是剁椒鱼头，强烈推荐！！建议老板稍微扩大一下场地，位置太少！每次去都要等！,"[菜, 好吃, 剁, 椒, 鱼头, 强烈推荐, 建议, 老板, 稍微, 场地, 位置, 太,..."
9764,-,公司楼下 最好吃的麻辣烫 老板是河南人 煮麻辣烫的小姑娘长得不错,"[公司, 楼下, , 好吃, 麻辣烫, , 老板, 河南人, , 煮, 麻辣烫, 小姑..."


In [13]:
# labels to int   spam = 1, ham = 0

def label_to_num(label):
    return 1 if label == "+" else 0
data["labels"] = data["label"].apply(lambda x: label_to_num(x))



data

Unnamed: 0,label,text,texts,labels
1,+,他家的面很有嚼劲，牛肉汤很有味道，服务员的服务也特别的好，其中服务员顾存芳的服务特好！好喜欢...,"[他家, 面, 嚼, 劲, 牛肉汤, 味道, 服务员, 服务, 特别, 服务员, 顾存芳, ...",1
2,-,鲜榨果汁很不错，水果都是很新鲜的，口感也很好，是个小妹妹态度非常好，环境也很好，但是专家告诉...,"[鲜榨, 果汁, 不错, 水果, 新鲜, 口感, 小妹妹, 态度, 环境, 专家, 告诉, ...",0
3,+,跟老婆过二人世界，就定了他家的一间包房，他家挺好的，跟老婆点了几个菜，边吃边聊，包房的话，那...,"[老婆, 二人, 世界, 就定, 他家, 一间, 包房, 他家, 挺, 老婆, 点, 几个,...",1
4,-,我是在开业那天去得，他们的环境很好，进去给人一种很温软的感觉 他们的服务很好，销售很耐心的...,"[开业, 那天, 环境, 一种, 温软, 感觉, , , 服务, 销售, 耐心, 老公,...",0
5,-,好像说属于宁波菜系，东西还可以，只是上菜速度比较慢，好在服务还不错。还有赠送的餐前小吃，水果...,"[好像, 说, 宁波, 菜系, 东西, 上菜, 速度, 比较慢, 服务, 不错, 赠送, 餐...",0
...,...,...,...,...
9761,+,之前一直在苏浙汇吃，看到这边新开了家丰收日就去尝了下，味道不错，价格也比较实惠，不是很贵，包...,"[苏浙, 汇吃, 新开, 家, 丰收, 日, 尝, 味道, 不错, 价格, 实惠, 贵, 包...",1
9762,-,我一直等着世界末日的到来，等到21号，钱用光了，什么都没有了，结果它没有来，TMD谁造的谣啊...,"[世界末日, 到来, 21, 号, 钱, 用光, TMD, 谁造, 谣, 剩, 100, 钱...",0
9763,+,菜很好吃，尤其是剁椒鱼头，强烈推荐！！建议老板稍微扩大一下场地，位置太少！每次去都要等！,"[菜, 好吃, 剁, 椒, 鱼头, 强烈推荐, 建议, 老板, 稍微, 场地, 位置, 太,...",1
9764,-,公司楼下 最好吃的麻辣烫 老板是河南人 煮麻辣烫的小姑娘长得不错,"[公司, 楼下, , 好吃, 麻辣烫, , 老板, 河南人, , 煮, 麻辣烫, 小姑...",0


In [14]:
# count unique words

def count_words(textset):
    count = Counter()
    max_tkn = 0
    for row in textset:
        i = 0
        for token in row:
            count[token] += 1
            i += 1
        max_tkn = i if i > max_tkn else max_tkn
    return count, max_tkn

words_statistic, max_token = count_words(data["texts"])
print(max_token)
print(len(words_statistic))
words_statistic

832
22868


Counter({'他家': 795,
         '面': 661,
         '嚼': 101,
         '劲': 150,
         '牛肉汤': 9,
         '味道': 3931,
         '服务员': 1738,
         '服务': 2028,
         '特别': 959,
         '顾存芳': 194,
         '特': 94,
         '喜欢': 2585,
         '鲜榨': 86,
         '果汁': 138,
         '不错': 5505,
         '水果': 145,
         '新鲜': 651,
         '口感': 345,
         '小妹妹': 5,
         '态度': 435,
         '环境': 1960,
         '专家': 5,
         '告诉': 61,
         '吃水果': 9,
         ' ': 24010,
         '老婆': 60,
         '二人': 4,
         '世界': 17,
         '就定': 10,
         '一间': 25,
         '包房': 256,
         '挺': 1754,
         '点': 1779,
         '几个': 356,
         '菜': 2132,
         '吃': 7081,
         '聊': 19,
         '小姑娘': 148,
         '两人': 17,
         '打扰': 8,
         '感到': 21,
         '挺舒服': 18,
         '开业': 77,
         '那天': 77,
         '一种': 110,
         '温软': 1,
         '感觉': 2116,
         '销售': 28,
         '耐心': 103,
         '老公': 169,
         '介绍': 287

In [15]:
# training and testing dataset

X_train_temp, X_test_temp, y_train, y_test = train_test_split(data["texts"], data["labels"], test_size = 0.2, shuffle = True, random_state = 0, stratify = data["labels"])

In [16]:
# tokenizer

tokenizer = Tokenizer(num_words = max_token)
tokenizer.fit_on_texts(X_train_temp)
word_index = tokenizer.word_index
word_index

{' ': 1,
 '吃': 2,
 '不错': 3,
 '味道': 4,
 '~': 5,
 '好吃': 6,
 '喜欢': 7,
 '菜': 8,
 '感觉': 9,
 '服务': 10,
 '环境': 11,
 '这家': 12,
 '挺': 13,
 '点': 14,
 '服务员': 15,
 '店': 16,
 '说': 17,
 '朋友': 18,
 ',': 19,
 '真的': 20,
 '价格': 21,
 '烤鸭': 22,
 '做': 23,
 '口味': 24,
 '特别': 25,
 '家': 26,
 '下次': 27,
 '地方': 28,
 '东西': 29,
 '他家': 30,
 '老板': 31,
 '推荐': 32,
 '一家': 33,
 '鸭': 34,
 '面': 35,
 '买': 36,
 '吃饭': 37,
 '.': 38,
 '新鲜': 39,
 '牛肉': 40,
 '牛肉面': 41,
 '饭店': 42,
 '想': 43,
 '/': 44,
 '!': 45,
 '热情': 46,
 '元': 47,
 '实惠': 48,
 '汤': 49,
 "'": 50,
 '喝': 51,
 '康师傅': 52,
 '上海': 53,
 '好喝': 54,
 '生意': 55,
 '特色': 56,
 '干净': 57,
 '装修': 58,
 '太': 59,
 '态度': 60,
 '肉': 61,
 '里': 62,
 '送': 63,
 '套餐': 64,
 '海鲜': 65,
 '〜': 66,
 '鸭子': 67,
 '便宜': 68,
 '团购': 69,
 '每次': 70,
 '适合': 71,
 '一份': 72,
 '服务态度': 73,
 '崇明': 74,
 '红烧': 75,
 '奶茶': 76,
 '量': 77,
 '点评': 78,
 '几个': 79,
 '贵': 80,
 '一点': 81,
 '好好': 82,
 '很大': 83,
 '口感': 84,
 '性价比': 85,
 '第一次': 86,
 '选择': 87,
 '赞': 88,
 '羊肉': 89,
 '中午': 90,
 '差': 91,
 '爱': 92,
 '饭': 93,
 '茶': 94,
 '

In [17]:
# train sequences

X_train_sequences = tokenizer.texts_to_sequences(X_train_temp)
X_train = pad_sequences(X_train_sequences, maxlen = max_token, padding = "post", truncating = "post")
X_train.shape

(7812, 832)

In [18]:
# test sequences

X_test_sequences = tokenizer.texts_to_sequences(X_test_temp)
X_test = pad_sequences(X_test_sequences, maxlen = max_token, padding = "post", truncating = "post")
X_test.shape

(1953, 832)

In [31]:
# model

model = Sequential()
model.add(Embedding(len(words_statistic), 32, input_length = max_token))
model.add(LSTM(64, dropout = .1))
model.add(Dense(1, activation = "sigmoid"))

optimizer = Adam(lr=8e-7)

model.compile(loss = "binary_crossentropy", optimizer = optimizer, metrics = ["accuracy"])
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 832, 32)           731776    
_________________________________________________________________
lstm_6 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 756,673
Trainable params: 756,673
Non-trainable params: 0
_________________________________________________________________


In [32]:
result = model.fit(X_train, y_train, epochs = 10, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [27]:
# model2

model2 = Sequential()
model2.add(Embedding(len(words_statistic), 32, input_length = max_token))
model2.add(LSTM(64, dropout = .1))
model2.add(Dense(1, activation = "sigmoid"))

optimizer = Adam(lr=2e-6)

model2.compile(loss = "binary_crossentropy", optimizer = optimizer, metrics = ["accuracy"])
model2.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 832, 32)           731776    
_________________________________________________________________
lstm_4 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 65        
Total params: 756,673
Trainable params: 756,673
Non-trainable params: 0
_________________________________________________________________


In [28]:
result2 = model2.fit(X_train, y_train, epochs = 10, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# index = dict([(value, key) for (key, value) in word_index.items()])
# def get_original_text(text):
#     return "".join([index.get(i, "") for i in text])