# Tutorial for Chinese Sentiment analysis with hotel review data
## Dependencies

Python 3.5, numpy, pickle, keras, tensorflow, [jieba](https://github.com/fxsjy/jieba)

## Optional for plotting

pylab, scipy


In [1]:
from os import listdir
from os.path import isfile, join
import jieba
import codecs
from langconv import * # convert Traditional Chinese characters to Simplified Chinese characters
import pickle
import random

from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU
from keras.preprocessing.text import Tokenizer
from keras.layers.core import Dense
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import TensorBoard

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Helper function to pickle and load stuff

In [2]:

def __pickleStuff(filename, stuff):
    save_stuff = open(filename, "wb")
    pickle.dump(stuff, save_stuff)
    save_stuff.close()
def __loadStuff(filename):
    saved_stuff = open(filename,"rb")
    stuff = pickle.load(saved_stuff)
    saved_stuff.close()
    return stuff

## Get lists of files, positive and negative files

In [3]:
dataBaseDirPos = "./Data/positive/"
dataBaseDirNeg = "./Data/negative/"
positiveFiles = [dataBaseDirPos + f for f in listdir(dataBaseDirPos) if isfile(join(dataBaseDirPos, f)) and '.txt' in f]
negativeFiles = [dataBaseDirNeg + f for f in listdir(dataBaseDirNeg) if isfile(join(dataBaseDirNeg, f)) and '.txt' in f]

## Show length of samples

In [4]:
print(len(positiveFiles))
print(len(negativeFiles))

print()
print(positiveFiles)
print(negativeFiles)

6
4

['./Data/positive/diary.txt', './Data/positive/msgs.txt', './Data/positive/theory.txt', './Data/positive/mind.txt', './Data/positive/drafts.txt', './Data/positive/saying.txt']
['./Data/negative/QQZoneComments.txt', './Data/negative/DuanZi.txt', './Data/negative/SiBuDeJieDianzi.txt', './Data/negative/BilibiliComments.txt']


## Have a look at what's in a file(one hotel review)

In [5]:
filename = positiveFiles[0]
with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as doc_file:
    text=doc_file.read()
    print(text[:200])

在这个世界上我能活多久？是空留无一物还是另类？我不知道，也不会去想。

世界总是要我们给予什么，但残酷的命运无情的夺走我们的一切。

时间在这时已停止，只留下一串串时间的印记串联起的文字。

因此才有了这本日记，他是属于自己的，没人偷看。

这是一片自由的天空，任自己遨游，飞跃时间的限制，让我们能在年老的时候说：瞧！这就是青春，我的宝贵时间就是那样过的！

——————————————

天空一如


## Test removing stop words
Demo what it looks like to tokenize the sentence and remove stop words.

In [6]:
filename = positiveFiles[1]
with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as doc_file:
    text=doc_file.read()[:200]
    text = text.replace("\n", "")
    text = text.replace("\r", "")
print("==Orginal==:\n\r{}".format(text))
    
stopwords = [ line.rstrip() for line in codecs.open('./Data/chinese_stop_words.txt',"r", encoding="utf-8") ]
seg_list = jieba.cut(text, cut_all=False)
final =[]
seg_list = list(seg_list)
for seg in seg_list:
    if seg not in stopwords:
        final.append(seg)
print("==Tokenized==\tToken count:{}\n\r{}".format(len(seg_list)," ".join(seg_list)))
print("==Stop Words Removed==\tToken count:{}\n\r{}".format(len(final)," ".join(final)))


Building prefix dict from the default dictionary ...


==Orginal==:
我对垃圾的断绝能力一直很低导致我在现实中经常很不爽要是拒绝可以更坚决一点，就没那么多伤害了——————————————喜剧之王 一点都不好看——————————————构建一套系统真的没那么容易比如 找工作APP如何构建一个诚信机制，既能让没有任何认证的人找到工作，又不让企业吃亏(淘宝是怎么做的？让人数少的想赚钱的商家交保证金，人数多的消费者不


Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.861 seconds.
Prefix dict has been built succesfully.


==Tokenized==	Token count:119
我 对 垃圾 的 断绝 能力 一直 很 低 导致 我 在 现实 中 经常 很 不爽 要是 拒绝 可以 更 坚决 一点 ， 就 没 那么 多 伤害 了 — — — — — — — — — — — — — — 喜剧之王   一点 都 不 好看 — — — — — — — — — — — — — — 构建 一套 系统 真的 没 那么 容易 比如   找 工作 APP 如何 构建 一个 诚信 机制 ， 既能 让 没有 任何 认证 的 人 找到 工作 ， 又 不让 企业 吃亏 ( 淘宝 是 怎么 做 的 ？ 让 人数 少 的 想 赚钱 的 商家 交 保证金 ， 人数 多 的 消费者 不
==Stop Words Removed==	Token count:44
垃圾 断绝 能力 低 导致 现实 中 不爽 拒绝 一点 伤害 喜剧之王   一点 好看 构建 一套 系统 真的   找 工作 APP 构建 诚信 机制 既能 认证 找到 工作 不让 企业 吃亏 淘宝 做 人数 少 想 赚钱 商家 交 保证金 人数 消费者


## Prepare "doucments", a list of tuples
Some files contain abnormal encoding characters which encoding GB2312 will complain about. Solution: read as bytes then decode as GB2312 line by line, skip lines with abnormal encodings. We also convert any traditional Chinese characters to simplified Chinese characters.

In [7]:
documents = []
positive_nums = 0
negative_nums = 0

for filename in positiveFiles:
    with open(filename, "r", encoding="utf-8", errors="ignore") as f:
        text = f.read()
    all_text = Converter('zh-hans').convert(text)# Convert from traditional to simplified Chinese
    text_list = all_text.split("\n\n——————————————\n\n")
    for text in text_list:
        #text = text.replace("\n", "")
        #text = text.replace("\r", "")
        documents.append((text, "pos"))
        positive_nums += 1

for filename in negativeFiles:
    with open(filename, "r", encoding="utf-8", errors="ignore") as f:
        text = f.read()
    all_text = Converter('zh-hans').convert(text)# Convert from traditional to simplified Chinese
    text_list = all_text.split("\n\n——————————————\n\n")
    for text in text_list:
        #text = text.replace("\n", "")
        #text = text.replace("\r", "")
        documents.append((text, "neg"))
        negative_nums += 1

print('positive_nums:', positive_nums)
print('negative_nums:', negative_nums)

positive_nums: 8739
negative_nums: 13422


## Optional step to save/load the documents as pickle file

In [8]:
# Uncomment those two lines to save/load the documents for later use since the step above takes a while
# __pickleStuff("./Data/chinese_sentiment_corpus.p", documents)
# documents = __loadStuff("./Data/chinese_sentiment_corpus.p")
print(len(documents))
print(documents[-4:-1])

22161
[('每天都做，但还没研究过，现在好了哈哈', 'neg'), ('极限6分钟，四分钟开始全身抖动', 'neg'), ('(=・ω・=)', 'neg')]


## shuffle the data

In [9]:
random.shuffle(documents)

## Prepare the input and output for the model
Each input (hotel review) will be a list of tokens, output will be one token("pos" or "neg"). The stopwords are not removed here since the dataset is relative small and removing the stop words are not saving much traing time.

In [10]:
# Tokenize only
totalX = []
totalY = [str(doc[1]) for doc in documents]
for doc in documents:
    seg_list = jieba.cut(doc[0], cut_all=False)
    seg_list = list(seg_list)
    totalX.append(seg_list)


#Switch to below code to experiment with removing stop words
# Tokenize and remove stop words
# totalX = []
# totalY = [str(doc[1]) for doc in documents]
# stopwords = [ line.rstrip() for line in codecs.open('./Data/chinese_stop_words.txt',"r", encoding="utf-8") ]
# for doc in documents:
#     seg_list = jieba.cut(doc[0], cut_all=False)
#     seg_list = list(seg_list)
#     Uncomment below code to experiment with removing stop words
#     final =[]
#     for seg in seg_list:
#         if seg not in stopwords:
#             final.append(seg)
#     totalX.append(final)


## Visualize distribution of sentence length
Decide the max input sequence, here we cover up to 60% sentences. The longer input sequence, the more training time will take, but could improve  prediction accuracy.

In [11]:
import numpy as np

import scipy.stats as stats
import pylab as pl
h = sorted([len(sentence) for sentence in totalX])
maxLength = h[int(len(h) * 0.60)]
print("Max length is: ",h[len(h)-1])
print("60% cover length up to: ",maxLength)
h = h[:5000]
fit = stats.norm.pdf(h, np.mean(h), np.std(h))  #this is a fitting indeed

pl.plot(h,fit,'-o')
pl.hist(h,normed=True)      #use this to draw histogram of your data
pl.show() 

Max length is:  2677
60% cover length up to:  16


<matplotlib.figure.Figure at 0x7fb3f25d8128>

## Words to number tokens, padding
Pad input sequence to max input length if it is shorter


Save the input tokenizer, since we need to use the same tokenizer for our new predition data.

In [12]:
totalX = [" ".join(wordslist) for wordslist in totalX]  # Keras Tokenizer expect the words tokens to be seperated by space 
input_tokenizer = Tokenizer(30000) # Initial vocab size
input_tokenizer.fit_on_texts(totalX)
vocab_size = len(input_tokenizer.word_index) + 1
print("input vocab_size:",vocab_size)
totalX = np.array(pad_sequences(input_tokenizer.texts_to_sequences(totalX), maxlen=maxLength))
__pickleStuff("./Data/input_tokenizer_chinese.p", input_tokenizer)

input vocab_size: 44932


## Output, array of 0s and 1s

In [13]:
target_tokenizer = Tokenizer(3)
target_tokenizer.fit_on_texts(totalY)
print("output vocab_size:",len(target_tokenizer.word_index) + 1)
totalY = np.array(target_tokenizer.texts_to_sequences(totalY)) -1
totalY = totalY.reshape(totalY.shape[0])

output vocab_size: 3


In [14]:
totalY[40:50]

array([0, 1, 0, 1, 1, 0, 0, 0, 0, 1])

## Turn output 0s and 1s to categories(one-hot vectors)

In [15]:
totalY = to_categorical(totalY, num_classes=2)

In [16]:
totalY[40:50]

array([[1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

In [17]:
output_dimen = totalY.shape[1] # which is 2

## Save meta data for later predition
maxLength: the input sequence length

vocab_size: Input vocab size

output_dimen: which is 2 in this example (pos or neg)

sentiment_tag: either ["neg","pos"] or ["pos","neg"] matching the target tokenizer

In [18]:
target_reverse_word_index = {v: k for k, v in list(target_tokenizer.word_index.items())}
sentiment_tag = [target_reverse_word_index[1],target_reverse_word_index[2]] 
metaData = {"maxLength":maxLength,"vocab_size":vocab_size,"output_dimen":output_dimen,"sentiment_tag":sentiment_tag}
__pickleStuff("./Data/meta_sentiment_chinese.p", metaData)

## Build the Model, train and save it
The training data is logged to Tensorboard, we can look at it by cd into directory 

"./Graph/sentiment_chinese" and run


"python -m tensorflow.tensorboard --logdir=."

In [19]:
embedding_dim = 256

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim,input_length = maxLength))
# Each input would have a size of (maxLength x 256) and each of these 256 sized vectors are fed into the GRU layer one at a time.
# All the intermediate outputs are collected and then passed on to the second GRU layer.
model.add(GRU(256, dropout=0.9, return_sequences=True))
# Using the intermediate outputs, we pass them to another GRU layer and collect the final output only this time
model.add(GRU(256, dropout=0.9))
# The output is then sent to a fully connected layer that would give us our final output_dim classes
model.add(Dense(output_dimen, activation='softmax'))
# We use the adam optimizer instead of standard SGD since it converges much faster
tbCallBack = TensorBoard(log_dir='./Graph/sentiment_chinese', histogram_freq=0,
                            write_graph=True, write_images=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(totalX, totalY, validation_split=0.1, batch_size=32, epochs=20, verbose=1, callbacks=[tbCallBack])
model.save('./Data/sentiment_chinese_model.HDF5')

print("Saved model!")

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 16, 256)           11502592  
_________________________________________________________________
gru_1 (GRU)                  (None, 16, 256)           393984    
_________________________________________________________________
gru_2 (GRU)                  (None, 256)               393984    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 514       
Total params: 12,291,074
Trainable params: 12,291,074
Non-trainable params: 0
_________________________________________________________________
Train on 19944 samples, validate on 2217 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 

### Below are prediction code
Function to load the meta data and the model we just trained.

In [20]:
model = None
sentiment_tag = None
maxLength = None
def loadModel():
    global model, sentiment_tag, maxLength
    metaData = __loadStuff("./Data/meta_sentiment_chinese.p")
    maxLength = metaData.get("maxLength")
    vocab_size = metaData.get("vocab_size")
    output_dimen = metaData.get("output_dimen")
    sentiment_tag = metaData.get("sentiment_tag")
    embedding_dim = 256
    if model is None:
        model = Sequential()
        model.add(Embedding(vocab_size, embedding_dim, input_length=maxLength))
        # Each input would have a size of (maxLength x 256) and each of these 256 sized vectors are fed into the GRU layer one at a time.
        # All the intermediate outputs are collected and then passed on to the second GRU layer.
        model.add(GRU(256, dropout=0.9, return_sequences=True))
        # Using the intermediate outputs, we pass them to another GRU layer and collect the final output only this time
        model.add(GRU(256, dropout=0.9))
        # The output is then sent to a fully connected layer that would give us our final output_dim classes
        model.add(Dense(output_dimen, activation='softmax'))
        # We use the adam optimizer instead of standard SGD since it converges much faster
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
        model.load_weights('./Data/sentiment_chinese_model.HDF5')
        model.summary()
    print("Model weights loaded!")

## Functions to convert sentence to model input, and predict result

In [21]:
def findFeatures(text):
    text=Converter('zh-hans').convert(text)
    text = text.replace("\n", "")
    text = text.replace("\r", "") 
    seg_list = jieba.cut(text, cut_all=False)
    seg_list = list(seg_list)
    text = " ".join(seg_list)
    textArray = [text]
    input_tokenizer_load = __loadStuff("./Data/input_tokenizer_chinese.p")
    textArray = np.array(pad_sequences(input_tokenizer_load.texts_to_sequences(textArray), maxlen=maxLength))
    return textArray
def predictResult(text):
    if model is None:
        print("Please run \"loadModel\" first.")
        return None
    features = findFeatures(text)
    predicted = model.predict(features)[0] # we have only one sentence to predict, so take index 0
    predicted = np.array(predicted)
    probab = predicted.max()
    predition = sentiment_tag[predicted.argmax()]
    return predition, probab

## Calling the load model function

In [22]:
loadModel()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 16, 256)           11502592  
_________________________________________________________________
gru_3 (GRU)                  (None, 16, 256)           393984    
_________________________________________________________________
gru_4 (GRU)                  (None, 256)               393984    
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 514       
Total params: 12,291,074
Trainable params: 12,291,074
Non-trainable params: 0
_________________________________________________________________
Model weights loaded!


## Try some new comments, feel free to try your own
The result tuple consists the predicted result and likehood.

In [40]:
predictResult("还好，床很大而且很干净，前台很友好，很满意，下次还来。")

('neg', 0.9984174)

In [41]:
predictResult("房间有点小但是设备还齐全，没有异味。")

('neg', 0.9882289)

In [42]:
predictResult("房间还算干净，一般般吧，短住还凑合。")

('neg', 0.99914956)

In [43]:
predictResult("开始不太满意，前台好说话换了一间，房间很干净没有异味。")

('neg', 0.9998275)

In [44]:
predictResult("以前从没有出现过这种情况，这一定有问题")

('neg', 0.9313915)

In [45]:
predictResult("需求决定人的行为")

('pos', 0.99997973)

In [46]:
predictResult("我不同意你所说的每一个字，但我誓死捍卫你说话的权力")

('pos', 0.5365014)

In [47]:
predictResult("凡夫俗子只关心如何去打发时间，而略具才华的人却考虑如何应用时间")

('pos', 0.9996892)

In [48]:
predictResult("清华大学的傻逼们，请出来说句话")

('neg', 0.9985605)

In [49]:
predictResult("我好可怜奥")

('neg', 0.99960333)

In [50]:
predictResult("好久都没有听到一首这样有韵味的歌了！")

('neg', 0.9997774)

In [51]:
predictResult("在一个傍晚的偏远小镇上，街道上寒冷凄清，几乎看不到路人，只有几盏闪烁的霓虹灯，渲染着寂寥的风景。")

('pos', 0.9094575)

In [35]:
predictResult("走开，女大十八变不知道啊")

('neg', 0.99970263)

In [36]:
predictResult("踢个球右腿被干了，瓜皮瓜皮")

('neg', 0.9998098)

In [37]:
predictResult("终于他梁的忙完这些稀里糊涂的东西了，爆炸")

('neg', 0.99515176)

In [38]:
predictResult("大家都是平等的")

('neg', 0.50790167)

In [39]:
saying = """
never give up
I'm born to do this
有希望在的地方，痛苦也成欢乐。
信仰是人生杠杆的支撑点，具备这个支撑点，才可能成为一个强而有力的人；信仰是事业的大门，没有正确的信仰，注定做不出伟大的事业。
哲学是有严密逻辑系统的宇宙观，它研究宇宙的性质、宇宙内万事万物演化的总规律、人在宇宙中的位置等等一些很基本的问题
伟人与平凡人的差别在于，伟人的胸中并不是没有不自信的时候，只是他能够在不自信时调整自己，从而从不自信中走出来，以达到自信的旺盛的精神状态
别人是自己的镜子，自己应该在别人成功与失败的教训中避免不幸的重现。
劣书是损害我们精神思想的毒药。
I love losing face
陈述性的讲演不会被当成 negative
偏激的、平庸的、不讲逻辑的才会
生死狙击是这两年兴起的一款页游

 
A teacher from a community college addressed a sympathetic audience.
你怕是个傻子
好耶好耶，妈妈有爸爸了
小学生们要喷就喷点有营养的好么
SB游戏
本人玩这个英雄联盟也有几千场了，打这么多场下来，不说100%的场次， 至少90%的场次是属于以下类型的。1,己方3路全爆或者敌方3路全爆2.赢是躺赢，输是凯瑞。3一方默契到爆每次抓人先人一步，或者无脑团，每次团得比对方快几秒。这个游戏秒人速度大家是有目共睹的，任何一个小小的失误都会导致被秒，团灭或者队友之间的胡喷，而且请记住，你是绝对无法彻底控制一场对战的随机性的。在这个战局优劣瞬息万变的游戏，5个随机的人打另外5个随机的人，又有各式各样的阵容克制，单个英雄之间的克制，还有暴击率。在这样一个随机性游戏里面，概率事件变得如此之多的游戏，很有可能这个游戏需要的运气量比你打牌或者赌钱的运气更多，前提是运气能量化的话。能决定你输或者赢得跟你技术关系真不大，不管你是翻盘局，少胜多，还是你凯瑞了，或者你带崩全局。都说明不了你，你队友或者你对手很垃圾或者很NB。综上经常开比赛，描述英雄联盟是一个多需要技术多注重竞技性的游戏，来洗脑这个只能玩路人局的你，舔着B脸说自己是竞技游戏的，真的是太垃圾了。
"""
text_list = [text for text in saying.split('\n') if text.strip('\n ') != '']
for text in text_list:
    print(text[:88], '\n', predictResult(text), '\n'*2)

never give up 
 ('pos', 0.9989967) 


I'm born to do this 
 ('pos', 0.999956) 


有希望在的地方，痛苦也成欢乐。 
 ('neg', 0.99284315) 


信仰是人生杠杆的支撑点，具备这个支撑点，才可能成为一个强而有力的人；信仰是事业的大门，没有正确的信仰，注定做不出伟大的事业。 
 ('pos', 0.9999392) 


哲学是有严密逻辑系统的宇宙观，它研究宇宙的性质、宇宙内万事万物演化的总规律、人在宇宙中的位置等等一些很基本的问题 
 ('pos', 0.99524117) 


伟人与平凡人的差别在于，伟人的胸中并不是没有不自信的时候，只是他能够在不自信时调整自己，从而从不自信中走出来，以达到自信的旺盛的精神状态 
 ('pos', 0.9999727) 


别人是自己的镜子，自己应该在别人成功与失败的教训中避免不幸的重现。 
 ('pos', 0.9999732) 


劣书是损害我们精神思想的毒药。 
 ('pos', 0.9948881) 


I love losing face 
 ('pos', 0.7475669) 


陈述性的讲演不会被当成 negative 
 ('neg', 0.9969944) 


偏激的、平庸的、不讲逻辑的才会 
 ('pos', 0.99997866) 


生死狙击是这两年兴起的一款页游 
 ('pos', 0.567123) 


A teacher from a community college addressed a sympathetic audience. 
 ('neg', 0.9859315) 


你怕是个傻子 
 ('neg', 0.9981139) 


好耶好耶，妈妈有爸爸了 
 ('neg', 0.99984396) 


小学生们要喷就喷点有营养的好么 
 ('neg', 0.9998635) 


SB游戏 
 ('neg', 0.9826949) 


本人玩这个英雄联盟也有几千场了，打这么多场下来，不说100%的场次， 至少90%的场次是属于以下类型的。1,己方3路全爆或者敌方3路全爆2.赢是躺赢，输是凯瑞。3一方默契到爆每 
 ('neg', 0.99814904) 


