# Sentiment analysis on openrice comments

This code demonstrates a basic (and standard) pipeline of sentiment analysis on the training data from [openrice](https://www.openrice.com/en/hongkong) web<sup>1</sup>. The task is to predict the sentiment score given a user comment.


INPUT: user comment in Cantonese

OUTPUT: sentiment score [1-5] or [1,5] (emoji)


<sup>1</sup>This demo is partially based on [Chinese sentence classifier](https://gist.github.com/fpsluozi/468aa8cff4f6c7e92eed9a7d7c1112ad) and [doc2vec tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb).


## Data pre-processing

The data were downloaded using Python scrapper, the format has all the features of a standard openrice comment including user information, image links, the text and heading of the comment, and the evaluation scores on the scale 1-5 (displayed as stars at openrice UI). As this research has not yet been published we can disclose the statistics only after paper publication.

In the small demo sample we will leave only information needed for sentiment analysis - comments and scores, excluding user-sensitive information.


In [40]:
import pandas as pd
import json

######filtering out non-Chinese comments#########
#get_ipython().magic(u'cat comments/uid0.txt | grep -v "Not comment in Chinese" > comments/uid0_canto.json') # exclude non-Chinese comments comments
#get_ipython().magic(u'cat comments/all_comments | grep -v "Not comment in Chinese" > comments/all_canto.json') # exclude non-Chinese comments comments
df_canto = pd.read_json('comments/uid0_canto.json', lines=True)
df_canto['comment'].replace('', np.nan, inplace=True)
df_canto.dropna(subset=['comment'],inplace=True)
########## hiding user-sensitive content#########
demo_sample = df_canto[['title', 'comment', 'taste', 'environment', 'service', 'sanitation', 'worthy']]
demo_sample.to_csv('comments/demo_sample.csv', sep='\t', encoding='utf-8')


######load data
df = pd.read_csv('comments/demo_sample.csv', sep='\t')
train_x = df[['comment']]
train_y = df[['worthy']]

#print (train_x)


## Tokenization

Tokenizing Chinese text with [jieba](https://github.com/fxsjy/jieba) . Using tokenization will introduce some errors. 


In [None]:
import jieba

dictionary = list() # segmented text
for index, sentence in train_x.iterrows():
    comment = sentence['comment']
    seg = " ".join(list(jieba.cut(comment)))
    dictionary.append([seg])

    
####put segmented text into a file#####   
with open("segmented_comments.txt", "w") as f:
    for pair in dictionary:        
        f.write(" ".join(pair))
        f.write("\n")

## Training the doc2vec model

Comments are encoded as vectors using doc2vec model.


In [None]:
from gensim import utils
from gensim.models.doc2vec import *

documents = TaggedLineDocument("segmented_comments.txt")
model = Doc2Vec(documents, vector_size=100, window=8, min_count=1, workers=4)
model.save("comments_doc2vec.vec")         



In [11]:
#########Testing the model###########
print ("first review in original format: ", train_x.iloc[0] )
print ("Emoji rank of the first review: ", train_y.iloc[0] )
print ("first review in segmented format: ", dictionary[0] )
print ("first review in doc2vec format: ", model.docvecs[0] )
print("Most similar to good: ", model.most_similar('好')) # just to test the model



first review in original format:  comment    雖然貴. 但真心好食嘅爆汁牛肉餅. 九龍城除咗泰國菜出名外. 仲有呢間. 清真牛肉館. 佢地...
Name: 0, dtype: object
Emoji rank of the first review:  worthy    3
Name: 0, dtype: int64
first review in segmented format:  ['雖然 貴 .   但 真心 好 食 嘅 爆 汁 牛肉 餅 .   九龍城 除 咗 泰國菜 出名 外 .   仲有 呢 間 .   清真 牛肉 館 .   佢 地 最 出名 嘅 就 係 呢 個 牛肉 餅 .   細細 一件 已經要 .   價錢 確實 有 啲 貴 .   但 味道 真 係 唔 錯 .   外皮 煎 得 好 香脆 .   一咬落 去 .   啲 湯汁 噴晒出 嚟 .   如果 無經驗者 .   真 係 超 容易 整污 糟件 衫 .   其實 我 差少少 都 中招 .   好彩 身手 敏捷 .   避開 咗 .   另外 佢 肉 味 香 濃 .   口感 飽滿 .   以 牛肉 餅呢 講 .   佢 都 算 一 哥 .   如果 佢 係 旺 區 開 分店 就 好 啦 .   咁 就 唔 洗 吓 吓碌入 九龍城 先食 到']
first review in doc2vec format:  [ 7.03203678e-02  1.03436872e-01  9.81666446e-02 -2.11124554e-01
  2.25318759e-03  1.40257671e-01 -8.35996941e-02  4.61226143e-02
 -3.49120386e-02 -1.32962922e-02 -5.50856031e-02  8.08985680e-02
 -1.51788685e-02  2.92610209e-02  5.13075758e-03 -2.64100172e-02
 -1.28817603e-01  6.25235289e-02  5.54200150e-02 -6.27737790e-02
  1.02614313e-01 -8.15101340e-03  2.64504217e-02

In [13]:


num_line = len(model.docvecs)

#######Put the doc2vec representation for each comment into an array
x_vecs = np.zeros((num_line, 100))
for i in range(0,num_line):
    x_vecs[i] = model.docvecs[i]

#######Put y_labels (without column names) into an array    
y_label = []
for index, label in train_y.iterrows():
    label_only = label['worthy']
    y_label.append(label_only)

######Slice the first 1000 lines to make a 
######Normally, we should have done some cross-validation
train_num = 1000
X_test = x_vecs[:train_num]
Y_test = y_label[:train_num]
X_train = x_vecs[train_num+1:]
Y_train = y_label[train_num+1:]


## Training the classifier 

In [18]:
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn import neighbors

########
clf_logreg = LogisticRegression()
clf_logreg.fit(X_train, Y_train)
print ("Logistic regression: ", clf_logreg.score(X_test, Y_test) )
########
clf_svc = LinearSVC()
clf_svc.fit(X_train, Y_train)
print ("Support Vector Machines: ", clf_svc.score(X_test, Y_test) )
########
clf_nei = neighbors.KNeighborsClassifier(num_class, weights='distance')
clf_nei.fit(X_train, Y_train)
print ("Nearest neighbours: ", clf_nei.score(X_test, Y_test) )



Logistic regression:  0.477
Support Vector Machines:  0.469
Nearest neighbours:  0.408


## Credits
This project presents the collaboration of the two departments of PolyU (CBS and COMP), by [Natalia Klyueva](https://www.linkedin.com/in/nataliaklyueva/) and [Yunfei Long](https://www.linkedin.com/in/yunfei-long-3342b08a/) under the supervision of [Lu Qin](http://www4.comp.polyu.edu.hk/~csluqin/)