* Name: SP Tian 
* Date: May 5, 2019 

#            Sentiment Analysis on Anime Reviews 

* ## 0 Introduction 
    * ### 0.1 Import Libraries
    * ### 0.2 Loading the Database 

* ## 1 Exploratory Data Analysis 
    * ### 1.1 Data Exploration 
        * #### 1.1.1 Rating Frequency Table
        * #### 1.1.2 Word Cloud 
    * ### 1.2 Data Cleaning 
    * ### 1.3 Data Split - only get rTrain
    * ### 1.4 Word Frequency Table 
    * ### 1.5  Import Sentiment Weights 

* ## 2 Train Models 
    * ### 2.1 Logistic Regression 
    * ### 2.2 Gaussian Naive Bayes 
    * ### 2.3 Random Forests 

* ## 3 Model Selection 
        * Cross Validatioin 
        * print skm_conf_mat

* ## 4 Prediction on Random Forests 

## 0. Introduction 

* Question: 
How we can use Sentiment Analysis on comments to further predict viewers' ratings? 

* Source: 
A Japanese anime, from Chinese viewing website called bilibili.com, which went IPO in NY Exchange as ticker (BILI). The reviews are scrapped from the website using JSON and till the end of the date of May 6, 2019. 

* Deliverables: cvs.file on predicting Ratings 

## Note: 
Test data is split half (train/test) and then 70-30, containing 2258 comments. 

### 0.1 Import Libraries 

In [None]:
import os 
import sys 
import re

import scipy
import numpy as np
import pandas as pd
import jieba.analyse
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

# import sklearn modules 
import sklearn.metrics as skm
import sklearn.model_selection
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix as skm_conf_mat
from collections import Counter
from collections import defaultdict

### 0.2 Loading the Dataset

In [None]:
datas = pd.read_csv("../input/bilibilib_gongzuoxibao.csv", sep = ",")

## 1. Exploratory Data Analysis 
### 1.1. Data Exploration 

In [None]:
colnames = datas.columns
print(colnames) # author, score, disliked, likes, liked, ctime, score.1, content, last_ex_index, cursor, date

In [None]:
datas.shape

In [None]:
datas.head()

#### 1.1.1 Rating Frequency Table

In [None]:
datas['score'].value_counts()

In [None]:
x = list(sorted(datas['score'].unique()))
y = list(datas['score'].value_counts())[::-1]
plt.bar(x,y, color='orange')
plt.xlabel('Score')
plt.ylabel('')
plt.title('Rating Frequencies')
plt.show()

In [None]:
#%% Content Analysis 
texts = ';'.join(datas['content'].tolist())
cut_text = " ".join(jieba.cut(texts))
# TF_IDF
keywords = jieba.analyse.extract_tags(cut_text, topK=100, withWeight=True, allowPOS=('a','e','n','nr','ns'))
text_cloud = dict(keywords)
###pd.DataFrame(keywords).to_excel('TF_IDF关键词前100.xlsx')

In [None]:
# Remove all punctuation and expression marks 
temp =  "\\【.*?】+|\\《.*?》+|\\#.*?#+|[.!/_,$&%^*()<>+""'?@|:~{}#]+|[——！\\\，。=？、：“”‘’￥……（）《》【】]"
cut_text = re.sub(pattern = temp, repl = "", string = cut_text)

#### cannot open Word Cloud picture 

### 1.2. Data Cleaning 

In [None]:
del datas['ctime']
del datas['cursor']
del datas['liked']
del datas['disliked']
del datas['likes']
del datas['last_ep_index']
pd.isnull(datas).astype(int).aggregate(sum, axis = 0)

### 1.3. Data Split

In [None]:
perfect = datas[datas.score == 10]
imperfect = datas[datas.score != 10]
perfect_sample = perfect.sample(n = 1583, random_state = 1 )
new_data = pd.concat([perfect_sample, imperfect], axis = 0)

features = new_data['content']
labels = new_data['score']

In [None]:
rTrain, rTest, y_train, y_test = train_test_split(features, labels, test_size = 0.3, random_state=42)
# let's understand up a bit the data
## print out the shapes of  resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(rTrain.shape), 
      #"\nValidation set: \t{}".format(rValidation.shape),
      "\nTest set: \t\t{}".format(rTest.shape))

### 1.4. Frequency Table for Top 100 

In [None]:
texts = '\n'.join(rTrain.tolist())
#cut_text = jieba.lcut(texts)
cut_text = "".join(jieba.cut(texts))
cut_text = re.sub(pattern = temp, repl = "", string = cut_text)

keyword = jieba.analyse.extract_tags(cut_text, topK=100, allowPOS=('a','e','n','nr','ns'))  # list
cut_text = cut_text.split('\n')
keyword

In [None]:
cutlist = []

for i in range(0, len(cut_text)):
    cut_dic = defaultdict(int) 
    comment = cut_text[i]
    comment_cut = jieba.lcut(comment)
    for word in comment_cut: # word freq for every comment 
        if word in keyword:
            cut_dic[word] += 1  
    order = sorted(cut_dic.items(),key = lambda x:x[1],reverse = True) # word freq in descending order
    #print(order)
 
    myresult = "" 
    for j in range(0,len(order)): 
        result = order[j][0]+ "-" + str(order[j][1])
        myresult = myresult + " " + result  
    cutlist.append(myresult)
#print(cutlist)

In [None]:
word_freqs = []
for raw in cutlist:
    word_freq = {}
    for word_freq_raw in raw.split():
        index = word_freq_raw.find('-')
        word = word_freq_raw[:index]
        freq = int(word_freq_raw[index + 1])
        word_freq[word] = freq
    word_freqs.append(word_freq)
    
matrix = []
for word_freq in word_freqs:
    row = []
    for word in keyword:
        if word in word_freq:
            row.append(word_freq[word])
        else:
            row.append(0)
    matrix.append(row)
#print(matrix)
matrix = np.array(matrix)

### 1.5. Import Sentiment Weights 
> X_rTrain

In [None]:
grade1 = np.array([0.1
,0
,0
,0.7
,0.8
,0.1
,0
,0.3
,0
,0
,0
,0
,0.6
,0.1
,-1
,0
,0
,1
,0
,0
,0
,0.5
,-0.3
,-0.1
,0.8
,0
,0.4
,0
,0
,0
,0.6
,0.6
,0.8
,0
,0.6
,0.4
,0.6
,1
,0
,-0.7
,0
,0.9
,0
,-0.2
,0
,0
,0
,0
,0
,0.7
,0
,1
,0
,0
,0
,0
,-0.2
,0
,0
,0.6
,0.1
,0
,0.6
,0.3
,0
,0.7
,0.7
,0
,0
,0
,0
,0
,0
,0
,0
,0.4
,0
,0.6
,0
,1
,0.6
,0
,0
,1
,0.4
,0.2
,-1
,0.8
,-1
,0
,1
,0
,0.9
,0.7
,-0.3
,0
,0.2
,0
,0
,0])

In [None]:
X = np.array(matrix) * grade1

## 2. Train Model 
### 2.1. Logistic Regression

In [None]:
# import Logistic model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV

clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial').fit(X, y_train)
clf.score(X, y_train)

In [None]:
np.unique(clf.predict(X))

### 2.2 Gaussian Naive Bayes

In [None]:
#Import Library of Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X, y_train)
gnb.score(X,y_train)

In [None]:
np.unique(gnb.predict(X))

### 2.3 Random Forests 

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFClass
model_rf = RFClass(n_estimators = 100, max_depth=5, random_state=2019)
model_rf.fit(X, y_train)
model_rf.score(X, y_train)

In [None]:
np.unique(gnb.predict(X))

## 3. Model Selection
> Import rTest

In [None]:
texts = '\n'.join(rTest.tolist())
#cut_text = jieba.lcut(texts)
cut_text = "".join(jieba.cut(texts))
cut_text = re.sub(pattern = temp, repl = "", string = cut_text)

keyword = jieba.analyse.extract_tags(cut_text, topK=100, allowPOS=('a','e','n','nr','ns'))  # list
cut_text = cut_text.split('\n')
keyword

In [None]:
cutlist = []

for i in range(0, len(cut_text)):
    cut_dic = defaultdict(int) 
    comment = cut_text[i]
    comment_cut = jieba.lcut(comment)
    for word in comment_cut: # word freq for every comment 
        if word in keyword:
            cut_dic[word] += 1  
    order = sorted(cut_dic.items(),key = lambda x:x[1],reverse = True) # word freq in descending order
    #print(order)
 
    myresult = "" 
    for j in range(0,len(order)): 
        result = order[j][0]+ "-" + str(order[j][1])
        myresult = myresult + " " + result  
    cutlist.append(myresult)
#print(cutlist)

In [None]:
word_freqs = []
for raw in cutlist:
    word_freq = {}
    for word_freq_raw in raw.split():
        index = word_freq_raw.find('-')
        word = word_freq_raw[:index]
        freq = int(word_freq_raw[index + 1])
        word_freq[word] = freq
    word_freqs.append(word_freq)
    
matrix = []
for word_freq in word_freqs:
    row = []
    for word in keyword:
        if word in word_freq:
            row.append(word_freq[word])
        else:
            row.append(0)
    matrix.append(row)
#print(matrix)
matrix = np.array(matrix)

In [None]:
grade2 = np.array([0.1
,0
,0
,0.7
,0.3
,0
,0
,0.8
,0.5
,0
,0.1
,0.1
,0
,0
,1
,-1
,0
,0
,0
,0.4
,0
,0.6
,0
,0.6
,0
,0
,1
,0
,0.8
,-0.1
,0
,0
,0.4
,0
,0
,0
,0.6
,0.6
,-0.4
,0
,0
,0
,0
,0
,0.4
,1
,-0.6
,0
,-0.7
,0.9
,-1
,0.4
,0.1
,-0.2
,-0.3
,0.6
,0
,0.2
,0
,0
,0
,0
,0.2
,0
,0.6
,0
,0.5
,-1
,0
,0
,0.9
,0
,0
,-0.6
,0.1
,0
,0.4
,-0.8
,0
,0
,-0.3
,0
,0.7
,0.5
,0
,0.8
,0
,0
,0
,0
,-0.2
,0.6
,0.5
,0.7
,0
,0
,0.8
,0.5
,0.7
,-0.4])

In [None]:
xTest = np.array(matrix) * grade2
xTest.shape

> function confusion matrix

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes)
    plt.show()

np.set_printoptions(precision = 2)

### Log prediction

In [None]:
clf_proba = clf.predict_proba(xTest)   # predict probability 
clf_pred = clf.predict(xTest)   # prediction result
clf.score(xTest, y_test)

In [None]:
clf_cm = skm_conf_mat(y_test, clf_pred)
plot_confusion_matrix(clf_cm, classes = list(sorted(y_train.unique())), title = 'Confusion Matrix')

### Cross Validation

In [None]:
clfcv = LogisticRegressionCV(cv=5, random_state=0, multi_class='multinomial').fit(X, y_train)
clfcv.score(X, y_train)

In [None]:
clfcv_proba = clfcv.predict_proba(xTest)
clfcv_pred = clfcv.predict(xTest)
clfcv.score(xTest, y_test)

In [None]:
clfcv_cm = skm_conf_mat(y_test, clf_pred)
plot_confusion_matrix(clfcv_cm, classes = list(sorted(datas['score'].unique())), title = 'Confusion Matrix')

### RF

In [None]:
rf_proba = model_rf.predict_proba(xTest)
rf_pred = model_rf.predict(xTest)
model_rf.score(xTest, y_test)

In [None]:
# Tree Plot
from graphviz import Source
from sklearn import tree as treemodule
Source(treemodule.export_graphviz(
        model_rf.estimators_[1]
        , out_file=None
        , filled = True
        , proportion = True #@@ try False and understand the differences
        )
)

In [None]:
rf_cm = skm_conf_mat(y_test, rf_pred)
plot_confusion_matrix(rf_cm, classes = list(sorted(datas['score'].unique())), title = 'Confusion Matrix')

## 4. Prediction Print on RF

In [None]:
rf_pred = pd.DataFrame(rf_pred)
rf_pred.to_csv("Predictions on Ratings.csv")

* ## 3 Model Selection REVISED
    * ### 3.1 Group Ratings by very high(10), high(8), and others(2-6) TO (4) 

In [None]:
#score = (new_data.score == 2)|(new_data.score == 6)
new_data.loc[new_data.score == 6, 'score'] = 4
new_data.loc[new_data.score == 2, 'score'] = 4

In [None]:
features = new_data['content']
labels = new_data['score']

new_data['score'].value_counts()

In [None]:
rTrain, rTest, y_train, y_test = train_test_split(features, labels, test_size = 0.3, random_state=42)
# let's understand up a bit the data
## print out the shapes of  resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(rTrain.shape), 
      #"\nValidation set: \t{}".format(rValidation.shape),
      "\nTest set: \t\t{}".format(rTest.shape))

* rTrain

In [None]:
texts = '\n'.join(rTrain.tolist())
#cut_text = jieba.lcut(texts)
cut_text = "".join(jieba.cut(texts))
cut_text = re.sub(pattern = temp, repl = "", string = cut_text)

keyword = jieba.analyse.extract_tags(cut_text, topK=100, allowPOS=('a','e','n','nr','ns'))  # list
cut_text = cut_text.split('\n')
keyword

In [None]:
cutlist = []

for i in range(0, len(cut_text)):
    cut_dic = defaultdict(int) 
    comment = cut_text[i]
    comment_cut = jieba.lcut(comment)
    for word in comment_cut: # word freq for every comment 
        if word in keyword:
            cut_dic[word] += 1  
    order = sorted(cut_dic.items(),key = lambda x:x[1],reverse = True) # word freq in descending order
    #print(order)
 
    myresult = "" 
    for j in range(0,len(order)): 
        result = order[j][0]+ "-" + str(order[j][1])
        myresult = myresult + " " + result  
    cutlist.append(myresult)
#print(cutlist)

In [None]:
word_freqs = []
for raw in cutlist:
    word_freq = {}
    for word_freq_raw in raw.split():
        index = word_freq_raw.find('-')
        word = word_freq_raw[:index]
        freq = int(word_freq_raw[index + 1])
        word_freq[word] = freq
    word_freqs.append(word_freq)
    
matrix = []
for word_freq in word_freqs:
    row = []
    for word in keyword:
        if word in word_freq:
            row.append(word_freq[word])
        else:
            row.append(0)
    matrix.append(row)
#print(matrix)
matrix = np.array(matrix)

In [None]:
X = np.array(matrix) * grade1

* rTest 

In [None]:
texts = '\n'.join(rTest.tolist())
#cut_text = jieba.lcut(texts)
cut_text = "".join(jieba.cut(texts))
cut_text = re.sub(pattern = temp, repl = "", string = cut_text)

keyword = jieba.analyse.extract_tags(cut_text, topK=100, allowPOS=('a','e','n','nr','ns'))  # list
cut_text = cut_text.split('\n')
keyword

In [None]:
cutlist = []

for i in range(0, len(cut_text)):
    cut_dic = defaultdict(int) 
    comment = cut_text[i]
    comment_cut = jieba.lcut(comment)
    for word in comment_cut: # word freq for every comment 
        if word in keyword:
            cut_dic[word] += 1  
    order = sorted(cut_dic.items(),key = lambda x:x[1],reverse = True) # word freq in descending order
    #print(order)
 
    myresult = "" 
    for j in range(0,len(order)): 
        result = order[j][0]+ "-" + str(order[j][1])
        myresult = myresult + " " + result  
    cutlist.append(myresult)
#print(cutlist)

In [None]:
word_freqs = []
for raw in cutlist:
    word_freq = {}
    for word_freq_raw in raw.split():
        index = word_freq_raw.find('-')
        word = word_freq_raw[:index]
        freq = int(word_freq_raw[index + 1])
        word_freq[word] = freq
    word_freqs.append(word_freq)
    
matrix = []
for word_freq in word_freqs:
    row = []
    for word in keyword:
        if word in word_freq:
            row.append(word_freq[word])
        else:
            row.append(0)
    matrix.append(row)
#print(matrix)
matrix = np.array(matrix)

In [None]:
xTest = np.array(matrix) * grade2
xTest.shape

In [None]:
clf = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial').fit(X, y_train)
clf.score(xTest, y_test)

In [None]:
clfcv = LogisticRegressionCV(cv=5, random_state=0, multi_class='multinomial').fit(X, y_train)
clfcv.score(xTest, y_test)

In [None]:
gnb.fit(X, y_train)
gnb.score(xTest, y_test)

In [None]:
model_rf.fit(X, y_train)
print(model_rf.score(X, y_train))
print(model_rf.score(xTest, y_test))

> Until now, we attempted multiple approaches to improve the accuracy rate of predicting corresponding scores on sentiment analysis. 

> 1. Importing different sets of sentiment weights
> 2. Lowering score dimensions to [4,8,10] rather [2,4,6,8,10] 