**Readme:**
This functions include two steps: <br>
 -- Step one is to implement bag-of-words to convert user reviews into word vectors <br>
 -- Step two is to implement a simple Neural Network to train a classifier based on word vectors created using bag-of-words in the previous stage

In [4]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

import jieba
import jieba.posseg as pseg
import jieba.analyse

import glob
import numpy as np
import time

<h3> Data preprocessing

In [8]:
'''
combine dataset (multiple categories) into one single category;
add a column called 'label'
'''

files= glob.glob('../output_data/*.txt')

df_lst = []
for f in files:
    label = f.split('/')[-1][:2]
    df = pd.read_csv(f,header=None)
    df['label'] = label
    df_lst.append(df)

all_df = pd.concat(df_lst)
print('the whole dataset include %d reviews'%len(all_df))
all_df = all_df.rename(columns = {0:'review_tokens'})
all_df.head(10)

the whole dataset include 1623 reviews


Unnamed: 0,review_tokens,label
0,11 月 15 日 提前 预订 2018 年 11 月 27 日 长沙 飞往 沈阳 cz3...,出发
1,航班 延误 登机口 升舱 活动 以原 航班 起飞时间 为准 办理 理解,出发
2,重庆 乌鲁木齐 南航 航班 天气 原因 延误 和田 乘坐 天津 航班,出发
3,沿途 停靠 理解 延误 小时,出发
4,飞机 无故 延误 小时 脸,出发
5,延误 五个 小时 算上 值机 时间 机场 八个 小时 早上 晚上 解释 解决方案 机长 人影...,出发
6,cz3842 航班 延误 投诉无门 十点 五十 起飞 下午 三点 弄 飞机 两个 小时 告知...,出发
7,南航 航班 延误 发 短信 太 严谨 回复 改 航班 用户名 密码 我要 变更 航班 做 延...,出发
8,行李 延误 重大损失,出发
9,确认 航班 延误 订 票 显示 确认,出发


<h3> Neural Network

The functions below implement a Neural Network based on word vectors after 'bag-of-words' approach.

In [9]:
print('data set has %d training examples'%len(all_df))
all_df.head()

data set has 1623 training examples


Unnamed: 0,review_tokens,label
0,11 月 15 日 提前 预订 2018 年 11 月 27 日 长沙 飞往 沈阳 cz3...,出发
1,航班 延误 登机口 升舱 活动 以原 航班 起飞时间 为准 办理 理解,出发
2,重庆 乌鲁木齐 南航 航班 天气 原因 延误 和田 乘坐 天津 航班,出发
3,沿途 停靠 理解 延误 小时,出发
4,飞机 无故 延误 小时 脸,出发


In [11]:
classes = all_df.label.unique().tolist()
# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(classes)

In [12]:
#### Bag-of-Words
'''
below using bag-of-words to convert reviews into word vectors
'''

import re

all_reviews = ''
for review in all_df['review_tokens'].values:
    all_reviews+=review
    all_reviews = re.sub(r'\d+','',all_reviews)  # remove digits
    
# a list of all unique words ever appear in user reviews
word_lst = list(set(all_reviews.split()))

for idx in range(len(all_df)):
    label = all_df.iloc[idx]['label']
    review = all_df.iloc[idx]['review_tokens']
    review = re.sub(r'\d+','',review)
    tokens = review.split()
    
    bag = [0]*len(word_lst)  
    for token in tokens:
        bag[word_lst.index(token)] = 1
    training.append(np.array(bag))
    output_row = list(output_empty)
    output_row[classes.index(label)] = 1 
    output.append(output_row)

In [13]:
print(len(training))
len(output)

1623


1623

In [14]:
# checking
print(all_df['review_tokens'].iloc[0]) # the first review
print(training[0]) # the first traning example
output[0]  # the first output

 11 月 15 日 提前 预订 2018 年 11 月 27 日 长沙 飞往 沈阳 cz3983 航班 做好 相关 会议 安排 11 月 17 日 收到 航班 延误 推迟 下午 13 40 CZ6408 航班 只好 解释 调整 会议 时间 12 点 飞机场 通知 时间 调整 下午 四点 二十 晚上 六点 起飞 飞机场 整整 六个 小时 一整天 做 事 一再 失信 生意 伙伴 信任 失望 
[0 0 0 ... 0 0 0]


[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [58]:
class NeuralNetwork():
    def __init__(self):
        np.random.seed(1)
        self.synaptic_weights = 2 * np.random.random((4871, 10)) - 1   #random starting synaptic weights --> fit the shape of input data
        self.output = []
        
    def __sigmoid(self,x):
        return  1 / (1 + np.exp(-x))
    
    def __sigmoid_derivative(self,x):
        return x * (1 - x)
    
    def train(self, training_set_inputs, training_set_outputs, number_of_training_iterations):
        for iteration in iter(range(number_of_training_iterations)):
            output = self.think(training_set_inputs)
            self.output.append(output)
            error = training_set_outputs - output
            adjustment = np.dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))
            
            self.synaptic_weights += adjustment
            if (iteration % 100 == 0):           # here we calculate every 1000 iterations 
                print ("error after %s iterations: %s" % (iteration, str(np.mean(np.abs(error)))))
                
    def think(self, inputs):
        return self.__sigmoid(np.dot(inputs, self.synaptic_weights))


In [59]:
neural_network = NeuralNetwork()

training_set_inputs = np.array(training)
training_set_outputs = np.array(output) 

neural_network.train(training_set_inputs, training_set_outputs, 1000)

print ("New synaptic weights after training: ")
print (neural_network.synaptic_weights)    

error after 0 iterations: 0.5051054383454322
error after 100 iterations: 0.05794935703699803
error after 200 iterations: 0.053463185762920196
error after 300 iterations: 0.05086073545902663
error after 400 iterations: 0.0485279970818316
error after 500 iterations: 0.047803951285920916
error after 600 iterations: 0.04652002000814068
error after 700 iterations: 0.04617310898706783
error after 800 iterations: 0.04562254886502111
error after 900 iterations: 0.04462926201024846
New synaptic weights after training: 
[[-0.69383858  0.33430924 -0.9998525  ... -0.31390752 -0.24052333
   0.07744929]
 [-1.69443484  0.17171839 -0.70445646 ... -0.04693435 -1.00860108
  -0.85799909]
 [ 2.45629673  0.80992955 -0.52975671 ... -1.2356938  -0.92297394
   0.72247602]
 ...
 [-0.30099736 -0.75848004  0.53118438 ...  0.06455959 -0.65520357
  -0.72378409]
 [-0.56536272 -0.34975075 -0.55708374 ... -0.52093481 -0.13449763
  -1.02603128]
 [-0.23088864 -0.81449824 -0.59503592 ... -0.43183004 -0.4014115
   0.1673

reference: implementation of Neural Network (https://github.com/ugik/notebooks/blob/master/Simple_Neural_Network.ipynb)

In [61]:
# get the output of the last iteration
neural_network.output[-1]   

array([[7.74985244e-01, 1.49937953e-45, 4.54781476e-43, ...,
        6.33191557e-21, 2.87027266e-16, 1.10113033e-52],
       [9.93001269e-01, 3.84562412e-20, 9.80287832e-19, ...,
        3.21484003e-12, 2.47643941e-06, 6.43823921e-22],
       [9.90022435e-01, 5.09659571e-30, 3.48220677e-26, ...,
        3.55115887e-11, 7.40254746e-09, 1.53464486e-30],
       ...,
       [4.56632051e-12, 1.53511120e-14, 7.32727246e-07, ...,
        7.26383317e-10, 2.40901445e-13, 1.03094790e-14],
       [1.27359358e-11, 3.00595823e-84, 7.67401854e-70, ...,
        1.79270355e-35, 7.19254826e-63, 2.10817550e-75],
       [2.73093888e-04, 6.33731406e-07, 5.53600216e-01, ...,
        8.30622326e-07, 4.12208825e-07, 4.30062604e-08]])