README: The scripts below used to build the basic pipeline of classification modeling. More to try include: <br>
 - embedding: try pretrained models
 - add: tf-idf processing
 - modeling: try other modeling methods except for naive bayes; hyperparameter tuning
 

In [99]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

import jieba
import jieba.posseg as pseg
import jieba.analyse

import glob
import numpy as np
import time

In [100]:
'''
combine dataset (multiple categories) into one single category;
add a column called 'label'
'''

files= glob.glob('../output_data/*.txt')

df_lst = []
for f in files:
    label = f.split('/')[-1][:2]
    df = pd.read_csv(f,header=None)
    df['label'] = label
    df_lst.append(df)

all_df = pd.concat(df_lst)
print('the whole dataset include %d reviews'%len(all_df))
all_df = all_df.rename(columns = {0:'review_tokens'})
all_df.head(10)

the whole dataset include 1623 reviews


Unnamed: 0,review_tokens,label
0,11 月 15 日 提前 预订 2018 年 11 月 27 日 长沙 飞往 沈阳 cz3...,出发
1,航班 延误 登机口 升舱 活动 以原 航班 起飞时间 为准 办理 理解,出发
2,重庆 乌鲁木齐 南航 航班 天气 原因 延误 和田 乘坐 天津 航班,出发
3,沿途 停靠 理解 延误 小时,出发
4,飞机 无故 延误 小时 脸,出发
5,延误 五个 小时 算上 值机 时间 机场 八个 小时 早上 晚上 解释 解决方案 机长 人影...,出发
6,cz3842 航班 延误 投诉无门 十点 五十 起飞 下午 三点 弄 飞机 两个 小时 告知...,出发
7,南航 航班 延误 发 短信 太 严谨 回复 改 航班 用户名 密码 我要 变更 航班 做 延...,出发
8,行李 延误 重大损失,出发
9,确认 航班 延误 订 票 显示 确认,出发


In [105]:
# get the data size for each label
labels = all_df.label.unique().tolist()
label_size = {}
for label in labels:
    label_size[label] = len(all_df[all_df.label == label])

print(label_size)

{'出发': 352, '到达': 147, '性能': 148, '售后': 166, '设计': 47, '计划': 38, '机上': 299, '预订': 218, '中转': 147, '行程': 61}


In [55]:
### try bag-of-words for now
def get_bag_of_words(training_df):
    '''
    input: a training set df that contains all data
    output: a bag-of-words embedding of training data 
    '''
    training = []
    all_reviews = ''
    for review in training_df['review_tokens'].values:
        all_reviews+=review
        all_reviews = re.sub(r'\d+','',all_reviews)  # remove digits

    # a list of all unique words ever appear in user reviews
    word_lst = list(set(all_reviews.split()))

    for idx in range(len(training_df)):
        review = training_df.iloc[idx]['review_tokens']
        review = re.sub(r'\d+','',review)
        tokens = review.split()

        bag = [0]*len(word_lst)  
        for token in tokens:
            bag[word_lst.index(token)] = 1
        training.append(np.array(bag))    
    return training

In [82]:
# features are the X for all data; 
# since we cannot separate implementing bag-of-words on train and test, 
# because it would result in different lenthg of input

from sklearn import preprocessing

features = get_bag_of_words(all_df)
le = preprocessing.LabelEncoder()
targets = le.fit_transform(all_df.label)

In [89]:
# check the shape of features and target
X = np.array(features)
y = np.array(targets).reshape((1623,1))

In [91]:
### train test split data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)
print('training data has %d examples' %len(X_train))
print('test data has %d examples' %len(X_test))

training data has 1087 examples
test data has 536 examples


In [132]:
from sklearn.metrics import classification_report

def get_model_performance(model, X,y):
    '''
    input: the modeling methods, the entire data split into features(X) and target(y)
    output: the accuracy score on test data
    '''
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)
    print('training data has %d examples' %len(X_train))
    print('test data has %d examples' %len(X_test))

    X_train = np.array(X_train)
    y_train = np.array(y_train)
    
    model = model
    model.fit(X_train,y_train)

    y_pred = model.predict(X_test)
    # since accuracy cannot handle imbalance dataset, may cause Accuracy Paradox; using accuracy only get accuracy score
    # over all classes;
    # accuracy = accuracy_score(y_test, y_pred)
    
    # sklearn "classification_report" returns precision, recall, f1-score for each target classes
    result = classification_report(y_test, y_pred)
    
    print('model classification report', result)
    # print("model accuracy score on test data: "+"{:.2f}".format(accuracy))


In [133]:
print('Naive bayes performance:')
get_model_performance(GaussianNB(),X,y)
print('================================')

print('logistic regression performance:')
get_model_performance(LogisticRegression(),X,y)

Naive bayes performance:
training data has 1087 examples
test data has 536 examples
model classification report               precision    recall  f1-score   support

           0       0.33      0.33      0.33        39
           1       0.38      0.43      0.40       116
           2       0.40      0.29      0.33        59
           3       0.36      0.36      0.36        53
           4       0.22      0.16      0.19        56
           5       0.53      0.45      0.49        93
           6       0.31      0.38      0.34        21
           7       0.05      0.12      0.07         8
           8       0.17      0.38      0.24        13
           9       0.38      0.36      0.37        78

   micro avg       0.36      0.36      0.36       536
   macro avg       0.31      0.33      0.31       536
weighted avg       0.37      0.36      0.36       536

logistic regression performance:
training data has 1087 examples
test data has 536 examples
model classification report          

In [118]:
# implement cross validation on training data
from sklearn.model_selection import cross_val_score

clf = GaussianNB()
# implement 5-fold cross validation
# TODO: cross validation for hyperparameter tuning
scores = cross_val_score(clf, X_train, y_train, cv=5)
print('scores:',scores)
print('average accuracy score:'+ '{:.2f}'.format(np.average(scores)))

scores: [0.33936652 0.31506849 0.3853211  0.31627907 0.3271028 ]
average accuracy score:0.34


In [134]:
# find the best parameter by grid search for cross validation
from sklearn.model_selection import GridSearchCV
parameters = {'penalty':('l1', 'l2'), 'C':[0.1, 1, 10]}
model = LogisticRegression()

# use "f1_weightes" as evaluation metrics (see below more explanation)
clf = GridSearchCV(model, parameters, cv=5, scoring = 'f1_weighted')
clf.fit(X_train, y_train)
print(clf.best_params_)

{'C': 1, 'penalty': 'l1'}


Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). 
This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall. <br>
reference: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score <br>
other scoring metrics in sklearn: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter


In [135]:
print('logistic regression performance:')
get_model_performance(LogisticRegression(C=1, penalty='l1'),X,y)

logistic regression performance:
training data has 1087 examples
test data has 536 examples
model classification report               precision    recall  f1-score   support

           0       0.54      0.67      0.60        39
           1       0.60      0.56      0.58       116
           2       0.62      0.58      0.60        59
           3       0.63      0.49      0.55        53
           4       0.47      0.27      0.34        56
           5       0.53      0.77      0.63        93
           6       0.70      0.67      0.68        21
           7       0.40      0.25      0.31         8
           8       0.38      0.38      0.38        13
           9       0.63      0.62      0.62        78

   micro avg       0.57      0.57      0.57       536
   macro avg       0.55      0.53      0.53       536
weighted avg       0.57      0.57      0.56       536



In [None]:
Next step:
 - do hyper-parameter tuning
 - may 