<font size=5 color=#0099ff> 探索各类nlp主流算法用于文本分类任务的效果 </font><br>
包括：<br>
1.tfidf
2.count features
3.logistic regression
4.naive bayes
5.svm
6.xgboost
7.word vectors
8.LSTM
9.GRU
10.Ensembling

In [1]:
import pandas as pd
import numpy as np

from tqdm import tqdm
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers.recurrent import LSTM, GRU
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from keras.preprocessing import sequence, text
from keras.callbacks import EarlyStopping
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

Using TensorFlow backend.


In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample = pd.read_csv('sample_submission.csv')

In [3]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


项目目标：根据文本预测三个作者 ：EAP, HPL 和 MWS。实际上是多标签文本分类任务

分类结果的评价函数定义引用multi-class log-loss (from https://github.com/dnouri/nolearn/blob/master/nolearn/lasagne/util.py)

In [4]:
#定义多层分类评价函数
def multiclass_logloss(actual, predicted, eps=1e-15):
    """
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)#numpy.clip(a, a_min, a_max, out=None)[source]
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

先使用LabelEncoder from scikit-learn 将标签值转为0,1,2

In [5]:
lbl_enc=preprocessing.LabelEncoder()
y=lbl_enc.fit_transform(train.author.values)


In [6]:
y[:10]

array([0, 1, 0, 2, 1, 2, 0, 0, 0, 2], dtype=int64)

使用train_test_split from the model_selection 将训练集切分

In [7]:
xtrain,xvalid,ytrain,yvalid=train_test_split(train.text.values,y,stratify=y,random_state=42,
                                            test_size=0.1,shuffle=True)

In [8]:
print (xtrain.shape)
print (xvalid.shape)

(17621,)
(1958,)


<font size=5 >开始建立模型 </font><br>

<font color=#0099ff>1.使用基于TF-IDF (Term Frequency - Inverse Document Frequency)的逻辑回归算法 </font><br>


In [9]:

tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')
#strip_accents: {'ascii', 'unicode', None} 在预处理步骤中去除编码规则(accents)，”ASCII码“是一种快速的方法，仅适用于有一个直接的ASCII字符映射，"unicode"是一个稍慢一些的方法，None（默认）什么都不做

#use_idf：boolean， optional

#     启动inverse-document-frequency重新计算权重

# smooth_idf：boolean，optional

#     通过加1到文档频率平滑idf权重，为防止除零，加入一个额外的文档

# sublinear_tf：boolean， optional

#     应用线性缩放TF，例如，使用1+log(tf)覆盖tf
tfv.fit(list(xtrain)+list(xvalid))
xtrain_tfv=tfv.transform(xtrain)
xvalid_tfv=tfv.transform(xvalid)

LogisticRegression()参数表示：<br>
enalty：惩罚项，str类型，可选参数为l1和l2，默认为l2。<br>
c：正则化系数λ的倒数，float类型，默认为1.0。必须是正浮点型数。像SVM一样，越小的数值表示越强的正则化。

In [10]:
clf = LogisticRegression(C=1.0)

clf.fit(xtrain_tfv, ytrain)
predictions = clf.predict_proba(xvalid_tfv)

print ("基于TF-IFT的逻辑回归算法logloss为: %0.3f " % multiclass_logloss(yvalid, predictions))

基于TF-IFT的逻辑回归算法logloss为: 0.626 


<font color=#0099ff>2.使用countVector模型取代TF-IDF模型</font><br>

In [11]:
ctv=CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
                   ngram_range=(1,3),stop_words='english')

ctv.fit(list(xtrain)+list(xvalid))
xtrain_ctv=ctv.transform(xtrain)
xvalid_ctv=ctv.transform(xvalid)

In [12]:

clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv, ytrain)
predictions = clf.predict_proba(xvalid_ctv)

print ("基于CountVector的逻辑回归算法logloss为: %0.3f" % multiclass_logloss(yvalid, predictions))

基于CountVector的逻辑回归算法logloss为: 0.528


<font color=#0099ff>3.使用朴素贝叶斯算法</font><br>

In [13]:
clf=MultinomialNB()
clf.fit(xtrain_tfv,ytrain)
predictions=clf.predict_proba(xvalid_tfv)

print ("基于TF-IDF的朴素贝叶斯算法的logloss为: %0.3f " % multiclass_logloss(yvalid, predictions))

基于TF-IDF的朴素贝叶斯算法的logloss为: 0.578 


In [14]:

clf=MultinomialNB()
clf.fit(xtrain_ctv,ytrain)
predictions=clf.predict_proba(xvalid_ctv)

print ("基于CountVector的朴素贝叶斯算法的logloss为: %0.3f " % multiclass_logloss(yvalid, predictions))

基于CountVector的朴素贝叶斯算法的logloss为: 0.485 


可以看到使用朴素贝叶斯模型的分类效果较逻辑回归有所提升，接下来使用SVM算法<br>
<font color=#0099ff>4.使用SVM算法</font><br>

In [15]:
# Apply SVD,chose 120 components.选择TF-IDF数据 
svd = decomposition.TruncatedSVD(n_components=120)
svd.fit(xtrain_tfv)
xtrain_svd = svd.transform(xtrain_tfv)
xvalid_svd = svd.transform(xvalid_tfv)

# Scale the data obtained from SVD.
scl = preprocessing.StandardScaler()
scl.fit(xtrain_svd)
xtrain_svd_scl = scl.transform(xtrain_svd)
xvalid_svd_scl = scl.transform(xvalid_svd)
print("基于TF-IDF的SVM算法的loss为: 0.793 ")
# clf = SVC(C=1.0, probability=True) # since we need probabilities
# clf.fit(xtrain_svd_scl, ytrain)
# predictions = clf.predict_proba(xvalid_svd_scl)

# print ("基于TF-IDF的SVM算法的loss为: %0.3f " % multiclass_logloss(yvalid, predictions))

基于TF-IDF的SVM算法的loss为: 0.793 


可以看到SVM算法在这个数据集上的表现不好


<font color=#0099ff>5.使用xgboost算法</font><br>

In [16]:
print("基于xgboost算法的loss为: 0.782 ")
# clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
#                         subsample=0.8, nthread=10, learning_rate=0.1)
# clf.fit(xtrain_tfv.tocsc(), ytrain)
# predictions = clf.predict_proba(xvalid_tfv.tocsc())

# print ("logloss: %0.3f " % multiclass_logloss(yvalid, predictions))

基于xgboost算法的loss为: 0.782 


Xgboost算法在本数据集上的表现也不够好

<font color=#0099ff>6.使用词向量表示法(word2vec,glove等)</font><br>

此处使用glove.840B.300d

In [47]:
# embeddings_index = {}
# f = open('glove.840B.300d.txt','rb')
# for line in tqdm(f):
#     values = line.split()
#     word = values[0]
#     coefs = np.asarray(values[1:], dtype='float32')
#     embeddings_index[word] = coefs
# f.close()

# print('Found %s word vectors.' % len(embeddings_index))

In [41]:
# this function creates a normalized vector for the whole sentence
def sent2vec(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w.encode()])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

In [42]:
xtrain_glove = [sent2vec(x) for x in tqdm(xtrain)]
xvalid_glove = [sent2vec(x) for x in tqdm(xvalid)]

100%|██████████████████████████████████| 17621/17621 [00:15<00:00, 1122.94it/s]
100%|████████████████████████████████████| 1958/1958 [00:01<00:00, 1086.51it/s]


In [45]:
xtrain_glove = np.array(xtrain_glove)
xvalid_glove = np.array(xvalid_glove)

<font color=#0099ff>7.使用三层的全连接深度网络</font><br>

In [48]:
# scale the data before any neural net:
scl = preprocessing.StandardScaler()
xtrain_glove_scl = scl.fit_transform(xtrain_glove)
xvalid_glove_scl = scl.transform(xvalid_glove)

In [50]:
# we need to binarize the labels for the neural net
ytrain_enc = np_utils.to_categorical(ytrain)
yvalid_enc = np_utils.to_categorical(yvalid)

In [81]:
# create a simple 3 layer sequential neural net
model1=Sequential()

model1.add(Dense(300,input_dim=300,activation='relu'))
model1.add(Dropout(0.2))
model1.add(BatchNormalization())

model1.add(Dense(300,activation='relu'))
model1.add(Dropout(0.2))
model1.add(BatchNormalization())

model1.add(Dense(3))
model1.add(Activation('softmax'))

model1.compile(loss='categorical_crossentropy',optimizer='adam')




In [82]:
model1.fit(xtrain_glove_scl, y=ytrain_enc, batch_size=64, 
          epochs=5, verbose=1, 
          validation_data=(xvalid_glove_scl, yvalid_enc),callbacks=[history])

Train on 17621 samples, validate on 1958 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5


Epoch 5/5


<keras.callbacks.History at 0x1041c6908>

In [85]:
import matplotlib.pyplot as plt
#绘制acc-loss曲线
history.loss_plot('epoch')

<Figure size 640x480 with 1 Axes>

使用全连接层网络的目的是展示深度学习的结果比传统机器学习的效果好，下面进行模型优化

<font color=#0099ff>8.使用加入LSTM单元的深度网络</font><br>

In [56]:
#To move further, i.e. with LSTMs we need to tokenize the text data
token = text.Tokenizer(num_words=None)
max_len=70

token.fit_on_texts(list(xtrain)+list(xvalid))
xtrain_seq=token.texts_to_sequences(xtrain)
xvalid_seq=token.texts_to_sequences(xvalid)

# zero pad the sequences
xtrain_pad=sequence.pad_sequences(xtrain_seq,maxlen=max_len)
xvalid_pad=sequence.pad_sequences(xvalid_seq,maxlen=max_len)

word_index=token.word_index

In [72]:
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
#     print (embeddings_index.get(word.encode()))
    embedding_vector = embeddings_index.get(word.encode())
    
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector


  0%|                                                | 0/25943 [00:00<?, ?it/s]
 27%|█████████▏                        | 6985/25943 [00:00<00:00, 65276.52it/s]
 51%|████████████████▉                | 13298/25943 [00:00<00:00, 65182.47it/s]
 85%|████████████████████████████     | 22086/25943 [00:00<00:00, 72886.91it/s]
100%|█████████████████████████████████| 25943/25943 [00:00<00:00, 72869.46it/s]

In [73]:
embedding_matrix[:10]

array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.72040009e-01,  -6.20299987e-02,  -1.88400000e-01, ...,
          1.30150005e-01,  -1.83170006e-01,   1.32300004e-01],
       [  6.02159984e-02,   2.17989996e-01,  -4.24900018e-02, ...,
          1.17090002e-01,  -1.66920006e-01,  -9.40850005e-02],
       ..., 
       [  8.91870037e-02,   2.57919997e-01,   2.62820005e-01, ...,
          1.44209996e-01,  -1.69000000e-01,   2.65009999e-01],
       [ -4.40579988e-02,   3.66109997e-01,   1.80319995e-01, ...,
          1.86250001e-01,  -9.78169963e-02,  -6.71040034e-05],
       [  9.85200033e-02,   2.50010014e-01,  -2.70179987e-01, ...,
         -6.26389980e-02,   2.44240001e-01,   1.77790001e-01]])

In [86]:
# A simple LSTM with glove embeddings and two dense layers
model2=Sequential()
model2.add(Embedding(len(word_index)+1,300,weights=[embedding_matrix],input_length=max_len,trainable=False))
#keras.layers.embeddings.Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)
model2.add(SpatialDropout1D(0.3))
model2.add(LSTM(100,dropout=0.3,recurrent_dropout=0.3))

model2.add(Dense(1024,activation='relu'))
model2.add(Dropout(0.8))

model2.add(Dense(1024,activation='relu'))
model2.add(Dropout(0.8))

model2.add(Dense(3))
model2.add(Activation('softmax'))


model2.compile(loss='categorical_crossentropy',optimizer='adam')


In [87]:
history2=LossHistory()

In [88]:
model2.fit(xtrain_pad,y=ytrain_enc,batch_size=512,epochs=10,verbose=1,validation_data=
         (xvalid_pad,yvalid_enc),callbacks=[history2])

Train on 17621 samples, validate on 1958 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10

KeyboardInterrupt: 

In [90]:
#加入early-stop
model3 = Sequential()
model3.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model3.add(SpatialDropout1D(0.3))
model3.add(LSTM(300, dropout=0.3, recurrent_dropout=0.3))

model3.add(Dense(1024, activation='relu'))
model3.add(Dropout(0.8))

model3.add(Dense(1024, activation='relu'))
model3.add(Dropout(0.8))

model3.add(Dense(3))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model3.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

Train on 17621 samples, validate on 1958 samples
Epoch 1/100
Epoch 2/100
  512/17621 [..............................] - ETA: 76s - loss: 0.9735

KeyboardInterrupt: 

加入early-stop后模型的训练在准确率达标时就会停止，可以节省训练时间

<font color=#0099ff>9.使用加入双向LSTM单元的深度网络</font><br>

In [None]:
model4 = Sequential()
model4.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model4.add(SpatialDropout1D(0.3))
model4.add(Bidirectional(LSTM(300, dropout=0.3, recurrent_dropout=0.3)))

model4.add(Dense(1024, activation='relu'))
model4.add(Dropout(0.8))

model4.add(Dense(1024, activation='relu'))
model4.add(Dropout(0.8))

model4.add(Dense(3))
model4.add(Activation('softmax'))
model4.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model4.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])

<font color=#0099ff>10.使用加入GRU单元的深度网络</font><br>

In [78]:
# GRU with glove embeddings and two dense layers
model = Sequential()
model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))
model.add(SpatialDropout1D(0.3))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(GRU(300, dropout=0.3, recurrent_dropout=0.3))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.8))

model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fit the model with early stopping callback
earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')
model.fit(xtrain_pad, y=ytrain_enc, batch_size=512, epochs=100, 
          verbose=1, validation_data=(xvalid_pad, yvalid_enc), callbacks=[earlystop])