## NLP基础方法
源kaggle kernel: https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle


**step1.** 导入相关模块和数据

**step2.**  定义multiclass_logloss——多分类对数损失函数

**step3.** 编码目标label，然后划分训练集和验证集

**step4.** 建立基础模型：TF-IDF & word count

* TF-IDF: TfidfVectorizer in sklearn.feature_extraction.text
* 用逻辑回归建立基础分类器
* Word Count: CountVectorizer in sklearn.feature_extraction.text
* 然后用逻辑回归建立基于word count特征的模型
* 用朴素贝叶斯建模
* 用SVM建模
	* 由于SVM会很慢，首先得降维：SVD + scale
	* 用SVC建模，并打印结果
* 用xgboost建模
	* 用tf-idf特征训练
	* 用word count特征训练
	* 用降维后的训练集训练

**step5. ** Grid Search
* 创建评分函数
* 创建一个管道pipeline
* 然后需要一个参数组合
* 初始化GridSearchCV模型，拟合训练

**step6.** Word2vec

In [15]:
# step1. import packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

In [3]:
# step1. read data
DIR = '.'
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [4]:
# step2. 定义对数损失函数
def multiclass_logloss(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

In [5]:
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [16]:
# step3. 编码目标值，
label_ec = preprocessing.LabelEncoder()
y = label_ec.fit_transform(train['author'].values)

In [17]:
# step3. 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(train['text'].values, y,
                                                    stratify=y,
                                                    test_size=0.1,
                                                    random_state=42)

In [18]:
print(X_train.shape)
print(y_train.shape)

(17621,)
(17621,)


In [20]:
# step4. 用TFIDF或word count 特征，建立基础模型
# Always start with these features. They work (almost) everytime!
tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

tfv.fit(list(X_train)+list(X_train))

xtrain_tfv = tfv.transform(X_train)
xtest_tfv = tfv.transform(X_test)

In [37]:
# 逻辑回归 + tf-idf
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_tfv, y_train)
y_pred = clf.predict_proba(xtest_tfv)
print(multiclass_logloss(y_test, y_pred))

0.628676206264


In [33]:
# 词频特征
ctv = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), stop_words = 'english')

# Fitting Count Vectorizer to both training and test sets (semi-supervised learning)
ctv.fit(list(X_train) + list(X_test))
xtrain_ctv =  ctv.transform(X_train) 
xtest_ctv = ctv.transform(X_test)

In [35]:
# 逻辑回归 + word count
clf = LogisticRegression(C=1.0)
clf.fit(xtrain_ctv, y_train)
y_pred = clf.predict_proba(xtest_ctv)
print(multiclass_logloss(y_test, y_pred))

0.528310978546


In [36]:
# 朴素贝叶斯 + tfidf
clf = MultinomialNB()
clf.fit(xtrain_tfv, y_train)
y_pred = clf.predict_proba(xtest_tfv)
print(multiclass_logloss(y_test, y_pred))

0.585469600444


In [38]:
# 朴素贝叶斯 + wc
clf = MultinomialNB()
clf.fit(xtrain_ctv, y_train)
y_pred = clf.predict_proba(xtest_ctv)
print(multiclass_logloss(y_test, y_pred))

0.485414923135
