## kaggle进行情感分析
### 数据集 https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
### 子豪兄NLP视频 https://www.bilibili.com/video/av57066293
- 不要恐惧kaggle
- 不管黑猫白猫，抓到猫就是好猫。算法简单有效就行
- 数据处理与特征工程很重要

## 导入数据集

In [1]:
import numpy as np
import pandas as pd

In [2]:
data_train = pd.read_csv('train.tsv',sep='\t') #tsv的读取采用分隔符为\t
data_test = pd.read_csv('test.tsv',sep='\t')

In [3]:
data_train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


## 建立语料库
- 将句子看作是向量，文本的特征工程
- 常用的词袋模型 TF-IDF模型 word2vec模型
- 语料库，训练和测试集所有文本内容

In [4]:
train_sentences = data_train['Phrase']
test_sentences = data_test['Phrase']
sentences = pd.concat([train_sentences,test_sentences])

label = data_train['Sentiment']
stop_words = open('stop_words.txt',encoding='utf-8').read().splitlines()

### 特征选取 - 词袋模型

In [5]:
# 用sklearn库中的CountVectorizer构建词袋模型
# 词袋模型的详细介绍请看子豪兄的视频
# analyzer='word'指的是以词为单位进行分析，对于拉丁语系语言，有时需要以字母'character'为单位进行分析
# ngram指分析相邻的几个词，避免原始的词袋模型中词序丢失的问题
# max_features指最终的词袋矩阵里面包含语料库中出现次数最多的多少个词

from sklearn.feature_extraction.text import CountVectorizer
co = CountVectorizer(
    analyzer='word',
    ngram_range=(1,4), #1个词-4个词 都提取出来
    stop_words=stop_words,
    max_features=150000
)
# 使用语料库，构建词袋模型
co.fit(sentences)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=150000, min_df=1,
        ngram_range=(1, 4), preprocessor=None,
        stop_words=["\ufeffain'", 'happy', 'isn', 'ain', 'al', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'sn', 'll', 'mon', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn', "'d", "'ll", "'m", "'re", "'s", "'t", "'ve", 'ZT', 'ZZ', 'a', "a's", 'able', 'about', 'above', 'abst', 'accordance', 'accor...', ',', '·', '￥', '……', '（', '）', '——', '、', '：', '；', '“', '’', '《', '》', '，', '。', '、', '？', '★ '],
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
# 将训练集随机拆分为新的训练集和验证集，默认3:1,然后进行词频统计
# 新的训练集和验证集都来自于最初的训练集，都是有标签的。

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(train_sentences, label, test_size=0.3, random_state=1234)

In [7]:
# 用上面构建的词袋模型，把训练集和验证集中的每一个词都进行特征工程，变成向量
x_train = co.transform(x_train)
x_test = co.transform(x_test)

In [8]:
x_train[0] #稀疏矩阵

<1x150000 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

## logistic分类器

In [9]:
import warnings 
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(x_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [11]:
print('词袋模型进行特征选取，分类器为sklearn默认的logistic分类器')

print('train acc:' ,lg.score(x_train,y_train))
print('valid acc:' ,lg.score(x_test,y_test))

词袋模型进行特征选取，分类器为sklearn默认的logistic分类器
train acc: 0.7685322495011077
valid acc: 0.6412277329232347


In [12]:
#测试集预测结果
test_X = co.transform(test_sentences) #词袋模型转为向量
prediction = lg.predict(test_X) 

data_test.loc[:,'Sentiment'] = prediction
data_test.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,156061,8545,An intermittently pleasing but mostly routine ...,3
1,156062,8545,An intermittently pleasing but mostly routine ...,3
2,156063,8545,An,2
3,156064,8545,intermittently pleasing but mostly routine effort,3
4,156065,8545,intermittently pleasing but mostly routine,3


In [None]:
#保存结果
final_data = data_test.loc[:,['PhraseId','Sentiment']]
final_data.to_csv('final_data.csv',index=None)

### 朴素bayes分类器 速度更快

In [13]:
#引用朴素贝叶斯进行分类训练和预测
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(x_train,y_train)
print('词袋模型进行特征选取，分类器为sklearn默认的bayes分类器')
print('train acc:' ,classifier.score(x_train,y_train))
print('valid acc:' ,classifier.score(x_test,y_test))

train acc: 0.7125098405375222
valid acc: 0.6088256653423897
词袋方法进行文本特征工程，使用sklearn默认的多项式朴素贝叶斯分类器，验证集上的预测准确率: 0.6088256653423897


## 特征选取 TF-IDF

TF值衡量了一个词出现的次数，即词频(Term Frequency)。<br>
IDF值衡量了这个词是不是烂大街。逆文本频率指数（Inverse Document Frequency）如果是the、an、a等烂大街的词，IDF值就会很低。<br>
例如，“中国”、“功夫”这两个词也许会同时出现，但“中国”这个词在各个文档中都有出现，IDF值很低，因此TF_IDF值也很低。 而“功夫”这个词只在特定文档中出现，这个词能带来的“特异性”信息就会大很多。

\begin{aligned}
TF(“功夫")&=\frac{"功夫"这个词在当前文章中出现的次数}{"功夫"这个词在整个语料库中出现的次数}\\
\\
\\
IDF(“功夫")&=ln \frac{语料库的总文档数}{语料库中"功夫"出现的文档数}\\
\\
\\
TF\_IDF(“功夫")&=IF(“功夫") \times IDF(“功夫")
\end{aligned}

In [19]:
# 用sklearn库中的TfidfVectorizer构建TF-IDF模型
# TF-IDF模型的详细介绍请看子豪兄的视频
# analyzer='word'指的是以词为单位进行分析，对于拉丁语系语言，有时需要以字母'character'为单位进行分析
# ngram指分析相邻的几个词，避免原始的词袋模型中词序丢失的问题
# max_features指最终的词袋矩阵里面包含语料库中出现次数最多的多少个词

# TF-IDF模型是专门用来过滤掉烂大街的词的，所以不需要引入停用词stop_words

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1,4),
    # stop_words=stop_words,
    max_features=150000
)

In [20]:
tf_idf.fit(sentences)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=150000, min_df=1,
        ngram_range=(1, 4), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [29]:
#划分训练集和验证集，并转为tf-idf特征
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(train_sentences,label,random_state=1234)

x_train = tf_idf.transform(x_train)
x_test = tf_idf.transform(x_test)

In [34]:
classifier = MultinomialNB()
classifier.fit(x_train,y_train)
print('TF-IDF模型进行特征选取，分类器为sklearn默认的bayes分类器')
print('train acc:' ,classifier.score(x_train,y_train))
print('valid acc:' ,classifier.score(x_test,y_test))


TF-IDF模型进行特征选取，分类器为sklearn默认的logistic分类器
train acc: 0.6708872655816139
valid acc: 0.6045367166474432


In [35]:
lg = LogisticRegression()
lg.fit(x_train,y_train)
print('TF-IDF模型进行特征选取，分类器为sklearn默认的logistic分类器')
print('train acc:' ,lg.score(x_train,y_train))
print('test acc:' ,lg.score(x_test,y_test))

TF-IDF模型进行特征选取，分类器为sklearn默认的logistic分类器
train acc: 0.7058481780511769
test acc: 0.6326541073945918


### logistic模型的优化

In [38]:
# C：正则化系数，C越小，正则化效果越强
# dual：求解原问题的对偶问题
lg2 = LogisticRegression(C=3,dual=True)
lg2.fit(x_train,y_train)
print('TF-IDF模型进行特征选取，分类器为sklearn默认的logistic分类器,优化后的结果')
print('train acc:' ,lg2.score(x_train,y_train))
print('test acc:' ,lg2.score(x_test,y_test))


TF-IDF模型进行特征选取，分类器为sklearn默认的logistic分类器,优化后的结果
train acc: 0.7719595027553505
test acc: 0.6533384595668332


### logistic模型的参数寻优
针对C，dual可以进行网格式搜索，搜索空间：C从1到9。对每一个C，都分别尝试dual为True和False的两种参数。<br>
最后从所有参数中挑出能够使模型在验证集上预测准确率最高的。

In [37]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C':range(1,10),
              'dual':[True,False]
             }
lg3 = LogisticRegression()
grid = GridSearchCV(lg3,param_grid=param_grid)
grid.fit(x_train,y_train)

GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': range(1, 10), 'dual': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [39]:
grid.best_params_

{'C': 5, 'dual': True}

In [40]:
lg_final = grid.best_estimator_

In [41]:
print('网格搜索最优参数后，正确率：')
print('train acc:' ,lg_final.score(x_train,y_train))
print('test acc:' ,lg_final.score(x_test,y_test))

网格搜索最优参数后，正确率：
train acc: 0.7973172711350336
test acc: 0.6546456491093169
