# task3 基于机器学习的文本分类

由于数据量过于庞大，故在本地只导入部分数据（10000条训练数据）作为尝试，采用3折交叉验证作为评估。<br /> 结论：
- 1）词频法效果较差
- 2） TFIDF相比词频法有较高的进步
- 3）尝试增加文本长度作为特征，对模型效果有些微的提升，但是不是很显著
- 4）基于树模型的方法（Random forest and GBDT）使用默认参数似乎在这个场景效果不如岭回归，还没有尝试更多模型，有待验证
- 5）TFIDF增加词的数量可以改善模型效果

In [67]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
import math
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

## 1.词频法

In [56]:
train_df = pd.read_csv('train_set.csv', sep='\t', nrows=10000)

In [57]:
vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])
clf = RidgeClassifier()
X = train_test
y = train_df['label']
print (cross_val_score(clf, X, y, cv=3, scoring='f1_macro').mean())

0.7018818268874215


## 2.TFIDF

In [62]:
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
clf = RidgeClassifier()
train_test = tfidf.fit_transform(train_df['text'])
X = train_test
y = train_df['label']
print(cross_val_score(clf, X, y, cv=3, scoring='f1_macro').mean() )

0.8678348836244041


In [63]:
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
train_df['log_text'] =  train_df['text_len'].apply(lambda x: math.log(x))
X = pd.concat([pd.DataFrame(train_test.toarray()), train_df.log_text] ,axis=1)
y = train_df['label']
print(cross_val_score(clf, X, y, cv=3, scoring='f1_macro').mean() )

0.8687687338227859


In [64]:
clf = RandomForestClassifier()
train_test = tfidf.fit_transform(train_df['text'])
X = train_test
y = train_df['label']
print(cross_val_score(clf, X, y, cv=3, scoring='f1_macro').mean() )

0.7703247992300352


In [68]:
clf =GradientBoostingClassifier()
train_test = tfidf.fit_transform(train_df['text'])
X = train_test
y = train_df['label']
print(cross_val_score(clf, X, y, cv=3, scoring='f1_macro').mean() )

0.7889408927102632


## 3.输出预测结果

In [49]:
test_df = pd.read_csv('test_a.csv', sep='\t',nrows=1000)

In [50]:
test_trans = tfidf.transform(test_df['text'])

In [None]:
val_pred = clf.predict(test_trans)
pd.DataFrame(val_pred,columns=['label']).to_csv('/home/tianchi/myspace/result.csv',index=False)