## 数据来源
数据来源：http://t.cn/Ezyacne

数据包含2个csv文件：

>train_set.csv：此数据集用于训练模型，每一行对应一篇文章。文章分别在“字”和“词”的级别上做了脱敏处理。共有四列：
第一列是文章的索引(id)，第二列是文章正文在“字”级别上的表示，即字符相隔正文(article)；第三列是在“词”级别上的表示，即词语相隔正文(word_seg)；第四列是这篇文章的标注(class)。
注：每一个数字对应一个“字”，或“词”，或“标点符号”。“字”的编号与“词”的编号是独立的！

>test_set.csv：此数据用于测试。数据格式同train_set.csv，但不包含class。
注：test_set与train_test中文章id的编号是独立的。

友情提示：请不要尝试用excel打开这些文件！由于一篇文章太长，excel可能无法完整地读入某一行！

## 数据加载

In [1]:
import pandas as pd

#datas=pd.read_csv('./Data/new_data/train_set.csv') # 数据大，爆内存CParserError: Error tokenizing data. C error: out of memory

datas=pd.read_csv('./Data/new_data/train_set.csv',nrows=10000)  # 读前10000行
datas.head()

Unnamed: 0,id,article,word_seg,class
0,0,7368 1252069 365865 755561 1044285 129532 1053...,816903 597526 520477 1179558 1033823 758724 63...,14
1,1,581131 165432 7368 957317 1197553 570900 33659...,90540 816903 441039 816903 569138 816903 10343...,3
2,2,7368 87936 40494 490286 856005 641588 145611 1...,816903 1012629 957974 1033823 328210 947200 65...,12
3,3,299237 760651 299237 887082 159592 556634 7489...,563568 1239563 680125 780219 782805 1033823 19...,13
4,4,7368 7368 7368 865510 7368 396966 995243 37685...,816903 816903 816903 139132 816903 312320 1103...,12


In [2]:
datas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
id          10000 non-null int64
article     10000 non-null object
word_seg    10000 non-null object
class       10000 non-null int64
dtypes: int64(2), object(2)
memory usage: 312.6+ KB


## TF-IDF

In [3]:
datas['article'].head()

0    7368 1252069 365865 755561 1044285 129532 1053...
1    581131 165432 7368 957317 1197553 570900 33659...
2    7368 87936 40494 490286 856005 641588 145611 1...
3    299237 760651 299237 887082 159592 556634 7489...
4    7368 7368 7368 865510 7368 396966 995243 37685...
Name: article, dtype: object

In [4]:
datas[['article']].head()

Unnamed: 0,article
0,7368 1252069 365865 755561 1044285 129532 1053...
1,581131 165432 7368 957317 1197553 570900 33659...
2,7368 87936 40494 490286 856005 641588 145611 1...
3,299237 760651 299237 887082 159592 556634 7489...
4,7368 7368 7368 865510 7368 396966 995243 37685...


In [5]:
# word_seg
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()
word_seg_counts=count_vect.fit_transform(datas['word_seg']) # 词频矩阵
word_seg_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [6]:
# tf-idf
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer()
word_seg_tfidf=tfidf_transformer.fit_transform(word_seg_counts)

word_seg_tfidf.toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

## word2vec

参考：https://github.com/Heitao5200/DGB/blob/master/feature/feature_code/train_word2vec.py

In [8]:
def transfer(word_seg):
    return word_seg.split()

word_seg_list=list(datas['word_seg'].apply(transfer))

In [14]:
# word_seg_list[1]

['90540',
 '816903',
 '441039',
 '816903',
 '569138',
 '816903',
 '1034376',
 '997489',
 '382714',
 '1236006',
 '520477',
 '132364',
 '1207796',
 '89613',
 '1033299',
 '834740',
 '133940',
 '816903',
 '848000',
 '386932',
 '520477',
 '31046',
 '766772',
 '748483',
 '481786',
 '701424',
 '522428',
 '764656',
 '572782',
 '266161',
 '432549',
 '457828',
 '968083',
 '397939',
 '323159',
 '1033823',
 '925313',
 '520477',
 '848000',
 '220238',
 '701424',
 '500399',
 '133940',
 '816903',
 '1256303',
 '520477',
 '1243427',
 '520477',
 '487094',
 '993110',
 '477703',
 '768219',
 '133940',
 '1034376',
 '140644',
 '98991',
 '1207796',
 '569876',
 '1130139',
 '1033823',
 '257190',
 '878073',
 '585102',
 '1105940',
 '985047',
 '834740',
 '520477',
 '506606',
 '266784',
 '781202',
 '566120',
 '768219',
 '133940',
 '816903',
 '31046',
 '422170',
 '441513',
 '703615',
 '847492',
 '520477',
 '300241',
 '54111',
 '1278725',
 '173393',
 '652252',
 '133940',
 '1014945',
 '1231069',
 '1046814',
 '1033823',

In [9]:
import gensim
model = gensim.models.Word2Vec(word_seg_list,size=100, window=5, min_count=5, workers=8, sg=0, iter=5)


In [10]:
from sklearn.cross_validation import train_test_split

train_datas,test_datas = train_test_split(datas, test_size=0.3, random_state=2019)

train_datas.shape, test_datas.shape



((7000, 4), (3000, 4))

In [12]:
train_datas.head()

Unnamed: 0,id,article,word_seg,class
9158,9158,415555 1269463 1044285 1219589 62534 942897 11...,362928 520477 1025743 907424 669476 599826 990...,13
9064,9064,360364 100053 1215246 1077049 1222182 1220011 ...,747180 1279476 1107357 1033823 327773 520477 5...,13
5053,5053,506019 647476 37174 524569 7368 1220011 206394...,376878 77803 816903 1132917 47925 816903 56122...,13
3282,3282,755561 345037 79747 394856 518106 700959 11079...,769051 1226448 1119114 1020809 816903 816903 8...,15
5824,5824,7368 1120647 360394 79747 155248 910763 112265...,816903 266069 1226448 187531 1083859 1021911 1...,19


In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer
model=TfidfVectorizer()
model.fit(datas["word_seg"])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [40]:
train_datas,test_datas = train_test_split(datas, test_size=0.3, random_state=2019)
train_datas.shape,test_datas.shape

((7000, 4), (3000, 4))

In [42]:
X_train=model.transform(train_datas["word_seg"])
X_test=model.transform(test_datas["word_seg"])
Y_train=train_datas["class"]

In [43]:
X_train.shape, X_test.shape,Y_train.shape

((7000, 251069), (3000, 251069), (7000,))

In [45]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred1 = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

83.459999999999994