# 描述：

    报社等相关的机构，往往会遇到一个问题，就是别的机构使用他们的文章但没有标明来源，这种属于抄袭的现象。在本次任务中，我们将解决新华社的文章被抄袭引用的问题。
    在给定的数据集合中，存在一些新闻语料来自新华社，但是其来源并不是新华社，请设计技巧学习模型解决该问题（判别该文章是不是抄袭新华社的）。
    
    参考资料：
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import jieba
import re
import warnings
import time
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib
warnings.filterwarnings('ignore')

Using matplotlib backend: MacOSX


## 数据分析

In [3]:
# 导入数据
filename = '../data_sets/sqlResult_1558435.csv'
all_data = pd.read_csv(filename, encoding='gb18030')
all_data.head()

Unnamed: 0,id,author,source,content,feature,title,url
0,89617,,快科技@http://www.kkj.cn/,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""37""...",小米MIUI 9首批机型曝光：共计15款,http://www.cnbeta.com/articles/tech/623597.htm
1,89616,,快科技@http://www.kkj.cn/,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""15""...",骁龙835在Windows 10上的性能表现有望改善,http://www.cnbeta.com/articles/tech/623599.htm
2,89615,,快科技@http://www.kkj.cn/,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\r\n...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""18""...",一加手机5细节曝光：3300mAh、充半小时用1天,http://www.cnbeta.com/articles/tech/623601.htm
3,89614,,新华社,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n,"{""type"":""国际新闻"",""site"":""环球"",""commentNum"":""0"",""j...",葡森林火灾造成至少62人死亡 政府宣布进入紧急状态（组图）,http://world.huanqiu.com/hot/2017-06/10866126....
4,89613,胡淑丽_MN7479,深圳大件事,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\r\n@深圳交警微博称：昨日清...,"{""type"":""新闻"",""site"":""网易热门"",""commentNum"":""978"",...",44岁女子约网友被拒暴雨中裸奔 交警为其披衣相随,http://news.163.com/17/0618/00/CN617P3Q0001875...


In [4]:
# 查看 table_labels
print(all_data.columns.values.tolist())

['id', 'author', 'source', 'content', 'feature', 'title', 'url']


In [5]:
# 查看整体数据的情况
print(all_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89611 entries, 0 to 89610
Data columns (total 7 columns):
id         89611 non-null int64
author     79396 non-null object
source     89609 non-null object
content    87054 non-null object
feature    89611 non-null object
title      89577 non-null object
url        87144 non-null object
dtypes: int64(1), object(6)
memory usage: 4.8+ MB
None


In [6]:
# 计算新华社文章所占的比例
xinhua_news = all_data[all_data['source'] == '新华社']
len(xinhua_news), len(xinhua_news)/len(all_data)

(78661, 0.8778051801676133)

## 数据预处理

In [7]:
all_data = all_data.dropna(subset=['content'])

In [8]:
all_data.shape

(87054, 7)

In [9]:
# 添加一列 label，来源于新华社的为1，其他为0
all_data['label'] = all_data['source'].apply(lambda x: 1 if x == '新华社' else 0)

In [10]:
print(all_data.columns.values.tolist())

['id', 'author', 'source', 'content', 'feature', 'title', 'url', 'label']


In [11]:
data = all_data[['content', 'label']]

In [12]:
data.head()

Unnamed: 0,content,label
0,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,0
1,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,0
2,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\r\n...,0
3,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n,1
4,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\r\n@深圳交警微博称：昨日清...,0


In [13]:
len(data[data['label'] == 0])

8393

In [14]:
data_0 = data[data['label'] == 0]
data_1 = data[data['label'] == 1].sample(n=len(data_0),replace=False, random_state=None)

In [15]:
data_0.shape, data_1.shape

((8393, 2), (8393, 2))

In [16]:
df = data_0.merge(data_1, how='outer')

In [17]:
df.shape

(16786, 2)

In [18]:
df.index = range(len(df))

In [19]:
df.head()

Unnamed: 0,content,label
0,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,0
1,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,0
2,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\r\n...,0
3,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\r\n@深圳交警微博称：昨日清...,0
4,受到A股被纳入MSCI指数的利好消息刺激，A股市场从周三开始再度上演龙马行情，周四上午金...,0


In [20]:
texts = df['content'] # 内容
label = df['label'] # 标签

## 文本向量化（使用tf_idf）

In [21]:
from tqdm import tqdm # 用于显示进度条的

In [22]:
def run_time(f):
    '''@param f is a function'''
    def warp(n):
        s = time.clock()
        result = f(n)
        e = time.clock()
        print('运行时间为：{}秒'.format(e-s))
        return result
    return warp

In [23]:
# 切词
@run_time
def get_contents(texts):
    contents = []
    for text in tqdm(texts):
        sentence = ''.join(re.findall(r'\w+', text))
        contents.append(' '.join(jieba.cut(sentence)))
    return contents

In [24]:
contents = get_contents(texts)

  0%|          | 0/16786 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/dp/9d1tlc2926x2djj4c05f3325ps5m7_/T/jieba.cache
Loading model cost 0.818 seconds.
Prefix dict has been built succesfully.
100%|██████████| 16786/16786 [00:48<00:00, 342.94it/s]

运行时间为：49.046831000000005秒





In [25]:
contents[0:1]

['此外 自 本周 6 月 12 日起 除 小米 手机 6 等 15 款 机型 外 其余 机型 已 暂停 更新 发布 含 开发 版 体验版 内测 稳定版 暂不受 影响 以 确保 工程师 可以 集中 全部 精力 进行 系统优化 工作 有人 猜测 这 也 是 将 精力 主要 用到 MIUI9 的 研发 之中 MIUI8 去年 5 月 发布 距今已有 一年 有余 也 是 时候 更新换代 了 当然 关于 MIUI9 的 确切 信息 我们 还是 等待 官方消息']

In [26]:
vectorizer = TfidfVectorizer(max_features=500)
vectors = vectorizer.fit_transform(contents)

In [27]:
X = vectors.toarray()
y = label.tolist()

In [28]:
X.shape, len(y)

((16786, 500), 16786)

## 构建不同的机器学习模型进行训练

In [29]:
# 构建模型
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

from sklearn.model_selection import GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [30]:
# 拆分数据集
X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=50, test_size=0.15) # 拆成训练和测试集
x_train, x_valid, y_train, y_valid = train_test_split(X_train, Y_train, random_state=50, test_size=0.15) # 拆成训练和验证集

In [31]:
X_train.shape, X_test.shape, x_valid.shape

((14268, 500), (2518, 500), (2141, 500))

### KNN

In [32]:
knn = KNeighborsClassifier()
parameters = {'n_neighbors': [i for i in range(1,6)]}
gcv = GridSearchCV(knn, parameters, scoring='roc_auc', n_jobs=4)
gcv.fit(x_train, y_train)

GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [33]:
gcv.best_params_

{'n_neighbors': 4}

In [34]:
model_info = "The model is {}, the parameters are {}."
result_info = '''
Test result: 
score = {}
ps = {}
rs = {}
f1 = {}
ras = {}
'''

In [35]:
knn_best = KNeighborsClassifier(n_neighbors = 4)
knn_best.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform')

In [36]:
score = knn_best.score(x_valid, y_valid)
y_pred = knn_best.predict(x_valid)
y_pred_prob = knn_best.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("KNN", "(n_neighbors = 4)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is KNN, the parameters are (n_neighbors = 4).

Test result: 
score = 0.8169079869219991
ps = 0.9538258575197889
rs = 0.6694444444444444
f1 = 0.7867247007616974
ras = 0.8800262680210844



### LogisticRegression

In [37]:
lr_classifier = LogisticRegression()
lr_classifier.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [38]:
score = lr_classifier.score(x_valid, y_valid)
y_pred = lr_classifier.predict(x_valid)
y_pred_prob = lr_classifier.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("LR", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is LR, the parameters are (default).

Test result: 
score = 0.9444184960298926
ps = 0.9781094527363184
rs = 0.9101851851851852
f1 = 0.942925659472422
ras = 0.9912844276887633



In [39]:
lrc = LogisticRegression(C=50)
lrc.fit(x_train, y_train)

score = lrc.score(x_valid, y_valid)
y_pred = lrc.predict(x_valid)
y_pred_prob = lr_classifier.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("LR", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is LR, the parameters are (default).

Test result: 
score = 0.9584306398879029
ps = 0.9714557564224549
rs = 0.9453703703703704
f1 = 0.9582355701548569
ras = 0.9912844276887633



### Naive Bayes

    高斯朴素贝叶斯(sklearn.naive_bayes.GaussianNB(priors=None))

In [40]:
gnb_classifier = GaussianNB()
gnb_classifier.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [41]:
score = gnb_classifier.score(x_valid, y_valid)
y_pred = gnb_classifier.predict(x_valid)
y_pred_prob = gnb_classifier.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("Naive Bayes", "(GaussianNB)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is Naive Bayes, the parameters are (GaussianNB).

Test result: 
score = 0.8612797758056983
ps = 0.9125395152792413
rs = 0.8018518518518518
f1 = 0.8536224741251849
ras = 0.895738646280588



    多项式分布贝叶斯(sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None))

In [42]:
mnb_classifier = MultinomialNB()
mnb_classifier.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [43]:
score = mnb_classifier.score(x_valid, y_valid)
y_pred = mnb_classifier.predict(x_valid)
y_pred_prob = mnb_classifier.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("Naive Bayes", "(MultinomialNB)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is Naive Bayes, the parameters are (MultinomialNB).

Test result: 
score = 0.8589444184960299
ps = 0.9322222222222222
rs = 0.7768518518518519
f1 = 0.8474747474747475
ras = 0.9261798792194644



### SVM

In [44]:
svm_svc = SVC(probability=True)
svm_svc.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [45]:
score = svm_svc.score(x_valid, y_valid)
y_pred = svm_svc.predict(x_valid)
y_pred_prob = svm_svc.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("SVM", "(SVC)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is SVM, the parameters are (SVC).

Test result: 
score = 0.811303129378795
ps = 1.0
rs = 0.6259259259259259
f1 = 0.7699316628701595
ras = 0.9612315425699026



In [46]:
s = time.clock()
svm_svc_c = SVC(C=50.0, probability=True)
svm_svc_c.fit(x_train, y_train)

score = svm_svc_c.score(x_valid, y_valid)
y_pred = svm_svc_c.predict(x_valid)
y_pred_prob = svm_svc_c.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("SVM", "(SVC)"))
print(result_info.format(score, ps, rs, f1, ras))
e = time.clock()
print('运行时间为：{}秒'.format(e-s))

The model is SVM, the parameters are (SVC).

Test result: 
score = 0.9420831387202242
ps = 0.9799196787148594
rs = 0.9037037037037037
f1 = 0.9402697495183044
ras = 0.9917530631479737

运行时间为：173.83277099999998秒


In [47]:
s = time.clock()
svm_svc = SVC(C=1000.0, probability=True)
svm_svc.fit(x_train, y_train)

score = svm_svc.score(x_valid, y_valid)
y_pred = svm_svc.predict(x_valid)
y_pred_prob = svm_svc.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("SVM", "(SVC)"))
print(result_info.format(score, ps, rs, f1, ras))
e = time.clock()
print('运行时间为：{}秒'.format(e-s))

The model is SVM, the parameters are (SVC).

Test result: 
score = 0.9645025688930406
ps = 0.9735849056603774
rs = 0.9555555555555556
f1 = 0.9644859813084113
ras = 0.9932540929242156

运行时间为：90.99185499999999秒


### DecisionTree

In [48]:
clf_dt = DecisionTreeClassifier(max_depth=None)
clf_dt.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [49]:
score = clf_dt.score(x_valid, y_valid)
y_pred = clf_dt.predict(x_valid)
y_pred_prob = clf_dt.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("DecisionTree", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is DecisionTree, the parameters are (default).

Test result: 
score = 0.9663708547407753
ps = 0.9658040665434381
rs = 0.9675925925925926
f1 = 0.9666975023126734
ras = 0.9669022934338674



In [50]:
clf_dd = DecisionTreeClassifier(random_state=100)
clf_dd.fit(x_train, y_train)

score = clf_dd.score(x_valid, y_valid)
y_pred = clf_dd.predict(x_valid)
y_pred_prob = clf_dd.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("DecisionTree", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is DecisionTree, the parameters are (default).

Test result: 
score = 0.9696403549743111
ps = 0.9686057248384118
rs = 0.9712962962962963
f1 = 0.9699491447064261
ras = 0.970210667783712



### RandomForest

In [51]:
clf_rf = RandomForestClassifier(n_estimators=50)
clf_rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [52]:
score = clf_rf.score(x_valid, y_valid)
y_pred = clf_rf.predict(x_valid)
y_pred_prob = clf_rf.predict_proba(x_valid)
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(n_estimators=10)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (n_estimators=10).

Test result: 
score = 0.9775805698271836
ps = 0.9813432835820896
rs = 0.9740740740740741
f1 = 0.9776951672862454
ras = 0.9966820260411212



In [53]:
clf_rf1 = RandomForestClassifier(n_estimators=100)
clf_rf1.fit(x_train, y_train)

score = clf_rf1.score(x_valid, y_valid)
y_pred = clf_rf1.predict(x_valid)
y_pred_prob = clf_rf1.predict_proba(x_valid)  
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(n_estimators=10)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (n_estimators=10).

Test result: 
score = 0.9780476412891173
ps = 0.9699727024567789
rs = 0.987037037037037
f1 = 0.9784304726938963
ras = 0.9965882116801061



In [54]:
clf_rf2 = RandomForestClassifier(n_estimators=500)
clf_rf2.fit(x_train, y_train)

score = clf_rf2.score(x_valid, y_valid)
y_pred = clf_rf2.predict(x_valid)
y_pred_prob = clf_rf2.predict_proba(x_valid)  
ps = precision_score(y_valid, y_pred)
rs = recall_score(y_valid, y_pred)
f1 = f1_score(y_valid, y_pred)
ras = roc_auc_score(y_valid, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(n_estimators=10)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (n_estimators=10).

Test result: 
score = 0.978514712751051
ps = 0.9743119266055046
rs = 0.9833333333333333
f1 = 0.9788018433179724
ras = 0.996987468146752



### 各模型在 test 数据集上的表现

In [55]:
x_test = X_test
y_test = Y_test

#### LogisticRegression

In [56]:
score = lr_classifier.score(x_test, y_test)
y_pred = lr_classifier.predict(x_test)
y_pred_prob = lr_classifier.predict_proba(x_test)
ps = precision_score(y_test, y_pred)
rs = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
ras = roc_auc_score(y_test, y_pred_prob[:, 1])
print(model_info.format("LogisticRegression", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is LogisticRegression, the parameters are (default).

Test result: 
score = 0.9467831612390787
ps = 0.9741602067183462
rs = 0.9157894736842105
f1 = 0.9440734557595993
ras = 0.9908009125878429



#### Naive Bayes (MultinomialNB)

In [57]:
score = mnb_classifier.score(x_test, y_test)
y_pred = mnb_classifier.predict(x_test)
y_pred_prob = mnb_classifier.predict_proba(x_test)
ps = precision_score(y_test, y_pred)
rs = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
ras = roc_auc_score(y_test, y_pred_prob[:, 1])
print(model_info.format("Naive Bayes (MultinomialNB)", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is Naive Bayes (MultinomialNB), the parameters are (default).

Test result: 
score = 0.857823669579031
ps = 0.9311701081612586
rs = 0.7668016194331984
f1 = 0.8410301953818827
ras = 0.9230087629890723



#### SVM

In [58]:
score = svm_svc.score(x_test, y_test)
y_pred = svm_svc.predict(x_test)
y_pred_prob = svm_svc.predict_proba(x_test)
ps = precision_score(y_test, y_pred)
rs = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
ras = roc_auc_score(y_test, y_pred_prob[:, 1])
print(model_info.format("SVM", "(C=1000.0, probability=True)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is SVM, the parameters are (C=1000.0, probability=True).

Test result: 
score = 0.9682287529785544
ps = 0.972972972972973
rs = 0.9619433198380567
f1 = 0.96742671009772
ras = 0.9916920426252993



#### DecisionTree¶

In [59]:
score = clf_dd.score(x_test, y_test)
y_pred = clf_dd.predict(x_test)
y_pred_prob = clf_dd.predict_proba(x_test)
ps = precision_score(y_test, y_pred)
rs = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
ras = roc_auc_score(y_test, y_pred_prob[:, 1])
print(model_info.format("DecisionTree", "(random_state=100)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is DecisionTree, the parameters are (random_state=100).

Test result: 
score = 0.9682287529785544
ps = 0.9668552950687146
rs = 0.968421052631579
f1 = 0.9676375404530745
ras = 0.9678385363252245



#### RandomForest

In [60]:
score = clf_rf2.score(x_test, y_test)
y_pred = clf_rf2.predict(x_test)
y_pred_prob = clf_rf2.predict_proba(x_test)
ps = precision_score(y_test, y_pred)
rs = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
ras = roc_auc_score(y_test, y_pred_prob[:, 1])
print(model_info.format("DecisionTree", "(n_estimators=500)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is DecisionTree, the parameters are (n_estimators=100).

Test result: 
score = 0.9817315329626688
ps = 0.9775100401606426
rs = 0.9854251012145749
f1 = 0.9814516129032258
ras = 0.9976825570130735



In [61]:
score = clf_rf1.score(x_test, y_test)
y_pred = clf_rf1.predict(x_test)
y_pred_prob = clf_rf1.predict_proba(x_test)
ps = precision_score(y_test, y_pred)
rs = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
ras = roc_auc_score(y_test, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(n_estimators=100)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (n_estimators=100).

Test result: 
score = 0.9817315329626688
ps = 0.9759807846277022
rs = 0.9870445344129555
f1 = 0.9814814814814815
ras = 0.9975228856961638



#### 综上，下面选择 LogisticRegression 来检测文章是否抄袭新华社。

## 找出预测结果与实际结果相反的文章，作为抄袭的候选者。

In [62]:
data.head()

Unnamed: 0,content,label
0,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,0
1,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,0
2,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\r\n...,0
3,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n,1
4,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\r\n@深圳交警微博称：昨日清...,0


In [64]:
data.index = range(len(data))

In [71]:
texts = data['content'] # 内容
contents = get_contents(texts)
vectorizer = TfidfVectorizer(max_features=500)
vectors = vectorizer.fit_transform(contents)
X = vectors.toarray()

100%|██████████| 87054/87054 [04:51<00:00, 298.45it/s]


运行时间为：286.097573秒


In [72]:
label = data['label'] # 标签
y = label.tolist()

In [73]:
X.shape

(87054, 500)

In [74]:
len(y)

87054

In [78]:
# 预测所有的样本
score = clf_rf1.score(X, y)
y_pred = clf_rf1.predict(X)
y_pred_prob = clf_rf1.predict_proba(X)
ps = precision_score(y, y_pred)
rs = recall_score(y, y_pred)
f1 = f1_score(y, y_pred)
ras = roc_auc_score(y, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (default).

Test result: 
score = 0.47463643255910126
ps = 0.9656879384476126
rs = 0.4340016018102999
f1 = 0.598861533333918
ras = 0.7176080925187094



### 可以看出效果并不好，尽管训练和测试误差都较小，但是一旦在整体数据上来看就太差了。

    下面增加训练样本的数量（正负样本个2w个）

In [86]:
data_f = data[data['label'] == 0].sample(n=20000,replace=True, random_state=100)
data_z = data[data['label'] == 1].sample(n=20000,replace=False, random_state=100)

In [87]:
data_f.shape, data_z.shape

((20000, 3), (20000, 3))

In [91]:
data_farame = data_f.merge(data_z, how='outer')

In [92]:
data_farame.head()

Unnamed: 0,content,label,pred_label
0,参考消息网6月23日报道 台媒称，英国著名物理学家霍金20日于挪威举办的斯塔穆斯节发表演讲，...,0,0
1,参考消息网6月23日报道 台媒称，英国著名物理学家霍金20日于挪威举办的斯塔穆斯节发表演讲，...,0,0
2,参考消息网6月23日报道 台媒称，英国著名物理学家霍金20日于挪威举办的斯塔穆斯节发表演讲，...,0,0
3,参考消息网6月23日报道 台媒称，英国著名物理学家霍金20日于挪威举办的斯塔穆斯节发表演讲，...,0,0
4,参考消息网6月23日报道 台媒称，英国著名物理学家霍金20日于挪威举办的斯塔穆斯节发表演讲，...,0,0


In [93]:
texts = data_farame['content'] # 内容
label = data_farame['label'] # 标签

In [94]:
contents = get_contents(texts)

100%|██████████| 40000/40000 [03:26<00:00, 194.13it/s]

运行时间为：200.47628699999996秒





In [95]:
vectorizer = TfidfVectorizer(max_features=500)
vectors = vectorizer.fit_transform(contents)
X = vectors.toarray()
y = label.tolist()

In [96]:
X.shape, len(y)

((40000, 500), 40000)

In [97]:
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=50, test_size=0.3) # 拆成训练和测试集

In [98]:
clf_rf1 = RandomForestClassifier(n_estimators=100)
clf_rf1.fit(x_train, y_train)

score = clf_rf1.score(x_test, y_test)
y_pred = clf_rf1.predict(x_test)
y_pred_prob = clf_rf1.predict_proba(x_test)  
ps = precision_score(y_test, y_pred)
rs = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
ras = roc_auc_score(y_test, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(n_estimators=10)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (n_estimators=10).

Test result: 
score = 0.9925
ps = 0.994316282179873
rs = 0.9906728847435043
f1 = 0.9924912397797431
ras = 0.9991044023797343



In [106]:
score = clf_rf1.score(X, y)
y_pred = clf_rf1.predict(X)
y_pred_prob = clf_rf1.predict_proba(X)
ps = precision_score(y, y_pred)
rs = recall_score(y, y_pred)
f1 = f1_score(y, y_pred)
ras = roc_auc_score(y, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (default).

Test result: 
score = 0.997325
ps = 0.9981968444778362
rs = 0.99645
f1 = 0.9973226573251596
ras = 0.9997926225000001



In [99]:
texts = data['content'] # 内容
label = data['label'] # 标签

In [100]:
contents = get_contents(texts)
vectorizer = TfidfVectorizer(max_features=500)
vectors = vectorizer.fit_transform(contents)
X_o = vectors.toarray()
y_o = label.tolist()

100%|██████████| 87054/87054 [04:30<00:00, 321.86it/s]


运行时间为：263.9786130000002秒


In [101]:
data.index = range(len(data))

In [103]:
# 预测所有的样本
score = clf_rf1.score(X_o, y_o)
y_pred = clf_rf1.predict(X_o)
y_pred_prob = clf_rf1.predict_proba(X_o)
ps = precision_score(y_o, y_pred)
rs = recall_score(y_o, y_pred)
f1 = f1_score(y_o, y_pred)
ras = roc_auc_score(y_o, y_pred_prob[:, 1])
print(model_info.format("RandomForest", "(default)"))
print(result_info.format(score, ps, rs, f1, ras))

The model is RandomForest, the parameters are (default).

Test result: 
score = 0.6090472580237554
ps = 0.9757980254600508
rs = 0.5817622455854871
f1 = 0.7289380206757036
ras = 0.7312480778508906



In [104]:
data['pred_label'] = y_pred

In [105]:
# 预测错误的个数
(data['label'] != data['pred_label']).sum()

34034

In [107]:
# 查看实际是0，预测为1。表示抄袭新华社的文章
copy_data = data[(data['label'] == 0) & (data['pred_label'] == 1)]

In [108]:
copy_data.head(10)

Unnamed: 0,content,label,pred_label
14,6月21日，MSCI在官网发布公告称，从明年6月起将中国A股纳入MSCI新兴市场指数和MSC...,0,1
29,文章导读： 供应商围堵追债、20多位高管离职、上千人被裁员、孤注一掷史上最大规模的降价…...,0,1
41,6月14日，记者从省卫计委的答复中获悉，广东将启动城市三甲医院人员下沉基层计划，每年全省三级...,0,1
53,央广网贵阳6月19日消息（记者王珩 贵州台记者黄瑾）为规范省级救灾储备物资管理，提高救灾物资...,0,1
54,（原标题：高速上50秒别车6次 面包车司机现身：我一时冲动犯了错）\r\n高速上50秒别车6...,0,1
90,新疆日报讯（记者王永飞报道）近日，新疆广汇实业投资（集团）有限责任公司在新加坡交易所成功发行...,0,1
96,中新网6月23日电 (记者潘旭临) 意大利航空首席商务官乔治先生22日在北京接受中新网记者专...,0,1
127,自治区信访局干部冯玉是自治区第二批驻村工作队队员。2016年11月，她把4岁的女儿托付给婆婆...,0,1
153,佩莱格里尼\r\n凤凰体育讯（记者范宏基报道）6月17日，中超联赛进行了一场土豪间的对决，结...,0,1
155,今年，或许是在“新零售”的感召下，让大家本以为已回归平静的电商市场又起波澜。无论是线上线下融...,0,1


In [109]:
copy_text = copy_data['content'].values[0]

In [110]:
print(copy_text)

6月21日，MSCI在官网发布公告称，从明年6月起将中国A股纳入MSCI新兴市场指数和MSCI ACWI全球指数，这恐怕是近半年来中国资本市场上最令人振奋的消息。
A股早在2013年6月就已纳入新兴市场指数的候选列表中，但此后几年，都因为配额分配、资本流动限制、资本利得税等所谓原因而遭否决，尤其是在2016年第三次闯关失败后，中国投资者和相关监管部门似乎对“A股入摩”已心灰意冷，甚至连证监会分管国际合作的副主席方星海都在今年一月份的时候表示，“中国与MSCI在股指期货上的观点存在分歧，中国并不急于加入MSCI全球指数”。
然而事情最终出现了转机，今年3月，MSCI提出纳入A股的新方案——将A股的权重由原计划的1%降低至0.5%，并将指数纳入A股的数量由原来计划的448只减至169只，这一举动其实已经预示了A股今年大概率“入摩”。从6月21日宣布的结果来看，相比3月份的调整可以说还有惊喜，最终确定的权重为0.73%，股票数量为222只。
就具体的时间表而言，MSCI新兴市场A股纳入计划分两步走，第一步是在2018年5月按2.5%的指数纳入因子（index inclusion factor）给予A股0.37%的权重，第二步是在2018年8月按5%的因子将权重提高至计划的0.73%。从现在到A股正式进入MSCI新兴市场指数尚有一年时间，因此短期来看，这一事件不会马上起到提振国内股市的作用。另一方面，大部分机构预计本次“入摩”将为中国带来约1000亿美元的资金流入。相比于标的公司近2万亿美元的市值来说，这些资金并不能在市场上掀起太大的涟漪，只有当纳入因子进一步提高时（根据MSCI在2016年提出的A股纳入计划，纳入因子达到100%时，A股的权重将达到18.1%），“入摩”才可能在资金面上直接对A股市场有重大利好。
除了股价上的利好，“入摩”更重大的意义在于其给国内资本市场改革带来机遇与动力。MSCI在做出纳入决定前，需要广泛咨询国际机构投资者，这些投资者能够在全球范围内进行资产配置，他们最擅于比较各个国家的投资环境、资本市场对外资的友好程度。此前A股屡次碰壁就是因为国内金融市场上的QFII配额限制、QFII每月资本赎回限制、大面积股票停牌以及交易所需对A股相关金融产品预审等过多的管制与不确定性，给海外机构投资中国市场设下实质障碍，也让海外机构担心投入中国的资金的安全

In [111]:
xinhua_news = data[data['label'] == 1]['content'].tolist()

In [112]:
xinhua_news[0]

'这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n'

In [None]:
# s -> t
def edit_distance(s, t):
    len1, len2 = len(s), len(t)
    if len1 == 0 and len2 == 0: return 0
    if len2 == 0: return len1
    if len1 == 0: return len2
    solution = {}
    dp = [[0] * (len2+1) for _ in range(len1+1)]
    for i in range(1, len2+1):
        dp[0][i] = i
    for i in range(1, len1+1):
        dp[i][0] = i
    for i in range(1, len1+1):
        for j in range(1, len2+1):
            if s[i-1] == t[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                options = [(dp[i-1][j-1], 'sub'), (dp[i-1][j], 'del'), (dp[i][j-1], 'add')]
                tmp = min(options)
                dp[i][j] = tmp[0] + 1
    return dp[-1][-1]

In [None]:
def setEditDistance(text):
    return edit_distance(copy_text, text)

In [None]:
data_news['edit_dis'] = data_news['content'].apply(lambda x: setEditDistance(x))

#### 上面这步太卡了，没有算不出来 :(

In [None]:
# 在上一步的基础上，找出与 copy_text 编辑距离最短的文章
sim_text = data_news[data_news['edit_dis'] == min(data_news['edit_dis'])]['content'].values[0]

#### 下面我想从 X 中直接用余弦相似度的方法求得与当前 copy_text 最相似的文章

    1、从 X 中找出 copy_text 对应的向量 X_copy
    2、求解 X_copy 与 X 中对应新华社的文章向量相似度最高的向量 X_sim，即为新华社被抄袭的文章的向量
    3、再根据 X_sim 找到对应的文章 sim_text，即为被抄袭的文章

In [None]:
len(X)

In [None]:
# 需要重新设置一下所有==索引，因为把 content 为空的去掉了，这样就能对应上 X 的索引
data_news.index = range(len(data_news))

In [None]:
copy_index = copy_data.index[0]
X_copy = X[copy_index]

In [None]:
# 求解两个向量的余弦相似度
def cos_sim(vector_a, vector_b):
    """
    计算两个向量之间的余弦相似度
    :param vector_a: 向量 a 
    :param vector_b: 向量 b
    :return: sim
    """
    vector_a = np.mat(vector_a)
    vector_b = np.mat(vector_b)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    cos = num / denom
    sim = 0.5 + 0.5 * cos
    return sim

In [None]:
def getMostSimText(X_copy):
    sim = 0.
    sim_idx = 0
    for idx in tqdm(range(len(X))):
        _sim = cos_sim(X[idx], X_copy)
        if y[idx] == 1 and _sim > sim:
            sim = _sim
            sim_idx = idx
    return data_news.loc[sim_idx]['content'], sim, sim_idx

In [None]:
sim_text, sim, sim_idx = getMostSimText(X_copy)

In [None]:
(sim, sim_idx)

In [None]:
print(sim_text)

In [None]:
print(copy_text)

### 总结：
### 上面结果看起来差不多吧。相似度 0.83，肉眼看貌似有点像。特别提示：最后根据索引提取文章的时候，一定要先重置一下原来的索引，因为在数据预处理的时候删除了一些 content 为空的文章，导致索引不再是连续的。

## 上面存在一个问题，因为数据是imbalance的。同时训练数据不一定要用全量的数据，将在下次作业中改进