# 名人名言管理员 - 超级充能

“上次的标签很有意思，但是还有没有其他方法？”

上次实验重复的部分已经实现好了，请直接运行前面部分，跳到“自动标签”部分，实现一系列模型完成任务。

## 分库

将爬取到的100条名人名言（`spider_items.jl`）分为80条“已入库”`stocked_in`和20条“新入库”`new_arrival`。

In [1]:
import pandas as pd
from collections import Counter

df = pd.read_json('spider_items.jl', lines=True)

stocked_in = df[:80]
new_arrival = df[80:].reindex()
del new_arrival['tags']

## 统计词频

选出已入库名言中最常出现的100个单词`top_words`作为词频属性。

In [3]:
from collections import Counter

counter_all = Counter()
counter_text = []

for _, data in stocked_in.iterrows():
    words = data['text'].split()
    counter_all.update(words)
    counter_text.append(Counter(words))

top_words, _ = zip(*counter_all.most_common(100))

建立创建词频向量的函数`build_word_count_vector()`，并对已入库的数据进行计算，获得词频向量，并组成矩阵`word_count_matrix`。

In [5]:
def build_word_count_vector(word_counts):
    return [word_counts.get(word, 0) for word in top_words]

In [7]:
import numpy as np

vectors = []

for word_counts in counter_text:
    vectors.append(build_word_count_vector(word_counts))

word_count_matrix = np.array(vectors)

## 归一化

计算每行（一个向量）的平方和，使得新向量的每个元素的平方和为1。

In [9]:
s = (np.sum(word_count_matrix ** 2, axis=1) ** 0.5)[:, np.newaxis]
word_count_matrix = word_count_matrix * (1 / s)

获得“新入库”的数据，对其进行词频向量化，然后对每个模型进行预测，得到预测结果。

In [11]:
vectors = []

for _, data in new_arrival.iterrows():
    words = data['text'].split()
    word_counts = Counter(words)
    vectors.append(build_word_count_vector(word_counts))

new_arrival_word_count_matrix = np.array(vectors)

s = (np.sum(new_arrival_word_count_matrix ** 2, axis=1) ** 0.5)[:, np.newaxis]
new_arrival_word_count_matrix = new_arrival_word_count_matrix * (1 / s)

## 自动标签

构建多种模型用于预测标签。

In [13]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


### 构造目标标签

以其中一个标签，如`love`，为例，构造目标标签`target_label`，其值为1表示该名言中包含`love`，值为0表示不包含。

In [15]:
stocked_in_target = []  # Your code here to build stocked_in_target
for _, data in stocked_in.iterrows():
    tags = data['tags']
    if 'love' in tags:
        stocked_in_target.append(1)
    else:
        stocked_in_target.append(0)

stocked_in_target = np.array(stocked_in_target)[:, np.newaxis]

### 逻辑回归

In [17]:
from sklearn.linear_model import LogisticRegression
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
model = LogisticRegression(max_iter=1000)
model.fit(word_count_matrix, stocked_in_target)
tags = []  # Your code here to predict tags for new_arrival
tags = model.predict(new_arrival_word_count_matrix)

new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

  y = column_or_1d(y, warn=True)


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 感知机

In [19]:
from sklearn.linear_model import Perceptron
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
model = Perceptron(max_iter=1000)
model.fit(word_count_matrix, stocked_in_target)
tags = []  # Your code here to predict tags for new_arrival
tags = model.predict(new_arrival_word_count_matrix)

new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

  y = column_or_1d(y, warn=True)


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[love]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[love]
84,"“There are few people whom I really love, and ...",Jane Austen,[love]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


输出示例：

![image.png](attachment:70752991-648d-4d6b-9a18-47d946b02aad.png)

### K近邻

In [21]:
from sklearn.neighbors import KNeighborsClassifier
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
model = KNeighborsClassifier(n_neighbors=1)
model.fit(word_count_matrix, stocked_in_target)
tags = []  # Your code here to predict tags for new_arrival
tags = model.predict(new_arrival_word_count_matrix)

new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

  return self._fit(X, y)


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 决策树

In [23]:
from sklearn.tree import DecisionTreeClassifier
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
model = DecisionTreeClassifier()
model.fit(word_count_matrix, stocked_in_target)
tags = []  # Your code here to predict tags for new_arrival
tags = model.predict(new_arrival_word_count_matrix)

new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[love]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 支持向量机

In [25]:
from sklearn.svm import SVC
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
model = SVC(kernel='linear', C=0.9)
model.fit(word_count_matrix, stocked_in_target)
tags = []  # Your code here to predict tags for new_arrival
tags = model.predict(new_arrival_word_count_matrix)

new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

  y = column_or_1d(y, warn=True)


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 多层感知机

In [27]:
from sklearn.neural_network import MLPClassifier
# Your code here to train a model on stocked_in_word_count_matrix and stocked_in_target
model = MLPClassifier(max_iter=1000, hidden_layer_sizes=(20,20))
model.fit(word_count_matrix, stocked_in_target)
tags = []  # Your code here to predict tags for new_arrival
tags = model.predict(new_arrival_word_count_matrix)

new_arrival["tags"] = list(map(lambda x: ["love"] if x else [], tags))
new_arrival

  y = column_or_1d(y, warn=True)


Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,[]
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,[]
83,“I declare after all there is no enjoyment lik...,Jane Austen,[]
84,"“There are few people whom I really love, and ...",Jane Austen,[]
85,“Some day you will be old enough to start read...,C.S. Lewis,[]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[love]
87,“The fear of death follows from the fear of li...,Mark Twain,[]
88,“A lie can travel half way around the world wh...,Mark Twain,[]
89,“I believe in Christianity as I believe that t...,C.S. Lewis,[]


### 参数

那么这些模型的参数是什么呢？尝试找出它们！

In [29]:
# your code here
model.coefs_

[array([[-0.09453553,  0.33495664, -0.32552333, ..., -0.19460135,
         -0.09571239,  0.14786669],
        [-0.00946966,  0.18233274,  0.21095334, ...,  0.04478435,
          0.14907428, -0.23473225],
        [-0.06801127,  0.20796023,  0.09875849, ..., -0.18600205,
         -0.16171338,  0.37552048],
        ...,
        [-0.14859373,  0.03432624, -0.27427968, ..., -0.22180307,
         -0.08040728,  0.12252447],
        [-0.00600715, -0.31506203, -0.08440662, ..., -0.15705855,
          0.0544379 ,  0.24300162],
        [ 0.08812621,  0.12919036,  0.11633853, ..., -0.01444048,
         -0.14617087,  0.19054971]]),
 array([[-2.24721652e-01,  6.80002089e-04, -6.63555129e-03,
          3.41884652e-01,  2.56724050e-01, -2.85927744e-01,
          1.00416289e-01, -1.34559806e-01,  2.55927167e-08,
         -1.42001694e-01, -2.32009269e-03, -2.75607998e-01,
         -8.14841337e-02,  7.49035763e-02, -2.17816754e-02,
         -1.27785957e-03, -2.49749350e-01,  3.66344316e-01,
         -7.9

输出示例：

![image.png](attachment:1a7d881d-61f3-4d67-bc46-f2ca3b129b5a.png)