# 名人名言管理员

“虽然不能写出名人名言，但是为了成为对社会有价值的人，我成为了名人名言管理员。
新录入的名言如何加标签呢？让Python来帮我吧。”

## 分库

将爬取到的100条名人名言（`spider_items.jl`）分为80条“已入库”`stocked_in`和20条“新入库”`new_arrival`。

In [1]:
import pandas as pd
from collections import Counter

df = pd.read_json('spider_items.jl', lines=True)

stocked_in = df[:80]
new_arrival = df[80:].reindex()
del new_arrival['tags']

## 统计词频

选出已入库名言中最常出现的100个单词`top_words`作为词频属性。

In [3]:
from collections import Counter

# your code here
texts = ' '.join(stocked_in['text'])
texts = texts.lower().split()
c = Counter(texts)

top_words = c.most_common(100) # replace with your code
print(top_words)

[('you', 73), ('is', 59), ('to', 56), ('a', 48), ('the', 43), ('and', 36), ('of', 33), ('not', 29), ('i', 28), ('it', 24), ('that', 24), ('but', 24), ('be', 22), ('in', 22), ('your', 18), ('can', 15), ('have', 13), ('who', 13), ('as', 12), ('what', 12), ('will', 12), ('all', 12), ('love', 12), ('are', 11), ('or', 11), ('she', 11), ('“the', 10), ('no', 10), ('with', 10), ('if', 10), ('think', 10), ('my', 10), ('more', 9), ("it's", 9), ('“i', 9), ('never', 9), ('make', 9), ('up', 9), ('so', 9), ('her', 9), ('one', 8), ('for', 8), ('do', 8), ('we', 7), ('“it', 7), ('than', 7), ('only', 7), ('just', 7), ('like', 7), ('going', 7), ("don't", 7), ('“if', 7), ('may', 7), ('our', 6), ('without', 6), ('live', 6), ('them', 6), ('give', 6), ('because', 6), ('at', 6), ('keep', 6), ('when', 6), ('good', 5), ('must', 5), ('-', 5), ('some', 5), ("doesn't", 5), ('“you', 5), ('from', 5), ('“there', 4), ('nothing', 4), ('man', 4), ('“a', 4), ('know', 4), ('life', 4), ("you're", 4), ('get', 4), ('let', 4)

建立创建词频向量的函数`build_word_count_vector()`（对一条数据也就是一条名言，统计含有top100词的个数，返回长度为100的一维数组），并对已入库的数据进行计算，获得词频向量（每条数据分别包含top100词的个数），并组成矩阵`word_count_matrix`（shape为（80,100））。

In [5]:
def build_word_count_vector(word_counts):
    # your code here
    vector = []
    for word, count in word_counts:
        vector.append(count)
    return vector

In [7]:
import numpy as np
word_count_matrix = np.arange(100).reshape(1, 100)
top_dict = {word: 0 for i, (word, count) in enumerate(top_words)}
# your code here
for text in stocked_in['text']:
    text = text.lower().split()
    c = Counter(text)
    for word, count in top_words:
        top_dict[word] = c[word]
    top_count = list(top_dict.items())
    temp = np.array([build_word_count_vector(top_count)])
    word_count_matrix = np.concatenate((word_count_matrix, temp), axis=0)
    
word_count_matrix = word_count_matrix[1:]  
print(word_count_matrix)

[[0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 4 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 2 2 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## 归一化

计算每行（一个向量）的平方和，是每个元素乘以平方和的倒数，那么新向量的每个元素的平方和即为1。

In [9]:
# your code here
word_count_matrix = word_count_matrix.astype(float)
for i in range(80):
    norm = np.linalg.norm(word_count_matrix[i])
    word_count_matrix[i] = word_count_matrix[i] / norm
    
print(word_count_matrix)

[[0.         0.23570226 0.         ... 0.         0.         0.        ]
 [0.         0.30151134 0.         ... 0.         0.         0.        ]
 [0.         0.68599434 0.17149859 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.45883147 0.45883147 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


## 自动标签

统计每条名言词频，为每条“新入库”名言找到词频内积相似性（点乘）最高的名言，并将其标签用作这条“新入库”条名言的标签，将其填入`tags`字段中。

In [11]:
# your code here
new_tags = []
new_dict = {word: 0 for i, (word, count) in enumerate(top_words)}
for text in new_arrival['text']:
    text = text.lower().split()
    c = Counter(text)
    tag = []
    for word, count in top_words:
        new_dict[word] = c[word]
    new_count = list(new_dict.items())
    new_temp = np.array([build_word_count_vector(new_count)]).astype(float)
    result = np.dot(new_temp, word_count_matrix.T)
    max_index = np.argmax(result)
    tag=stocked_in['tags'][max_index]
    new_tags.append(tag)


new_arrival['tags'] = new_tags

## 结果

（由于该数据集比较随机，结果可能与实际情况不太一致，这里不比较正确性。）

In [13]:
new_arrival

Unnamed: 0,text,author,tags
80,“Anyone who has never made a mistake has never...,Albert Einstein,"[books, inspirational, reading, tea]"
81,“A lady's imagination is very rapid; it jumps ...,Jane Austen,[humor]
82,"“Remember, if the time should come when you ha...",J.K. Rowling,"[books, contentment, friends, friendship, life]"
83,“I declare after all there is no enjoyment lik...,Jane Austen,[music]
84,"“There are few people whom I really love, and ...",Jane Austen,[the-hunger-games]
85,“Some day you will be old enough to start read...,C.S. Lewis,[imagination]
86,“We are not necessarily doubting that God will...,C.S. Lewis,[inspirational]
87,“The fear of death follows from the fear of li...,Mark Twain,"[adulthood, success, value]"
88,“A lie can travel half way around the world wh...,Mark Twain,"[knowledge, learning, understanding, wisdom]"
89,“I believe in Christianity as I believe that t...,C.S. Lewis,"[humor, insanity, lies, lying, self-indulgence..."


参考如下：
![image.png](attachment:dced374c-c868-4e17-8bcd-cf21f2da3137.png)