# Dataset and preprocessing
我们用的数据集是来自medium.com的文章，共有190000多篇，内容包括科技、社会、健康、环境保护和生活方式等主题，我们取其中的100篇作为这次实验的数据集。这个数据集为一个csv文件，每一行包括文章标题、文章文本、文章url、作者、发布时间和文章的标签等。在这次实验中，我们只使用到文章的文本。

在预处理部分，我们要把文本中的标点符号、数字、特殊符号等去掉，然后将文本中的单词转换为小写，并且将文本中的单词分割为一个一个的词语。sklearn中的CountVectorizer类可以帮助我们实现这个功能。除此之外，我们还设置了一个停用词表，排除一些对理解文章主要内容没有帮助的词语，这样可以提高计算的效率。接下来我们要构建文本的空间向量，在本实验中为了方便我们直接使用词袋来实现这项功能，也就是说我们只考虑每个单词在每篇文章中的词频，而不考虑文章的结构、单词出现的位置等信息。在下面的代码中，我们使用CountVectorizer类的fit_transform方法来构建文本的空间向量，count[i][j]表示的是第j个单词在第i个文本中出现的次数。


In [1]:
import os
import re
import csv
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

FILE_CNT = 256
STOP_WORDS = open('stopwords.txt', 'r', encoding='utf-8').read().split('\n')

def get_text_list():
    df = pd.read_csv('1.csv', nrows=FILE_CNT, encoding='utf-8')
    return df['text'].values.tolist()


def get_bag(texts):
    bag = CountVectorizer(token_pattern='\\b[A-Za-z]+\\b', stop_words=STOP_WORDS)
    count = bag.fit_transform(texts)
    return bag, count


def generate_inverse_index(text_list, bag, array):
    result = defaultdict(list)
    words = bag.get_feature_names_out()
    for index, value in enumerate(text_list):
        for i, word in enumerate(words):
            if array[index][i] != 0:
                position_list = [m.span() for m in re.finditer(r'\b' + word + r'\b', value)]
                result[word].append((index, array[index][i], position_list)) # 文章编号，词频，位置列表
    return result


# Get scores and search


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import math


def get_similarity(a, b):
    dot = 0
    len_a = 0
    len_b = 0
    for i in range(len(a)):
        dot += a[i] * b[i]
        len_a += a[i] * a[i]
        len_b += b[i] * b[i]
    len_a = math.sqrt(len_a)
    len_b = math.sqrt(len_b)
    return dot / (len_a * len_b)

def get_tfidf_score(freq_of_text, word_cnt_of_text, text_cnt_of_word):
    tf = freq_of_text / word_cnt_of_text
    idf = math.log(FILE_CNT / (text_cnt_of_word + 1))
    return tf * idf

class ResultItem:
    def __init__(self, index, title, text, url, timestamp):
        self.index = index
        self.title = title
        self.text = text
        self.url = url
        self.timestamp = timestamp
        self.score = 0.0
        self.freq = 0
        self.count = 0
        self.occurrence = []
        self.similarity = 0.0

    def __str__(self):
        s = "file_index: " + str(self.index) + \
            "\ntitle: " + self.title + \
            "\nurl: " + self.url + \
            "\ntimestamp: " + self.timestamp + \
            "\nfreq: " + str(self.freq) + \
            "\nscore: " + str(self.score) + \
            "\nsimilarity: " + str(self.similarity) + \
            "\n"
        for j in self.occurrence:
            s += "> ..." + self.text[max(0, j[0] - 50):j[0] + 50] + "...\n"
        return s

def run_search(search_str, inverse_index, metadata, texts, bag, array):
    words_inverse_index = [] # 搜索单词的倒排索引
    text_cnt = []
    s_list = search_str.split(' ')
    word_cnt = array.toarray().sum(axis=1)
    for word in s_list:
        words_inverse_index.append(inverse_index[word].copy())
        text_cnt.append(0)
        if inverse_index[word]:
            text_cnt[-1] += len(inverse_index[word])
    result_dict = {}
    for index, inverse_index_item in enumerate(words_inverse_index):
        if not inverse_index_item:
            continue
        for text_inverse_info in inverse_index_item:
            text_index = text_inverse_info[0]
            text_freq = text_inverse_info[1]
            if text_index not in result_dict:
                item = ResultItem(
                    text_index,
                    metadata[text_index][0],
                    texts[text_index],
                    metadata[text_index][1],
                    metadata[text_index][2])
                item.count += 1
                item.freq += text_freq
                item.score += get_tfidf_score(text_freq, word_cnt[text_index], text_cnt[index])
                item.occurrence.extend(text_inverse_info[2])
                result_dict[text_index] = item
            else:
                result_dict[text_index].count += 1
                result_dict[text_index].freq += text_freq
                result_dict[text_index].score += get_tfidf_score(text_freq, word_cnt[text_index], text_cnt[index])
                result_dict[text_index].occurrence.extend(text_inverse_info[2])
    result_list = [i for i in result_dict.values()]

    search_vec = CountVectorizer(vocabulary=bag.get_feature_names_out()).fit_transform([search_str]).toarray()
    for i in result_list:
        i.similarity = get_similarity(search_vec[0], array[i.index].A[0])

    result_list.sort(key=lambda x: -x.score * x.count)
    return result_list


# Manual evaluation


In [None]:
def get_manual_accuracy(total_search, fine_search):
    if total_search == 0:
        return float('nan')
    return fine_search / total_search * 100

In [None]:
import IPython.display as display
print("Loading data...")
text_list = get_text_list()
print("Vectorizing...")
bag, count = get_bag(text_list)
print("Generating index...")
inverse_index = generate_inverse_index(text_list, bag, count.toarray())
print("Done.")
metadata = pd.read_csv('1.csv', nrows=FILE_CNT, encoding='utf-8')[['title', 'url', 'timestamp']].values.tolist()
search_cnt = 0
fine_search_cnt = 0
while True:
    manual_accuracy = get_manual_accuracy(search_cnt, fine_search_cnt)
    search_str = input(f"Current manual accuracy: {manual_accuracy}.\nType your keyword to search now. Type q to exit.\n> ")
    display.clear_output()
    if search_str == 'q':
        print('Bye :)')
        break

    search_cnt += 1
    result = run_search(search_str, inverse_index, metadata, text_list, bag, count)
    for i in result:
        print(i)
    search_evaluate = input("Is this result correct? (Y/n)\n> ")
    if search_evaluate != 'n' or search_evaluate !=  'N':
        fine_search_cnt += 1

file_index: 190
title: How can atmospheric pollution result to excessive harm in the marine ecosystems?
url: https://medium.com/datadriveninvestor/how-atmospheric-pollution-can-result-to-excessive-harm-in-the-marine-ecosystems-922fc295a258
timestamp: 2020-12-28 12:42:46.667000+00:00
freq: 4
score: 0.015267305600951218
similarity: 0.030442246621763137
> ...How can atmospheric pollution result to excessive harm in the marine e...
> ...the atmospheric system, the focus is based on the pollution and the effect of the excessive quantiti...
> ...ware of the fact that our harmful actions for the pollution of the air are equally and sometimes eve...
> ...negative mechanism which is caused by atmospheric pollution and it’s attached to the acidic characte...

file_index: 178
title: There is Plastic in Your Fruits & Vegetables
url: https://medium.com/age-of-awareness/there-is-plastic-in-your-fruits-vegetables-100ac353ac51
timestamp: 2020-07-08 11:23:47.512000+00:00
freq: 1
score: 0.01131845061454