## 本实验利用爬虫爬到的网页页面，利用TF-IDF和PageRank分别构成两种搜索引擎
- 分析TF-IDF和搜索关键字，找到网页
- 分别计算TD-IDF和PageRank来给网页排序

### 文件夹www.cmiw.cn 中的都是爬虫web_crawler.ipynb爬下来的html文件，并且保留了网页链接关系文件linked_url.json

In [5]:
!dir

 驱动器 C 中的卷没有标签。
 卷的序列号是 D4A3-95FD

 C:\Users\Mark\1AI.NLP\AI.NLP\lecture7 的目录

2019-01-11  14:50    <DIR>          .
2019-01-11  14:50    <DIR>          ..
2019-01-04  16:34    <DIR>          .ipynb_checkpoints
2019-01-11  14:49    <DIR>          log
2019-01-11  14:48            48,643 simple_search_engine.ipynb
2019-01-11  14:50            12,206 web_crawler.ipynb
2019-01-11  14:51    <DIR>          www.cmiw.cn
2019-01-04  19:37    <DIR>          www.cmiw.cn_text_
               2 个文件         60,849 字节
               6 个目录 26,884,120,576 可用字节


In [97]:
html_base = "www.cmiw.cn" # html folder

### 爬虫爬到的是HTML文件，文本中含有标签，这里使用html2text将html处理为文本文件

In [98]:
from tqdm import tqdm
import os
import html2text

html_handler = html2text.HTML2Text()
html_handler.ignore_links = True

def extract_text_from_html(html_folder):
    text_folder = html_folder + '_text'
    if not os.path.isdir(text_folder):
        os.mkdir(text_folder)
    html_files = [html_f for html_f in os.listdir(html_folder) if '.html' in html_f and ('article' in html_f or 'thread' in html_f)]
    for f in tqdm(html_files):
        if not os.path.exists(os.path.join(text_folder, f)):
            with open(os.path.join(html_folder, f)) as html_file:
                with open(os.path.join(text_folder, f),'w',encoding='utf-8') as txt_file:
                    txt_file.write(html_handler.handle(html_file.read()))
    
    return text_folder

In [101]:
text_base = extract_text_from_html(html_base)

100%|████████████████████████████████████████████████████████████████████████████████| 663/663 [00:37<00:00, 17.66it/s]


In [102]:
import os

len(os.listdir(text_base))

663

### 所有的文本文件先各自分词

In [103]:
import jieba

def cut(string): return ' '.join(jieba.cut(string))

In [104]:
from tqdm import tqdm
import os

def cut_text(text_folder):
    cut_folder = text_folder + '_cut'
    if not os.path.isdir(cut_folder):
        os.mkdir(cut_folder)
        
    for f in tqdm(os.listdir(text_folder)):
        if not os.path.exists(os.path.join(cut_folder, f)):
            with open(os.path.join(text_folder, f), encoding = 'utf-8') as text_file:
                with open(os.path.join(cut_folder, f), 'w',encoding = 'utf-8') as cut_file:
                    cut_file.write(cut(text_file.read()))
    
    return cut_folder

In [105]:
cut_base = cut_text(text_base)

100%|████████████████████████████████████████████████████████████████████████████████| 663/663 [00:16<00:00, 39.81it/s]


### 加载网页路径名称和网页语料库
- 路径相当与网址，搜索结果给出路径即可

In [106]:
from tqdm import tqdm
import os

def load_webpage_corpus(folder):
    files = os.listdir(folder)
    corpus = []
    for i in tqdm(range(len(files))):
        with open(os.path.join(folder, files[i]), encoding = 'utf-8') as cut_file:
            corpus += [cut_file.read()]
    return (files, corpus)

In [107]:
webpage_corpus = load_webpage_corpus(cut_base)

100%|██████████████████████████████████████████████████████████████████████████████| 663/663 [00:00<00:00, 2483.13it/s]


### TfidfVectorizer 可以将文章进行TF-IDF向量化
- TF-IDF值表示词在某篇文章的重要性
- 本实验共有663篇文章，39223个词汇
- 如果文章没有该词，则TF-IDF为0

In [112]:
len(webpage_corpus[1])

663

In [135]:
from sklearn.feature_extraction.text import TfidfVectorizer

def vectorize(corpus):
    vectorizer = TfidfVectorizer()
    tf_idf_vec = vectorizer.fit_transform(corpus)
    print(tf_idf_vec.shape)
    return (vectorizer.vocabulary_, tf_idf_vec)

In [136]:
voca_vec = vectorize(webpage_corpus[1])
voca = voca_vec[0]
doc_word_vec = voca_vec[1]

(663, 39223)


In [137]:
doc_word_array = doc_word_vec.toarray()

### 转置之后就是词的倒排索引
- 若第v个词在第r篇文章中则word_doc_vec[v][r]不为0
- word_doc_vec[v][r]这个值表示了词在文章中的重要性

In [138]:
word_doc_vec = doc_word_vec.transpose()
word_doc_vec.shape

(39223, 663)

In [139]:
word_doc_array = word_doc_vec.toarray()

In [143]:
def get_word_ids(sentence, vocabulary):
    return [vocabulary.get(c) for c in cut(sentence).split()]

### 根据词的编号找到索引的所有文章TF-IDF数组，然后用np.where(array)可以得到数组非0的坐标（数组索引）

In [171]:
import numpy as np
from operator import and_
from functools import reduce

def find_docs_by_sentence(sentence, vocabulary, word_doc_vector):
    if not sentence:
        return None
    
    word_ids = [vocabulary.get(c) for c in cut(sentence).split()]
    found_doc_indexes = []
    for word_id in word_ids:
        found_doc_indexes.append(set(np.where(word_doc_vector[word_id])[0]))
    
    if not found_doc_indexes:
        return None
    
    common_doc_ids = reduce(and_, found_doc_indexes)
    return common_doc_ids

In [175]:
find_docs_by_sentence("深圳机械师傅", voca, word_doc_array)

{123, 635}

In [204]:
import urllib.parse

# 经过检查可以在一下网页找到 深圳 机械 师傅
print(urllib.parse.unquote(webpage_corpus[0][123])) 
print(urllib.parse.unquote(webpage_corpus[0][635])) 

http://www.cmiw.cn/thread-240107-1-1.html
http://www.cmiw.cn/thread-967758-1-1.html


### 这里使用关键词在文章中的TF-IDF值的和来确定搜索结果排序
- 搜索关键词有w1,w2
- 包含关键词w1,w2的文章有d1,d2，都是搜索结果
- 求w1,w2在搜索结果的文章中的TF-IDF值的和sum1, sum2，越大的代表越重要
- 根据重要程度，实验中，文章635应该在文章123之前，因为前者的sum_Tfidf更大
- 实验没有使用课堂上的Cosine距离，因为从原理上说关键词的向量应该与文章的向量的关系应该不大

In [201]:
import jieba

def sum_Tfidf(sentence, Tfidf_doc, vocabulary = voca):
    cut_word_ids = get_word_ids(sentence, vocabulary)
    return sum([Tfidf_doc[word_id] for word_id in cut_word_ids])

In [205]:
print(sum_Tfidf("深圳机械师傅", doc_word_array[123]))
print(sum_Tfidf("深圳机械师傅", doc_word_array[635]))

0.015197336243449423
0.11507366592701099


### 使用词的TF-IDF加权值来对文章进行排序

In [213]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine


def search_webpage_STI(keywords, vocabulary = voca, doc_word_ary = doc_word_array, sort_strategy):
    if not keywords:
        return None
    
    found_doc_ids = find_docs_by_sentence(keywords, vocabulary, doc_word_ary.transpose())
    if not found_doc_ids:
        return None
    
    sum_TI_sorted_ids = sorted(found_doc_ids, key = lambda x: sum_Tfidf(keywords, doc_word_ary[x], vocabulary), reverse = True)

    return sum_TI_sorted_ids

In [230]:
sorted_results = search_webpage_STI("深圳机械师傅")

In [226]:
import os        
from IPython.display import display, HTML

def show_webpage(doc_index, webpages):
    print(urllib.parse.unquote(webpages[doc_index]))
    return webpages[doc_index]

In [233]:
url_results = [show_webpage(id_, webpage_corpus[0]) for id_ in sorted_results]

http://www.cmiw.cn/thread-967758-1-1.html
http://www.cmiw.cn/thread-240107-1-1.html


## Page Rank
### Refference https://blog.csdn.net/hguisu/article/details/7996185

### 对网页关系的图进行PageRank计算
- linked_url.json是在爬虫爬网页时建立的网页关系数据结构（dict）
- 与课上不一样，PageRank应该使用单向图，所以使用nx.DiGraph()

In [206]:
import json
def get_linked_urls(filename):
    with open(filename, encoding= 'utf-8') as json_file:
        return json.loads(json_file.read())

In [110]:
linked_urls = get_linked_urls("linked_url.json")

In [208]:
import networkx as nx
%matplotlib inline

webpage_network = nx.DiGraph(linked_urls)
ranked_webpages = nx.pagerank(webpage_network)

In [235]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine


def search_webpage_with_sort(keywords, sort_strategy, vocabulary = voca, doc_word_ary = doc_word_array):
    if not keywords:
        return None
    
    found_doc_ids = find_docs_by_sentence(keywords, vocabulary, doc_word_ary.transpose())
    if not found_doc_ids:
        return None

    sum_TI_sorted_ids = sorted(found_doc_ids, key = lambda x: sort_strategy(x), reverse = True)
    return sum_TI_sorted_ids

In [239]:
def get_page_rank_value(doc_index):
    url = urllib.parse.unquote(webpage_corpus[0][doc_index])
    return ranked_webpages[url]

In [241]:
print(page_rank_value(123))
print(page_rank_value(635))

3.054620590299883e-05
2.8554413385059092e-05


In [242]:
pagerank_results = search_webpage_with_sort("深圳机械师傅", get_page_rank_value)
pagerank_url_results = [show_webpage(id_, webpage_corpus[0]) for id_ in pagerank_results]

http://www.cmiw.cn/thread-240107-1-1.html
http://www.cmiw.cn/thread-967758-1-1.html


### 可以发现PageRank的排名与TF-IDF加权值方法排名不同，可以考虑综合两者权重重新排名
- 由于时间关系这里不做继续研究