## Lecture 06 Keywords Extraction & Search Engine

Outline:
* Finding similar words using Word2Vec;
* TF-IDF Keyword Extraction;
* Word cloud;
* Build a search engine
     + input: words;
     + output: matched documents;
     + two ranking methods: 1) ranked by TF-IDF; 2) page rank

In [1]:
## ! note for re-running these codes:
## can skip all these initial set-up steps and start with 
## line_setences_path = '/Users/xinweixu/Dropbox/learn/Comp_Prog/nlp/data/sentences-cut.txt'

csv_path = '/Users/xinweixu/Dropbox/learn/Comp_Prog/nlp/data/sqlResult_1558435.csv'

In [2]:
import pandas as pd

In [3]:
content = pd.read_csv(csv_path, encoding='gb18030')

In [5]:
content = content.fillna('') # remove NAs

In [7]:
news_content = content['content'].tolist()

In [8]:
import jieba

def cut(string): return ' '.join(jieba.cut(string))

In [9]:
cut('这是一个测试')

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/1y/1btp7xpj7b1f82lnwvn2916h0000gn/T/jieba.cache
Loading model cost 0.845 seconds.
Prefix dict has been built succesfully.


'这是 一个 测试'

In [10]:
import re

def find_tokens(string):
    "a function to find all the valide tokens"
    return re.findall(r'[\d|\w]+', string)

In [11]:
find_tokens('这是一个测试\n\n\n')

['这是一个测试']

In [12]:
news_content = [find_tokens(n) for n in news_content]

In [13]:
news_content = [' '.join(n) for n in news_content]

In [15]:
news_content[:10]

['此外 自本周 6月12日 起 除小米手机6等15款机型外 其余机型已暂停更新发布 含开发版 体验版内测 稳定版暂不受影响 以确保工程师可以集中全部精力进行系统优化工作 有人猜测这也是将精力主要用到MIUI 9的研发之中 MIUI 8去年5月发布 距今已有一年有余 也是时候更新换代了 当然 关于MIUI 9的确切信息 我们还是等待官方消息',
 '骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器 高通强调 不会因为只考虑性能而去屏蔽掉小核心 相反 他们正联手微软 找到一种适合桌面平台的 兼顾性能和功耗的完美方案 报道称 微软已经拿到了一些新的源码 以便Windows 10更好地理解big little架构 资料显示 骁龙835作为一款集成了CPU GPU 基带 蓝牙 Wi Fi的SoC 比传统的Wintel方案可以节省至少30 的PCB空间 按计划 今年Q4 华硕 惠普 联想将首发骁龙835 Win10电脑 预计均是二合一形态的产品 当然 高通骁龙只是个开始 未来也许还能见到三星Exynos 联发科 华为麒麟 小米澎湃等进入Windows 10桌面平台',
 '此前的一加3T搭载的是3400mAh电池 DashCharge快充规格为5V 4A 至于电池缩水 可能与刘作虎所说 一加手机5要做市面最轻薄大屏旗舰的设定有关 按照目前掌握的资料 一加手机5拥有5 5寸1080P三星AMOLED显示屏 6G 8GB RAM 64GB 128GB ROM 双1600万摄像头 备货量 惊喜 根据京东泄露的信息 一加5起售价是xx99元 应该是在2799 2899 2999中的某个',
 '这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车 新华社记者张立云摄',
 '原标题 44岁女子跑深圳约会网友被拒 暴雨中裸身奔走 深圳交警微博称 昨日清晨交警发现有一女子赤裸上身 行走在南坪快速上 期间还起了轻生年头 一辅警发现后赶紧为其披上黄衣 并一路劝说她 那么事发时 到底都发生了些什么呢 南都记者带您一起还原现场 南都记者在龙岗大队坂田中队见到了辅警刘青 发现女生的辅警 一位外表高大帅气 说话略带些腼腆的90后青年 刘青介绍 6月16日早上7时36分 他正在环城南路附近值勤 接到中队关于一位女子裸身进入机动车可能有危险的警情 随后骑着小铁骑开始沿路

In [25]:
news_content = [cut(n) for n in news_content]

In [26]:
news_content[1]

'骁龙 835 作为 唯一 通过 Windows   10 桌面 平台 认证 的 ARM 处理器   高通 强调   不会 因为 只 考虑 性能 而 去 屏蔽掉 小 核心   相反   他们 正 联手 微软   找到 一种 适合 桌面 平台 的   兼顾 性能 和 功耗 的 完美 方案   报道 称   微软 已经 拿到 了 一些 新 的 源码   以便 Windows   10 更好 地 理解 big   little 架构   资料 显示   骁龙 835 作为 一款 集成 了 CPU   GPU   基带   蓝牙   Wi   Fi 的 SoC   比 传统 的 Wintel 方案 可以 节省 至少 30   的 PCB 空间   按计划   今年 Q4   华硕   惠普   联想 将 首发 骁龙 835   Win10 电脑   预计 均 是 二合一 形态 的 产品   当然   高通 骁龙 只是 个 开始   未来 也许 还 能 见到 三星 Exynos   联发科   华为 麒麟   小米 澎湃 等 进入 Windows   10 桌面 平台'

### Get similar words using word2vec model

In [None]:
with open('sentences-cut.txt', 'w') as f:
    for n in news_content_words:
        f.write(n + '\n')
        

In [2]:
line_setences_path = '/Users/xinweixu/Dropbox/learn/Comp_Prog/nlp/data/sentences-cut.txt'

In [3]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [29]:
news_word2vec= Word2Vec(LineSentence(line_setences_path), size=35, workers=8)

# Parameters:
# --- size: the dimensionality of the vector, or the size of the NN layers, 
#     which correspond to the “degrees” of freedom the training algorithm has.
#     Larger size would require more training data!
#     The default value for size is 100
#  See more clarification: 
#  https://stackoverflow.com/questions/45444964/python-what-is-the-size-parameter-in-gensim-word2vec-model-class
#  https://rare-technologies.com/word2vec-tutorial/ 

In [35]:
news_word2vec.most_similar('葡萄牙', topn=20)

  """Entry point for launching an IPython kernel.


[('意大利', 0.8864297866821289),
 ('摩洛哥', 0.8568353056907654),
 ('捷克', 0.8296809792518616),
 ('比利时', 0.8292907476425171),
 ('乌拉圭', 0.8234138488769531),
 ('奥地利', 0.8224449157714844),
 ('巴塞罗那', 0.8199140429496765),
 ('克罗地亚', 0.8178519010543823),
 ('拉脱维亚', 0.814125657081604),
 ('斯洛文尼亚', 0.8134593367576599),
 ('里斯本', 0.8040810227394104),
 ('丹麦', 0.798108696937561),
 ('南非', 0.792057454586029),
 ('苏格兰', 0.7903223633766174),
 ('西班牙', 0.7767381072044373),
 ('马德里', 0.7761043906211853),
 ('瑞士', 0.77527916431427),
 ('伊斯坦布尔', 0.7747988700866699),
 ('巴拉圭', 0.7737539410591125),
 ('威尔士', 0.7732177972793579)]

In [36]:
news_word2vec.most_similar('意大利', topn=20)

  """Entry point for launching an IPython kernel.


[('葡萄牙', 0.8864297866821289),
 ('西班牙', 0.857528567314148),
 ('比利时', 0.8561197519302368),
 ('瑞士', 0.85273277759552),
 ('捷克', 0.8408275246620178),
 ('摩洛哥', 0.837763786315918),
 ('南非', 0.8138989806175232),
 ('德国', 0.8035558462142944),
 ('奥地利', 0.7948752045631409),
 ('马德里', 0.7915687561035156),
 ('苏格兰', 0.780310869216919),
 ('法国', 0.773610532283783),
 ('澳大利亚', 0.7686328887939453),
 ('拉脱维亚', 0.7664074897766113),
 ('保加利亚', 0.7656012773513794),
 ('加拿大', 0.7587913274765015),
 ('柏林', 0.7525461912155151),
 ('俄罗斯', 0.7469874024391174),
 ('巴塞罗那', 0.7465837597846985),
 ('罗马尼亚', 0.7409399747848511)]

In [32]:
news_word2vec.most_similar('捷克', topn=20)

  """Entry point for launching an IPython kernel.


[('罗马尼亚', 0.8822157382965088),
 ('拉脱维亚', 0.8692420125007629),
 ('丹麦', 0.8523069620132446),
 ('意大利', 0.840827465057373),
 ('葡萄牙', 0.8296809792518616),
 ('奥地利', 0.8258203864097595),
 ('斯洛文尼亚', 0.8257920742034912),
 ('比利时', 0.8113787174224854),
 ('巴林', 0.810397207736969),
 ('克罗地亚', 0.7818545699119568),
 ('立陶宛', 0.7805928587913513),
 ('乌克兰', 0.7758355736732483),
 ('南非', 0.7729107737541199),
 ('基辅', 0.7706931829452515),
 ('黎巴嫩', 0.7637196779251099),
 ('斯洛伐克', 0.7608702778816223),
 ('爱沙尼亚', 0.760601282119751),
 ('塞内加尔', 0.7586889863014221),
 ('格鲁吉亚', 0.7585309743881226),
 ('中国香港', 0.7581397891044617)]

In [39]:
news_word2vec.most_similar('罗马尼亚', topn=30)

  """Entry point for launching an IPython kernel.


[('捷克', 0.882215678691864),
 ('丹麦', 0.8520300984382629),
 ('拉脱维亚', 0.8184226751327515),
 ('克罗地亚', 0.8179722428321838),
 ('奥地利', 0.8108478784561157),
 ('斯洛文尼亚', 0.8080112338066101),
 ('乌克兰', 0.8075355887413025),
 ('比利时', 0.8007112741470337),
 ('南非', 0.7932149171829224),
 ('纳米比亚', 0.7860801815986633),
 ('中国香港', 0.7854606509208679),
 ('马耳他', 0.7831732630729675),
 ('黎巴嫩', 0.7809414267539978),
 ('爱沙尼亚', 0.7779865264892578),
 ('葡萄牙', 0.7722561955451965),
 ('吉尔吉斯斯坦', 0.7693463563919067),
 ('白俄罗斯', 0.7632394433021545),
 ('立陶宛', 0.7571207880973816),
 ('波兰', 0.7551602721214294),
 ('巴林', 0.7545751333236694),
 ('马来西亚', 0.7530940771102905),
 ('赞比亚', 0.7514887452125549),
 ('乌兹别克斯坦', 0.7475714683532715),
 ('津巴布韦', 0.7469395995140076),
 ('阿根廷', 0.7454819679260254),
 ('意大利', 0.7409399747848511),
 ('匈牙利', 0.737257182598114),
 ('保加利亚', 0.7372565865516663),
 ('埃塞俄比亚', 0.7371681928634644),
 ('毛里求斯', 0.7367457151412964)]

In [40]:
news_word2vec.most_similar('说', topn=30)

  """Entry point for launching an IPython kernel.


[('表示', 0.8955628871917725),
 ('指出', 0.8502642512321472),
 ('认为', 0.8479128479957581),
 ('告诉', 0.812511682510376),
 ('看来', 0.8101485371589661),
 ('坦言', 0.7899549603462219),
 ('介绍', 0.7614747285842896),
 ('称', 0.7429639101028442),
 ('透露', 0.7301914691925049),
 ('强调', 0.7105162143707275),
 ('特别强调', 0.7053742408752441),
 ('明说', 0.6989708542823792),
 ('所说', 0.6948589086532593),
 ('中说', 0.6791709065437317),
 ('时说', 0.6265304088592529),
 ('文说', 0.6251811981201172),
 ('提到', 0.6126987338066101),
 ('嚷嚷', 0.5899860262870789),
 ('称赞', 0.5895407199859619),
 ('道', 0.5849297642707825),
 ('相信', 0.5809491872787476),
 ('地说', 0.5805396437644958),
 ('问', 0.5783715844154358),
 ('的话', 0.5772743225097656),
 ('写道', 0.573288083076477),
 ('所指', 0.5670838952064514),
 ('陈说', 0.561398983001709),
 ('建议', 0.5582199692726135),
 ('一家之言', 0.5528523921966553),
 ('普遍认为', 0.5519106388092041)]

In [41]:
news_word2vec.most_similar('认为', topn=30)

  """Entry point for launching an IPython kernel.


[('指出', 0.9022014737129211),
 ('表示', 0.8893882036209106),
 ('说', 0.8479128479957581),
 ('看来', 0.8315796852111816),
 ('普遍认为', 0.8058453798294067),
 ('坦言', 0.7952060103416443),
 ('称', 0.7836077809333801),
 ('透露', 0.7590065598487854),
 ('建议', 0.7248430848121643),
 ('告诉', 0.7241552472114563),
 ('强调', 0.7051271796226501),
 ('所说', 0.6783744692802429),
 ('相信', 0.6777569651603699),
 ('介绍', 0.6732577681541443),
 ('特别强调', 0.6587983965873718),
 ('对此', 0.658518373966217),
 ('看法', 0.6486562490463257),
 ('而言', 0.6473656892776489),
 ('中称', 0.6216552257537842),
 ('表明', 0.6216184496879578),
 ('看好', 0.6183799505233765),
 ('现阶段', 0.6123632788658142),
 ('猜测', 0.6115888357162476),
 ('说明', 0.6077918410301208),
 ('观点', 0.6037999391555786),
 ('呼吁', 0.6030945181846619),
 ('资深', 0.6025510430335999),
 ('事实上', 0.6024030447006226),
 ('一家之言', 0.6022484302520752),
 ('嚷嚷', 0.6000785827636719)]

In [43]:
news_word2vec.most_similar('表明', topn=30)

  """Entry point for launching an IPython kernel.


[('说明', 0.7730811238288879),
 ('事实上', 0.7121999859809875),
 ('现阶段', 0.7006252408027649),
 ('结论', 0.6929256319999695),
 ('依赖', 0.6705552935600281),
 ('普遍认为', 0.6523441076278687),
 ('可能性', 0.6512296199798584),
 ('预料', 0.6495680212974548),
 ('判断', 0.6482747197151184),
 ('揭示', 0.6474969983100891),
 ('指向', 0.6359158754348755),
 ('迹象', 0.6356276273727417),
 ('动向', 0.635370135307312),
 ('变化', 0.6328083276748657),
 ('迄今为止', 0.6326295137405396),
 ('与此同时', 0.63169264793396),
 ('释放', 0.624531626701355),
 ('现状', 0.6239757537841797),
 ('反映', 0.6232850551605225),
 ('毫无疑问', 0.6231241226196289),
 ('而言', 0.6229085922241211),
 ('秘密', 0.6226353049278259),
 ('取决于', 0.6221134662628174),
 ('认为', 0.6216184496879578),
 ('孤立', 0.6190369129180908),
 ('做法', 0.618963897228241),
 ('敏感', 0.6176596283912659),
 ('看法', 0.6175332069396973),
 ('对', 0.6163378357887268),
 ('并非', 0.6162695288658142)]

Notes: 
* 数据量越多，同义词效果会越好（比如加入维基百科的数据）
* 寻找同义词的过程类似于一个搜索树：
      + 葡萄牙 --> 意大利， 摩洛哥，捷克， ...
      + 意大利 --> 葡萄牙，西班牙，比利时，捷克，...
      + 捷克 --> 罗马尼亚，拉脱维亚，丹麦, ...
      + 在这个搜索树里，单词出现频次越多，就和原有单词的意思越接近


按照之前我们熟悉的地图的搜索树的算法结构，可以得到初步的函数如下：

In [44]:
def get_related_words(words, model):
    """
    @words are the initial words whose synonyms we want to search for
    @model is the word2vec model
    """
    unseen = [words]
    
    seen = set()
    
    while unseen:
        node = unseen.pop(0)
        new_branches = [w for w, s in model.most_similar(node, topn=30)]
        unseen += new_branches
        
        seen.add(node)
    
    return seen
 

但同义词搜索树与地图搜索的区别在于，我们想要知道每个单词出现的频次，以便对所有可能的单词进行排序，从而得到高频次的同义词；我们可以对原程序的 `seen` 变量进行修改：

In [55]:
from collections import defaultdict
def get_related_words(words, model):
    """
    @words are the initial words whose synonyms we want to search for
    @model is the word2vec model
    """
    unseen = words
    
    seen = defaultdict(int)
    
    max_size = 1000 # set a size limit for seen to reduce the searching space
    
    while unseen and len(seen) < max_size:
        if len(seen) % 50 == 0:
            print('seen length: {}'.format(len(seen)))
            
        node = unseen.pop(0)
        new_branches = [w for w, s in model.most_similar(node, topn=30)]
        unseen += new_branches
        
        seen[node] += 1
    
    return seen
    

程序尚待优化的点：
* `seen[node] +=1` 的计数方法可以进行优化，调整单词权重
* 加入dynamic programming的思路，将重复计算的分支结果存储起来，减少总体计算时间

In [46]:
len(news_word2vec.wv.vocab)

97927

In [56]:
related_words = get_related_words(['说', '建议'], news_word2vec)

# 注： 这其实是一个weak supervised-learning的思路

seen length: 0
seen length: 50




seen length: 100
seen length: 100
seen length: 100
seen length: 150
seen length: 200
seen length: 200
seen length: 200
seen length: 250
seen length: 300
seen length: 350
seen length: 350
seen length: 400
seen length: 450
seen length: 450
seen length: 450
seen length: 450
seen length: 450
seen length: 450
seen length: 500
seen length: 550
seen length: 600
seen length: 600
seen length: 650
seen length: 700
seen length: 750
seen length: 750
seen length: 750
seen length: 750
seen length: 750
seen length: 750
seen length: 800
seen length: 850
seen length: 850
seen length: 850
seen length: 900
seen length: 950


In [57]:
# now we want to rank all the related words by frequency

sorted(related_words.items(), key=lambda x: x[1], reverse=True)

[('所说', 24),
 ('说', 20),
 ('表示', 20),
 ('指出', 20),
 ('认为', 20),
 ('坦言', 18),
 ('建议', 17),
 ('看来', 16),
 ('透露', 16),
 ('特别强调', 16),
 ('告诉', 14),
 ('称', 14),
 ('看法', 14),
 ('强调', 13),
 ('提到', 13),
 ('介绍', 11),
 ('嚷嚷', 11),
 ('相信', 11),
 ('普遍认为', 11),
 ('呼吁', 11),
 ('中说', 10),
 ('对此', 10),
 ('说明', 10),
 ('中称', 10),
 ('明说', 8),
 ('文说', 8),
 ('要求', 8),
 ('观点', 8),
 ('回答', 8),
 ('时说', 7),
 ('写道', 7),
 ('所指', 7),
 ('一家之言', 7),
 ('相应', 7),
 ('提及', 7),
 ('资深', 7),
 ('解释', 6),
 ('适当', 6),
 ('表明', 6),
 ('现阶段', 6),
 ('问道', 6),
 ('问', 5),
 ('补充', 5),
 ('还应', 5),
 ('可行性', 5),
 ('必要性', 5),
 ('为此', 5),
 ('觉得', 5),
 ('说道', 5),
 ('主张', 5),
 ('做法', 5),
 ('原话', 5),
 ('质疑', 5),
 ('判断', 5),
 ('称赞', 4),
 ('地说', 4),
 ('明确要求', 4),
 ('接受', 4),
 ('重申', 4),
 ('聊起', 4),
 ('眼中', 4),
 ('直言', 4),
 ('说法', 4),
 ('声称', 4),
 ('抨击', 4),
 ('立场', 4),
 ('明确提出', 4),
 ('引用', 4),
 ('谈论', 4),
 ('请问', 4),
 ('理由', 4),
 ('管理工作', 4),
 ('考量', 4),
 ('合理', 4),
 ('的话', 3),
 ('陈说', 3),
 ('评价', 3),
 ('应', 3),
 ('必要', 3),
 ('是否', 3),
 ('咨询

### Keyword Extraction
+ calculating TF-IDF;
+ TF-IDF vectorized;
+ visualization - word cloud

Suppose we have a collection of $N$ documents, and for a given document $d$, the frequency of term $t$ appeared in the document is $tf_{t,d}$ (*term frequency*), and for all documents in the collection, the number of documents containing the specified term $t$ is $df_t$ (*document frequency*), then we define the following:

the ***inverse document frequency*** is given by:
$$idf_t = log \frac{N}{df_t}$$

and therefore ***TF-iDF*** is given by:
$$tf-idf_{t,d} = df_{t,d} \times idf_t$$

In [58]:
def document_frequency(word, list_text): 
    "Returns the document frequency of a @word in a given collection of texts, @list_text"
    return sum(1 for text in list_text if word in text)

In [59]:
document_frequency('的', news_content)

70342

In [60]:
document_frequency('火星', news_content)

116

In [61]:
import math
def idf(word, list_text):
    """Get the inversed document frequency of a @word in a collection of texts, @list_text"""
    return math.log10(len(list_text) / document_frequency(word, list_text))

In [62]:
idf('的', news_content)

0.1051466115514474

In [63]:
idf('火星', news_content)

2.887903334565555

In [64]:
idf('小米', news_content)

# the commonplace word,'的', should have a lower idf than more important words such as '小米' 

2.948039950009831

In [65]:
def tf(word, document):
    """
    Get the term frequemcy of a @word in a @document.
    """
    words = document.split()
    
    return sum(1 for w in words if w == word)

In [66]:
tf('银行', news_content[11])

6

In [67]:
tf('创业板', news_content[11])

6

In [68]:
tf('短期', news_content[11])

3

In [72]:
# now we can define a function that returns 
# the rank of tf-idf values for all words appeared in the document,
# given a collection of documents

def get_word_rank_in_a_document(index, document_collection):
    """
    @index = an integer specify the index of the document in the callableollection;
    @document_collection = a list of documents to be searched
    """
    document = document_collection[index]
    words = set(document.split())
    
    tfidf = [
        (w, tf(w, document) * idf(w, document_collection)) for w in words
    ]
    
    tfidf = sorted(tfidf, key=lambda x: x[1], reverse=True)
    
    return tfidf

In [73]:
get_word_rank_in_a_document(11, news_content)

[('市场', 21.353584391728972),
 ('股指', 18.198034968575843),
 ('周四', 17.26088617439966),
 ('均线', 15.505514875366993),
 ('板块', 15.184208429020511),
 ('创业板', 15.040542723113257),
 ('沪', 14.096891190311872),
 ('反弹', 11.40131732928378),
 ('巨丰', 11.244724023409647),
 ('普涨', 11.1657372072426),
 ('居前', 10.78835527475667),
 ('午后', 10.712813898115176),
 ('早盘', 10.614032989531069),
 ('大盘', 10.528860150725679),
 ('保险', 9.712428450401568),
 ('跳水', 9.392541082832015),
 ('具备', 9.384071811999714),
 ('局部性', 8.950480138145622),
 ('走势', 8.886316192504337),
 ('回落', 8.85749598983617),
 ('银行', 8.730101656649362),
 ('大涨', 8.113164172292002),
 ('涨幅', 7.982850978349081),
 ('阴线', 7.904722647584947),
 ('普跌', 7.676835942971274),
 ('半年线', 7.552540129473584),
 ('上影线', 7.552540129473584),
 ('题材', 7.529644664184074),
 ('个股', 7.317431171173604),
 ('伏击', 7.0747759516433115),
 ('探底', 7.010406584900509),
 ('行情', 6.978062869252294),
 ('兴业银行', 6.560526931713512),
 ('沪市', 6.52433048752792),
 ('复星', 6.472715960315349),
 ('白马股'

In [1]:
get_word_rank_in_a_document(101, news_content)

NameError: name 'get_word_rank_in_a_document' is not defined

### TFIDF Vectorizezd

### Word cloud