# Las Vegas 评论文本中词汇相似度分析

### 该 Notebook 对应的博客[链接](XXX)

用户在查找商店的时候一般会输入关键词进行搜索，从而找到跟该关键词相似度较高的商店。具体实现是通过 Word2Vec 工具，对评论文本中词汇进行相似度分析。

In [1]:
import pandas as pd

# 1. 加载评论文本数据

In [2]:
yelp_lv_rts = pd.read_csv('../../dataset/las_vegas/review/las_vegas_review_text_preprocessed_with_db_id.csv')

In [3]:
len(yelp_lv_rts)

1604044

In [4]:
yelp_lv_rts[:5]

Unnamed: 0,review_db_id,text_words
0,3,pizza make night good people great pizza anyth...
1,6,one absolute favorite restaurant usually go on...
2,8,know place star lifesaver stay mandalay bay lo...
3,20,nd time eat today st time great dont think hus...
4,22,regal locate village square super convenient p...


In [5]:
text_words = [words.split(' ') for words in yelp_lv_rts.text_words]

# 2. Word2Vec 建模

关于 Gensim Word2Vec 参数解释可参考[官方文档](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)。

## 2.1 训练并保存模型

In [6]:
from gensim.models import word2vec

In [7]:
# Creating the model and setting values for the various parameters
sg = 0               # 使用 CBOW（另一种是 Skip-gram）
hs = 1               # 选择 Hierarchical Softmax 训练方法（另一种是 Negative Sampling）
num_iter = 10        # 迭代次数
num_features = 300   # 词向量维度
min_word_count = 40  # 计算词向量的最小词频。这个值可以去掉一些很生僻的低频词
context = 10         # 词向量上下文最大距离
num_workers = 10     # 并发线程数
downsampling = 1e-3  # （高频词）下采样阈值

In [8]:
%%time

# 训练 Word2Vec 模型
model = word2vec.Word2Vec(text_words,\
                          workers=num_workers,\
                          sg=sg,\
                          hs=hs,\
                          iter=num_iter,\
                          size=num_features,\
                          min_count=min_word_count,\
                          window=context,
                          sample=downsampling)

# 舍弃原始词向量，只保留施加正则之后词向量，提高内存使用效率, Ref: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.init_sims
model.init_sims(replace=True)

CPU times: user 2h 20min 56s, sys: 20.4 s, total: 2h 21min 17s
Wall time: 14min 40s


In [9]:
# 保存模型
model.save('yelp_las_vegas_review_text_word_similarities_model')

## 2.2 测试 

In [10]:
# 查看词 bbq 的词向量
model.wv['bbq'][:10]

array([-0.0656217 , -0.08170802,  0.04135355, -0.02397001, -0.01003061,
       -0.05073321, -0.037329  , -0.08237506, -0.00115939, -0.07006032],
      dtype=float32)

In [11]:
# 产看跟词 bbq 最相似的 10 个词
model.wv.most_similar('bbq')

[('barbecue', 0.8572158217430115),
 ('barbeque', 0.77912437915802),
 ('brisket', 0.6074708700180054),
 ('ribs', 0.5418552160263062),
 ('korean', 0.49812448024749756),
 ('bbqs', 0.4939558506011963),
 ('kalbi', 0.4618802070617676),
 ('dickey', 0.43527039885520935),
 ('memphis', 0.43213239312171936),
 ('riblets', 0.431049108505249)]

In [12]:
model.wv.most_similar('funny')

[('hilarious', 0.7780593633651733),
 ('laugh', 0.5951111316680908),
 ('witty', 0.5746638774871826),
 ('humorous', 0.5719525814056396),
 ('hysterical', 0.566609263420105),
 ('entertaining', 0.5625545382499695),
 ('corny', 0.5624991655349731),
 ('entertain', 0.5589728355407715),
 ('laughed', 0.5135879516601562),
 ('comedian', 0.5086338520050049)]

## 2.3 加载模型

In [13]:
loaded_model = word2vec.Word2Vec.load('yelp_las_vegas_review_text_word_similarities_model')

In [14]:
loaded_model.wv.most_similar('bbq')

[('barbecue', 0.8572158217430115),
 ('barbeque', 0.77912437915802),
 ('brisket', 0.6074708700180054),
 ('ribs', 0.5418552160263062),
 ('korean', 0.49812448024749756),
 ('bbqs', 0.4939558506011963),
 ('kalbi', 0.4618802070617676),
 ('dickey', 0.43527039885520935),
 ('memphis', 0.43213239312171936),
 ('riblets', 0.431049108505249)]