# 项目描述

目的：利用用户的观点(review)，使用Word2Vec 训练出 word Embendding
流程：
   把每个 review 分句，然后清洗 ；这里直接将review 分解成句子为单位
   对每个句子分成词，句子用词组表示['life', 's', 'like', 'that']
   以句子为单位，去除无效词，做 word2vec

# word2vec训练词向量

In [30]:
import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

import nltk.data
nltk.download()
from nltk.corpus import stopwords

from gensim.models import word2vec
from gensim.models.word2vec import Word2Vec


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [4]:
def load_dataset(name, nrows=None):
    datasets = {
        'unlabeled_train': 'unlabeledTrainData.tsv',
        'labeled_train': 'labeledTrainData.tsv',
        'test': 'testData.tsv'
    }
    if name not in datasets:
        raise ValueError(name)
    data_file = os.path.join('..', 'data', datasets[name])
    df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
    print('Number of reviews: {}'.format(len(df)))
    return df

### 读入无标签数据
用于训练生成word2vec词向量

In [5]:
df = load_dataset('unlabeled_train')
df.head()

Number of reviews: 50000


Unnamed: 0,id,review
0,9999_0,"Watching Time Chasers, it obvious that it was ..."
1,45057_0,I saw this film about 20 years ago and remembe...
2,15561_0,"Minor Spoilers<br /><br />In New York, Joan Ba..."
3,7161_0,I went to see this film with a great deal of e...
4,43971_0,"Yes, I agree with everyone on this site this m..."


### 和第一个ipython notebook一样做数据的预处理
稍稍有一点不一样的是，我们留了个候选，可以去除停用词，也可以不去除停用词

In [9]:
#eng_stopwords = set(stopwords.words('english'))
eng_stopwords = {}.fromkeys([ line.rstrip() for line in open('../stopwords.txt')])

def clean_text(text, remove_stopwords=False):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    if remove_stopwords:
        words = [w for w in words if w not in eng_stopwords]
    return words

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def print_call_counts(f):
    n = 0
    def wrapped(*args, **kwargs):
        nonlocal n
        n += 1
        if n % 1000 == 1:
            print('method {} called {} times'.format(f.__name__, n))
        return f(*args, **kwargs)
    return wrapped

@print_call_counts
def split_sentences(review):
#     分句
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = [clean_text(s) for s in raw_sentences if s]
    return sentences

In [10]:
%time sentences = sum(df.review.apply(split_sentences), [])
print('{} reviews -> {} sentences'.format(len(df), len(sentences)))

method split_sentences called 1 times


  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)


method split_sentences called 1001 times
method split_sentences called 2001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 3001 times


  ' Beautiful Soup.' % markup)


method split_sentences called 4001 times
method split_sentences called 5001 times
method split_sentences called 6001 times
method split_sentences called 7001 times


  ' Beautiful Soup.' % markup)


method split_sentences called 8001 times
method split_sentences called 9001 times


  ' Beautiful Soup.' % markup)


method split_sentences called 10001 times
method split_sentences called 11001 times
method split_sentences called 12001 times
method split_sentences called 13001 times
method split_sentences called 14001 times
method split_sentences called 15001 times
method split_sentences called 16001 times
method split_sentences called 17001 times
method split_sentences called 18001 times
method split_sentences called 19001 times
method split_sentences called 20001 times
method split_sentences called 21001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 22001 times
method split_sentences called 23001 times
method split_sentences called 24001 times
method split_sentences called 25001 times
method split_sentences called 26001 times
method split_sentences called 27001 times
method split_sentences called 28001 times
method split_sentences called 29001 times
method split_sentences called 30001 times
method split_sentences called 31001 times
method split_sentences called 32001 times
method split_sentences called 33001 times
method split_sentences called 34001 times
method split_sentences called 35001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 36001 times
method split_sentences called 37001 times
method split_sentences called 38001 times
method split_sentences called 39001 times
method split_sentences called 40001 times
method split_sentences called 41001 times
method split_sentences called 42001 times
method split_sentences called 43001 times
method split_sentences called 44001 times


  ' Beautiful Soup.' % markup)


method split_sentences called 45001 times
method split_sentences called 46001 times
method split_sentences called 47001 times
method split_sentences called 48001 times


  ' that document to Beautiful Soup.' % decoded_markup


method split_sentences called 49001 times
Wall time: 8min 57s
50000 reviews -> 537851 sentences


### 用gensim训练词嵌入模型

In [25]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [24]:
# 数据处理步骤展示
print(df['review'][1])
print(tokenizer.tokenize(df['review'][1].strip()))
print([clean_text(s) for s in tokenizer.tokenize(df['review'][1].strip())])
sentences[5]

I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses' home and rapes, tortures and kills various women.<br /><br />It is in black and white but saves the colour for one shocking shot.<br /><br />At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.<br /><br />Avoid.
['I saw this film about 20 years ago and remember it as being particularly nasty.', "I believe it is based on a true incident: a young man breaks into a nurses' home and rapes, tortures and kills various women.<br /><br />It is in black and white but saves the colour for one shocking shot.<br /><br />At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.<br /><br />Avoid."]
[['i', 'saw', 'this', 'film', 'about', 'years', 'ago', 'and', 'remember', 'it', 'as', 'being', 'particularly', 'nasty']

['life', 's', 'like', 'that']

In [26]:
# 设定词向量训练的参数
num_features = 300    # Word vector dimensionality
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent words

model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)

In [32]:
print('Training model...')
model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

# If you don't plan to train the model any further, calling 
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

# It can be helpful to create a meaningful model name and 
# save the model for later use. You can load it later using Word2Vec.load()
model.save(os.path.join('..', 'models', model_name))

2019-04-02 11:22:01,126 : INFO : collecting all words and their counts
2019-04-02 11:22:01,127 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-02 11:22:01,194 : INFO : PROGRESS: at sentence #10000, processed 225072 words, keeping 17237 word types
2019-04-02 11:22:01,260 : INFO : PROGRESS: at sentence #20000, processed 443536 words, keeping 24570 word types


Training model...


2019-04-02 11:22:01,323 : INFO : PROGRESS: at sentence #30000, processed 666343 words, keeping 29785 word types
2019-04-02 11:22:01,390 : INFO : PROGRESS: at sentence #40000, processed 886903 words, keeping 33939 word types
2019-04-02 11:22:01,452 : INFO : PROGRESS: at sentence #50000, processed 1103863 words, keeping 37503 word types
2019-04-02 11:22:01,519 : INFO : PROGRESS: at sentence #60000, processed 1327231 words, keeping 40738 word types
2019-04-02 11:22:01,582 : INFO : PROGRESS: at sentence #70000, processed 1550828 words, keeping 43603 word types
2019-04-02 11:22:01,652 : INFO : PROGRESS: at sentence #80000, processed 1772824 words, keeping 46155 word types
2019-04-02 11:22:01,717 : INFO : PROGRESS: at sentence #90000, processed 1987492 words, keeping 48328 word types
2019-04-02 11:22:01,793 : INFO : PROGRESS: at sentence #100000, processed 2210772 words, keeping 50551 word types
2019-04-02 11:22:01,867 : INFO : PROGRESS: at sentence #110000, processed 2435500 words, keeping 

2019-04-02 11:22:18,901 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-04-02 11:22:18,903 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-02 11:22:18,912 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-02 11:22:18,916 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-02 11:22:18,917 : INFO : EPOCH - 1 : training on 11877527 raw words (8394428 effective words) took 13.3s, 630965 effective words/s
2019-04-02 11:22:19,933 : INFO : EPOCH 2 - PROGRESS: at 8.04% examples, 674411 words/s, in_qsize 7, out_qsize 0
2019-04-02 11:22:20,937 : INFO : EPOCH 2 - PROGRESS: at 15.93% examples, 667820 words/s, in_qsize 8, out_qsize 0
2019-04-02 11:22:21,947 : INFO : EPOCH 2 - PROGRESS: at 23.45% examples, 654323 words/s, in_qsize 7, out_qsize 0
2019-04-02 11:22:22,949 : INFO : EPOCH 2 - PROGRESS: at 31.38% examples, 656075 words/s, in_qsize 8, out_qsize 0
2019-04-02 11:22:23,956 : INFO : EPOCH 2 - PRO

2019-04-02 11:23:14,599 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-02 11:23:14,600 : INFO : EPOCH - 5 : training on 11877527 raw words (8395117 effective words) took 13.6s, 616162 effective words/s
2019-04-02 11:23:14,601 : INFO : training on a 59387635 raw words (41975440 effective words) took 69.0s, 608343 effective words/s
2019-04-02 11:23:14,602 : INFO : precomputing L2-norms of word weight vectors
2019-04-02 11:23:14,618 : INFO : saving Word2Vec object under ..\models\300features_40minwords_10context.model, separately None
2019-04-02 11:23:14,619 : INFO : not storing attribute vectors_norm
2019-04-02 11:23:14,620 : INFO : not storing attribute cum_table
2019-04-02 11:23:15,448 : INFO : saved ..\models\300features_40minwords_10context.model


### 看看训练的词向量结果如何

In [6]:
# doesnt_match 选出下列集合中不同词义的词 
print(model.doesnt_match("man woman child kitchen".split()))
print(model.doesnt_match('france england germany berlin'.split()))
# model 已经学习到词与词之间的关系了

kitchen
berlin


In [10]:
# 选出与 man 最相似的词
model.most_similar("man")

[('woman', 0.6256189346313477),
 ('lady', 0.5953349471092224),
 ('lad', 0.576863169670105),
 ('person', 0.5407935380935669),
 ('farmer', 0.5382746458053589),
 ('chap', 0.536788821220398),
 ('soldier', 0.5292650461196899),
 ('men', 0.5261573791503906),
 ('monk', 0.5237958431243896),
 ('guy', 0.5213091373443604)]

In [11]:
model.most_similar("queen")

[('princess', 0.6749982833862305),
 ('maid', 0.6223365068435669),
 ('bride', 0.6201028227806091),
 ('belle', 0.6200867891311646),
 ('temple', 0.6171057224273682),
 ('stripper', 0.608874499797821),
 ('catherine', 0.6072724461555481),
 ('eva', 0.6019693613052368),
 ('dancer', 0.594109833240509),
 ('sylvia', 0.5933606624603271)]

In [12]:
model.most_similar("awful")

[('terrible', 0.7551683187484741),
 ('atrocious', 0.7340768575668335),
 ('horrible', 0.7315883040428162),
 ('dreadful', 0.7080680131912231),
 ('abysmal', 0.7010548114776611),
 ('horrendous', 0.6951696872711182),
 ('appalling', 0.691646933555603),
 ('horrid', 0.6708598136901855),
 ('amateurish', 0.6481891870498657),
 ('embarrassing', 0.6306308507919312)]