https://github.com/corazzon/KaggleStruggle/blob/master/word2vec-nlp-tutorial/tutorial-part-2.ipynb  
[NLP 2](2/3) Word2Vec을 Gensim을 통해 벡터화하고 t-SNE로 시각화 - IMDB 영화 리뷰 분석 캐글 머신러닝(기계학습)  

@  
I vectorize words by Word2Vec which is one of deep learning techniques.  
I visualize vectorized word data by t-SNE.  
I use hybrid methodology using both deep learning and random forest in supervised learning.  
  
@  
Word2Vec(Word embedding To Vector).  
Computer only can recognize numbers.  
Character and image are saved as binary file.  
I vectorized words by "bag of word" methodology to make computer understand words.  
  
@  
Vectorizing by "one hot encoding" and "bag of word" makes too big and sparse vector.  
It reduces perfomance in neural net.  
  
@  
Word2Vec uses idea that the meaning of some word should be similar with the words around that word.  
When I use around words as label for the specific word when I train it.  
The process of Word2Vec is mapping word to "dense vector" containing meaning.  
Word2Vec uses similarity between words, so, it can understand the relation between "paris and france" and "berlin and germany".  
  
@  
![word2vec](https://1.bp.blogspot.com/-Q7F8ulD6fC0/UgvnVCSGmXI/AAAAAAAAAbg/MCWLTYBufhs/s1600/image00.gif)  
이미지 출처 : https://opensource.googleblog.com/2013/08/learning-meaning-behind-words.html  
  
You can see the process of word embedding in visualization in real time : [word embedding visual inspector](https://ronxin.github.io/wevi/)  
  
@  
![CBOW와 Skip-Gram](https://i.imgur.com/yXY1LxV.png)  
출처 : https://arxiv.org/pdf/1301.3781.pdf  
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.  
  
  
@  
There are 2 methods, CBOW, Skip-Gram.  
CBOW(continuous bag-of-words) predicts one word by entire text, so, it's beneficial to use this for small entire data.  
  
@  
The example of CBOW fitted task is predicting word for the blank in the simple sentence.  
<pre>  
1. __ has good taste   
2. Riding __ is fun   
3. Since I ate food 2 __, I'm feeling __ ache  
</pre>  
  
@  
Skip-Gram is predicting original word from target words.  
As opposed to CBOW, Skip-Gram processes a pair of "context-target" as a new finding, and it's beneficial to use this when you have large data set.  
  
@  
You can use Skip-gram to predict word which can be fit around follwing marked word.  
<pre>  
1. *Apple* has good taste   
2. Riding *bike* is fun   
3. Since I ate food 2 *times*, I'm feeling *stomach* ache  
</pre>  
  
@  
Word2Vec 참고자료    
[word2vec 모델 · 텐서플로우 문서 한글 번역본](https://tensorflowkorea.gitbooks.io/tensorflow-kr/g3doc/tutorials/word2vec/)  
[Word2Vec으로 문장 분류하기 · ratsgo's blog](https://ratsgo.github.io/natural%20language%20processing/2017/03/08/word2vec/)  
  
[Efficient Estimation of Word Representations in  
Vector Space](https://arxiv.org/pdf/1301.3781v3.pdf)  
[Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)  
[CS224n: Natural Language Processing with Deep Learning](http://web.stanford.edu/class/cs224n/syllabus.html)  
[Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)  
  
  
@  
Gensim  
[gensim: models.word2vec – Deep learning with word2vec](https://radimrehurek.com/gensim/models/word2vec.html)  
[gensim: Tutorials](https://radimrehurek.com/gensim/tutorial.html)  
[한국어와 NLTK, Gensim의 만남 - PyCon Korea 2015](https://www.lucypark.kr/docs/2015-pyconkr/)

In [42]:
# 출력이 너무 길어지지 않게하기 위해 찍지 않도록 했으나 
# 실제 학습 할 때는 아래 두 줄을 주석처리 하는 것을 권장한다.
import warnings
warnings.filterwarnings('ignore')

In [43]:
import pandas as pd
import re
import nltk

import numpy as np

from bs4 import BeautifulSoup
from nltk.corpus import stopwords

train = pd.read_csv('D:/chromedown/labeledTrainData.tsv', 
                    header=0, delimiter='\t', quoting=3)
test = pd.read_csv('D:/chromedown/testData.tsv', 
                   header=0, delimiter='\t', quoting=3)
unlabeled_train = pd.read_csv('D:/chromedown/unlabeledTrainData.tsv', 
                              header=0, delimiter='\t', quoting=3)

print(train.shape)
print(test.shape)
print(unlabeled_train.shape)

print(train['review'].size)
print(test['review'].size)
print(unlabeled_train['review'].size)

(25000, 3)
(25000, 2)
(50000, 2)
25000
25000
50000


In [44]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [45]:
# You can see there is no sentiment data in test data unlike train data
test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [46]:
from KaggleWord2VecUtility import KaggleWord2VecUtility

In [47]:
KaggleWord2VecUtility.review_to_wordlist(train['review'][0])[:10]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

In [48]:
from nltk import word_tokenize

sentences = []
for review in train["review"]:
    sentences += KaggleWord2VecUtility.review_to_sentences(
        review, remove_stopwords=False)

AttributeError: 'str' object has no attribute 'decode'

In [None]:
for review in unlabeled_train["review"]:
    sentences += KaggleWord2VecUtility.review_to_sentences(
        review, remove_stopwords=False)

In [None]:
len(sentences)

In [None]:
sentences[0][:10]

In [None]:
sentences[1][:10]

@  
Gensim
[gensim: models.word2vec – Deep learning with word2vec](https://radimrehurek.com/gensim/models/word2vec.html)

@  
Parameters of Word2Vec model  

@  
architecture : architecture option is given by skip-gram (default) or CBOW.  
skip-gram (default) is slow but better result.  
  
learning algorith : hierarchical softmax (default) or negative sampling.  
default one if working well.  
  
@  
down sampling for frequently appearing word : Google document recommends .00001 ~ .001.  
  
@  
dimensionality of vector of word  
  
@  
context / window size  
  
@  
minimal number of word  

In [None]:
import logging
logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s', 
    level=logging.INFO)

In [None]:
# I configure value of parameters.
num_features = 300 # This is dimensionality of vector for words
min_word_count = 40 # This is minimum number of characters
num_workers = 4
context = 10 
downsampling = 1e-3

# I initialize and make it learn
from gensim.models import word2vec
model = word2vec.Word2Vec(sentences, 
                          workers=num_workers, 
                          size=num_features, 
                          min_count=min_word_count,
                          window=context,
                          sample=downsampling)
model

In [None]:
# 학습이 완료 되면 필요없는 메모리를 unload 시킨다.
model.init_sims(replace=True)

model_name = '300features_40minwords_10text'
# model_name = '300features_50minwords_20text'
model.save(model_name)

## 모델 결과 탐색 
Exploring the Model Results

In [None]:
# 유사도가 없는 단어 추출
model.wv.doesnt_match('man woman child kitchen'.split())

In [None]:
model.wv.doesnt_match("france england germany berlin".split())

In [None]:
# 가장 유사한 단어를 추출
model.wv.most_similar("man")

In [None]:
model.wv.most_similar("queen")

In [None]:
# model.wv.most_similar("awful")

In [None]:
model.wv.most_similar("film")

In [None]:
# model.wv.most_similar("happy")
model.wv.most_similar("happi") # stemming 처리 시 

### Word2Vec으로 벡터화 한 단어를 t-SNE 를 통해 시각화

In [None]:
# 참고 https://stackoverflow.com/questions/43776572/visualise-word2vec-generated-from-gensim
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt
import gensim 
import gensim.models as g

# 그래프에서 마이너스 폰트 깨지는 문제에 대한 대처
mpl.rcParams['axes.unicode_minus'] = False

model_name = '300features_40minwords_10text'
model = g.Doc2Vec.load(model_name)

vocab = list(model.wv.vocab)
X = model[vocab]

print(len(X))
print(X[0][:10])
tsne = TSNE(n_components=2)

# 100개의 단어에 대해서만 시각화
X_tsne = tsne.fit_transform(X[:100,:])
# X_tsne = tsne.fit_transform(X)

In [None]:
df = pd.DataFrame(X_tsne, index=vocab[:100], columns=['x', 'y'])
df.shape

In [None]:
df.head(10)

In [None]:
fig = plt.figure()
fig.set_size_inches(40, 20)
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

for word, pos in df.iterrows():
    ax.annotate(word, pos, fontsize=30)
plt.show()

In [None]:
import numpy as np

def makeFeatureVec(words, model, num_features):
    """
    This is a function which find mean value of word vector from given sentence.
    """
    # In advance, I initialize matrix by 0 to enhance the process speed.
    featureVec = np.zeros((num_features,),dtype="float32")

    nwords = 0.
    # Index2word is a list containing the name of word which is located in a dictionary of model.
    # I convert index2word as set data type.
    index2word_set = set(model.wv.index2word)
    # I add words into features if they're involved in model dictionary as I go with "for loop".
    for word in words:
        if word in index2word_set:
            nwords = nwords + 1.
            featureVec = np.add(featureVec, model[word])
    # I calculate the mean value of feature vector by deviding feature vector by the number of workds.
    featureVec = np.divide(featureVec, nwords)
    return featureVec

In [None]:
def getAvgFeatureVecs(reviews, model, num_features):
    # I calculate the mean feature vector of each list of review word.
    # And then I return 2D numpy array.
    
    counter = 0.
    
    reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype="float32")
    
    for review in reviews:
       # 매 1000개 리뷰마다 상태를 출력
       if counter%1000. == 0.:
           print("Review %d of %d" % (counter, len(reviews)))
       
       reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, num_features)
       
       counter = counter + 1.
    return reviewFeatureVecs

In [None]:
# I use 4 workers by using multiple threads.
# I return clean_reviews
def getCleanReviews(reviews):
    clean_reviews = []
    clean_reviews = KaggleWord2VecUtility.apply_by_multiprocessing(reviews["review"], KaggleWord2VecUtility.review_to_wordlist, workers=4)
    return clean_reviews

In [None]:
%time trainDataVecs = getAvgFeatureVecs(getCleanReviews(train), model, num_features) 

In [None]:
%time testDataVecs = getAvgFeatureVecs(getCleanReviews(test), model, num_features)

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100, n_jobs = -1, random_state=2018)

In [None]:
%time forest = forest.fit(trainDataVecs, train["sentiment"])

In [None]:
from sklearn.model_selection import cross_val_score
%time score = np.mean(cross_val_score(forest, trainDataVecs, train['sentiment'], cv=10, scoring='roc_auc'))

In [None]:
score

In [None]:
result = forest.predict( testDataVecs )

In [None]:
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
output.to_csv('data/Word2Vec_AverageVectors_{0:.5f}.csv'.format(score), index=False, quoting=3)

* 300features_40minwords_10text 일 때 0.90709436799999987
* 300features_50minwords_20text 일 때 0.86815798399999999

In [None]:
output_sentiment = output['sentiment'].value_counts()
print(output_sentiment[0] - output_sentiment[1])
output_sentiment

In [None]:
import seaborn as sns 
%matplotlib inline

fig, axes = plt.subplots(ncols=2)
fig.set_size_inches(12,5)
sns.countplot(train['sentiment'], ax=axes[0])
sns.countplot(output['sentiment'], ax=axes[1])

In [None]:
544/578