#### LDA 와 word2Vec 섞은 Topic-word Vector를 만들어 봅시다 

- 정의란 무엇인가? 이는 사람이 가지고 있는 지식에 따라 다른 대답이 나올 수 있다. 이는 정의가 동음이의어의 성격을 가지고 있기 때문이다. 
    - 기본적으로 정의란 justice의 의미로 가장 많이 사용된다. 하지만 이과생들에게는 definition 의미로 정의라는 단어를 많이 사용한다
    - 기존의 word2vec에서는 학습 데이터에 따라 정의가 justice나 defination의 중간 어딘가에 매칭될 것이다
    - 이러한 단어가 많아질 수록, 벡터 공간의 일부분을 동음이의어가 차지하는 결과를 보일 것이며, 이러한 문제를 해결하기 위해서는 더 많은 차원이 필요해 질 것이다. 

- 따라서, 정의라는 단어는 주제적인 측면에서 1차적으로 분류되고, 그 다음 의미적인 분석을 수행해야 한다 
    - 그래서 아이디어는 LDA를 활용하여 1차적으로 k개의 잠재적 토픽으로 분류하고, 이를 바탕으로 k X W x V의 Word2vec 학습을 수행한다 
    - 즉, one-hot encoding 을 통해 합쳐질 때, k의 가중치를 반영하여 여러 w x v 메트릭스를 동시에 학습하는 것이다. 
    
- 우선 구현은 다른 라이브러리를 적극적으로 활용하느 것으로 하되, 여의치 않으면 작은 사이즈로 CBOW만 구현 해보는 것으로 하자  

---

- 데이터 전처리 -> LDA(Vocab의 사이즈는 LDA와 W2V이랑 동일) -> k(topic size) X W(vocab size) X vector(word vector size)   

---
  
참조 사이트 
- https://hulk89.github.io/neural%20machine%20translation/2017/05/08/Word2Vec-impl/ 
- http://solarisailab.com/archives/374




In [10]:
#library definition 
import collections 
import math 
import os 
import zipfile 

import numpy as np 
import tensorflow as tf 

import json
import pandas as pd

import gensim

#### step 1: 텍스트 데이터 읽어오기 


In [11]:
# data read 
def read_Amazon_review_data(file_path):
    review_list = []
    with open (file_path) as json_file:
        d = json_file.read()

    split_data = d.split("\n")
    for text in split_data:
        if text.strip():
            text_list = json.loads(text)
            review_list.append({"ID": text_list["reviewerID"], 
                                "Review": text_list["reviewText"], 
                                "Rating": text_list["overall"], 
                                "asin": text_list["asin"],
                                "timestemp":text_list["unixReviewTime"]
                                         
                               })

    return review_list

In [12]:
read_data = read_Amazon_review_data("reviews_Musical_Instruments_5.json")

In [13]:
read_data[:2]

[{'ID': 'A2IBPI20UZIR0U',
  'Review': "Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,",
  'Rating': 5.0,
  'asin': '1384719342',
  'timestemp': 1393545600},
 {'ID': 'A14VAT5EAX3D9S',
  'Review': "The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]",
  'Rating': 5.0,
  'asin': '1384719

In [14]:
data_word = [] 
for line in read_data: 
    line["Review"] = list(gensim.utils.simple_preprocess(str(line["Review"]), deacc=True))
    data_word.append(line["Review"])

#### step 2: LDA 수행하기 

In [20]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])


bigram = gensim.models.Phrases(data_word, min_count = 5, threshold = 100)
bigram_mode = gensim.models.phrases.Phraser(bigram)

def remove_stopwords(texts): 
    return [[word for word in gensim.utils.simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mode[doc] for doc in texts]

data_words_nostops = remove_stopwords(data_word)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)


# Create Dictionary
id2word = gensim.corpora.Dictionary(data_words_bigrams)

# Create Corpus
texts = data_words_bigrams

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]


In [None]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)

In [22]:
lda_model.print_topics()

[(0,
  '0.016*"less" + 0.015*"pick" + 0.014*"picks" + 0.013*"however" + 0.012*"give" + 0.011*"big" + 0.010*"distortion" + 0.010*"clip" + 0.010*"accurate" + 0.010*"power"'),
 (1,
  '0.032*"star" + 0.031*"tight" + 0.024*"wall" + 0.023*"bottom" + 0.020*"version" + 0.017*"violin" + 0.016*"speakers" + 0.016*"screws" + 0.016*"hd" + 0.013*"micro"'),
 (2,
  '0.111*"pedal" + 0.091*"tone" + 0.042*"pedals" + 0.024*"gain" + 0.021*"joyo" + 0.019*"overdrive" + 0.017*"battery" + 0.017*"knob" + 0.014*"tones" + 0.013*"effect"'),
 (3,
  '0.043*"many" + 0.036*"volume" + 0.033*"effects" + 0.031*"noise" + 0.028*"tube" + 0.023*"amps" + 0.021*"output" + 0.020*"software" + 0.019*"trying" + 0.018*"loud"'),
 (4,
  '0.047*"bright" + 0.035*"especially" + 0.029*"compared" + 0.025*"setting" + 0.023*"smaller" + 0.023*"designed" + 0.021*"stands" + 0.019*"prefer" + 0.019*"adjustable" + 0.017*"pickup"'),
 (5,
  '0.051*"music" + 0.037*"studio" + 0.036*"amazon" + 0.033*"inexpensive" + 0.027*"level" + 0.026*"feature" + 0.

#### step 3: word2vec 구현하기 using tensorflow 

In [94]:
import random
def generate_input(dataset,id2word, window_size):
    random.shuffle(dataset)
    data = []
    label = []
    for doc in dataset : 
        for idx in range(int(window_size/2), len(doc)-(window_size - int(window_size/2))):
            front = idx - int(window_size/2)
            rear = idx + (window_size - int(window_size/2))
            #flatten.append({'data': doc[front:idx] + doc[idx:rear], 'label':doc[idx]})
            data.append(id2word.doc2idx(doc[front:idx] + doc[idx+1:rear])) 
            label.append(id2word.doc2idx([doc[idx]]))
    
    return(data, label)
            

In [95]:
vocab_size = len(id2word)
embedding_size = 5
input_ , label = generate_input(texts, id2word, embedding_size)
batch_size = len(label)
num_sampled = vocab_size//2

In [97]:
print(input_[:2], '\n\n',label[:2])

[[96, 120, 1675, 121], [120, 183, 121, 248]] 

 [[183], [1675]]


In [101]:
embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size-1], -1.0, 1.0))
nce_weights = tf.Variable(tf.truncated_normal([vocab_size, embedding_size-1], stddev = 1.0/ math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocab_size]))

train_inputs = tf.placeholder(tf.int32, shape=[batch_size, embedding_size-1])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

In [102]:
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights,
                                     nce_biases,
                                     train_labels,
                                     embed,
                                     num_sampled,
                                     vocab_size))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

init = tf.global_variables_initializer()

ValueError: Shape must be rank 2 but is rank 3 for 'nce_loss_1/MatMul' (op: 'MatMul') with input shapes: [393318,4,4], [9593,4].

In [None]:
with tf.Session() as sess:
    init.run()
    for i in range(1000):
        batch_inputs, batch_labels = generate_input(texts, id2word,5)
        print(batch_inputs[:5],'\n', batch_labels[:5])
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
    
        _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
        if i % 100 == 0:
            print(loss_val)
    emb_weights = sess.run(embeddings)

In [None]:
vocab = ['나는',   '그녀가',  '너는', '그가',
         '밥을', '콩밥을', '싸움을', '꽃을', 
         '먹었다', '했다',  '샀다', '만들었다']
dataset = [[0, 4, 8], [1, 4, 8], [2, 4, 8], [3, 4, 8],
           [0, 5, 8], [0, 5, 9], [1, 5, 8], [1, 5, 9],
           [2, 5, 8], [2, 5, 9], [3, 5, 8], [3, 5, 9],
           [0, 6, 9], [1, 6, 9], [2, 6, 9], [3, 6, 9],
           [0, 7, 10], [1, 7, 10], [2, 7, 10], [3, 7, 10], 
           [0, 6, 11], [1, 6, 11], [2, 6, 11], [3, 6, 11]]

def decode_data(data, vocab):  
    '''
    idx들의 list로 문장을 만들어주는 함수
    '''
    decoded_list = [vocab[idx] for idx in data]
    return ' '.join(decoded_list)

for data in dataset[0:5]:   # 데이터를 몇개만 찍어보자.
    print(decode_data(data, vocab))
    
def generate_input(dataset, num_skips):
    random.shuffle(dataset)  # 문장 단위로 셔플한다.

    # 일차원 array로 만든다. (window를 돌리기 위해!)
    flatten = []
    for list_ in dataset:
        flatten += list_

    # (나는, 그녀를 보았다.) => (i:그녀를, l:나는), (i:그녀를, l:보았다)
    data = []
    label = []
    for idx in range(num_skips, len(flatten)-num_skips):
        data.append(flatten[idx])
        data.append(flatten[idx])
        label.append([flatten[idx-1]])
        label.append([flatten[idx+1]])
    return data, label

input_, label = generate_input(dataset, 1)

vocab_size = len(vocab)
embedding_size = 5
batch_size = len(label)
num_sampled = 6  # vocab_size//2

## Graph build
embeddings = tf.Variable(tf.random_uniform([vocab_size,
                                            embedding_size],
                                           -1.0, 1.0))

nce_weights = tf.Variable(tf.truncated_normal([vocab_size,
                                               embedding_size],
                                              stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocab_size]))

# Placeholders for inputs
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# 매번 음수 라벨링 된 셈플을 이용한 NCE loos 계산
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights,
                                     nce_biases,
                                     train_labels,
                                     embed,
                                     num_sampled,
                                     vocab_size))

# SGD optimizer 를 사용
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    init.run()
    for i in range(1000):
        batch_inputs, batch_labels = generate_input(dataset, 1)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
    
        _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
        if i % 100 == 0:
            print(loss_val)
    emb_weights = sess.run(embeddings)

np.savetxt('dataset', emb_weights, fmt='%.5e', delimiter='\t')
with open('meta', 'w') as f:
    for i in vocab:
        f.write("{}\n".format(i))

#### step 4: LDA 수행 결과를 반영하여 word2Vec 수행하기 
