In [None]:
'''To be done here:
1) Visualization of embeddings for trained common domain words
2) Visualization of embeddings using Doc2vec
3) Visualization of embeddings using custom differential embeddings-> better starting state
4) Intro to word contribution to detect domain linkers, independent and dependent words. Common embedding for linked words. 
'''

Q: Why do we need pretrained embeddings?

General embeddings
The purpose of embeddings is to be able to represent words with information we deem useful for our task. This is vaguely the definition of a feature for a model. The current trending mechanism to represent word information in deep learning is by being able to encode the usage of a word through its neighbours/context. These context specific features can be derived through tuned matrix transformations, ie, a neural network. Thus, from a neural network point of view, for each word we try and obtain a representation/vector that when transformed to the vocab space (softmax layer) results in high activations for mostly co-occuring words. 

These embeddings can be used in multiple ways for a deep learning text specific task. They can be tuned further based on the task (this results in a lot more parameters), or set as fixed. Give rep to a model and ask it to based on this, capture other info. Good detailed overview  with other techniques: https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795) 

2nd extension: Word senses
Consider the sentences:
S1: "The stock price took a hit during recession"
S2: "He hit the ball for a six" 
The word hit has a different meaning/contribution in the sentence. This varying usage of a word is termed as a word exhibiting different senses, and something the vanilla word embedding will not capture.
A simple method to capture different senses is to associate the POS tag with the word when computing the embedding. Hence, hit|NOUN and hit|VERB will have two different embeddings. 

3rd extension: Domain extension-Domain linkers, etc
Now consider the scenario for our problem wherein we train a deep learning model to detect aspect and opinion terms with its input features as our pretrained word embeddings. The resultant trained model, based on its architecture, learns the transformation matrices (which may be used to derive additional variables for computation- ex. attention) to transform the data into a 'latent/middle ground' space.
Let's break down the computation and training process of a BiLSTM-CRF model:
1. Prepare the sequence of inputs- and corresponding embeddings.
2. Run the computation steps to obtain the intermediate features for each word- composite of transformations of two hidden bilstm states (https://arxiv.org/abs/1511.00215) 
3. Compute the log likelihoods based on crf feature weights to output sequence labels.

I suppose that the network tries to do the following:
Given 

The question now is if we had two domains where labels are available for the first one, but a limited number of labels are available for the second one (we can take the case of being able to ask which sentence really needs a label-which would help our model the most, etc)

One simple approach would be to say that since the data is tuned generally- we should have similar feature representations for similar words across domains. This can be viewed as sharing a common latent space- which can either be done by finding similar word contributions amongst words in sentences of different domains- basically words that perform similar roles should have similar embeddings. 
An easier approach is to say that words that are linked by the same word across domains should have the same embedding- since they're the same feature. 

-->Another problem in domain adaptation for general sentiment analysisis (not just extraction) is that words can connotate different sentiments. easy-> good for a test, perhaps bad for describing a footballer
difficult-> good for describing defence, bad for describing a situation 
(https://nlp.stanford.edu/projects/socialsent/) 

--> This is again why a reasoning structure is needed-> soft when used in football can tell about a soft shot(-), soft tackle(-), soft touch(-), feather like control(+). 

In [1]:
import pickle
import pandas as pd
import csv
import tensorflow as tf
#import torch
#import torch.autograd as autograd
#import torch.nn as nn
from collections import Counter
import numpy as np
import random
import time
#torch.manual_seed(8)

  from ._conv import register_converters as _register_converters


## Auxiliary functions (subsampling, sense, etc)

## Load the dataset

In [9]:
with open('./Final_data/Domains/rest.pkl') as p1:
    domain_class = pickle.load(p1)
data_dir = './Final_data/Domains/Laptop/WITH_SENSE__normal_training_list.pickle'#laptop_class.data_path
vocab_dir = domain_class.vocab_path
domain_name = domain_class.name

In [20]:
#pd_csv = pd.read_csv(da)
with open(data_dir) as p2:
    tr_data = pickle.load(p2)
    tr_data_for_embeddings = map(lambda x: x[0], tr_data)
with open(vocab_dir) as p1:
    vocab_to_int = pickle.load(p1)
    int_to_vocab = {val:key for key, val in vocab_to_int.items()}
#Mapping training data to indices
idd_tr_data= map(lambda x: [vocab_to_int[word] for word in x], tr_data_for_embeddings) 
#vocab_to_int[' <START> '] = len(vocab_to_int)

In [21]:
### Functions to generate target context pairs
def get_context(words, index, window_size):
    '''Given a window size and current index of target, return the context words    
    '''
    r = np.random.randint(1, window_size+1)
    start = index - r if(index - r) > 0 else 0 
    stop = index + r
    context_words = list(set(words[start:index]+words[index+1:stop+1]))
    return context_words


def generate_training_batch(tokenized_corpus_list, window_size = 5):
    '''
    This generator function runs over the entire dataset row by row (Each row is a batch)
    
    Input
    ------
    Tokenized training list
    
    Output
    -------
    Yields [target, context pairs]
    '''
    #num_batches = np.ceil(float(len(corpus)))/batch_size #This is assuming that each target->all contexts are taken as a single element
    
    #num_in_last_batch = len(corpus)%batch_size 
    
    for i, tokenized_instance in enumerate(tokenized_corpus_list): #do the last batch with 
    #1) For each word, we obtain the context words with a random variable ranging from 1 to desired window size
        tr_x = []
        tr_y = []
        for target in tokenized_instance :
            temp_x = [target]
            temp_y = get_context(tokenized_instance, i, window_size)
            tr_y += temp_y
            tr_x += temp_x*len(temp_y)
        yield tr_x, tr_y
  
    

In [22]:
'''Parameters for training'''
threshold = 1e-1  #drop threshold as per formula
word_counts = Counter(reduce(lambda x,y:x+y, idd_tr_data))
total_count = len(idd_tr_data)
freqs = {word_id: float(count)/total_count for word_id, count in word_counts.items()}
p_drop = {word_id: 1 - np.sqrt(threshold/freqs[word_id]) for word_id in word_counts}

#Randomly drop words 
final_tr_data = map(lambda x: [word_id for word_id in x if random.random() < (1- p_drop[word_id])], idd_tr_data)

In [23]:
x = generate_training_batch(idd_tr_data, 5)

## Model 

In [24]:
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32, [None], name='inputs')
#     labels = tf.placeholder(tf.int32, [None, None], name='labels')
    labels = tf.placeholder(tf.int32, [None, None], name='labels')

In [25]:
n_vocab = len(vocab_to_int)
n_embedding =  100

with train_graph.as_default():
    #make embedding variable-> need to save after training
    embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs) # use tf.nn.embedding_lookup to get the hidden layer output

In [26]:
# Number of negative labels to sample
n_sampled = 100
with train_graph.as_default():
    softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding))) # create softmax weight matrix here
    softmax_b = tf.Variable(tf.zeros(n_vocab), name="softmax_bias") # create softmax biases here
    
    #Backprop selected vars 
    # Calculate the loss using negative sampling
    loss = tf.nn.sampled_softmax_loss(
        weights=softmax_w,
        biases=softmax_b,
        labels=labels,
        inputs=embed,
        num_sampled=n_sampled,
        num_classes=n_vocab)
    
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer().minimize(cost)

In [28]:
with train_graph.as_default():
    valid_size = 16 # Random set of words to evaluate similarity on.
    valid_window = 100
    # pick 8 samples from (0,100) and (1000,1100) each ranges. lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples, 
                               random.sample(range(1000,1000+valid_window), valid_size//2))

    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    
    norm = tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keep_dims=True))
    normalized_embedding = embedding / norm
    valid_embedding = tf.nn.embedding_lookup(normalized_embedding, valid_dataset)#in passed dict, id the ones in valid dataset
    similarity = tf.matmul(valid_embedding, tf.transpose(normalized_embedding)) #create a node in the graph

In [16]:
epochs = 10
window_size = 10

with train_graph.as_default():
    saver = tf.train.Saver()
    
with tf.Session(graph=train_graph) as sess:
    iteration = 1
    loss = 0
    sess.run(tf.global_variables_initializer()) #initialize variables before training
    
    for e_ in range(1, epochs+1):
        gen_batch = generate_training_batch(idd_tr_data, window_size)
        start = time.time()
        
        for tr_x, tr_y in gen_batch:
            feed = {inputs: tr_x, labels: np.array(tr_y)[:, None]}
            train_loss, _ = sess.run([cost, optimizer], feed_dict=feed)
            loss += train_loss
        
            if iteration % 100 == 0: 
                end = time.time()
                print("Epoch {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Avg. Training loss: {:.4f}".format(loss/100),
                      "{:.4f} sec/batch".format((end-start)/100))
                loss = 0
                start = time.time()
                
            if iteration % 1000 == 0:
                
                sim = similarity.eval() #get similarity node evaluated 
                for i in range(valid_size): #valid size is the number of words to compare
                    valid_word = int_to_vocab[valid_examples[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k+1] # ignore the 1st element (itsefl)
                    log = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = int_to_vocab[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
                    
            iteration += 1
    #Save model
    save_path = saver.save(sess, "checkpoints/text82.ckpt")
    embed_mat = sess.run(normalized_embedding)

ResourceExhaustedError: Ran out of GPU memory when allocating 0 bytes for 
	 [[Node: SoftmaxCrossEntropyWithLogits = SoftmaxCrossEntropyWithLogits[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Reshape_1)]]
	 [[Node: Mean/_51 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_644_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op u'SoftmaxCrossEntropyWithLogits', defined at:
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 474, in start
    ioloop.IOLoop.instance().start()
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 887, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 276, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 390, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py", line 501, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-11-f8051408975e>", line 14, in <module>
    num_classes=n_vocab)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_impl.py", line 1258, in sampled_softmax_loss
    labels=labels, logits=logits)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1783, in softmax_cross_entropy_with_logits
    precise_logits, labels, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 4364, in _softmax_cross_entropy_with_logits
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): Ran out of GPU memory when allocating 0 bytes for 
	 [[Node: SoftmaxCrossEntropyWithLogits = SoftmaxCrossEntropyWithLogits[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Reshape_1)]]
	 [[Node: Mean/_51 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_644_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]


In [30]:
'''Save tf weights as json'''
def output_weights_as_json(embeddings_matrix, directory_addr, int_to_vocab):
    #get_float_list = lambda x: list(map(lambda y:float(y),x))
    embeddings = {int_to_vocab[i]: embeddings_matrix[i] for i,_ in enumerate(embeddings_matrix)}
    #with open(directory_addr,'w') as f1:
     #   json.dump(embeddings, f1)

    opfname = directory_addr.replace('.json', '.txt')
    total_sg_count = len(embeddings)
    dims = str(300)
    first_line = '{} {}'.format(total_sg_count,dims)
    fh = open (opfname,'w')
    fh.write(first_line+"\n")
    for k,v in embeddings.items():
        fh.write('{} {} \n'.format(str(k), ' '.join([str(vv) for vv in v])))
    fh.close()
    
    return embeddings

In [None]:
saved_dict = output_weights_as_json(embed_mat, os.path.join("./Final_data/embeddings/{}/".format(domain_name),"tf_weights" +".json"), int_to_vocab)

In [11]:
#1) Embeddings -30mins
#2) Bilstm - 15mins
#3) ML - 1hr

'Rest'

## Embeddings Visualization across Domains

In [11]:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [None]:
def get_common_words_across_domains(domain_vocabs_list):
    
    set1 = set(domain1_vocab.keys())
    set2 = set(domain2_vocab.keys())
    intersect_set = 

In [None]:
viz_words = 500
tsne = TSNE()
embed_tsne = tsne.fit_transform(embed_mat[:viz_words, :])

In [None]:
fig, ax = plt.subplots(figsize=(14, 14))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)

In [None]:
class ContextPredictionEmbedding(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(ContextPredictionEmbedding,self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 300)
        self.linear2 = nn.Linear(300, vocab_size)
        
    def forward(self, inputs):
        input_embedding = self.embeddings(inputs).view((1,-1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs
    
    
    
losses = []
loss_function = nn.NLLLoss()
model = ContextPredictionEmbedding(len(vocab), 300, 2)
optimizer = optim.SGD(model.parameters(), lr = 0.001)

for epoch in range(10):
    total_loss = torch.Tensor([0]):
    for target, context in training_batch:
        
        #1) Convert target var to embedding and wrap as a variable
        target_id = [vocab[target]]
        context_id = [vocab[context]]
        target_var = autograd.Variable(torch.LongTensor(target_id))
        context_var = autograd.Variable(torch.LongTensor(context))
        
        #2) reset gradients
        model.zero_grad()
        
        #3) run forward pass
        log_probs = model(target_var)
        
        #4) compute loss and update parameters
        loss = loss_function(log_probs, context_var)
        loss.backward()
        optimizer.step()
        total_loss+= loss.data
        
    losses.append(total_loss)
        