In [1]:
'''To be done here:
1) Visualization of embeddings and common words using Glove
1) Visualization of embeddings for trained common domain words (sense is always included) <-- Using Gensim w2vec and custom
#2) Visualization of embeddings using Doc2vec
3) Visualization of embeddings using custom differential embeddings-> better starting state
4) Intro to word contribution to detect domain linkers, independent and dependent words. Common embedding for linked words. 
'''

'To be done here:\n1) Visualization of embeddings and common words using Glove\n1) Visualization of embeddings for trained common domain words (sense is always included) <-- Using Gensim w2vec\n2) Visualization of embeddings using Doc2vec\n3) Visualization of embeddings using custom differential embeddings-> better starting state\n4) Intro to word contribution to detect domain linkers, independent and dependent words. Common embedding for linked words. \n'

Q: Why do we need pretrained embeddings?

General embeddings
The purpose of embeddings is to be able to represent words with information we deem useful for our task. This is vaguely the definition of a feature for a model. The current trending mechanism to represent word information in deep learning is by being able to encode the usage of a word through its neighbours/context. These context specific features can be derived through tuned matrix transformations, ie, a neural network. Thus, from a neural network point of view, for each word we try and obtain a representation/vector that when transformed to the vocab space (softmax layer) results in high activations for mostly co-occuring words. 

These embeddings can be used in multiple ways for a deep learning text specific task. They can be tuned further based on the task (this results in a lot more parameters), or set as fixed. Give rep to a model and ask it to based on this, capture other info. Good detailed overview  with other techniques: https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795) 

2nd extension: Word senses
Consider the sentences:
S1: "The stock price took a hit during recession"
S2: "He hit the ball for a six" 
The word hit has a different meaning/contribution in the sentence. This varying usage of a word is termed as a word exhibiting different senses, and something the vanilla word embedding will not capture.
A simple method to capture different senses is to associate the POS tag with the word when computing the embedding. Hence, hit|NOUN and hit|VERB will have two different embeddings. 

3rd extension: Domain extension-Domain linkers, etc
Now consider the scenario for our problem wherein we train a deep learning model to detect aspect and opinion terms with its input features as our pretrained word embeddings. The resultant trained model, based on its architecture, learns the transformation matrices (which may be used to derive additional variables for computation- ex. attention) to transform the data into a 'latent/middle ground' space.
Let's break down the computation and training process of a BiLSTM-CRF model:
1. Prepare the sequence of inputs- and corresponding embeddings.
2. Run the computation steps to obtain the intermediate features for each word- composite of transformations of two hidden bilstm states (https://arxiv.org/abs/1511.00215) 
3. Compute the log likelihoods based on crf feature weights to output sequence labels.

I suppose that the network tries to do the following:
Given 

The question now is if we had two domains where labels are available for the first one, but a limited number of labels are available for the second one (we can take the case of being able to ask which sentence really needs a label-which would help our model the most, etc)

One simple approach would be to say that since the data is tuned generally- we should have similar feature representations for similar words across domains. This can be viewed as sharing a common latent space- which can either be done by finding similar word contributions amongst words in sentences of different domains- basically words that perform similar roles should have similar embeddings. 
An easier approach is to say that words that are linked by the same word across domains should have the same embedding- since they're the same feature. 

-->Another problem in domain adaptation for general sentiment analysisis (not just extraction) is that words can connotate different sentiments. easy-> good for a test, perhaps bad for describing a footballer
difficult-> good for describing defence, bad for describing a situation 
(https://nlp.stanford.edu/projects/socialsent/) 

--> This is again why a reasoning structure is needed-> soft when used in football can tell about a soft shot(-), soft tackle(-), soft touch(-), feather like control(+). 

In [11]:
import pickle
import pandas as pd
import csv
import tensorflow as tf
#import torch
#import torch.autograd as autograd
#import torch.nn as nn
import os
from collections import Counter
import numpy as np
import random
import time
#torch.manual_seed(8)

  from ._conv import register_converters as _register_converters


## Auxiliary functions (subsampling, sense, etc)

## Word2vec using Gensim

In [12]:
from gensim.models import Word2Vec
import pickle

In [13]:
def load_training_data_for_embeddings(path_to_pkl):
    '''Not optimized will consume large memory since data is not distributed (currently stored in a single pkl)'''
    with open(path_to_pkl) as p2:
        tr_data = pickle.load(p2)
    
    combined_docs = map(lambda x: x[0], tr_data)
    
    #for row_n in range(len(tr_data)):
     #   combined_docs.append(tr_data[row_n][1][1])
      #  combined_docs.append(tr_data[row_n][2][1])
        
    return combined_docs
    

In [52]:
#https://rare-technologies.com/word2vec-tutorial/
class TrainingInstances(object):
    def __init__(self, domain_tr_path_list):
        self.domain_tr_path_list = domain_tr_path_list
        
    def __iter__(self):
        for domain_tr_path in self.domain_tr_path_list:
            training_data = load_training_data_for_embeddings(domain_tr_path)
            
            for row in training_data:
                yield(row)
with_sense = False
if(with_sense):
    raw_dir_str = './Final_data/Domains/{}/WITH_SENSE__normal_training_list.pickle'
else:
    raw_dir_str = './Final_data/Domains/{}/Normal__normal_training_list.pickle'

In [57]:
domains = ["Rest"]#["Laptop"]#,"Rest"]
domains_data_list = []
for domain in domains:
    domains_data_list.append(raw_dir_str.format(domain,domain))

if(with_sense):
    #model_save_dir ="./Final_data/embeddings/Gensim_w2v/together_sense"
    model_save_dir ="./Final_data/embeddings/Gensim_w2v/{}_sense".format(domains[0])
else:
    model_save_dir ="./Final_data/embeddings/Gensim_w2v/{}_normal".format(domains[0])
tr_instance_obj = TrainingInstances(domains_data_list)

In [58]:
#Training
model = Word2Vec(tr_instance_obj, min_count=2, size=300, workers = 4)  
model.save(model_save_dir)
model.wv.save_word2vec_format(model_save_dir+'.emd')

In [59]:
model.most_similar("gift")


[('now', 0.999557375907898),
 ('fast', 0.9995460510253906),
 ("'s", 0.9995405077934265),
 ('they', 0.9995402693748474),
 ('not', 0.9995400905609131),
 ('an', 0.999539852142334),
 ('$', 0.9995396137237549),
 ('even', 0.9995391964912415),
 ('which', 0.9995386600494385),
 ('just', 0.999537467956543)]

In [51]:
%ls Final_data/embeddings/Gensim_w2v/


[0m[01;34mabsa_models[0m/                         [01;34mFinal_data[0m/
Approach 1- Word contribution.ipynb  [01;34mPriming Net[0m/
[01;34mAssorted[0m/                            Problem description.ipynb
Baseline BiLSTM-CRF.ipynb            README.md
[01;34mcheckpoints[0m/                         Scripts_PreApproach1.ipynb
Domain_embeddings.ipynb              Seq2Seq.ipynb
[01;34mDomainLexicons[0m/                      [01;34mTrained_models[0m/
domain_processing.py                 utils.py


## Load the dataset

In [3]:
with open('./Final_data/Domains/laptop.pkl') as p1:
    domain_class = pickle.load(p1)
data_dir = './Final_data/Domains/Laptop/WITH_SENSE__normal_training_list.pickle'#laptop_class.data_path
vocab_dir = domain_class.vocab_path
domain_name = domain_class.name

In [5]:
#pd_csv = pd.read_csv(da)
with open(data_dir) as p2:
    tr_data = pickle.load(p2)
    tr_data_for_embeddings = map(lambda x: x[0], tr_data)
with open(vocab_dir) as p1:
    vocab_to_int = pickle.load(p1)
    int_to_vocab = {val:key for key, val in vocab_to_int.items()}
#Mapping training data to indices
idd_tr_data= map(lambda x: [vocab_to_int[word] for word in x], tr_data_for_embeddings) 
#vocab_to_int[' <START> '] = len(vocab_to_int)

In [18]:
### Functions to generate target context pairs
def get_context(words, index, window_size):
    '''Given a window size and current index of target, return the context words    
    '''
    r = np.random.randint(1, window_size+1)
    start = index - r if(index - r) > 0 else 0 
    stop = index + r
    context_words = list(set(words[start:index]+words[index+1:stop+1]))
    return context_words


def generate_training_batch(tokenized_corpus_list, window_size = 5):
    '''
    This generator function runs over the entire dataset row by row (Each row is a batch)
    
    Input
    ------
    Tokenized training list
    
    Output
    -------
    Yields [target, context pairs]
    '''
    #num_batches = np.ceil(float(len(corpus)))/batch_size #This is assuming that each target->all contexts are taken as a single element
    
    #num_in_last_batch = len(corpus)%batch_size 
    
    for i, tokenized_instance in enumerate(tokenized_corpus_list): #do the last batch with 
    #1) For each word, we obtain the context words with a random variable ranging from 1 to desired window size
        tr_x = []
        tr_y = []
        for target in tokenized_instance :
            temp_x = [target]
            temp_y = get_context(tokenized_instance, i, window_size)
            tr_y.extend(temp_y)
            tr_x.extend(temp_x*len(temp_y))
        yield tr_x, tr_y
  
    

In [19]:
'''Parameters for training'''
threshold = 1e-1  #drop threshold as per formula
word_counts = Counter(reduce(lambda x,y:x+y, idd_tr_data))
total_count = len(idd_tr_data)
freqs = {word_id: float(count)/total_count for word_id, count in word_counts.items()}
p_drop = {word_id: 1 - np.sqrt(threshold/freqs[word_id]) for word_id in word_counts}

#Randomly drop words 
final_tr_data = map(lambda x: [word_id for word_id in x if random.random() < (1- p_drop[word_id])], idd_tr_data)

## Differential Embedding Model 

In [20]:
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32, [None], name='inputs')
#     labels = tf.placeholder(tf.int32, [None, None], name='labels')
    labels = tf.placeholder(tf.int32, [None, None], name='labels')

In [21]:
n_vocab = len(vocab_to_int)
n_embedding =  100

with train_graph.as_default():
    #make embedding variable-> need to save after training
    embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs) # use tf.nn.embedding_lookup to get the hidden layer output

In [22]:
# Number of negative labels to sample
n_sampled = 100
with train_graph.as_default():
    softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding))) # create softmax weight matrix here
    softmax_b = tf.Variable(tf.zeros(n_vocab), name="softmax_bias") # create softmax biases here
    
    #Backprop selected vars 
    # Calculate the loss using negative sampling
    loss = tf.nn.sampled_softmax_loss(
        weights=softmax_w,
        biases=softmax_b,
        labels=labels,
        inputs=embed,
        num_sampled=n_sampled,
        num_classes=n_vocab)
    
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer().minimize(cost)

In [23]:
with train_graph.as_default():
    valid_size = 16 # Random set of words to evaluate similarity on.
    valid_window = 100
    # pick 8 samples from (0,100) and (1000,1100) each ranges. lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples, 
                               random.sample(range(1000,1000+valid_window), valid_size//2))

    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    
    norm = tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keep_dims=True))
    normalized_embedding = embedding / norm
    valid_embedding = tf.nn.embedding_lookup(normalized_embedding, valid_dataset)#in passed dict, id the ones in valid dataset
    similarity = tf.matmul(valid_embedding, tf.transpose(normalized_embedding)) #create a node in the graph

In [61]:
stop_at_testing_loss = 1e-2
epochs = int(1e6)
window_size = 8

with train_graph.as_default():
    saver = tf.train.Saver()
    
with tf.Session(graph=train_graph) as sess:
    iteration = 1
    loss = 0
    sess.run(tf.global_variables_initializer()) #initialize variables before training
    
    for e_ in range(1, epochs+1):
        
        gen_batch = generate_training_batch(final_tr_data, window_size)
        start = time.time()
        
        for tr_x, tr_y in gen_batch:
            #print("iteration", iteration)
            #print(tr_x)
            #print(tr_y)
            if(len(tr_x)==0 or len(tr_y)==0):
                continue
            feed = {inputs: tr_x, labels: np.array(tr_y)[:, None]}
            
            train_loss, _ = sess.run([cost, optimizer], feed_dict=feed)
            
            loss += train_loss
        
            if iteration % 100 == 0: 
                end = time.time()
                print("Epoch {}/{}".format(e_, epochs),
                      "Iteration: {}".format(iteration),
                      "Avg. Training loss: {:.4f}".format(loss/100),
                      "{:.4f} sec/batch".format((end-start)/100))
                if(loss<stop_at_testing_loss):
                    break
                loss = 0
                start = time.time()
                
            if iteration % 1000 == 0:
                
                sim = similarity.eval() #get similarity node evaluated 
                for i in range(valid_size): #valid size is the number of words to compare
                    valid_word = int_to_vocab[valid_examples[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k+1] # ignore the 1st element (itsefl)
                    log = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = int_to_vocab[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
                    
            iteration += 1
    #Save model
    save_path = saver.save(sess, "checkpoints/text82.ckpt")
    embed_mat = sess.run(normalized_embedding)

('Epoch 5/1000000', 'Iteration: 100', 'Avg. Training loss: 15.0928', '0.0003 sec/batch')
('Epoch 9/1000000', 'Iteration: 200', 'Avg. Training loss: 14.4144', '0.0006 sec/batch')
('Epoch 14/1000000', 'Iteration: 300', 'Avg. Training loss: 13.8355', '0.0002 sec/batch')
('Epoch 18/1000000', 'Iteration: 400', 'Avg. Training loss: 13.3947', '0.0005 sec/batch')
('Epoch 23/1000000', 'Iteration: 500', 'Avg. Training loss: 13.0241', '0.0000 sec/batch')
('Epoch 27/1000000', 'Iteration: 600', 'Avg. Training loss: 12.3032', '0.0004 sec/batch')
('Epoch 31/1000000', 'Iteration: 700', 'Avg. Training loss: 11.8042', '0.0007 sec/batch')
('Epoch 36/1000000', 'Iteration: 800', 'Avg. Training loss: 11.2886', '0.0002 sec/batch')
('Epoch 40/1000000', 'Iteration: 900', 'Avg. Training loss: 10.6103', '0.0005 sec/batch')
('Epoch 45/1000000', 'Iteration: 1000', 'Avg. Training loss: 10.2887', '0.0001 sec/batch')
Nearest to shop|NOUN: hardrive|ADJ, beginners|NOUN, click|NOUN, pulling|VERB, ram/|NOUN, beat|NOUN, t

KeyboardInterrupt: 

In [62]:
'''Save tf weights as json'''
def output_weights_as_json(embeddings_matrix, directory_addr, int_to_vocab):
    #get_float_list = lambda x: list(map(lambda y:float(y),x))
    embeddings = {int_to_vocab[i]: embeddings_matrix[i] for i,_ in enumerate(embeddings_matrix)}
    #with open(directory_addr,'w') as f1:
     #   json.dump(embeddings, f1)

    opfname = directory_addr.replace('.json', '.txt')
    total_sg_count = len(embeddings)
    dims = str(300)
    first_line = '{} {}'.format(total_sg_count,dims)
    fh = open (opfname,'w')
    fh.write(first_line+"\n")
    for k,v in embeddings.items():
        fh.write('{} {} \n'.format(str(k), ' '.join([str(vv) for vv in v])))
    fh.close()
    
    return embeddings

In [66]:
saved_dict = output_weights_as_json(embed_mat, os.path.join("./Final_data/embeddings/{}/".format(domain_name),"tf_weights" +".json"), int_to_vocab)

In [11]:
#1) Embeddings -30mins
#2) Bilstm - 15mins
#3) ML - 1hr

## Embeddings Visualization across Domains

In [11]:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

In [None]:
def get_common_words_across_domains(domain_vocabs_list):
    #Simple optimum -> Start with domain with smallest vocab, and then keep intersecting-> counting freq will be a costlier operation
    f_set = set(smallest_vocab.keys())
    for domain_vocab in domain_vocabs_list:
        set1 = set(domain_vocab.keys())
        f_set = f_set.intersection(set1)
    
    return f_set

def get_most_prominent_common_words(domain_names, domain_freqs_list, remove_stop_words = False):
    '''Input: MAKE SURE THAT domains_names is in order of domain_freqs_list
    Output: Dictionary of words ordered by counts in domains and cumulative freq
    To obtain format: Key ["<ORDER_OF_DOMAINS>"] 
    '''
    assert len(domain_names) == len(domain_freqs_list) #Just a small limited check
    common_word_dict = {}
    common_words = get_common_words_across_domains(domain_freqs_list)
    common_word_dict["<ORDER_OF_DOMAINS>"] = domain_names+["Cumulative_count"]
    for word in common_words:
        if(remove_stop_words and word in stop_words):
            continue
        counts = []
        for domain_freq_dict in domain_freqs_list:
            counts.append(domain_freq_dict[word])
        counts.append(reduce(lambda x,y:x+y, counts))
        common_word_dict[word] = counts
    
    return common_word_dict

In [None]:
viz_words = 500
tsne = TSNE()
embed_tsne = tsne.fit_transform(embed_mat[:viz_words, :])

In [None]:
fig, ax = plt.subplots(figsize=(14, 14))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)

In [None]:
class ContextPredictionEmbedding(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(ContextPredictionEmbedding,self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 300)
        self.linear2 = nn.Linear(300, vocab_size)
        
    def forward(self, inputs):
        input_embedding = self.embeddings(inputs).view((1,-1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs
    
    
    
losses = []
loss_function = nn.NLLLoss()
model = ContextPredictionEmbedding(len(vocab), 300, 2)
optimizer = optim.SGD(model.parameters(), lr = 0.001)

for epoch in range(10):
    total_loss = torch.Tensor([0]):
    for target, context in training_batch:
        
        #1) Convert target var to embedding and wrap as a variable
        target_id = [vocab[target]]
        context_id = [vocab[context]]
        target_var = autograd.Variable(torch.LongTensor(target_id))
        context_var = autograd.Variable(torch.LongTensor(context))
        
        #2) reset gradients
        model.zero_grad()
        
        #3) run forward pass
        log_probs = model(target_var)
        
        #4) compute loss and update parameters
        loss = loss_function(log_probs, context_var)
        loss.backward()
        optimizer.step()
        total_loss+= loss.data
        
    losses.append(total_loss)
        