Q: Why do we need pretrained embeddings?

General embeddings
The purpose of embeddings is to be able to represent words with information we deem useful for our task. This is vaguely the definition of a feature for a model. The current trending mechanism to represent word information in deep learning is by being able to encode the usage of a word through its neighbours/context. These context specific features can be derived through tuned matrix transformations, ie, a neural network. Thus, from a neural network point of view, for each word we try and obtain a representation/vector that when transformed to the vocab space (softmax layer) results in high activations for mostly co-occuring words. 

These embeddings can be used in multiple ways for a deep learning text specific task. They can be tuned further based on the task (this results in a lot more parameters), or set as fixed. Give rep to a model and ask it to based on this, capture other info. Good detailed overview  with other techniques: https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795) 

2nd extension: Word senses
Consider the sentences:
S1: "The stock price took a hit during recession"
S2: "He hit the ball for a six" 
The word hit has a different meaning/contribution in the sentence. This varying usage of a word is termed as a word exhibiting different senses, and something the vanilla word embedding will not capture.
A simple method to capture different senses is to associate the POS tag with the word when computing the embedding. Hence, hit|NOUN and hit|VERB will have two different embeddings. 

3rd extension: Domain extension-Domain linkers, etc
Now consider the scenario for our problem wherein we train a deep learning model to detect aspect and opinion terms with its input features as our pretrained word embeddings. The resultant trained model, based on its architecture, learns the transformation matrices (which may be used to derive additional variables for computation- ex. attention) to transform the data into a 'latent/middle ground' space.
Let's break down the computation and training process of a BiLSTM-CRF model:
1. Prepare the sequence of inputs- and corresponding embeddings.
2. Run the computation steps to obtain the intermediate features for each word- composite of transformations of two hidden bilstm states (https://arxiv.org/abs/1511.00215) 
3. Compute the log likelihoods based on crf feature weights to output sequence labels.

I suppose that the network tries to do the following:
Given 

The question now is if we had two domains where labels are available for the first one, but a limited number of labels are available for the second one (we can take the case of being able to ask which sentence really needs a label-which would help our model the most, etc)

One simple approach would be to say that since the data is tuned generally- we should have similar feature representations for similar words across domains. This can be viewed as sharing a common latent space- which can either be done by finding similar word contributions amongst words in sentences of different domains- basically words that perform similar roles should have similar embeddings. 
An easier approach is to say that words that are linked by the same word across domains should have the same embedding- since they're the same feature. 

-->Another problem in domain adaptation for general sentiment analysisis (not just extraction) is that words can connotate different sentiments. easy-> good for a test, perhaps bad for describing a footballer
difficult-> good for describing defence, bad for describing a situation 
(https://nlp.stanford.edu/projects/socialsent/) 

--> This is again why a reasoning structure is needed-> soft when used in football can tell about a soft shot(-), soft tackle(-), soft touch(-), feather like control(+). 

In [6]:
import pickle
import pandas as pd
import csv
import torch
import torch.autograd as autograd
import torch.nn as nn
torch.manual_seed(8)

<torch._C.Generator at 0x7f0bb0241530>

In [7]:
#Input layer and vocab one hot encoded inputs
#Embedding layer and its resultant transformation into vocab size<- another parameter
#Noisy inputs, etc 

## Auxiliary functions (subsampling, sense, etc)

In [None]:
### Subsampling-> Remove words with a probability proportional to  high frequency given the function



### Generate target context pairs
def get_target(words, index, window_size):
    '''Given a window size and current index of target, return the context words'''
    r = np.random.randint(1, window_size+1)
    start = index - r if(index - r )>0 else 0 
    stop = index + r
    target_words = list(set(words[start:index]+words[index+1, stop+1]))
    return target_words


def generate_training_batch(tokenized_corpus, batch_size, window_size = 5):
    '''This runs over the entire dataset once'''
    num_batches = np.ceil(float(len(corpus))/batch_size #This is assuming that each target->all contexts are taken as a single element
    
    num_in_last_batch = len(corpus)%batch_size 
    
    for batch_num in range(num_batches-1): #do the last batch with 
    #1) For each word, we obtain the context words with a random variable ranging from 1 to desired window size
        target, context = [], []
        for target in tokenized_corpus[batch_num*batch_size:batch_num*batch_size]
    #Do same operations for last batch
    if(num_in_last_batch>0):
        
    None

## Load the dataset

In [8]:
class ContextPredictionEmbedding(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(ContextPredictionEmbedding,self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 300)
        self.linear2 = nn.Linear(300, vocab_size)
        
    def forward(self, inputs):
        input_embedding = self.embeddings(inputs).view((1,-1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs
    
    

In [10]:
with open('./Final_data/laptop_lower_additional_training_list.pickle') as f:
    training_data_laptop = pickle.load(f)

In [11]:
with open('./Final_data/laptop_lower_vocab.pickle') as f:
    laptop_vocab = pickle.load(f)

In [None]:
losses = []
loss_function = nn.NLLLoss()
model = ContextPredictionEmbedding(len(vocab), 300, 2)
optimizer = optim.SGD(model.parameters(), lr = 0.001)

for epoch in range(10):
    total_loss = torch.Tensor([0]):
    for target, context in training_batch:
        
        #1) Convert target var to embedding and wrap as a variable
        target_id = [vocab[target]]
        context_id = [vocab[context]]
        target_var = autograd.Variable(torch.LongTensor(target_id))
        context_var = autograd.Variable(torch.LongTensor(context))
        
        #2) reset gradients
        model.zero_grad()
        
        #3) run forward pass
        log_probs = model(target_var)
        
        #4) compute loss and update parameters
        loss = loss_function(log_probs, context_var)
        loss.backward()
        optimizer.step()
        total_loss+= loss.data
        
    losses.append(total_loss)
        