Q: Why do we need pretrained embeddings?

General embeddings
The purpose of embeddings is to be able to represent words with information we deem useful for our task. This is vaguely the definition of a feature for a model. The current trending mechanism to represent word information in deep learning is by being able to encode the usage of a word through its neighbours/context. These context specific features can be derived through tuned matrix transformations, ie, a neural network. Thus, from a neural network point of view, for each word we try and obtain a representation/vector that when transformed to the vocab space (softmax layer) results in high activations for mostly co-occuring words. 

These embeddings can be used in multiple ways for a deep learning text specific task. They can be tuned further based on the task (this results in a lot more parameters), or set as fixed. Give rep to a model and ask it to based on this, capture other info. Good detailed overview  with other techniques: https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795) 

2nd extension: Word senses
Consider the sentences:
S1: "The stock price took a hit during recession"
S2: "He hit the ball for a six" 
The word hit has a different meaning/contribution in the sentence. This varying usage of a word is termed as a word exhibiting different senses, and something the vanilla word embedding will not capture.
A simple method to capture different senses is to associate the POS tag with the word when computing the embedding. Hence, hit|NOUN and hit|VERB will have two different embeddings. 

3rd extension: Domain extension-Domain linkers, etc
Now consider the scenario for our problem wherein we train a deep learning model to detect aspect and opinion terms with its input features as our pretrained word embeddings. The resultant trained model, based on its architecture, learns the transformation matrices (which may be used to derive additional variables for computation- ex. attention) to transform the data into a 'latent/middle ground' space.
Let's break down the computation and training process of a BiLSTM-CRF model:
1. Prepare the sequence of inputs- and corresponding embeddings.
2. Run the computation steps to obtain the intermediate features for each word- composite of transformations of two hidden bilstm states (https://arxiv.org/abs/1511.00215) 
3. Compute the log likelihoods based on crf feature weights to output sequence labels.

I suppose that the network tries to do the following:
Given 

The question now is if we had two domains where labels are available for the first one, but a limited number of labels are available for the second one (we can take the case of being able to ask which sentence really needs a label-which would help our model the most, etc)

One simple approach would be to say that since the data is tuned generally- we should have similar feature representations for similar words across domains. This can be viewed as sharing a common latent space- which can either be done by finding similar word contributions amongst words in sentences of different domains- basically words that perform similar roles should have similar embeddings. 
An easier approach is to say that words that are linked by the same word across domains should have the same embedding- since they're the same feature. 

-->Another problem in domain adaptation for general sentiment analysisis (not just extraction) is that words can connotate different sentiments. easy-> good for a test, perhaps bad for describing a footballer
difficult-> good for describing defence, bad for describing a situation 
(https://nlp.stanford.edu/projects/socialsent/) 

--> This is again why a reasoning structure is needed-> soft when used in football can tell about a soft shot(-), soft tackle(-), soft touch(-), feather like control(+). 

In [1]:
import pickle
import pandas as pd
import csv
import torch
import torch.autograd as autograd
import torch.nn as nn
torch.manual_seed(8)

<torch._C.Generator at 0x7f16aaba15b8>

In [26]:
#Input layer and vocab one hot encoded inputs
#Embedding layer and its resultant transformation into vocab size<- another parameter
#Noisy inputs, etc 
import csv 
import string
import pickle
import pandas as pd
import nltk 
import spacy
import string

nlp = spacy.load('en') #load spacy model

    
def obtain_tokens(sentence):
    return sentence.split()

start_tag = "<START>"
end_tag = "<END>"
tag_to_id = {start_tag:0,end_tag:-1, "BA":1, "IA":2, "BO":3, "IO":4, "OT":5}
id_to_tag = {id_: tag for tag,id_ in tag_to_id.items()}

'''The below functions that use Spacy can be made faster using Spacy specific syntax- example using doc, checking for verbs, etc. However, here it is converted to Python string format to make the primary function compatible with other NLP packages'''
def get_tokenized_list_for_delim(phrases, tokenizer_func = obtain_tokens, delim =";"):
    '''
    Input: A sequence of phrases delimited as provided by delimiter: Ex A B; C; D E F
    Output: A list of tokenized phrases: [[A, B], [C], [D, E, F]]
    '''
    #split_phrases = phrases.split(delim)
    
    split_phrases = phrases.split(delim)
    tokenized_split_phrases = map(lambda x: tokenizer_func(x), split_phrases)
    return tokenized_split_phrases


punctuation_dict = {'.':' <PERIOD> ',
                    ',':' <COMMA> ', 
                    '"':' <QUOTATION_MARK> ', 
                    ';':' <SEMICOLON> ',
                    '!': ' <EXCLAMATION_MARK> ',
                    '?': ' <QUESTION_MARK> ', 
                    ':': ' <COLON> '
                   ,'-': ' <HYPHEN> ',
                    '(': ' <LEFT PARENTHESIS> ',
                    ')': ' <RIGHT PARENTHESIS> '
                   }


def process_text_pos_and_punctuation(text, replace_punctuation):
    text = nlp(unicode(text))
    tagged_text = []
    tokenized_text = []
    if(replace_punctuation):
        for token in text:
            temp_token = token
            if(str(token.pos_) == 'PUNCT'):
                if(str(token) in punctuation_dict.keys()):
                    token = punctuation_dict[str(token)]
            tokenized_text.append(str(token))
            tagged_text.append((str(token), str(temp_token.pos_), str(temp_token.tag_), str(temp_token.dep_)))
    
    else:
        tagged_text = [(str(token), str(token.pos_), str(token.tag_), str(token.dep_)) for token in text]
        tokenized_text = map(lambda x:x[0], tagged_text)
    return tokenized_text, tagged_text

def store_dataset_info(df, save_to_file = False, save_path ='Final_data/laptop', include_sense = False,  replace_punctuation= True, to_lower = True, get_tags = True, remove_punctuation= False, get_dependency_structure = False, tokenizer_func= obtain_tokens):
    training_data = []
    training_data_with_other_stuff = []
    out_row = []
    token_to_freq = {} 
    #token_sense_to_id = {}
    
    token_to_id = {}
    loop_i = 0 
    for sentence, aspect_words, opinion_words in zip(df.Sentence, df.Aspects, df.Opinions):
        
        
        if(remove_punctuation):
            sentence = sentence.translate(None, string.punctuation)
        
        
        if(to_lower):
            sentence = sentence.lower()
            
        
        tokenized_sentence, tagged_text = process_text_pos_and_punctuation(sentence, replace_punctuation)

        if(get_dependency_structure):
            #Obtain dependency tree
            dependency_tree = get_dependency_tree(sentence)
        #if(get_parser_tree):
            #Obtain parser tree
         #   None
        
        
        aspect_list = get_tokenized_list_for_delim(aspect_words, tokenizer_func, ';')
        opinion_list = get_tokenized_list_for_delim(opinion_words, tokenizer_func, ';')
        #print(aspect_list)
        seq_absa_tagged = [] 
        aspect_ptr = 0
        opinion_ptr = 0
        skip = 0
        
        #labels = []
        for i_loop, token in enumerate(tokenized_sentence):
            temp_token = token 
            
            #1) If sense is included then we change token to token|POS
            if(include_sense):
                '''THIS is specific to spacy format'''
            
                #Only limit to nouns, verbs and adjectives
                
                if(tagged_text[i_loop][1] in ['NOUN','VERB','ADJ']):
                    token+= '|'+ tagged_text[i_loop][1]
                    tokenized_sentence[i_loop] = token
                    #print(token)
                    
            if token not in token_to_id:
                token_to_id[token] = len(token_to_id)
                token_to_freq[token] = 0 
            else:
                token_to_freq[token] += 1
                
                
                
            if(skip>0): #if we encounter incomplete aspect/opinion matches previously
                skip -= 1
             
            else:    
                label = [tag_to_id["OT"]] #assume it is OT, store as list in case of multiple aspect/opinion terms
        
        
                token = temp_token
                
                if(aspect_ptr < len(aspect_list) and token == aspect_list[aspect_ptr][0]): #Incomplete match: battery--> battery life
                    label = [tag_to_id["BA"]]
                    skip = len(aspect_list[aspect_ptr]) - 1 #words to skip ahead since they have been already covered
                    if(skip>0):
                        label += [tag_to_id["IA"] for i in aspect_list[aspect_ptr][1:]] 
                    aspect_ptr += 1 
        
                elif(opinion_ptr< len(opinion_list) and token == opinion_list[opinion_ptr][0]): 
                    label = [tag_to_id["BO"]]
                    skip = len(opinion_list[opinion_ptr]) - 1
                    if(skip>0):
                        label += [tag_to_id["IO"] for i in opinion_list[opinion_ptr][1:]]
                    opinion_ptr += 1 
        
            
                seq_absa_tagged+=label
                
        if(get_dependency_structure):
            training_data_with_other_stuff.append((tokenized_sentence, seq_absa_tagged, tagged_text, dependency_tree))
            out_row = ["Sentence", "Sequence", "Tags","Dependency Tree"]
        
        elif(get_tags):
            training_data_with_other_stuff.append((tokenized_sentence, seq_absa_tagged, tagged_text))
            out_row = ["Sentence", "Sequence", "Tags"]
        
        elif(get_dependency_structure):
            training_data_with_other_stuff.append((tokenized_sentence, seq_absa_tagged, dependency_tree))
            out_row = ["Sentence", "Sequence", "Dependency Tree"]
        else:
            training_data_with_other_stuff.append((tokenized_sentence, seq_absa_tagged))
            out_row = ["Sentence", "Sequence"]
            
        training_data.append((tokenized_sentence, seq_absa_tagged))
        #Write data to csv and save vocab as pickle
        
        if(loop_i%300==0):
            print("At sentence: {}".format(-loop_i))
            print(training_data_with_other_stuff[-loop_i])
        loop_i-=1
    
    
    if(save_to_file):
        opt_info = "Normal_"
        if(include_sense):
            opt_info='WITH_SENSE_'
        
        processed_normal_csv_path = "{}/{}_absa_seq_labelled.csv".format(save_path, opt_info)
        with open(processed_normal_csv_path,'wb') as fout:
            csv_out = csv.writer(fout)
            csv_out.writerow(["Sentence","Sequence"])
            for row in training_data:
                csv_out.writerow(row)
                
        processed_additional_info_csv_path = "{}/{}_additional_info_seq_labelled.csv".format(save_path, opt_info)
        with open(processed_additional_info_csv_path, 'wb') as fout:
            csv_out = csv.writer(fout)
            csv_out.writerow(out_row)
            for row in training_data_with_other_stuff:
                csv_out.writerow(row) 
        
        processed_norm_training_data_pickle = "{}/{}_normal_training_list.pickle".format(save_path, opt_info)
        with open(processed_norm_training_data_pickle,'wb') as pickle_o:
            pickle.dump(training_data, pickle_o)
        
        processed_add_training_data_pickle = "{}/{}_additional_training_list.pickle".format(save_path, opt_info)
        with open(processed_add_training_data_pickle,'wb') as pickle_o:
            pickle.dump(training_data_with_other_stuff, pickle_o)
        
        processed_vocab_pickle = "{}/{}_vocab.pickle".format(save_path, opt_info)
        with open(processed_vocab_pickle,'wb') as pickle_o: 
            pickle.dump(token_to_id, pickle_o)
        
        token_to_freq_pickle = "{}/{}_tokenfreq.pickle".format(save_path, opt_info)
        with open(token_to_freq_pickle,'wb') as pickle_o: 
            pickle.dump(token_to_freq, pickle_o)
        
        tag_to_id_pickle = "{}/{}_tag2id.pickle".format(save_path, opt_info)
        with open(tag_to_id_pickle, 'wb') as pickle_o:
            pickle.dump(tag_to_id, pickle_o)
    
    return save_path, processed_normal_csv_path, processed_vocab_pickle 

class Domain:
    def __init__(self, name, primary_dir, data_path, raw_csv_file=None, already_processed = True):
        """
        primary_dir refers to the directory where the domain info is stored
        Ideally data_dir will be uniform throughout all domains.
        """
        self.name = name
        self.primary_dir = primary_dir
        self.raw_csv_file = raw_csv_file
        self.data_path = data_path #this is the processed data is stored
        self.vocab_path = None
        self.embedding_dir = None
        if(not already_processed):
            self.tr_data_path, self.vocab_path = self.run_data_processing(True) #with sense, lower
            _, _ = self.run_data_processing(False)
            
    def run_data_processing(self, with_sense= False):
        pd_file = pd.read_csv(self.raw_csv_file) #convert to pandas
        print("Processing raw data first with and then without sense")
        _, tr_data_path, vocab_path = store_dataset_info(pd_file, True, self.data_path, with_sense)
        return tr_data_path, vocab_path
        
    def get_domain_independent_features(self):
        None
        
    def get_feature_label_mutual_info(self):
        None
        
    def features(self):
        None
    

In [30]:
d2 = Domain('Rest','./Final_data/Domains/Rest/', './Final_data/Domains/Rest/','./Final_data/Semeval_14_ver1/Combined_restaurant.csv',False)

Processing raw data first with and then without sense
At sentence: 0
(['but', 'the', 'staff|NOUN', 'was|VERB', 'so', 'horrible|ADJ', 'to', 'us', ' <PERIOD> '], [5, 5, 1, 5, 5, 3, 5, 5, 5], [('but', 'CCONJ', 'CC', 'cc'), ('the', 'DET', 'DT', 'det'), ('staff', 'NOUN', 'NN', 'nsubj'), ('was', 'VERB', 'VBD', 'ROOT'), ('so', 'ADV', 'RB', 'advmod'), ('horrible', 'ADJ', 'JJ', 'acomp'), ('to', 'ADP', 'IN', 'prep'), ('us', 'PRON', 'PRP', 'pobj'), (' <PERIOD> ', 'PUNCT', '.', 'punct')])
At sentence: 300
(['this', 'place|NOUN', 'is|VERB', 'great|ADJ', ' <PERIOD> '], [5, 5, 5, 3, 5], [('this', 'DET', 'DT', 'det'), ('place', 'NOUN', 'NN', 'nsubj'), ('is', 'VERB', 'VBZ', 'ROOT'), ('great', 'ADJ', 'JJ', 'acomp'), (' <PERIOD> ', 'PUNCT', '.', 'punct')])
At sentence: 600
(['the', 'food|NOUN', 'is|VERB', 'o.k', ' <PERIOD> ', ' <COMMA> ', 'but', 'not', 'any', 'better|ADJ', 'than', 'what|NOUN', 'you', 'get|VERB', 'at', 'a', 'good|ADJ', 'neighborhood|NOUN', 'restaurant|NOUN', ' <PERIOD> '], [5, 1, 5, 5, 5,

In [31]:
with open('./Final_data/Domains/rest.pkl', 'wb') as output:
    pickle.dump(d1, output, pickle.HIGHEST_PROTOCOL)

## Auxiliary functions (subsampling, sense, etc)

In [4]:
### Subsampling-> Remove words with a probability proportional to  high frequency given the function



### Generate target context pairs
def get_context(words, index, window_size):
    '''Given a window size and current index of target, return the context words'''
    r = np.random.randint(1, window_size+1)
    start = index - r if(index - r )>0 else 0 
    stop = index + r
    context_words = list(set(words[start:index]+words[index+1, stop+1]))
    return context_words


def generate_training_batch(tokenized_corpus, batch_size, window_size = 5):
    '''This runs over the entire dataset once'''
    num_batches = np.ceil(float(len(corpus)))/batch_size #This is assuming that each target->all contexts are taken as a single element
    
    num_in_last_batch = len(corpus)%batch_size 
    tr_x = []
    tr_y = []
            
    for batch_num in range(num_batches-1): #do the last batch with 
    #1) For each word, we obtain the context words with a random variable ranging from 1 to desired window size
        tr_x = []
        tr_y = []
        batched_corpus = tokenized_corpus[batch_num*batch_size:(batch_num+1)*batch_size] 
        for target in batched_corpus :
            tr_x = [target]
            tr_y = get_context()
            
    #Do same operations for last batch
    
   # if(num_in_last_batch>0):
    

## Load the dataset

In [8]:
class ContextPredictionEmbedding(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(ContextPredictionEmbedding,self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 300)
        self.linear2 = nn.Linear(300, vocab_size)
        
    def forward(self, inputs):
        input_embedding = self.embeddings(inputs).view((1,-1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs
    
    

In [10]:
with open('./Final_data/laptop_lower_additional_training_list.pickle') as f:
    training_data_laptop = pickle.load(f)

In [11]:
with open('./Final_data/laptop_lower_vocab.pickle') as f:
    laptop_vocab = pickle.load(f)

In [None]:
losses = []
loss_function = nn.NLLLoss()
model = ContextPredictionEmbedding(len(vocab), 300, 2)
optimizer = optim.SGD(model.parameters(), lr = 0.001)

for epoch in range(10):
    total_loss = torch.Tensor([0]):
    for target, context in training_batch:
        
        #1) Convert target var to embedding and wrap as a variable
        target_id = [vocab[target]]
        context_id = [vocab[context]]
        target_var = autograd.Variable(torch.LongTensor(target_id))
        context_var = autograd.Variable(torch.LongTensor(context))
        
        #2) reset gradients
        model.zero_grad()
        
        #3) run forward pass
        log_probs = model(target_var)
        
        #4) compute loss and update parameters
        loss = loss_function(log_probs, context_var)
        loss.backward()
        optimizer.step()
        total_loss+= loss.data
        
    losses.append(total_loss)
        

In [3]:
import spacy
nlp = spacy.load('en')

In [13]:
x = "Hi Mr. K, what is up?"
text = nlp(unicode(x))

In [39]:
pos_tagged_text = [(str(token), str(token.pos_), str(token.tag_), str(token.dep_)) for token in text]

In [40]:
pos_tagged_text

[('Hi', 'INTJ', 'UH', 'intj'),
 ('Mr.', 'PROPN', 'NNP', 'compound'),
 ('K', 'PROPN', 'NNP', 'npadvmod'),
 (',', 'PUNCT', ',', 'punct'),
 ('what', 'NOUN', 'WP', 'nsubj'),
 ('is', 'VERB', 'VBZ', 'ROOT'),
 ('up', 'ADV', 'RB', 'advmod'),
 ('?', 'PUNCT', '.', 'punct')]

In [36]:
tagged_text, tokenized_text = process_text_pos_and_punctuation(x)

In [37]:
tagged_text

['Hi', 'Mr.', 'K', ' <COMMA> ', 'what', 'is', 'up', ' <QUESTION_MARK> ']

In [38]:
tokenized_text

[('Hi', 'INTJ', 'UH', 'intj'),
 ('Mr.', 'PROPN', 'NNP', 'compound'),
 ('K', 'PROPN', 'NNP', 'npadvmod'),
 (' <COMMA> ', 'PUNCT', ',', 'punct'),
 ('what', 'NOUN', 'WP', 'nsubj'),
 ('is', 'VERB', 'VBZ', 'ROOT'),
 ('up', 'ADV', 'RB', 'advmod'),
 (' <QUESTION_MARK> ', 'PUNCT', '.', 'punct')]