##Document Encoder
This notebook takes the documents(Z part) from our dataset and passes them through the encoder to create the embeddings for the documents.
The encoder used here is BertModel with BertTokenizer as the tokenizer.

The Model is used as a pre-trained model and is not trained on the current data specifically.
This is done for two reasons:
1) We index the documents. If train the document encoder, everytime the index needs to be changed, which is compute intensive.
2) If we want to add more documents, we need to re-train the complete model, again another compute intensive task.

## Challenges faced here
1)Inspite of having colab pro version and High-RAM availability, CUDA runs out of memory after some 60% of the training data, which is about 14000 samples.
So I divided the training data in two parts. Encoded the first part of training data. Saved it. Killed the kernel and restarted and encoded the second part of training data, saved it and reapeated to encode the test data.
2) About 50% of the documents were lengthier than the number of tokens bert model could actually take (which is 512 tokens)


In [None]:
!pip install transformers



In [None]:
import pandas as pd
import numpy as np

# torch imports
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

#from transformers import BertTokenizer

In [None]:
train_data=pd.read_csv("/content/drive/MyDrive/train_qa.csv")

In [None]:
train_data.head()

Unnamed: 0,X,Y,Z
0,"How to use torch.atan, give an example?",>>> a = torch.randn(4)\n>>> a\ntensor([ 0.2341...,>>> a = torch.randn(4)\n>>> a\ntensor([ 0.2341...
1,How can a handle be used to remove the added h...,callinghandle.remove(),Hooks will be called in order of registration....
2,What tensor of sizewin_length can a window be?,1-D,"windowcan be a 1-D tensor of sizewin_length, e..."
3,What did aScriptModuleorScriptFunction previou...,withtorch.jit.save,"Functionally equivalent to aScriptModule, but ..."
4,What is used as an entry point into aScriptMod...,annn.Module,Warning The@torch.jit.ignoreannotation’s behav...


In [None]:
batch_size=8

In [None]:
train_data.head()

Unnamed: 0,X,Y,Z
10008,"How to use torch.atan, give an example?",>>> a = torch.randn(4)\n>>> a\ntensor([ 0.2341...,>>> a = torch.randn(4)\n>>> a\ntensor([ 0.2341...
6408,How can a handle be used to remove the added h...,callinghandle.remove(),Hooks will be called in order of registration....
17395,What tensor of sizewin_length can a window be?,1-D,"windowcan be a 1-D tensor of sizewin_length, e..."
15488,What did aScriptModuleorScriptFunction previou...,withtorch.jit.save,"Functionally equivalent to aScriptModule, but ..."
11847,What is used as an entry point into aScriptMod...,annn.Module,Warning The@torch.jit.ignoreannotation’s behav...


In [None]:
train_data1=train_data[0:7200]

In [None]:
train_data2=train_data[7200:]

In [None]:
#doctrain_data=[t for t in train_data['Z']]

In [None]:
#BERT imports
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla T4'

In [None]:
#BertModel to encode context (Z)
bert_z=BertModel.from_pretrained("bert-base-uncased")
for param in bert_z.parameters():
    param.requires_grad = False

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## get document embeddings for training
Train embeddings were obtained in parts as CUDA runs out of memory


In [None]:
tokenized_train1_z=tokenizer([t for t in train_data1['Z']],truncation=True,padding=True, return_tensors='pt')
tokenized_train2_z=tokenizer([t for t in train_data2['Z']],truncation=True,padding=True, return_tensors='pt')


In [None]:
train1_dataset_z=TensorDataset(tokenized_train1_z['input_ids'],tokenized_train1_z['attention_mask'],tokenized_train1_z['token_type_ids'])
train2_dataset_z=TensorDataset(tokenized_train2_z['input_ids'],tokenized_train2_z['attention_mask'],tokenized_train2_z['token_type_ids'])

In [None]:
z1_sampler=SequentialSampler(train1_dataset_z)
z2_sampler=SequentialSampler(train2_dataset_z)

z_train1_dataloader=DataLoader(train1_dataset_z, sampler=z1_sampler, batch_size=batch_size, shuffle=False)
z_train2_dataloader=DataLoader(train2_dataset_z, sampler=z2_sampler, batch_size=batch_size, shuffle=False)


In [None]:
len(train1_dataset_z)

In [None]:
from tqdm import tqdm, trange

In [None]:
torch.cuda.empty_cache()

In [None]:
bert_z.cuda()

In [None]:
num_train_epochs = 1
encoded_z=[]
z_train_iterator = trange(num_train_epochs,desc="epochs")
for _ in z_train_iterator:
    epoch_iterator = tqdm(z_train1_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):
      batch = tuple(t.to(device) for t in batch)
      inputs = {'input_ids':       batch[0],
                'attention_mask':  batch[1], 
                'token_type_ids':  batch[2]}
      outputs = bert_z(**inputs)
      cls_tok=outputs[0][:,0,:]
      encoded_z.append(cls_tok)

In [None]:
len(encoded_z)

In [None]:
new_z=torch.stack(encoded_z)
new_z1=new_z.reshape(-1,768)

In [None]:
torch.save(new_z1,"encoded1_doc.pt")
from google.colab import files
files.download('encoded1_doc.pt')

In [None]:
torch.cuda.empty_cache()

In [None]:
num_train_epochs = 1
encoded_z=[]
z_train_iterator = trange(num_train_epochs,desc="epochs")
for _ in z_train_iterator:
    epoch_iterator = tqdm(z_train2_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):
      batch = tuple(t.to(device) for t in batch)
      inputs = {'input_ids':       batch[0],
                'attention_mask':  batch[1], 
                'token_type_ids':  batch[2]}
      outputs = bert_z(**inputs)
      cls_tok=outputs[0][:,0,:]
      encoded_z.append(cls_tok)

In [None]:
new_z=torch.stack(encoded_z)
new_z1=new_z.reshape(-1,768)

In [None]:
torch.save(new_z1,"encoded2_doc.pt")
from google.colab import files
files.download('encoded2_doc.pt')

Now embeddings for test data

In [None]:
test_data=pd.read_csv("/content/drive/MyDrive/test_qa.csv")

In [None]:
test_data.head()

In [None]:
tokenized_test_z=tokenizer([t for t in test_data['Z']],truncation=True,padding=True, return_tensors='pt')

In [None]:
test_dataset_z=TensorDataset(tokenized_test_z['input_ids'],tokenized_test_z['attention_mask'],tokenized_test_z['token_type_ids'])

In [None]:
z_sampler=SequentialSampler(test_dataset_z)

z_test_dataloader=DataLoader(test_dataset_z, sampler=z_sampler, batch_size=batch_size, shuffle=False)


In [None]:
num_test_epochs = 1
encoded_z=[]
z_test_iterator = trange(num_test_epochs,desc="epochs")
for _ in z_train_iterator:
    epoch_iterator = tqdm(z_test_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):
      batch = tuple(t.to(device) for t in batch)
      inputs = {'input_ids':       batch[0],
                'attention_mask':  batch[1], 
                'token_type_ids':  batch[2]}
      outputs = bert_z(**inputs)
      cls_tok=outputs[0][:,0,:]
      encoded_z.append(cls_tok)

In [None]:
new_z=torch.stack(encoded_z)
new_z1=new_z.reshape(-1,768)

In [None]:
torch.save(new_z1,"encoded3_doc.pt")
from google.colab import files
files.download('encoded3_doc.pt')