# Job Reccer Project - Job Search using Deep Learning Bi-Encoders

In [1]:
!pip install tensorboardX

Defaulting to user installation because normal site-packages is not writeable


In [2]:
from edited_roberta import *
from run_classifier import evaluate, load_and_cache_examples, accuracy, set_seed
import argparse
import glob
import logging
import os
import random
import numpy as np
import torch
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from torch.utils.data.distributed import DistributedSampler
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm, trange
from transformers import (WEIGHTS_NAME, get_linear_schedule_with_warmup, AdamW,RobertaConfig,RobertaTokenizer)
from utils import (compute_metrics, convert_examples_to_features,
                        output_modes, processors)

import pandas as pd

from torch.nn.functional import cosine_similarity

logger = logging.getLogger(__name__)


# Defining our Bi-Encoder model

In the following block, we'll create our neural network bi-encoder model. To do so, we'll define a new class `JobSearchBiencoderModel` that contains the two encoders, which are the encoder for query and the encoder for code. 

In pytorch, neural networks are defined by specifying their _parameters_ (the things that get updated during training) and a `forward` function that determines how to turn the inputs into outputs. For our bi-encoder, we'll need to fill these in as follows:

* Specify the two encoders in the `__init__` function as fields of the class (e.g., `self.x = 1` makes `x` a field of the object), which will tell PyTorch that we'll be updating their parameters during training
* Write the `forward` function so that we...
  * encode the query as a vector
  * encode the code-document as a vector
  * compute the cosine similarity of the two vectors (where 1 is relevant, 0 is not-relevant)
  
We'll detail these next.

## Creating the encoders

How do we instantiate an encoder? There are two steps. First we need to figure out what is the _architecture_ of the model. This defines things like how many layers are in a neural network and how the layers are connected. In our case, _both_ of our encoders will use the RoBERTa architecture; as you might have guessed, RoBERTa is related to BERT and just has slightly different tweaks. There are [many BERT variants](https://towardsdatascience.com/exploring-bert-variants-albert-roberta-electra-642dfe51bc23) and for the purposes of this excise, you can safely think of RoBERTa as the same as BERT.

Second, once we have our model architecture, we need to specify which parameters we'll start with. You can think of the difference between the model architecture and parameters as if you were specifying a meal: The architecture is a bit like specifying the plates/bowls/container based on what kind of food you want and the parameters are like filling the container with a specific kind of food. There are many pre-trained sets of parameters for architectures, so a neural network starts with some existing knowledge of certain kinds of things (e.g., what human  language looks like,  what programming languages look like, or how to classify images). In neural network land, the [Huggingface Model Repository](https://huggingface.co/models) is a common place to look for parameters that people have shared with others.

Returning to our IR problem, in our setting (conveniently), both of our encoders will use the same architecture: `RobertaModel`. If you _really_ want to know more, the code for this is provided in `edited_roberta.py`, but many end-users of these models (like us) will never need to look at this kind of code--and you certainly don't to complete this homework!

## Writing the forward function 

The `forward` function defines how the neural network goes from inputs to outputs. In our case, we're going to feed the different inputs (query and document) to separate encoders and then compare the outputs. 


Your task here:
1. Define the two encoders(query encoder and job_encoder) which are RobertaModel loaded from edit_roberta.py. You should two arguments when creating the instance of the Roberta Model, which are config and add_pooling_layer(equal to False).
2. define and calculate the cosine similarity between two embeddings.
3. define the loss function and loss.

### TODOs in this block are worth 20 points total

In [3]:
class JobSearchBiencoderModel(RobertaPreTrainedModel):

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config

        ## TODO: 
        # Fill in the following parts where you specify each encoder's architecture. 
        # You'll need to pass in "config" as an argument to the architecture's constructor
        # so it knows how to set things up.
        #
        # NOTE 1: Notice that we haven't specified the *parameters* here, just the architecture.
        # We'll fill in the parameters later
        #
        # NOTE 2: If you were ever curious how to do other kinds of non-text IR (e.g., images),
        # this part is where you'd specify a different kind of encoder architecture, such as
        # ResNet50 for encoding images. The rest of the code for this class would be mostly the same!
        # (The one caveat is that both models need to produce vector representations of the same size)
        
        self.query_encoder = RobertaModel(config, add_pooling_layer=False)
        self.job_encoder = RobertaModel(config, add_pooling_layer=False)

        # This is our loss function that determines how "good" our model's output is.
        # We'll use this in the forward() function to evaluate the model's outputs and
        # then return the predictions and loss.        
        self.loss_fn = BCEWithLogitsLoss()
        
        # This will initialize weights and apply final processing
        self.post_init()
 
    def forward(
        self,
        query_token_ids: Optional[torch.LongTensor] = None,
        job_token_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
       
    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
       

        outputs = self.query_encoder(
            query_token_ids,
            attention_mask=attention_mask,
            
        )
        query_emb = outputs[0][:, 0, :]
        
        outputs_code = self.job_encoder(
            job_token_ids,
            attention_mask=attention_mask,
           
        )
        code_emb = outputs_code[0][:, 0, :]

        # TODO: using the cosine_similarity function (imported above),
        # compute the similarity of the query and code embeddings.
        cosine_sim = cosine_similarity(query_emb, code_emb)
                       
        # TODO: use the self.loss_fn (our loss function) to measure how good/bad
        # the predictions are. This function will return a value you should 
        # call "loss". You can see how to call the function here
        # https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html
        # 
        # NOTE: The "labels" input to this function is the ground truth
        # relevance scores (labels) for each input. You'll want to compare
        # the cosine similarities with these labels when calling the function
        #
        # NOTE 2: There are many kinds of loss functions so understanding how
        # to call them and which order the arguments go in is important
        loss = self.loss_fn(cosine_sim, labels)

        # Finally, let's return some output! Pytorch has provided
        # some structure for us to say what-is-what in the output
        # values. We'll return our loss (how "bad" the model's prediction was)
        # and the logits, which is our predictions. 
        #
        # You don't need to worry about the other two outputs.
        #
        # NOTE: Normally, we'd be passing the cosine similarity through
        # some non-linear function (like a sigmoid!) to get "logits" as
        # our output values. However, in a bi-encoder, often you just 
        # return the cosine similarity as the logits, so the name is wrong
        # but the value is what's expected. Later on, when we access the
        # "logits" part of the output, remember these are the query-doc
        # cosine similarity scores
        return SequenceClassifierOutput(
            loss=loss,
            logits=cosine_sim,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

# Configure the Models and Training Setting

Training deep learning models often involves _lots_ of hyperparameter decisions--not to mention a bunch of seemingly-random bookkeeping options for where and when to save things. We have defined these all for you (yay) but it's worth at least looking through to see what kinds of decisions you'll need to make. Most importantly, we've specified the number of epochs and the batch size (more on that later) so that the model trains quickly.

You will eventually need to edit the input file name to use the full dataset. Everything else can stay the same, though you're welcome to try changing some things and seeing what happens once you've completed the assignment.

In [4]:
class Args:
    def __init__(self):
        
        # Where to save things
        self.data_dir = './data'
        self.model_type = 'roberta'
        self.model_name_or_path = 'microsoft/codebert-base'
        self.task_name = 'codesearch'
        self.output_dir = './models'
        self.output_mode = 'codesearch'

        # These are going to be your most common hyperparameters to change.
        # If you want to do deep learning stuff, it's worth learning a bit
        # about what they are and what they do.
        self.train_batch_size = 16
        self.eval_batch_size = 16
        self.gradient_accumulation_steps = 1
        self.learning_rate = 1e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3 # NOTE: Change this to 1 if debugging so it runs faster
        self.max_steps = -1
        self.warmup_steps = 0
        self.n_gpu = 1
        self.no_cuda = False

        # These are mostly configuration options for which pieces to run
        self.config_name = ""
        self.tokenizer_name = ""
        self.cache_dir = ""
        self.max_seq_length = 200
        self.do_train = True
        self.do_eval = True
        self.do_predict = False
        self.evaluate_during_training = False
        self.do_lower_case = False

        # How often we save things
        self.logging_steps = 1000 
        self.save_steps = 1000
        self.eval_all_checkpoints = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.seed = 42
        
        # Ignore all of these
        self.fp16 = False
        self.fp16_opt_level = 'O1'
        self.local_rank = -1
        self.server_ip = ""
        self.server_port = ""
        
        # Input and output files.
        #
        # TODO: Change the training file to train_300k.txt when ready
        #
        self.train_file = "train_10.txt"# CHANGE ME WHEN READY TO TRAIN!!!!!
        self.dev_file = "valid.txt"
        self.test_file = "test_data.txt"
        self.pred_model_dir = './models/checkpoint-best'
        self.test_result_dir = './results/'

args = Args()
print(args.test_file)

test_data.txt


In the following blocks, we are starting to train our model. Here we will firstly define the train function.

In the train function, we will define the procedure of training the model, the main steps are:
1. Define the dataloader (to do). You should use the function DataLoader(). Three arguments are required for you to input, which are dataset, batch_size, sampler.
2. set the gradient to zero and train the model using back propagation.
    

# Define the training process

Let's see how the training procedure works! The code block below specifies how we'll train our model. There's one part for you to fill in that loads the data using the `DataLoader` class. The rest is helpful to understand how 

### TODO in this block is worth 10 points

In [5]:
def train(args, train_dataset, model, tokenizer, optimizer):
    """ Train the model """

    # The sampler specifies how we should access the training data, which
    # in this case is in a random order
    train_sampler = RandomSampler(train_dataset, replacement=False,num_samples=20)
    
    # TODO: Initailize the DataLoader (https://pytorch.org/docs/stable/data.html)
    # so that it 
    # - loads from the provided train_dataset 
    # - samples using our sampler
    # - uses the specified batch size
    #
    # NOTE: The batch size is pretty important! This says how many examples to train on 
    # at one time. If you recall, we talked about Stocastic Gradient Descent (SGD) that
    # updates based on one instance at a time (e.g., changing the dog t-shirt size after seeing one dog)
    # versus Gradient Descent (GD) that updates after all the data. SGD is much faster to
    # converge to the "right" parameters but can make many missteps. The batch size 
    # says we can look at more than one instance at a time in determining how to update
    # our parameters (e.g., look at a few dogs at a time to determine how to best update 
    # the t-shirt size, rather than just one dog or all the dogs)
    train_dataloader = DataLoader(train_dataset, batch_size=5, sampler=train_sampler,
           batch_sampler=None)
    
    # How many total steps we'll take
    t_total = len(train_dataloader) //  args.num_train_epochs

    # The scheduler helps decide how quickly to update the weights based on how much
    # training data we've seen. 
    scheduler = get_linear_schedule_with_warmup(optimizer, args.warmup_steps, t_total)
    checkpoint_last = os.path.join(args.output_dir, 'checkpoint-last')
    scheduler_last = os.path.join(checkpoint_last, 'scheduler.pt')
    if os.path.exists(scheduler_last):
        scheduler.load_state_dict(torch.load(scheduler_last))

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Total train batch size (w. parallel, distributed & accumulation) = %d",
                args.train_batch_size * args.gradient_accumulation_steps * (
                    torch.distributed.get_world_size() if args.local_rank != -1 else 1))
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = args.start_step
    tr_loss, logging_loss = 0.0, 0.0
    best_acc = 0.0
    model.zero_grad()
    
    # Note that this "train_iterator" is just tdqm wrapper that prints out which
    # epoch we're currently in.     
    train_iterator = trange(args.start_epoch, int(args.num_train_epochs), desc="Epoch")
    
    set_seed(args) 
    
    # This tells pytorch that we're going to be changing the parameters so it needs
    # to start keeping track of stuff
    model.train()
    for idx, _ in enumerate(train_iterator):
        
        # Keep train of the training loss (how "bad" the performance is) for this epohch
        tr_loss = 0.0
        
        # For one epoch, loop over all the data, one batch at a time
        for step, batch in tqdm(enumerate(train_dataloader)):

            batch = tuple(t.to(args.device) for t in batch)
            inputs = {'query_token_ids': batch[0],
                      'job_token_ids': batch[1],
                      'labels': batch[3]}
            
            ouputs = model(**inputs)
            loss = ouputs[0]        
            
            # Do the back propagration to figure out which parameters to change.
            # It's that easy!
            loss.backward() 
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

            tr_loss += loss.item()
            
            # Update the parameters of our model based on the gradient and whatever
            # else the optimizer is keeping track of
            optimizer.step() 
            scheduler.step()  
            
            # This sets the gradient to zero before doing next update so we don't
            # accidentally update the model based on the last batch's performance
            model.zero_grad() 
            global_step += 1

            if args.max_steps > 0 and global_step > args.max_steps:
                break

        # Once we finish an epoch, evaluate the model on the development data and see
        # how well it does. We'll use this information to decide which version of
        # the parameters to use.
        results = evaluate(args, model, tokenizer, checkpoint=str(args.start_epoch + idx))

        # 
        # Save the model and if we've already saved it, overwrite that saved model with 
        # the newly-trained parameters
        #
        last_output_dir = os.path.join(args.output_dir, 'checkpoint-last')
        if not os.path.exists(last_output_dir):
            os.makedirs(last_output_dir)
        model_to_save = model.module if hasattr(model,
                                                'module') else model 
        model_to_save.save_pretrained(last_output_dir)
        logger.info("Saving model checkpoint to %s", last_output_dir)
        idx_file = os.path.join(last_output_dir, 'idx_file.txt')
        with open(idx_file, 'w', encoding='utf-8') as idxf:
            idxf.write(str(args.start_epoch + idx) + '\n')

        torch.save(optimizer.state_dict(), os.path.join(last_output_dir, "optimizer.pt"))
        torch.save(scheduler.state_dict(), os.path.join(last_output_dir, "scheduler.pt"))
        logger.info("Saving optimizer and scheduler states to %s", last_output_dir)

        step_file = os.path.join(last_output_dir, 'step_file.txt')
        with open(step_file, 'w', encoding='utf-8') as stepf:
            stepf.write(str(global_step) + '\n')

        # Optional part 1 goes here

        #
        # If this model is better (on the training data) than the models from any of the 
        # past checkpoints, then keep a separate record of that too
        #
        if (results['acc'] > best_acc):
            best_acc = results['acc']
            output_dir = os.path.join(args.output_dir, 'checkpoint-best')
            if not os.path.exists(output_dir):
                os.makedirs(output_dir)
            model_to_save = model.module if hasattr(model, 'module') else model  
            model_to_save.save_pretrained(output_dir)
            torch.save(args, os.path.join(output_dir, 'training_{}.bin'.format(idx)))
            logger.info("Saving model checkpoint to %s", output_dir)

            torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
            torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
            logger.info("Saving optimizer and scheduler states to %s", output_dir)

        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    return global_step, tr_loss / global_step

# Set up the training environment

This will get a few things ready for the model to train. You don't need to really do much in this block but it's worth seeing how it works if you want to train models in the future

In [6]:
# Setup CUDA so we can run on the GPU
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
args.device = device

# Setup logging
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt='%m/%d/%Y %H:%M:%S',
                    level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)

# Set seed
set_seed(args)

# This code will help us if we restart training and want to pick back up where we left off
args.start_epoch = 0
args.start_step = 0
checkpoint_last = os.path.join(args.output_dir, 'checkpoint-last')
if os.path.exists(checkpoint_last) and os.listdir(checkpoint_last):
    args.model_name_or_path = os.path.join(checkpoint_last, 'pytorch_model.bin')
    args.config_name = os.path.join(checkpoint_last, 'config.json')
    idx_file = os.path.join(checkpoint_last, 'idx_file.txt')
    with open(idx_file, encoding='utf-8') as idxf:
        args.start_epoch = int(idxf.readlines()[0].strip()) + 1

    step_file = os.path.join(checkpoint_last, 'step_file.txt')
    if os.path.exists(step_file):
        with open(step_file, encoding='utf-8') as stepf:
            args.start_step = int(stepf.readlines()[0].strip())
    logger.info("reload model from {}, resume from {} epoch".format(checkpoint_last, args.start_epoch))

# Task: Setting up the bi-encoder model and its parameters

### the TODO in this block is worth 10 points

In [7]:
# We'll specify some general configurations that tell the models what kind
# of parameters to use and how to turn incoming text/code data into identifiers for 
# processing with the neural network 
#
# We set num_labels = 1 because this is a regression class
# (compared to a classification task with many class labels)
num_labels = 5
config = RobertaConfig.from_pretrained('microsoft/codebert-base',
                                      num_labels=num_labels, finetuning_task=args.task_name)

# We'll treat relevance as a regression problem
config.problem_type = 'regression'

# If you remember from our neural language model part of the lecture, we talked
# about one language model that gets fed a series of words to predict the next
# and each word is mapped to an embedding. The "tokenizer" specifies how to 
# break up words into tokens but it frequently doesn't use just spaces!
# In fact, most tokenizers break words into *pieces* to reduce the size of the
# vocabularly (fewer embeddings!) so we need to specify which tokenizer to use.
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")


# Now comes the time to define our model. Let's specify the model class (which we'll need later)
model_class = JobSearchBiencoderModel
# And we'll instantiate the model itself.
model = JobSearchBiencoderModel(config)

# Now comes the magic where we specify the two encoders. Conveniently for us,
# there's actually a very recent langauge model that knows *both* code and human language!!
# We'll use this set of parameters to initialize *each* of our encoders. Over time,
# each encoder's parameters will start to become different since one side is going
# to learn how to encode queries better and the other will learn how to encode 
# code documents.
#
# NOTE: There's nothing stopping us from trying other parameters for the
# encoders too. If you're feeling curious you could swap in any RoBERTa model
# for the query encoder and it will just work.
#
# TODO: Initialize each of the coders using the "from_pretrained" method and
# specifying the pretrained model you want. Here, we'll use the CodeBERT model, 
# which is hosted on Huggingface https://huggingface.co/microsoft/codebert-base
# You should pass in the full name of the pretrained model (which includes the "/").
# Note that this code is going to look the same for both encoders and may 
# seem kind of easy to do but we want you to see how to do it yourself. :) 
model.query_encoder = RobertaModel.from_pretrained("microsoft/codebert-base")
model.job_encoder =  RobertaModel.from_pretrained("microsoft/codebert-base")

# This will move the model's parameters onto the GPU so it runs fast
model.to(args.device)

# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
     'weight_decay': args.weight_decay},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
# Remember how we talked about stochastic gradient descent (SGD)? Well, it's not
# the only way to update parameters. There are many (many) ways to do this
# and the usual standard is actually AdamW which uses a bit of bookkeeping to figure
# out how to update the weights more efficiently so the model learns faster.
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)

# If we're restarting, load the optimizer's state at the last time step
optimizer_last = os.path.join(checkpoint_last, 'optimizer.pt')
if os.path.exists(optimizer_last):
    optimizer.load_state_dict(torch.load(optimizer_last))

Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaModel: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaModel: ['pooler.dense.weight', 'pooler.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).


# Do the training!

Finally!! Let's train that model and save it to a file so we can evaluate it

In [8]:
logger.info("Training/evaluation parameters %s", args)

# Load in the training dataset. Here, we've handled most of the data preprocessing for you
# train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, ttype='train')
train_dataset = 
print(f'The trainint dataset looks like this: ', train_dataset)
# train_dataset = 
print(len(train_dataset))
# Call the training function that we defined above
global_step, tr_loss = train(args, train_dataset, model, tokenizer, optimizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

# Create output directory if needed
if not os.path.exists(args.output_dir):
    os.makedirs(args.output_dir)

# Save the trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
logger.info("Saving model checkpoint to %s", args.output_dir)
model_to_save = model.module if hasattr(model, 'module') else model  
model_to_save.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)

# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, 'training_args.bin'))

# Load a trained model and vocabulary that you have fine-tuned
model = AutoModel.from_pretrained(args.output_dir)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model.to(args.device)

12/14/2022 01:57:11 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x14e8c0ffb610>
12/14/2022 01:57:11 - INFO - run_classifier -   Loading features from cached file ./data/cached_train_train_10_codebert-base_200_codesearch
12/14/2022 01:57:11 - INFO - __main__ -   ***** Running training *****
12/14/2022 01:57:11 - INFO - __main__ -     Num examples = 5
12/14/2022 01:57:11 - INFO - __main__ -     Num Epochs = 3
12/14/2022 01:57:11 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
12/14/2022 01:57:11 - INFO - __main__ -     Gradient Accumulation steps = 1
12/14/2022 01:57:11 - INFO - __main__ -     Total optimization steps = 1


cached_features_file ./data/cached_train_train_10_codebert-base_200_codesearch True
<utils.InputFeatures object at 0x14e8054c7460>
The trainint dataset looks like this:  <torch.utils.data.dataset.TensorDataset object at 0x14e8052a7c70>
5


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
1it [00:00,  1.64it/s][A
2it [00:00,  3.04it/s][A
3it [00:00,  4.25it/s][A
4it [00:00,  4.03it/s][A
12/14/2022 01:57:12 - INFO - run_classifier -   Loading features from cached file ./data/cached_dev_valid_codebert-base_200_codesearch
12/14/2022 01:57:12 - INFO - run_classifier -   Creating features from dataset file at ./data
12/14/2022 01:57:12 - INFO - utils -   LOOKING AT ./data/valid.txt


cached_features_file ./data/cached_dev_valid_codebert-base_200_codesearch False


12/14/2022 01:57:13 - INFO - utils -   Writing example 0 of 46213
12/14/2022 01:57:13 - INFO - utils -   *** Example ***
12/14/2022 01:57:13 - INFO - utils -   guid: dev-0
12/14/2022 01:57:13 - INFO - utils -   query_token_ids: 0 6407 13851 37357 5 2324 828 2622 30 5222 41 3169 19470 463 36 192 20387 39891 463 37380 1215 347 4839 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12/14/2022 01:57:13 - INFO - utils -   code_token_ids: 0 9232 386 1215 9981 10845 36 1403 2156 2345 2156 1100 2156 425 5457 9291 2156 414 5457 9291 2156 17017 5457 9291 2156 923 5457 321 2156 1123 5457 883 612 4839 4832 18088 1403 479 18134 642 4345 1215 9981 10845 16 9291 2156 22 44307 554 48726 113 1403 479 18134 642 4345 1215 9981 10845 5457 221 4345 48132 36 2345 2156 1100 2156 425 2156 414 2156 17017 2156 923 2156 1123 4839 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12/14/2022 01:57:13 - INFO - utils -   

<utils.InputFeatures object at 0x14e7edfdbd30>


12/14/2022 01:57:49 - INFO - run_classifier -   ***** Running evaluation  *****
12/14/2022 01:57:49 - INFO - run_classifier -     Num examples = 46213
12/14/2022 01:57:49 - INFO - run_classifier -     Batch size = 16

Evaluating:   0%|          | 0/2889 [00:00<?, ?it/s][A
Epoch:   0%|          | 0/3 [00:37<?, ?it/s]


TypeError: forward() got an unexpected keyword argument 'code_token_ids'

# Evaluate the best model on the test data

Our code in training keeps track of how the model is doing and currently keeps around the files for the "best performing" model on the training data. How well does this model do on the test data? Let's find out!

In [None]:
# Evaluation
checkpoint = args.output_dir

logger.info("Evaluate the following checkpoint: %s", checkpoint)

print(checkpoint)
global_step = ""
model = model_class.from_pretrained(checkpoint)
model.to(args.device)
result = evaluate(args, model, tokenizer, checkpoint=checkpoint, prefix=global_step)
result = dict((k + '_{}'.format(global_step), v) for k, v in result.items())
print(result)

    
# Optional part 3 goes here


# Doing Inference on the Test Dataset

Finally, let's estimate the relevance scores for the query-document pairs in our test dataset. The test dataset **test_data.csv** contains each pair of 99 queries and 958 documents, which in total adds up to 94,842 query-document annotations (compare that with the project update number!). 

For ease of this exercise, we've already processed the data into a ready-to-go format in **test_data.txt** which is required by the model. In the following block, for each query-document pair, we generate a prediction score that measure the relevance of that pair by feeding the pair as inputs to the model's `forward` function (note that in pytorch if you have some model, doing `model(inputs)` and `model.forward(inputs)` is the same--it's trying to emphasize thinking of these as functions!). 

Once we have the model predictions, we'll create a new dataframe that contains (1) the query id, (2) the document id, and (3) the relevance score for that pair. We'll hand this dataframe off to Part 2 so you can finish up your GPU work.

You should adjust the directory path accordingly in order to successfully do the inference. Check what you got in the result.txt file and write this back into the **test_data.csv** that adds an additional column "sim". Later In Part 2, you will incorporate the prediction score into the learning to rank model to see if it can improve the performance of ranking,

### Implementation notes

This implementation works because we've aligned the `test_data.txt` and `test_data.csv` files so they're in the same order. That means you can write a long list of similarities and then add it back to the test data's DataFrame and it will Just Work™. However, in production settings, it's often useful to keep identifiers with the data as much as possible so that you don't just have a file of predictions and instead can write the predictions with the query/document identifiers (or whatever data you're working with).

In this homework's setup, we're precomputed the relevance scores for later integration with some overall ranking function (done in Part 2). To get these, we re-encode everything for each step. In commercial systems, what is typically done is the documents are encoded once and then cached, much like how our inverted index caches the terms in each document. Then when a new queries arrives, we only have to encode it and compare it with the cached document emeddings. This saves a lot of time! Thankfully our dataset is quite small here so we don't need to do that, but the ability to precompute and cache embeddings is worth remembering why bi-encoders are helpful and efficient for ranking.

In [None]:
# This tells the model that we're switching to evaluation mode (rather than training)
# so it should turn off any training-specific functionality that could make it slower
# or interfere with our results.
model.eval()
    
# Note here: we're loading the test set ***in sequential order***. This is critical
# for the next step because we need to map these predictions to query-document pairs.
# In training, we want to see a random order, but typically not during test.
eval_dataset, instances = load_and_cache_examples(args, "jobsearch", tokenizer, ttype='test')
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

# This data structures will have our predictions and we'll fill them as we process each batch
relevance_predictions = np.array([])

for batch in tqdm(eval_dataloader, desc="Evaluating"):
    batch = tuple(t.to(args.device) for t in batch)

    # Get the model's cosine similarity for the query-document pairs
    # we pass as input. This no_grad() call also tells pytorch that
    # we're doing evaluation so pytorch doesn't have to keep track
    # of any gradients for updating the model (e.g., remember how
    # in the dog t-shirt fitting, we had to keep around how much to change
    # the t-shirt sizes).
    with torch.no_grad():
        
        # Prepare the inputs
        inputs = {'query_token_ids': batch[0],
                  'job_token_ids': batch[1],
                  'labels': batch[3]}
       
        # Note that this is a list of outputs, which includes the cosine
        # similarity, among other stuff
        outputs = model(**inputs)
        
    # Let's pull out just the cosine similarity
    _, cosine_sim = outputs[:2] 
    
    # Pytorch works with "tensors" which are just like fancy numpy arrays.
    # One main difference is that the tensor might "live" on a GPU, which 
    # means we need to copy it into regular computer memory to use it.
    # Here, we'll call .cpu() to get the value back off the GPU and then
    # convert the similarities to numpy. Remember, we're getting a list
    # of similarities back out!
    cosine_sims = cosine_sim.cpu().numpy()
    
    # Add these similarities to our current similarities
    relevance_predictions = np.append(relevance_predictions, cosine_sims, axis=0)

if not os.path.exists(args.test_result_dir):
    os.makedirs(args.test_result_dir)

output_test_file = os.path.join(args.test_result_dir, 'relevance-scores.csv')

with open(output_test_file, "w") as outf:
    logger.info("***** Writing relevance predictions *****")
    all_logits = relevance_predictions.tolist()
    
    # Note that we write these all as one big list. In the next step,
    # we'll merge these with the data frame
    outf.write(",".join([str(item) for item in all_logits]))
    
# Optional part 3 goes here

### TODO: Merge the predictions with the query/doc pairs (10 points)

The numpy array `relevance_predictions` now contains a list of all the similarities, which we'll need to merge with the test data. Conveniently, these predictions appear in the exact same order as the query-document pairs in the `data/test_data.csv` file.

Your task is to read in `test_data.csv` as a dataframe and merge these relevance predictions as a new column called "sim". We'll export this dataframe to a separate file with just a few columns for better efficiency. Write a _new_ file with a subset of this dataframe containing only the columns
* "sim"
* "qid" (the query id)
* "docno" (the document id)
These last two columns match the pyterrier naming conventions, which you'll need for Part 2.

In [None]:
df = pd.read_csv('data/test_data.csv')
df.head()

# _Optional TODO_: Evaluating the different models (20 points total; this is part 1)

 In the code above, we save the model's parameter for the most recent epoch and an extra directory for saving the model with the highest accuracy so far on the validation data. How much training does the model actually need to recognize relevance? Would one epoch be enough? What if we did 10? or 100? (100 might be too many for Great Lakes limits...). In this **optional part**, we'll describe a series of steps you can take to explore this part!
 
Most of this optional part consists of changing or extending the code above using regular python/pandas things (no deep learning) so this is accessible to anyone. It will require you to figure out how some of the code does work though, so it's useful in general. 
 
Here's what you need to do:
* Right now the model trains for 3 epochs total. Increase that number to 5 or more. There's a 3-hour limit per session for GPUs in Great Lakes so if you've completed all of part 1, it's worth getting a fresh session to get all 3 hours again. You can increase the number of epochs if you want too.
* When training, we save the last checkpoint (overwriting the previous result) and also see if this is the "best" model and save that too. You will need to add more code here to save the model after every epoch. The code to do the saving is already shown in that block, so you'll need to figure out which parts to re-use _and_ be sure to change the directory. Look for "Optional part 1 goes here" on where to start
* After training completes, let's see how well each of the models does on the test set. We've already provided some code that does the evaluation on the best performing model. Add more code so that it evaluates the models you just saved for each epoch (look for "Optional part 2 goes here") and make a plot of the performance on the test set
* To measure the impact on NDCG, we'll need to calculate the different bi-encoders' relevance estimates to different files to use in Part 2. You'll need to add more code that loads in each of these models from the checkpoint directories and runs the inference
* In the Part 4, just write the different models' predictions to separate files. You might combine this with the part 3 code if it's easier to do there. We haven't marked a spot for it explicitly, but you'll use these files in part 2.

The output of Part 1 should be a plot showing the F1 performance per epoch and a list of files for each epoch's trained model's relevance predictions.