# AIR - Exercise in Google Colab

## Colab Preparation

Open via google drive -> right click: open with Colab

**Get a GPU**

Toolbar -> Runtime -> Change Runtime Type -> GPU

**Mount Google Drive**

* Download data and clone your github repo to your Google Drive folder
* Use Google Drive as connection between Github and Colab (Could also use direct github access, but re-submitting credentials might be annoying)
* Commit to Github locally from the synced drive

**Keep Alive**

When training google colab tends to kick you out, This might help: https://medium.com/@shivamrawat_756/how-to-prevent-google-colab-from-disconnecting-717b88a128c0

**Get Started**

Run the following script to mount google drive and install needed python packages. Pytorch comes pre-installed.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/AIR/
#! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.12.0-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.12.0-Linux-x86_64.sh -b -f -p /content/drive/MyDrive/AIR/Miniconda

!source /content/drive/MyDrive/AIR/Miniconda/bin/activate #to activate the miniconda environment

%env PYTHONPATH= '/content/drive/MyDrive/AIR/Miniconda/bin/python
!echo $PYTHONPATH

#Add miniconda to the system PATH:

import sys
sys.path.append('/content/drive/MyDrive/AIR/Miniconda/lib/python3.7/site-packages/')

import os # ?? needed?
path = '/content/drive/MyDrive/AIR/Miniconda/bin:' + os.environ['PATH']
%env PATH=$path

%cd /content/drive/MyDrive/AIR/condaENVair
!conda create --name MYcondaENVair python=3.6

%cd /content/drive/MyDrive/AIR/condaENVair
!source activate MYcondaENVair

!conda info --envs

!conda install --channel defaults conda python=3.6 --yes
!conda update --channel defaults --all --yes

!pip install -r /content/drive/MyDrive/AIR/requirements.txt

!conda install pytorch==1.6.0 torchvision==0.7.0 -c pytorch

In [None]:
import torch

print("Version:",torch.__version__)
print("Has GPU:",torch.cuda.is_available()) # check that 1 gpu is available
print("Random tensor:",torch.rand(10,device="cuda")) # check that pytorch works

# Main.py Replacement

-> add your code here

- Replace *air_test* with your google drive location in the sys.path.append()

#Config

In [None]:
import sys
sys.path.append('/content/drive/MyDrive/AIR/src')

from allennlp.common import Params, Tqdm
from allennlp.common.util import prepare_environment
from allennlp.data.dataloader import PyTorchDataLoader
prepare_environment(Params({})) # sets the seeds to be fixed

import pandas as pd
import torch
import torch.optim as optim
import torch.nn as nn

from allennlp.data.vocabulary import Vocabulary
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder

from data_loading import *
from model_knrm import *
from model_tk import *
from core_metrics import *

# change paths to your data directory
config = {
    "vocab_directory": "../AIR/data/Part-2/allen_vocab_lower_10",
    "pre_trained_embedding": "../AIR/data/Part-2/glove.42B.300d.txt",
    "model": "tk",
    "train_data": "../AIR/data/Part-2/triples.train.tsv",
    "validation_data": "../AIR/data/Part-2/msmarco_tuples.validation.tsv",
    "test_data": "../AIR/data/Part-2/msmarco_tuples.test.tsv",
    "fira_test_data": "../AIR/data/Part-2/fira-22.tuples.tsv",
    "qrels": "../AIR/data/Part-2/msmarco_qrels.txt",
    "fira_qrels": "../AIR/data/Part-1/fira-22.baseline-qrels.tsv",
    "custom_qrels": "../AIR/data/Part-1/aggregated_qrels.tsv",
    "custom_qrels_log": "../AIR/data/Part-1/aggregated_qrels_log.tsv",
    "model_path": "../AIR/data/Part-2/model_weights_tk.pth"
}

#
# data loading
#
vocab_directory = config["vocab_directory"]
print("Expected vocabulary directory:", os.path.abspath(vocab_directory))

vocab = Vocabulary.from_files(config["vocab_directory"])
tokens_embedder = Embedding(vocab=vocab,
                           pretrained_file= config["pre_trained_embedding"],
                           embedding_dim=300,
                           trainable=True,
                           padding_index=0)
word_embedder = BasicTextFieldEmbedder({"tokens": tokens_embedder})

# recommended default params for the models (but you may change them if you want)
if config["model"] == "knrm":
    model = KNRM(word_embedder, n_kernels=11)
elif config["model"] == "tk":
    model = TK(word_embedder, n_kernels=11, n_layers = 2, n_tf_dim = 300, n_tf_heads = 10)

# put model to gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# optimizer, loss
optimizer = optim.Adam(model.parameters(), lr=4e-5)
loss_function = nn.MarginRankingLoss(margin=1.0)

# print model struct
print('Model', config["model"], 'total parameters:', sum(p.numel() for p in model.parameters() if p.requires_grad))
print('Network:', model)


Expected vocabulary directory: /content/drive/MyDrive/AIR/data/Part-2/allen_vocab_lower_10


1917494it [01:22, 23133.62it/s]


Model tk total parameters: 97569520
Network: TK(
  (word_embeddings): BasicTextFieldEmbedder(
    (token_embedder_tokens): Embedding()
  )
  (positional_encoding): PositionalEncoding()
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=300, out_features=300, bias=True)
        )
        (linear1): Linear(in_features=300, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=300, bias=True)
        (norm1): LayerNorm((300,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((300,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (log_norm_layer): Linear(in_features=11, out_features=1, bias=False)
  (length_norm_layer): Linear(in_

# Define train/eval functions

In [None]:
def train_model(train_loader):

    model.train()
    total_loss = 0.0
    num_batches = 0

    for batch in Tqdm.tqdm(train_loader):

        #set grad to 0
        optimizer.zero_grad()

        #put data to device
        query = batch['query_tokens']['tokens']['tokens'].to(device)
        doc_pos = batch['doc_pos_tokens']['tokens']['tokens'].to(device)
        doc_neg = batch['doc_neg_tokens']['tokens']['tokens'].to(device)

        #get score for relevant and nonrelevant document
        score_pos = model(query, doc_pos)
        score_neg = model(query, doc_neg)

        #make targets
        target = torch.ones(score_pos.size(), device = score_pos.device)

        #compute loss
        loss = loss_function(score_pos, score_neg, target)
        total_loss += loss.item()

        #backprop
        loss.backward()
        optimizer.step()
        num_batches = num_batches + 1

    return total_loss / num_batches

def evaluate_model(eval_loader, qrels):

    # put model to eval mode
    model.eval()
    ranking = {}

    with torch.no_grad():
      for batch in Tqdm.tqdm(eval_loader):

        # data
        query = batch['query_tokens']['tokens']['tokens'].to(device)
        doc = batch['doc_tokens']['tokens']['tokens'].to(device)

        # get scores
        scores = model(query, doc).squeeze()

        # append scores with documents to queries
        for i in range(len(scores)):

            query_id = batch['query_id'][i]
            doc_id = batch['doc_id'][i]

            if query_id not in ranking:
                ranking[query_id] = []
            ranking[query_id].append((doc_id, scores[i].item()))

    # core_metrics.py functions, calculate metrics
    ranked_results = unrolled_to_ranked_result(ranking)
    eval_metrics = calculate_metrics_plain(ranked_results, qrels)
    return eval_metrics, ranked_results


# Instance dataloaders

In [None]:
# read train data, instance dataloader
train_reader = IrTripleDatasetReader(lazy=True, max_doc_length=180, max_query_length=30)
train_data = train_reader.read(config["train_data"])
train_data.index_with(vocab)
train_loader = PyTorchDataLoader(train_data, batch_size=256)

# read val data, instance dataloader
val_reader = IrLabeledTupleDatasetReader(lazy=True, max_doc_length=180, max_query_length=30)
val_data = val_reader.read(config["validation_data"])
val_data.index_with(vocab)
val_loader = PyTorchDataLoader(val_data, batch_size=256)

# read test data, instance dataloader
test_reader = IrLabeledTupleDatasetReader(lazy=True, max_doc_length=180, max_query_length=30)
test_data = val_reader.read(config["test_data"])
test_data.index_with(vocab)
test_loader = PyTorchDataLoader(test_data, batch_size=256)

# read fira data, instance dataloader
fira_reader = IrLabeledTupleDatasetReader(lazy=True, max_doc_length=180, max_query_length=30)
fira_data = val_reader.read(config["fira_test_data"])
fira_data.index_with(vocab)
fira_loader = PyTorchDataLoader(fira_data, batch_size=256)

# read qrels
qrels = load_qrels(config["qrels"])
fira_qrels = load_qrels(config["fira_qrels"])
custom_qrels = load_qrels(config["custom_qrels"])
custom_qrels_log = load_qrels(config["custom_qrels_log"])

# Run training

In [None]:
# for visualizing for report
train_loss_per_epoch = []
val_metrics_per_epoch = []

num_epochs = 10
best_mrr = 0
patience = 2
no_improvement_cnt = 0

for epoch in range(num_epochs):

    train_loss = train_model(train_loader)
    train_loss_per_epoch.append(train_loss)
    print(f"Epoch {epoch + 1}: Training Loss = {train_loss:.4f}")

    val_metrics, _ = evaluate_model(val_loader, qrels)
    val_metrics_per_epoch.append(val_metrics)
    print(f"Validation Metrics: {val_metrics}")

    #early stopping based on mrr@10 values
    curr_epoch_mrr = val_metrics.get("MRR@10", 0)
    if curr_epoch_mrr > best_mrr:
        best_mrr = curr_epoch_mrr
        no_improvement_cnt = 0
        torch.save(model.state_dict(), config["model_path"])
    else:
        no_improvement_cnt = no_improvement_cnt + 1
        if no_improvement_cnt >= patience:
            print(f"Stopped early at epoch {epoch + 1}")
            break

Save metrics from training

In [None]:
train_loss_df = pd.DataFrame(train_loss_per_epoch, columns=['loss'])
train_loss_df.to_csv("/content/drive/MyDrive/AIR/data/Part-2/train_loss_tk.csv", index=False)

val_metrics_df = pd.DataFrame(val_metrics_per_epoch)
val_metrics_df.to_csv("/content/drive/MyDrive/AIR/data/Part-2/val_metrics_tk.csv", index=False)

# Evaluate on test sets

In [None]:
msmarco_test_eval, ranked_results = evaluate_model(test_loader, qrels)
fira_test_eval, _ = evaluate_model(fira_loader, fira_qrels)
custom_fira_eval, _ = evaluate_model(fira_loader, custom_qrels)

print(f"MSMarco metrics: {msmarco_test_eval}")
print(f"Fira metrics: {fira_test_eval}")
print(f"Custom Fira metrics: {custom_fira_eval}")

0it [00:00, ?it/s]
1it [00:00,  7.56it/s]
reading instances: 256it [00:00, 1972.80it/s][A
2it [00:00,  4.30it/s]
3it [00:00,  5.68it/s]
4it [00:00,  6.65it/s]
5it [00:00,  6.70it/s]
6it [00:00,  7.38it/s]
reading instances: 1536it [00:00, 1887.56it/s][A
7it [00:01,  5.27it/s]
8it [00:01,  6.04it/s]
9it [00:01,  6.80it/s]
10it [00:01,  7.50it/s]
11it [00:01,  7.96it/s]
12it [00:01,  8.39it/s]
reading instances: 3072it [00:01, 2139.56it/s][A
13it [00:02,  5.80it/s]
14it [00:02,  6.43it/s]
15it [00:02,  6.95it/s]
16it [00:02,  7.47it/s]
17it [00:02,  7.86it/s]
18it [00:02,  8.19it/s]
reading instances: 4608it [00:02, 2071.47it/s][A
19it [00:02,  5.73it/s]
20it [00:03,  6.22it/s]
21it [00:03,  6.78it/s]
22it [00:03,  7.31it/s]
23it [00:03,  7.67it/s]
reading instances: 5888it [00:03, 1967.25it/s][A
24it [00:03,  5.56it/s]
25it [00:03,  6.26it/s]
26it [00:03,  6.85it/s]
27it [00:04,  7.12it/s]
28it [00:04,  7.58it/s]
reading instances: 7168it [00:04, 1942.84it/s][A
29it [00:04,  5.48i

MSMarco metrics: {'MRR@10': 0.278959126984127, 'Recall@10': 0.4949583333333333, 'QueriesWithNoRelevant@10': 999, 'QueriesWithRelevant@10': 1001, 'AverageRankGoldLabel@10': 3.128871128871129, 'MedianRankGoldLabel@10': 2.0, 'MRR@20': 0.2838755692342573, 'Recall@20': 0.5644166666666668, 'QueriesWithNoRelevant@20': 858, 'QueriesWithRelevant@20': 1142, 'AverageRankGoldLabel@20': 4.587565674255692, 'MedianRankGoldLabel@20': 3.0, 'MRR@1000': 0.2851484426801613, 'Recall@1000': 0.6002916666666668, 'QueriesWithNoRelevant@1000': 788, 'QueriesWithRelevant@1000': 1212, 'AverageRankGoldLabel@1000': 5.963696369636963, 'MedianRankGoldLabel@1000': 3.0, 'nDCG@3': 0.27037205359351596, 'nDCG@5': 0.3003718305083099, 'nDCG@10': 0.32911543600391163, 'nDCG@20': 0.34693789013860776, 'nDCG@1000': 0.35447183872949234, 'QueriesRanked': 2000, 'MAP@1000': 0.281939943312788}
Fira metrics: {'MRR@10': 0.9309903041676441, 'Recall@10': 0.9406342631096326, 'QueriesWithNoRelevant@10': 116, 'QueriesWithRelevant@10': 4059, 

# Evaluate log qrels

In [None]:
model.load_state_dict(torch.load(config["model_path"]))
custom_log_fira_eval, _ = evaluate_model(fira_loader, custom_qrels_log)
print(f"Custom Fira metrics with np.log1p: {custom_log_fira_eval}")

0it [00:00, ?it/s]
reading instances: 0it [00:00, ?it/s][A
reading instances: 122it [00:00, 1215.21it/s][A
1it [00:00,  3.68it/s]
reading instances: 382it [00:00, 1029.76it/s][A
2it [00:00,  2.73it/s]
reading instances: 571it [00:00, 669.08it/s][A
3it [00:00,  3.22it/s]
reading instances: 827it [00:00, 803.82it/s][A
4it [00:01,  2.64it/s]
reading instances: 1037it [00:01, 556.22it/s][A
5it [00:01,  2.96it/s]
reading instances: 1280it [00:01, 706.03it/s][A
6it [00:01,  3.27it/s]
reading instances: 1536it [00:01, 838.50it/s][A
reading instances: 1643it [00:02, 646.60it/s][A
7it [00:02,  2.86it/s]
reading instances: 1878it [00:02, 748.21it/s][A
8it [00:02,  3.05it/s]
reading instances: 2092it [00:02, 800.03it/s][A
9it [00:03,  2.75it/s]
reading instances: 2304it [00:03, 642.17it/s][A
10it [00:03,  3.04it/s]
reading instances: 2560it [00:03, 786.84it/s][A
reading instances: 2702it [00:03, 928.82it/s][A
11it [00:03,  2.74it/s]
reading instances: 2895it [00:03, 683.30it/s][A
1

Custom Fira metrics with np.log1p: {'MRR@10': 0.939856925843254, 'Recall@10': 0.9408800529711174, 'QueriesWithNoRelevant@10': 80, 'QueriesWithRelevant@10': 4095, 'AverageRankGoldLabel@10': 1.143101343101343, 'MedianRankGoldLabel@10': 1.0, 'MRR@20': 0.9398791204455267, 'Recall@20': 1.0, 'QueriesWithNoRelevant@20': 79, 'QueriesWithRelevant@20': 4096, 'AverageRankGoldLabel@20': 1.1455078125, 'MedianRankGoldLabel@20': 1.0, 'MRR@1000': 0.9398791204455267, 'Recall@1000': 1.0, 'QueriesWithNoRelevant@1000': 79, 'QueriesWithRelevant@1000': 4096, 'AverageRankGoldLabel@1000': 1.1455078125, 'MedianRankGoldLabel@1000': 1.0, 'nDCG@3': 0.8484802235896395, 'nDCG@5': 0.8583890503971509, 'nDCG@10': 0.8842504576080512, 'nDCG@20': 0.9024749796866838, 'nDCG@1000': 0.9024749796866838, 'QueriesRanked': 4096, 'MAP@1000': 0.928988627689344}


# Extract and save top1 MSMarco

In [None]:
top1_results = {qid: did[0] for qid, did in ranked_results.items()}
df_top1 = pd.DataFrame(list(top1_results.items()), columns=['query_id', 'doc_id'])
df_top1.to_csv('/content/drive/MyDrive/AIR/data/Part-2/top1_results_tk.csv', index=False)

In [None]:
top2_results = {qid: (did[0], did[1]) for qid, did in ranked_results.items()}
print(top2_results)
df_top2 = pd.DataFrame([(qid, dids[0], dids[1]) for qid, dids in top2_results.items()], columns=['query_id', 'doc1_id', 'doc2_id'])
df_top2.to_csv('/content/drive/MyDrive/AIR/data/Part-2/top2_results_tk.csv', index=False)