# This is my cool Pipeline

### Step 1: Import everything and load variables

In [2]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm
import os

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./sample-input-full-rank')
from pyterrier_colbert.ranking import ColBERTFactory


I will use a small hardcoded example located in ./sample-input-full-rank.
The output directory is /tmp/


### Step 2: Load the Data

In [3]:
print('Step 2: Load the data.')

queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]


Step 2: Load the data.


### Step 3: Create the Index

In [4]:
print('Step 3: Create the Index.')

!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100, 'text': 10240})
index_ref = iter_indexer.index(tqdm(documents))


Step 3: Create the Index.


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 10.51it/s]


In [5]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", metadata=['docno', 'text'])

    
pytcolbert = ColBERTFactory(os.environ['MODEL_NAME'], "/index", "index")

pipeline = bm25 % 1000 >> pytcolbert.text_scorer(verbose=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAI

[May 03, 20:02:45] #> Loading model checkpoint.
[May 03, 20:02:45] #> Loading checkpoint http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip




[May 03, 20:03:00] #> checkpoint['epoch'] = 0
[May 03, 20:03:00] #> checkpoint['batch'] = 44500


### Step 4: Create Run

In [6]:
print('Step 4: Create Run.')
run = pipeline(queries)

Step 4: Create Run.


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.99q/s]


In [7]:
run

Unnamed: 0,qid,query,docno,score,rank
0,1,fox jumps above animal,pangram-04,21.435955,0
1,1,fox jumps above animal,pangram-02,20.018982,1
2,1,fox jumps above animal,pangram-03,14.482786,2
3,1,fox jumps above animal,pangram-01,12.042614,3
4,2,multiple animals including a zebra,pangram-03,20.021858,0
5,2,multiple animals including a zebra,pangram-01,16.255941,1
6,2,multiple animals including a zebra,pangram-05,16.099251,2


### Step 5: Persist Run

In [8]:
print('Step 5: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='colbert', depth=1000)

Step 5: Persist Run.
