The purpose of this notebook is to convert `triples.train.small.tsv` into a JSON file with each line containing a `[qid, positive pid, negative pid]` triple. This notebook was created by running:

```
modal launch jupyter --volume colbert-maintenance
```

My `colbert-maintenance` Modal volume contains the files:

- collection.tsv
- queries.train.tsv
- triples.train.small.tsv


The corresponding URLs are:

- https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz
- https://msmarco.z22.web.core.windows.net/msmarcoranking/queries.tar.gz
- https://msmarco.z22.web.core.windows.net/msmarcoranking/triples.train.small.tar.gz

In [None]:
!pip install pandas

In [21]:
import pandas as pd
import json

In [3]:
collection = pd.read_csv("collection.tsv", sep='\t', header=None)

In [4]:
collection.shape

(8841823, 2)

In [5]:
queries = pd.read_csv("queries.train.tsv", sep='\t', header=None)
queries.shape

(808731, 2)

In [6]:
triples = pd.read_csv("triples.train.small.tsv", sep='\t', header=None)
triples.shape

(39780811, 3)

In [7]:
query_text_to_id = dict(zip(queries.iloc[:, 1], queries.iloc[:, 0]))
passage_text_to_id = dict(zip(collection.iloc[:, 1], collection.iloc[:, 0]))

In [8]:
queries.head()

Unnamed: 0,0,1
0,121352,define extreme
1,634306,what does chattel mean on credit history
2,920825,what was the great leap forward brainly
3,510633,tattoo fixers how much does it cost
4,737889,what is decentralization process.


In [12]:
query_text_to_id["define extreme"]

121352

In [9]:
collection.head()

Unnamed: 0,0,1
0,0,The presence of communication amid scientific ...
1,1,The Manhattan Project and its atomic bomb help...
2,2,Essay on The Manhattan Project - The Manhattan...
3,3,The Manhattan Project was the name for a proje...
4,4,versions of each volume as well as complementa...


In [14]:
passage_text_to_id[collection.iloc[0,1]]

0

In [15]:
triples['qid'] = triples.iloc[:, 0].map(query_text_to_id)
triples['pos_pid'] = triples.iloc[:, 1].map(passage_text_to_id) 
triples['neg_pid'] = triples.iloc[:, 2].map(passage_text_to_id)

In [16]:
triples.head()

Unnamed: 0,0,1,2,qid,pos_pid,neg_pid
0,is a little caffeine ok during pregnancy,We donât know a lot about the effects of caf...,It is generally safe for pregnant women to eat...,400296.0,1540783,3518497
1,what fruit is native to australia,Passiflora herbertiana. A rare passion fruit n...,"The kola nut is the fruit of the kola tree, a ...",662731.0,193249,2975302
2,how large is the canadian military,The Canadian Armed Forces. 1 The first large-...,The Canadian Physician Health Institute (CPHI)...,238256.0,4435042,100008
3,types of fruit trees,Cherry. Cherry trees are found throughout the ...,"The kola nut is the fruit of the kola tree, a ...",527862.0,1505983,2975302
4,how many calories a day are lost breastfeeding,"Not only is breastfeeding better for the baby,...","However, you still need some niacin each day; ...",275813.0,5736515,1238670


In [17]:
triples.shape

(39780811, 6)

Some rows are lost when dropping NAs but that's fine for my use case as I'm only interested in getting training to "just work".

In [18]:
triples_mapped = triples[['qid', 'pos_pid', 'neg_pid']].dropna()
triples_mapped = triples_mapped.astype(int)
triples_mapped.shape

(39767620, 3)

In [None]:
with open('triples.train.small.json', 'w') as f:
    for row in triples_mapped.values:
        f.write(json.dumps(row.tolist()) + '\n')