# ColBERTv2: Indexing Notebook

## Setup
First, we'll import the relevant classes. Note that `Indexer`is the key actor here. Next, we'll download the necessary dependencies.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!git -C /content/drive/MyDrive/ColBERT_Indexing/ColBERT/ pull || git clone https://github.com/stanford-futuredata/ColBERT.git /content/drive/MyDrive/ColBERT_Indexing/ColBERT/
import sys; sys.path.insert(0, '/content/drive/MyDrive/ColBERT_Indexing/ColBERT')


Already up to date.


In [3]:
!git -C /content/drive/MyDrive/ColBERT_Indexing/eAS pull || git clone https://github.com/rfeinberg3/eBayAutoSeller.git /content/drive/MyDrive/ColBERT_Indexing/eAS

Already up to date.


In [4]:
try: # When on google Colab, let's install all dependencies with pip.
    import google.colab
    !pip install -U pip
    !pip install -e /content/drive/MyDrive/ColBERT_Indexing/ColBERT/['faiss-gpu','torch']
except Exception:
  import sys; sys.path.insert(0, 'ColBERT/')
  try:
    from colbert import Indexer, Searcher
  except Exception:
    print("If you're running outside Colab, please make sure you install ColBERT in conda following the instructions in our README. You can also install (as above) with pip but it may install slower or less stable faiss or torch dependencies. Conda is recommended.")
    assert False

[0mObtaining file:///content/drive/MyDrive/ColBERT_Indexing/ColBERT
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: colbert-ai
  Attempting uninstall: colbert-ai
    Found existing installation: colbert-ai 0.2.20
    Uninstalling colbert-ai-0.2.20:
      Successfully uninstalled colbert-ai-0.2.20
  Running setup.py develop for colbert-ai
Successfully installed colbert-ai-0.2.20
[0m

In [5]:
import colbert

In [6]:
from colbert import Indexer, Searcher
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection

In [7]:
import sys; sys.path.insert(0, '/content/drive/MyDrive/ColBERT_Indexing')
import eAS
from eAS.RAG import DataCollator

## Preprocessing Data for Indexing
- The data must be formatted into a collection defined as a `list of strings`. The strings we use in the collection will be what the Searcher compares queries to, to find relevant data. We'll want to use the item description strings as they carry lots of information to compare about a given item.

- We want the solve the user story:
`As an e-commerce user looking to sell used items, I would like to list my item at a reasonable price, as to attract potiental buyers quickly.`

- We can achieve this by using the Search feature later to extract indices of the top item description search results. Then we would use the indices to find the corresponding prices for each item returned by the Searcher. (Indexing doesn't change the order of our collection.)

In [8]:
path_to_data = "/content/drive/MyDrive/ColBERT_Indexing/eAS/Dataset/dataset.json"

In [9]:
dataset = DataCollator(path_to_data)
collection = dataset.get_collection()

In [10]:
# Collections is now a list of passages texts
print(collection[0], collection[1], collection[2], collection[3], sep='\n\n---\n\n')

Life time toll free support 30 day money back guarantee Free shipping Product *Monitor sold separately Qwe Monitor Acer 27” 170Hz 2K Gaming Monitor 1ms AMD FreeSync Premium, WQHD (2560 x 1440), HDR Support (1 x Display Port 1.2 & 2 x HDMI 2.0 Ports) Nitro KG271U Pbiip Cases AZZA Spectra 280W / Gaming / ATX Mid Tower / Tempered Glass / White Power Supply Super Flower Leadex V Platinum PRO White 1000W ATX 80 PLUS PLATINUM Certified Power Supply Memory 8gb $670.00 *This is custom PC configuration build using our system configurator tool. To insure order accuracy please review system configuration before ordering. .st0 { fill: #000720; } System Configurator More info Monitor Acer 27” 170Hz 2K Gaming Monitor 1ms AMD FreeSync Premium, WQHD (2560 x 1440), HDR Support (1 x Display Port 1.2 & 2 x HDMI 2.0 Ports) Nitro KG271U Pbiip More info Assembly and test NONE SELECTED More info Speakers NONE SELECTED More info Cases AZZA Spectra 280W / Gaming / ATX Mid Tower / Tempered Glass / White More in

In [11]:
# Lets see how many passages (string rows) we have
len(collection)

1034

## Indexing

For an efficient search, we can pre-compute the ColBERT representation of each passage and index them.

Below, the `Indexer` take a model checkpoint and writes a (compressed) index to disk. We then prepare a `Searcher` for retrieval from this index.

In [12]:
# Defining hyperparameters
nbits = 2   # encode each dimension with 2 bits
doc_maxlen = 300 # truncate passages at 300 tokens
max_id = 10000 # Only index up to pid 10,000 to save time and space
nranks = 1 # nranks specifies the number of GPUs to use
kmeans_niters = 4 # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                # Consider larger numbers for small datasets.

experiment_path = "/content/drive/MyDrive/ColBERT_Indexing/eAS/RAG"
index_type = 'collection'
index_name = f'{index_type}.kmeans_{kmeans_niters}iters.{nbits}bits' # to be used as index name for the indexer later

To save space and time, we will only run the `Indexer` on the first 10,000 passages. To do so, we will filter out queries that do not contain passages with ids less than 10,000 (not applicable right now, but good to keep in mind).

Now we run the `Indexer` on the collection. This takes about 6 minutes on a T4 GPU.

In [13]:
checkpoint = 'colbert-ir/colbertv2.0'

with Run().context(RunConfig(nranks=nranks, experiment=experiment_path)):
  config = ColBERTConfig(
    nbits=nbits,
    kmeans_niters=kmeans_niters,
    doc_maxlen=doc_maxlen,
    overwrite=True, # overwrite the indexer config if present
  )
  indexer = Indexer(checkpoint=checkpoint, config=config)
  indexer.index(name=index_name, collection=collection[:max_id], overwrite=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.




[Jun 22, 01:09:22] #> Note: Output directory /content/drive/MyDrive/ColBERT_Indexing/eAS/RAG/indexes/collection.kmeans_4iters.2bits already exists


[Jun 22, 01:09:22] #> Will delete 10 files already at /content/drive/MyDrive/ColBERT_Indexing/eAS/RAG/indexes/collection.kmeans_4iters.2bits in 20 seconds...
#> Starting...
#> Joined...


In [14]:
indexer.get_index() # You can get the absolute path of the index, if needed.

'/content/drive/MyDrive/ColBERT_Indexing/eAS/RAG/indexes/collection.kmeans_4iters.2bits'

## Search

Having built the index and prepared our `searcher`, we can search for individual query strings.

In [15]:
# To create the searcher using its relative name (i.e., not a full path), set
# experiment=value_used_for_indexing in the RunConfig.
with Run().context(RunConfig(experiment=experiment_path)):
    searcher = Searcher(index=index_name, collection=collection)

# If you want to customize the search latency--quality tradeoff, you can also supply a
# config=ColBERTConfig(ncells=.., centroid_score_threshold=.., ndocs=..) argument.
# The default settings with k <= 10 (1, 0.5, 256) gives the fastest search,
# but you can gain more extensive search by setting larger values of k or
# manually specifying more conservative ColBERTConfig settings (e.g. (4, 0.4, 4096)).



[Jun 22, 01:10:11] #> Loading codec...
[Jun 22, 01:10:11] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 22, 01:10:11] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 22, 01:10:11] #> Loading IVF...
[Jun 22, 01:10:11] #> Loading doclens...


100%|██████████| 1/1 [00:00<00:00, 357.57it/s]

[Jun 22, 01:10:11] #> Loading codes and residuals...



100%|██████████| 1/1 [00:00<00:00, 40.10it/s]


Load in the item titles where each title corresponds to a description from the collection in the same row.

In [16]:
queries = dataset.get_queries()

In [17]:
print(queries[0], queries[1], queries[2], queries[3], sep='\n\n---\n\n')

asdzxcqwe MM7.61.04

---

JPT 30W 50W 60W 100W Fiber Laser Marking Engraving Metal Machine & Rotary Axis

---

Fanxiang M.2 NVME 4TB 2TB 1TB SSD PCIE Game SSD Internal Solid State Drive Lot

---

30W Solar Panel 12V Trickle Charger Battery Charger Kit Maintainer Boat Car RV


In [18]:
query = "Macbook Pro" # try with an in-range query or supply your own
print(f"#> {query}")

# Find the top-3 passages for this query
results = searcher.search(query, k=3)

# Print out the top-k retrieved passages
for passage_id, passage_rank, passage_score in zip(*results):
    print(f"\t [{passage_rank}] \t\t {passage_score:.1f} \t\t {searcher.collection[passage_id]}")

#> Macbook Pro

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Macbook Pro, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 6097, 8654, 4013,  102,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')

	 [1] 		 23.6 		 Brand New Apple MacBook Pro MB990LL/A 13.3 in. Notebook!
	 [2] 		 23.6 		 Brand New Apple MacBook Pro MB990LL/A 13.3 in. Notebook!
	 [3] 		 23.6 		 Brand New Apple MacBook Pro MB990LL/A 13.3 in. Notebook!


In [20]:
# Get average price from top 3 results
prices = dataset.get_price()
print((float(prices[results[0][0]])+float(prices[results[0][1]])+float(prices[results[0][2]]))/3)

500.0
