Make possible to pip install #2

cmacdonald · 2020-08-07T10:07:56Z

Hello, thanks for your repository and SIGIR paper. We would like to develop wrappers on top of ColBERT. Would it be possible to make the repo compatible with pip. This would need:

make a setup.py
rename src directory as colbert

okhat · 2020-08-08T01:54:19Z

Hi Craig. Thanks for developing these wrappers!

I'm working to push a number of major updates (lots of changes, but mainly end-to-end retrieval and optimized inference) to the current barebone version of the repository soon. This sounds like a helpful component to add in there. I will update here once done.

cmacdonald · 2020-08-10T09:38:34Z

For reference, my code is https://github.com/cmacdonald/pyterrier_bert/blob/master/pyterrier_bert/pyt_colbert.py.

Being able to refer to the colbert package as a dependency would be advantageous.

okhat · 2020-09-22T19:15:18Z

Hi Craig.

FYI, I just released v0.2.0 under a new branch. This is a near-complete rewrite with lots of added functionality and much faster inference, among other things. I added the setup file although I haven't tested pip install yet.

You will notice substantial (but not too massive) changes to the command-line interface. I will update here when I create new instructions for v0.2 (then will merge into master).

https://github.com/stanford-futuredata/ColBERT/tree/v0.2

cmacdonald · 2020-09-22T19:20:50Z

Thanks, Looking forward to trying it

cmacdonald · 2020-10-12T22:18:56Z

I tried to use the v0.2 branch tonight, and didnt succeed. Here are some points:

requirements.txt not present, nor in setup.py -- addressed in my fork, see https://github.com/stanford-futuredata/ColBERT/compare/v0.2...cmacdonald:v0.2?expand=1
mlflow is a huge dependency, adding 80MB to an installation (including flask, azure-storage-blob, sqlalchemy).
logging.py tries to initialise many things. Cant we have a simpler proxy for this?
I dont know what to install to include in requirements.txt: ImportError: TensorBoard logging requires TensorBoard with Python summary writer installed. This should be available in 1.14 or above.
Why Python 3.7 now required. Google Colab is still on 3.6. What changes actually need 3.7?

I only currently use only two functions from ColBERT in "v0.1":

from colbert.evaluation.loaders import load_colbert
from colbert.evaluation.ranking import rerank

In essence:

why so many dependencies?
Can we have a simpler API for ranking in "client" code:
- Can load_colbert avoid using Runs?
- Can colbert/evaluation/ranking.py have its plain rerank function restored?

okhat · 2020-10-12T22:22:47Z

Thanks, Craig. All good points; I think I can address most of those today (PST time).

For the (current) dependencies, it seems like conda is important for faiss (end-to-end retrieval) efficiency since it builds the right setup, but maybe there's a way to get pip to match that.

With conda, this is a minimal working conda env for v0.2

name: colbert-v0.2
channels:
  - pytorch
  - defaults
dependencies:
  - python=3.7
  - pip=20.2.2
  - cudatoolkit=10.1
  - pytorch=1.6.0
  - torchvision=0.7.0
  - faiss-gpu
  - pip:
    - mlflow
    - tensorboard
    - tqdm
    - transformers==3.0.2
    - ujson

Is there a reason to prefer pip?

cmacdonald · 2020-10-12T22:26:32Z

Thanks for the quick reply. I'll give another shot about the dependencies tomorrow. I suspect if you address the last two points:

"Can load_colbert avoid using Runs?"
"Can colbert/evaluation/ranking.py have its plain rerank function restored?"

then my immediate integration will be easier.

okhat · 2020-10-12T22:43:55Z

For the second point about rerank, I think this could work for you with the v0.2 code:

from colbert.modeling.inference import ModelInference
from colbert.evaluation.slow import slow_rerank

# Prepare args   (new: set args.amp to True to use mixed-precision mode, which is often >2x faster with virtually the same accuracy for document encoding)
[...]

# Create an inference manager
args.inference = ModelInference(args.colbert, amp=args.amp)

slow_rerank(args, query, pids, passages)

If you like, we could wrap these two lines in a function called rerank but that would be slow-ish since it would keep re-creating the inference manager. One could cache that into the colbert object though. Let me know if you prefer that.

The current code happens to be more tuned for command-line usage than internal-library usage, but I see the value of the latter (e.g., in the context of pyterrier) so I'm happy to see what would make things easier on your end.

This should be fairly straight-forward for "slow" re-ranking (as above). But it should be doable also for indexing, fast re-ranking, and/or end-to-end retrieval.

okhat · 2020-10-13T04:08:17Z

requirements.txt not present, nor in setup.py -- addressed in my fork, see https://github.com/stanford-futuredata/ColBERT/compare/v0.2...cmacdonald:v0.2?expand=1

I dont know what to install to include in requirements.txt: ImportError: TensorBoard logging requires TensorBoard with Python summary writer installed. This should be available in 1.14 or above.

Added conda_env.yml.

conda create --name colbert-v0.2 --file conda_env.yml

I don't have reliable experience installing faiss with pip (it works but could be compiled suboptimally for the architecture, not use MKL, etc.), so in general I recommend this way if end-to-end retrieval will be used.

mlflow is a huge dependency, adding 80MB to an installation (including flask, azure-storage-blob, sqlalchemy).

logging.py tries to initialise many things. Cant we have a simpler proxy for this?

I agree that lightweight logging makes sense for some use cases. Perhaps I can update the code to use these libraries iff they are installed. How critical do you find this?

Why Python 3.7 now required. Google Colab is still on 3.6. What changes actually need 3.7?

Thanks for the point about Colab. Python 3.7 dictionaries are ordered by default (this is most relevant for defaultdict objects, since otherwise one could rely on OrderedDict). The v0.2 code isn't tested without that assumption.

This only affects loading related code (mostly, if not entirely, evaluation/loaders.py and the downstream code that uses it), though, which isn't too deep into the core code.

Can colbert/evaluation/ranking.py have its plain rerank function restored?

Addressed separately above.

Can load_colbert avoid using Runs?

Extracted load_model with minimal dependencies under colbert.evaluation.load_model. This should do the job.

cmacdonald · 2020-10-13T13:15:55Z

Is there a reason to prefer pip?

Just because its a familiar method to install library dependencies.
I added tensorboard etc. I couldnt get faiss-gpu to pip install on my laptop (swig error), but faiss-cpu does. faiss-gpu does pip install fine on colab. See colab example: https://colab.research.google.com/drive/1EC5RhRk0eux0CTbgoYqxKrWB2c4NEMwA?usp=sharing

Thanks for the point about Colab. Python 3.7 dictionaries are ordered by default (this is most relevant for defaultdict objects, since otherwise one could rely on OrderedDict). The v0.2 code isn't tested without that assumption.
This only affects loading related code (mostly, if not entirely, evaluation/loaders.py and the downstream code that uses it), though, which isn't too deep into the core code.

The CPython implementation of Python 3.6 retains dictionary order, its just not guarenteed across all Python implementations: https://stackoverflow.com/a/39980744.

Extracted load_model with minimal dependencies under colbert.evaluation.load_model. This should do the job.
Thanks. Testing this.

I then got:

/usr/local/lib/python3.6/dist-packages/colbert/evaluation/ranking.py in <module>()
     12 
     13 from colbert.evaluation.metrics import Metrics
---> 14 from colbert.evaluation.ranking_logger import RankingLogger
     15 from colbert.modeling.inference import ModelInference
     16 

/usr/local/lib/python3.6/dist-packages/colbert/evaluation/ranking_logger.py in <module>()
      1 import os
      2 
----> 3 from contextlib import contextmanager, nullcontext
      4 from colbert.utils.utils import print_message
      5 from colbert.utils.runs import Run

ImportError: cannot import name 'nullcontext'

nullcontext only exists in Python 3.7 - I used a workaround from https://stackoverflow.com/a/45187287. You can see my fork.

Finally, I wasnt able to load an existing ColBERT model. Is this expected?

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Oct 13, 13:12:36] #> Loading model checkpoint.
[Oct 13, 13:12:36] #> Loading checkpoint ./colbert.dnn
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-f89614f10a2b> in <module>()
      1 from pyterrier_bert.passager import SlidingWindowPassager, MaxPassage
      2 from pyterrier_bert.pyt_colbert import ColBERTPipeline
----> 3 colbert = ColBERTPipeline("./colbert.dnn", doc_attr="body", verbose=True)

3 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1043         if len(error_msgs) > 0:
   1044             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1045                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1046         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1047 

RuntimeError: Error(s) in loading state_dict for ColBERT:
	Missing key(s) in state_dict: "bert.embeddings.position_ids".

Might this be because our models were trained with older huggingface versions?

okhat · 2020-10-13T13:29:03Z

Thanks! Will address each.

For the last item, it should work with transformers 3.0.2 as in the env. They introduced a breaking change in 3.1.

cmacdonald · 2020-10-13T14:07:20Z

For the last item, it should work with transformers 3.0.2 as in the env. They introduced a breaking change in 3.1.

Ah, yes, thanks.

Finally, the input text is type checked for list or tuple, while I gave a numpy array. I simpy used .tolist(), but it seemed unnecessary?
Relevant lines were: https://github.com/stanford-futuredata/ColBERT/blob/v0.2/colbert/modeling/tokenization/doc_tokenization.py#L19 or https://github.com/stanford-futuredata/ColBERT/blob/v0.2/colbert/modeling/tokenization/doc_tokenization.py#L45 - would a numpy array of text not be acceptable here?

Ok pegging to 3.0.2 and using slow_rerank works for me. Thanks for your timely help!

okhat · 2020-10-13T14:48:54Z

I'm not sure about a numpy array of text. It could work, possibly. These assertions guard against string inputs (at least) because the downstream code wouldn't complain about them; it would just produce weird outcomes.

By the way, be sure you select args.mask_punctuation = True (default happens to be false in v0.2, just because it's a flag --mask-punctuation, but I'll try to have a negative flag instead in the next version as the intention is it should be on unless someone wants to disable it).

cmacdonald · 2020-10-13T14:51:55Z

args.mask_punctuation = True

Got it - https://github.com/cmacdonald/pyterrier_bert/blob/colbert0.2/pyterrier_bert/pyt_colbert.py#L40

PR sent for ColBERT.

* add .gitignore * define requirements.txt and import in setup.py, try setup.py at Python 3.6 * fix requirements for pip * added missing init.py files * add missing init.py * py36 support * patch the nullcontext * peg transformer version * 3.6 not 3.5

fixed assert in slow_rerank

okhat closed this as completed Dec 5, 2020

Yaasha pushed a commit to Yaasha/ColBERT that referenced this issue Sep 7, 2022

Merge pull request stanford-futuredata#2 from seanmacavaney/patch-2

4367fae

fixed assert in slow_rerank

jessiejuachon mentioned this issue Jul 29, 2023

2 new API endpoints to IndexUpdater: create_embs_and_doclens and update_searcher #229

Merged

parsakzr added a commit to parsakzr/ColBERT that referenced this issue May 30, 2024

switched to AutoTokenizer stanford-futuredata#2

7a02051

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make possible to pip install #2

Make possible to pip install #2

cmacdonald commented Aug 7, 2020

okhat commented Aug 8, 2020

cmacdonald commented Aug 10, 2020

okhat commented Sep 22, 2020

cmacdonald commented Sep 22, 2020

cmacdonald commented Oct 12, 2020 •

edited

Loading

okhat commented Oct 12, 2020 •

edited

Loading

cmacdonald commented Oct 12, 2020

okhat commented Oct 12, 2020 •

edited

Loading

okhat commented Oct 13, 2020

cmacdonald commented Oct 13, 2020 •

edited

Loading

okhat commented Oct 13, 2020

cmacdonald commented Oct 13, 2020 •

edited

Loading

okhat commented Oct 13, 2020

cmacdonald commented Oct 13, 2020

Make possible to pip install #2

Make possible to pip install #2

Comments

cmacdonald commented Aug 7, 2020

okhat commented Aug 8, 2020

cmacdonald commented Aug 10, 2020

okhat commented Sep 22, 2020

cmacdonald commented Sep 22, 2020

cmacdonald commented Oct 12, 2020 • edited Loading

okhat commented Oct 12, 2020 • edited Loading

cmacdonald commented Oct 12, 2020

okhat commented Oct 12, 2020 • edited Loading

okhat commented Oct 13, 2020

cmacdonald commented Oct 13, 2020 • edited Loading

okhat commented Oct 13, 2020

cmacdonald commented Oct 13, 2020 • edited Loading

okhat commented Oct 13, 2020

cmacdonald commented Oct 13, 2020

cmacdonald commented Oct 12, 2020 •

edited

Loading

okhat commented Oct 12, 2020 •

edited

Loading

okhat commented Oct 12, 2020 •

edited

Loading

cmacdonald commented Oct 13, 2020 •

edited

Loading

cmacdonald commented Oct 13, 2020 •

edited

Loading