Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make possible to pip install #2

Closed
cmacdonald opened this issue Aug 7, 2020 · 14 comments
Closed

Make possible to pip install #2

cmacdonald opened this issue Aug 7, 2020 · 14 comments

Comments

@cmacdonald
Copy link
Contributor

Hello, thanks for your repository and SIGIR paper. We would like to develop wrappers on top of ColBERT. Would it be possible to make the repo compatible with pip. This would need:

  • make a setup.py
  • rename src directory as colbert
@okhat
Copy link
Collaborator

okhat commented Aug 8, 2020

Hi Craig. Thanks for developing these wrappers!

I'm working to push a number of major updates (lots of changes, but mainly end-to-end retrieval and optimized inference) to the current barebone version of the repository soon. This sounds like a helpful component to add in there. I will update here once done.

@cmacdonald
Copy link
Contributor Author

For reference, my code is https://github.com/cmacdonald/pyterrier_bert/blob/master/pyterrier_bert/pyt_colbert.py.

Being able to refer to the colbert package as a dependency would be advantageous.

@okhat
Copy link
Collaborator

okhat commented Sep 22, 2020

Hi Craig.

FYI, I just released v0.2.0 under a new branch. This is a near-complete rewrite with lots of added functionality and much faster inference, among other things. I added the setup file although I haven't tested pip install yet.

You will notice substantial (but not too massive) changes to the command-line interface. I will update here when I create new instructions for v0.2 (then will merge into master).

https://github.com/stanford-futuredata/ColBERT/tree/v0.2

@cmacdonald
Copy link
Contributor Author

Thanks, Looking forward to trying it

@cmacdonald
Copy link
Contributor Author

cmacdonald commented Oct 12, 2020

I tried to use the v0.2 branch tonight, and didnt succeed. Here are some points:

  • requirements.txt not present, nor in setup.py -- addressed in my fork, see https://github.com/stanford-futuredata/ColBERT/compare/v0.2...cmacdonald:v0.2?expand=1
  • mlflow is a huge dependency, adding 80MB to an installation (including flask, azure-storage-blob, sqlalchemy).
  • logging.py tries to initialise many things. Cant we have a simpler proxy for this?
  • I dont know what to install to include in requirements.txt: ImportError: TensorBoard logging requires TensorBoard with Python summary writer installed. This should be available in 1.14 or above.
  • Why Python 3.7 now required. Google Colab is still on 3.6. What changes actually need 3.7?

I only currently use only two functions from ColBERT in "v0.1":

from colbert.evaluation.loaders import load_colbert
from colbert.evaluation.ranking import rerank

In essence:

  • why so many dependencies?
  • Can we have a simpler API for ranking in "client" code:
    • Can load_colbert avoid using Runs?
    • Can colbert/evaluation/ranking.py have its plain rerank function restored?

@okhat
Copy link
Collaborator

okhat commented Oct 12, 2020

Thanks, Craig. All good points; I think I can address most of those today (PST time).

For the (current) dependencies, it seems like conda is important for faiss (end-to-end retrieval) efficiency since it builds the right setup, but maybe there's a way to get pip to match that.

With conda, this is a minimal working conda env for v0.2

name: colbert-v0.2
channels:
  - pytorch
  - defaults
dependencies:
  - python=3.7
  - pip=20.2.2
  - cudatoolkit=10.1
  - pytorch=1.6.0
  - torchvision=0.7.0
  - faiss-gpu
  - pip:
    - mlflow
    - tensorboard
    - tqdm
    - transformers==3.0.2
    - ujson

Is there a reason to prefer pip?

@cmacdonald
Copy link
Contributor Author

Thanks for the quick reply. I'll give another shot about the dependencies tomorrow. I suspect if you address the last two points:

  • "Can load_colbert avoid using Runs?"
  • "Can colbert/evaluation/ranking.py have its plain rerank function restored?"

then my immediate integration will be easier.

@okhat
Copy link
Collaborator

okhat commented Oct 12, 2020

For the second point about rerank, I think this could work for you with the v0.2 code:

from colbert.modeling.inference import ModelInference
from colbert.evaluation.slow import slow_rerank

# Prepare args   (new: set args.amp to True to use mixed-precision mode, which is often >2x faster with virtually the same accuracy for document encoding)
[...]

# Create an inference manager
args.inference = ModelInference(args.colbert, amp=args.amp)

slow_rerank(args, query, pids, passages)

If you like, we could wrap these two lines in a function called rerank but that would be slow-ish since it would keep re-creating the inference manager. One could cache that into the colbert object though. Let me know if you prefer that.

The current code happens to be more tuned for command-line usage than internal-library usage, but I see the value of the latter (e.g., in the context of pyterrier) so I'm happy to see what would make things easier on your end.

This should be fairly straight-forward for "slow" re-ranking (as above). But it should be doable also for indexing, fast re-ranking, and/or end-to-end retrieval.

@okhat
Copy link
Collaborator

okhat commented Oct 13, 2020

Added conda_env.yml.

conda create --name colbert-v0.2 --file conda_env.yml

I don't have reliable experience installing faiss with pip (it works but could be compiled suboptimally for the architecture, not use MKL, etc.), so in general I recommend this way if end-to-end retrieval will be used.

  • mlflow is a huge dependency, adding 80MB to an installation (including flask, azure-storage-blob, sqlalchemy).
  • logging.py tries to initialise many things. Cant we have a simpler proxy for this?

I agree that lightweight logging makes sense for some use cases. Perhaps I can update the code to use these libraries iff they are installed. How critical do you find this?

  • Why Python 3.7 now required. Google Colab is still on 3.6. What changes actually need 3.7?

Thanks for the point about Colab. Python 3.7 dictionaries are ordered by default (this is most relevant for defaultdict objects, since otherwise one could rely on OrderedDict). The v0.2 code isn't tested without that assumption.

This only affects loading related code (mostly, if not entirely, evaluation/loaders.py and the downstream code that uses it), though, which isn't too deep into the core code.

  • Can colbert/evaluation/ranking.py have its plain rerank function restored?

Addressed separately above.

  • Can load_colbert avoid using Runs?

Extracted load_model with minimal dependencies under colbert.evaluation.load_model. This should do the job.

@cmacdonald
Copy link
Contributor Author

cmacdonald commented Oct 13, 2020

Is there a reason to prefer pip?

Just because its a familiar method to install library dependencies.
I added tensorboard etc. I couldnt get faiss-gpu to pip install on my laptop (swig error), but faiss-cpu does. faiss-gpu does pip install fine on colab. See colab example: https://colab.research.google.com/drive/1EC5RhRk0eux0CTbgoYqxKrWB2c4NEMwA?usp=sharing

Thanks for the point about Colab. Python 3.7 dictionaries are ordered by default (this is most relevant for defaultdict objects, since otherwise one could rely on OrderedDict). The v0.2 code isn't tested without that assumption.
This only affects loading related code (mostly, if not entirely, evaluation/loaders.py and the downstream code that uses it), though, which isn't too deep into the core code.

The CPython implementation of Python 3.6 retains dictionary order, its just not guarenteed across all Python implementations: https://stackoverflow.com/a/39980744.

Extracted load_model with minimal dependencies under colbert.evaluation.load_model. This should do the job.
Thanks. Testing this.

I then got:

/usr/local/lib/python3.6/dist-packages/colbert/evaluation/ranking.py in <module>()
     12 
     13 from colbert.evaluation.metrics import Metrics
---> 14 from colbert.evaluation.ranking_logger import RankingLogger
     15 from colbert.modeling.inference import ModelInference
     16 

/usr/local/lib/python3.6/dist-packages/colbert/evaluation/ranking_logger.py in <module>()
      1 import os
      2 
----> 3 from contextlib import contextmanager, nullcontext
      4 from colbert.utils.utils import print_message
      5 from colbert.utils.runs import Run

ImportError: cannot import name 'nullcontext'

nullcontext only exists in Python 3.7 - I used a workaround from https://stackoverflow.com/a/45187287. You can see my fork.

Finally, I wasnt able to load an existing ColBERT model. Is this expected?

Some weights of the model checkpoint at bert-base-uncased were not used when initializing ColBERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ColBERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing ColBERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Oct 13, 13:12:36] #> Loading model checkpoint.
[Oct 13, 13:12:36] #> Loading checkpoint ./colbert.dnn
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-f89614f10a2b> in <module>()
      1 from pyterrier_bert.passager import SlidingWindowPassager, MaxPassage
      2 from pyterrier_bert.pyt_colbert import ColBERTPipeline
----> 3 colbert = ColBERTPipeline("./colbert.dnn", doc_attr="body", verbose=True)

3 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1043         if len(error_msgs) > 0:
   1044             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1045                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1046         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1047 

RuntimeError: Error(s) in loading state_dict for ColBERT:
	Missing key(s) in state_dict: "bert.embeddings.position_ids". 

Might this be because our models were trained with older huggingface versions?

@okhat
Copy link
Collaborator

okhat commented Oct 13, 2020

Thanks! Will address each.

For the last item, it should work with transformers 3.0.2 as in the env. They introduced a breaking change in 3.1.

@cmacdonald
Copy link
Contributor Author

cmacdonald commented Oct 13, 2020

For the last item, it should work with transformers 3.0.2 as in the env. They introduced a breaking change in 3.1.

Ah, yes, thanks.

Finally, the input text is type checked for list or tuple, while I gave a numpy array. I simpy used .tolist(), but it seemed unnecessary?
Relevant lines were: https://github.com/stanford-futuredata/ColBERT/blob/v0.2/colbert/modeling/tokenization/doc_tokenization.py#L19 or https://github.com/stanford-futuredata/ColBERT/blob/v0.2/colbert/modeling/tokenization/doc_tokenization.py#L45 - would a numpy array of text not be acceptable here?

Ok pegging to 3.0.2 and using slow_rerank works for me. Thanks for your timely help!

@okhat
Copy link
Collaborator

okhat commented Oct 13, 2020

I'm not sure about a numpy array of text. It could work, possibly. These assertions guard against string inputs (at least) because the downstream code wouldn't complain about them; it would just produce weird outcomes.

By the way, be sure you select args.mask_punctuation = True (default happens to be false in v0.2, just because it's a flag --mask-punctuation, but I'll try to have a negative flag instead in the next version as the intention is it should be on unless someone wants to disable it).

@cmacdonald
Copy link
Contributor Author

args.mask_punctuation = True

Got it - https://github.com/cmacdonald/pyterrier_bert/blob/colbert0.2/pyterrier_bert/pyt_colbert.py#L40

PR sent for ColBERT.

okhat pushed a commit that referenced this issue Dec 5, 2020
* add .gitignore

* define requirements.txt and import in setup.py, try setup.py at Python 3.6

* fix requirements for pip

* added missing init.py files

* add missing init.py

* py36 support

* patch the nullcontext

* peg transformer version

* 3.6 not 3.5
@okhat okhat closed this as completed Dec 5, 2020
Yaasha pushed a commit to Yaasha/ColBERT that referenced this issue Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants