Skip to content

FEA Add TextEncoder#1077

Merged
jeromedockes merged 78 commits intoskrub-data:mainfrom
Vincent-Maladiere:add_sentence_transformer
Nov 18, 2024
Merged

FEA Add TextEncoder#1077
jeromedockes merged 78 commits intoskrub-data:mainfrom
Vincent-Maladiere:add_sentence_transformer

Conversation

@Vincent-Maladiere
Copy link
Copy Markdown
Member

@Vincent-Maladiere Vincent-Maladiere commented Sep 19, 2024

closes #1047

Reference
#1047

What does this PR propose?

This PR wraps the SentenceTransformers library, in the same fashion as ragger-duck and embetter does.

Following the good empirical baselines of arxiv.org/abs/2312.09634, this PR also proposes to:

  1. Add a PCA within the estimator, controlled by a n_components parameter. This also allows the class to stay consistent with the other skrub encoders and perform HP tuning more easily. Following the paper, we set n_components=30 by default.
  2. Add a normalization option, with a default "l2". This simplifies cosine distance to dot products and brings embeddings as comparable features to distance-based estimators (e.g., NearestNeighbors). This is useful when other normalized features are used (fuzzy joining).

What the SentenceEncoder does not:

  • perform weights or embedding quantization (kept for later research)
  • set torch threading, which doesn't interact well with joblib and raises an error if called twice.
  • choose a different embedding Pooling scheme (only mean pooling is implemented in SentenceTransformer)

See the implementation of the paper for more details.

Todo in another PR
Enriching an existing example from skrub with the SentenceEncoder.

@Vincent-Maladiere
Copy link
Copy Markdown
Member Author

Pixi is extremely annoying. It has negative returns on productivity as far as I am concerned.

@Vincent-Maladiere
Copy link
Copy Markdown
Member Author

Vincent-Maladiere commented Sep 19, 2024

I have an intriguing segfault due to pytorch, that I can't reproduce locally with pixi or mamba. I'd be happy to get feedback on this. I suspect pixi doesn't download torch along with sentence-transformers from Pypi.

Comment thread pyproject.toml Outdated
@glemaitre glemaitre self-requested a review September 19, 2024 15:36
Comment thread pyproject.toml
@Vincent-Maladiere
Copy link
Copy Markdown
Member Author

Ping @jovan-stojanovic :)

Copy link
Copy Markdown
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@Vincent-Maladiere
Copy link
Copy Markdown
Member Author

@GaelVaroquaux can we merge this one?

@GaelVaroquaux
Copy link
Copy Markdown
Member

I'm having a look right now. I played with the example on my computer. It's really cool

I've created a small PR to your branch here: Vincent-Maladiere#1
If you're happy with it, can you merge it please?

Also, there is a conflict with main, which means that we cannot merge

Focus on the string encoding, rather than the sentiment analysis
Copy link
Copy Markdown
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please merge my doc PR Vincent-Maladiere#1 :)

@jeromedockes jeromedockes merged commit b2c4f82 into skrub-data:main Nov 18, 2024
@jeromedockes
Copy link
Copy Markdown
Member

Nice!!! thanks @Vincent-Maladiere ! very exciting -- let's try to release it soon :) 🚀

@GaelVaroquaux
Copy link
Copy Markdown
Member

Very exciting indeed! Bravo!

@GaelVaroquaux
Copy link
Copy Markdown
Member

Darn, we have a build failure on circleCI on main:
image

I'm restarting the build.

@jeromedockes
Copy link
Copy Markdown
Member

jeromedockes commented Nov 18, 2024 via email

@GaelVaroquaux
Copy link
Copy Markdown
Member

Yes indeed. It fails reliably. I fear that it might be related to the PR: some nasty interaction between examples.

We'll need to bisect to know whether it's related to the dependencies that are being installed, or the code that is being run. I've created an issue to track this: #1143

@glemaitre
Copy link
Copy Markdown
Member

Memory issues?

@Vincent-Maladiere
Copy link
Copy Markdown
Member Author

I can try to bisect on this tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a FastText encoder

5 participants