Text Embedder

text_embedder is a powerful and flexible Python library for generating and managing text embeddings using pre-trained transformer based multilingual embedding models. It offers support for various pooling strategies, similarity functions, and quantization techniques, making it a versatile tool for a variety of NLP tasks, including embedding, similarity search, clustering, and more.

🚀 Features

Model Integration: Wraps around 🤗 transformers to leverage the state-of-ther-art pre-trained embedding models.
Pooling Strategies: Choose from multiple pooling methods such as CLS token, max/mean pooling, and more to tailor to your need.
Flexible Similarity Metrics: Compute similarity scores between embeddings using cosine, dot, euclidean, and manhattan metrics.
Quantization Support: Reduce memory usage and improve performance by quantizing embeddings to multiple precision levels with support for auto mixed precision quantization.
Prompt Support: Optionally include a custom prompt in embeddings for contextualized representation.
Configurable Options: Tune embedding generation with options for batch size, sequence length, normalization, and more.

🛠 Installation

Install text_embedder from PyPI using pip:

pip install text_embedder

📖 Usage

Initialization

Initialize the TransformersEmbedder with your desired configurations:

from text_embedder import TextEmbedder

embedder = TextEmbedder(
    model="BAAI/bge-small-en",
    sim_fn="cosine",
    pooling_strategy=["cls"],
    device="cuda",  # Specify device if needed
)

Generating Embeddings

Generate embeddings for a list of texts:

embeddings = embedder.embed(["Hello world", "Transformers are amazing!"])
print(embeddings)

Computing Similarity

Compute similarity between two embeddings:

embedding1 = embedder.embed(["Cat jumped from a chair"])
embedding2 = embedder.embed(["Mamba architecture is better than transformers tho, ngl."])
similarity_score = embedder.get_similarity(embedding1, embedding2)
print(f"Similarity Score: {similarity_score}")

Advanced Usage

Pooling Strategies

You can choose from various pooling strategies:

"cls": Use the CLS token embedding.
"max": Take the maximum value across tokens.
"mean": Compute the mean of token embeddings.
"mean_sqrt_len": Compute the mean divided by the square root of token length.
"weightedmean": Compute a weighted mean of token embeddings.
"lasttoken": Use the last token embedding.

Similarity Functions

Supported similarity functions:

Cosine Similarity: Measures the cosine of the angle between two vectors.
Dot Product: Measures the dot product between two vectors.
Euclidean Distance: Measures the straight-line distance between two vectors. (L1)
Manhattan Distance: Measures the sum of absolute differences between two vectors. (L2)

Quantization

Quantize embeddings to lower precision:

float32: 32-bit floating-point precision.
float16: 16-bit floating-point precision.
int8: 8-bit integer precision.
uint8: 8-bit unsigned integer precision.
binary: Binary quantization.
ubinary: Unsigned binary quantization.
2bit: 2-bit quantization.
4bit: 4-bit quantization.
8bit: 8-bit quantization.

Future Work

Additional Pooling Strategies: Implement more advanced pooling methods (eg., attention-based). Also have to add a auto option to pooling_strategy to find a right pooling method based on model config.
Custom Quantization Methods: Add to new quantization techniques for further improvement.
Similarity function: Also add more similarity metric functions

🤝 Contributing

Contributions are welcome! Please follow these steps to get started with your contribution:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature).
Make your changes.
Commit your changes (git commit -am 'Add new feature').
Push to the branch (git push origin feature/your-feature).
Create a new Pull Request.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgement

Special Thanks to devs of Sentence-Transformers library.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src/text_embedder		src/text_embedder
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Embedder

🚀 Features

🛠 Installation

📖 Usage

Initialization

Generating Embeddings

Computing Similarity

Advanced Usage

Pooling Strategies

Similarity Functions

Quantization

Future Work

🤝 Contributing

📄 License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Embedder

🚀 Features

🛠 Installation

📖 Usage

Initialization

Generating Embeddings

Computing Similarity

Advanced Usage

Pooling Strategies

Similarity Functions

Quantization

Future Work

🤝 Contributing

📄 License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages