Search posts from x that have liked yourself using the archive download files
- Github repository: https://github.com/cast42/search-x-likes/
- Documentation https://cast42.github.io/search-x-likes/
- Pypy package https://pypi.org/project/search-x-likes/
https://huggingface.co/spaces/cast42/x_likes_search/
First, create a repository on GitHub with the same name as this project, and then run the following commands:
git init -b main
git add .
git commit -m "init commit"
git remote add origin git@github.com:cast42/search-x-likes.git
git push -u origin main
Then, install the environment and the pre-commit hooks with
make install
This will also generate your uv.lock
file
Initially, the CI/CD pipeline might be failing due to formatting issues. To resolve those run:
uv run pre-commit run -a
Lastly, commit the changes made by the two steps above to your repository.
git add .
git commit -m 'Fix formatting issues'
git push origin main
export OPENAI_API_KEY=<your key>
You are now ready to start development on your project! The CI/CD pipeline will be triggered when you open a pull request, merge to main, or when you create a new release.
To finalize the set-up for publishing to PyPI, see here. For activating the automatic documentation with MkDocs, see here. To enable the code coverage reports, see here.
- Create an API Token on PyPI.
- Add the API Token to your projects secrets with the name
PYPI_TOKEN
by visiting this page. - Create a new release on Github.
- Create a new tag in the form
*.*.*
.
For more details, see here.
Use ruff
for linting and formatting, mypy
for static code analysis, and pytest
for testing.
The documentation is built with mkdocs
, mkdocs-material
and mkdocstrings
.
To run uv run python search_x_likes/fix_datasets.py set:
export PYTORCH_ENABLE_MPS_FALLBACK=1
Retrieve the first k exact matches. This approach is implemented as a textual TUI in search_x_likes/exact_search.py
k = 5 # retrieve k documents
retrieved = []
for idx, document in enumerate(documents):
if query in document:
retrieved.append(document)
if idx > k:
break
BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices.
This approach is implemented in search_x_likes/bm25_search.py
BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡
Xing Han Lù, BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Given the embedding of a document A and an embedding of a query B, score it's similarity as the normalized dot product of the two vectors:
This approach is implemented in search_x_likes/cosine_search.py
The code to generate the synthetic dataset with gpt-4o-mini is in search_x_likes/generate_synthetic_eval_dataset.py It uses input the dataset that contains the post on x that are liked:
The evaluation results are:
Model | MRR | Recall@5 | NDCG@5 | Wall Time (CPU) | Wall Time (GPU) |
---|---|---|---|---|---|
BM25s | 0.7711 | 0.8367 | 0.3376 | 0.2s | 0.4s |
sentence-transformers/all-MiniLM-L6-v2 | 0.6517 | 0.9246 | 0.3964 | 20s | 4.09s |
nomic-ai/modernbert-embed-base | 0.6654 | 0.9472 | 0.4044 | 3m01s | 6.82s |
intfloat/multilingual-e5-large | 0.7063 | 0.9246 | 0.3823 | 7m57s | 12.5s |
minishlab/potion-retrieval-32M | 0.6346 | 0.8894 | 0.3813 | 2s | 1.64s |
minishlab/potion-base-8M | 0.6128 | 0.8794 | 0.3887 | 0.7s | 1.99s |
tomaarsen/static-retrieval-mrl-en-v1 | 0.5958 | 0.8543 | 0.3771 | 3.23s | 1.87s |
Quality results with rerankers:
Model | MRR | Recall@5 | NDCG@5 |
---|---|---|---|
BM25s - no reranker | 0.7711 | 0.8367 | 0.3376 |
bi-encoder sentence-transformers/all-MiniLM-L6-v2 | 0.7106 | 0.7889 | 0.3243 |
bi-encoder all-mpnet-base-v2 | 0.6778 | 0.7789 | 0.3315 |
bi-encoder minishlab/potion-retrieval-32M | 0.5973 | 0.7638 | 0.3396 |
bi-encoder nomic-ai/modernbert-embed-base | 0.7210 | 0.8065 | 0.3347 |
cross encoder cross-encoder/ms-marco-MiniLM-L-6-v2 | 0.7958 | 0.8417 | 0.3347 |
corss encoder mixedbread-ai/mxbai-rerank-xsmall-v1 | 0.7836 | 0.8417 | 0.3422 |
cross encoder mixedbread-ai/mxbai-rerank-base-v1 | 0.7708 | 0.8417 | 0.3409 |
cross encoder mixedbread-ai/mxbai-rerank-large-v1 | 0.7605 | 0.8342 | 0.3362 |
All contributions are welcome, including more documentation, examples, code, and tests. Even questions.
The package is open-sourced under the conditions of the MIT license.
Repository initiated with fpgmaas/cookiecutter-uv.