Skip to content

Search posts from x that have liked yourself usign the archive download files

License

Notifications You must be signed in to change notification settings

cast42/search-x-likes

Repository files navigation

search-x-likes

Release Build status codecov Commit activity License

Search posts from x that have liked yourself using the archive download files

Getting started with your project

1. Create a New Repository

First, create a repository on GitHub with the same name as this project, and then run the following commands:

git init -b main
git add .
git commit -m "init commit"
git remote add origin git@github.com:cast42/search-x-likes.git
git push -u origin main

2. Set Up Your Development Environment

Then, install the environment and the pre-commit hooks with

make install

This will also generate your uv.lock file

3. Run the pre-commit hooks

Initially, the CI/CD pipeline might be failing due to formatting issues. To resolve those run:

uv run pre-commit run -a

4. Commit the changes

Lastly, commit the changes made by the two steps above to your repository.

git add .
git commit -m 'Fix formatting issues'
git push origin main

5. Set OPENAI_API_KEY key

export OPENAI_API_KEY=<your key>

You are now ready to start development on your project! The CI/CD pipeline will be triggered when you open a pull request, merge to main, or when you create a new release.

To finalize the set-up for publishing to PyPI, see here. For activating the automatic documentation with MkDocs, see here. To enable the code coverage reports, see here.

Releasing a new version

  • Create an API Token on PyPI.
  • Add the API Token to your projects secrets with the name PYPI_TOKEN by visiting this page.
  • Create a new release on Github.
  • Create a new tag in the form *.*.*.

For more details, see here.

Development

Use ruff for linting and formatting, mypy for static code analysis, and pytest for testing.

The documentation is built with mkdocs, mkdocs-material and mkdocstrings.

Datasets

To run uv run python search_x_likes/fix_datasets.py set:

export PYTORCH_ENABLE_MPS_FALLBACK=1

Search approaches

Exact search

Retrieve the first k exact matches. This approach is implemented as a textual TUI in search_x_likes/exact_search.py

k = 5 # retrieve k documents
retrieved = []
for idx, document in enumerate(documents):
    if query in document:
        retrieved.append(document)
    if idx > k:
        break

BM25S

BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices.

This approach is implemented in search_x_likes/bm25_search.py

Hugging Face Blog BM25 for Python: Achieving high performance while simplifying dependencies with BM25S⚡

arXiv Xing Han Lù, BM25S: Orders of magnitude faster lexical search via eager sparse scoring

GitHub

Retrieve top-k documents scored with cosine similarity of their embeddings

Given the embedding of a document A and an embedding of a query B, score it's similarity as the normalized dot product of the two vectors:

cosine similarity = A B | A | | B |

This approach is implemented in search_x_likes/cosine_search.py

To evaluate the different retrieval methods, a synthetic dataset is created with llm

The code to generate the synthetic dataset with gpt-4o-mini is in search_x_likes/generate_synthetic_eval_dataset.py It uses input the dataset that contains the post on x that are liked:

The evaluation results are:

Retrieval Results (Colab CPU & GPU T4)

Model MRR Recall@5 NDCG@5 Wall Time (CPU) Wall Time (GPU)
BM25s 0.7711 0.8367 0.3376 0.2s 0.4s
sentence-transformers/all-MiniLM-L6-v2 0.6517 0.9246 0.3964 20s 4.09s
nomic-ai/modernbert-embed-base 0.6654 0.9472 0.4044 3m01s 6.82s
intfloat/multilingual-e5-large 0.7063 0.9246 0.3823 7m57s 12.5s
minishlab/potion-retrieval-32M 0.6346 0.8894 0.3813 2s 1.64s
minishlab/potion-base-8M 0.6128 0.8794 0.3887 0.7s 1.99s
tomaarsen/static-retrieval-mrl-en-v1 0.5958 0.8543 0.3771 3.23s 1.87s

Quality results with rerankers:

Model MRR Recall@5 NDCG@5
BM25s - no reranker 0.7711 0.8367 0.3376
bi-encoder sentence-transformers/all-MiniLM-L6-v2 0.7106 0.7889 0.3243
bi-encoder all-mpnet-base-v2 0.6778 0.7789 0.3315
bi-encoder minishlab/potion-retrieval-32M 0.5973 0.7638 0.3396
bi-encoder nomic-ai/modernbert-embed-base 0.7210 0.8065 0.3347
cross encoder cross-encoder/ms-marco-MiniLM-L-6-v2 0.7958 0.8417 0.3347
corss encoder mixedbread-ai/mxbai-rerank-xsmall-v1 0.7836 0.8417 0.3422
cross encoder mixedbread-ai/mxbai-rerank-base-v1 0.7708 0.8417 0.3409
cross encoder mixedbread-ai/mxbai-rerank-large-v1 0.7605 0.8342 0.3362

Contributing

All contributions are welcome, including more documentation, examples, code, and tests. Even questions.

License - MIT

The package is open-sourced under the conditions of the MIT license.


Repository initiated with fpgmaas/cookiecutter-uv.

arXiv arXiv

About

Search posts from x that have liked yourself usign the archive download files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages