mathlib-semantic-search

Note: Because this repo relied on the now unsupported code-davinci-002, the search demo will no longer work. It is trivial to replace codex with the chat api.

mathlib-semantic-search

Semantic search for Lean mathlib using OpenAI's embeddings API and Faiss.

Tested with python 3.10.4. Packages in requirements.txt.

Running demo

Save an openai api key to the environment variable OPENAI_API_KEY. Then cd into src and run demo_search.py. The mathlib embeddings are pre-computed, but do note embedding a query will cost you $0.0004 / 1K tokens.

Recomputing mathlib embeddings

This step is only documented for reproducibility purposes. It is not necessary to run the demo.

Pull leanprover-community/doc-gen and run the gen_docs script. When generating the docs, we used mathlib commit hash 06d0adfa76594f304b4650d098273d4366ede61b. Move the generated export.json to src/parse_docgen/docgen_export.json.

Then, in the src/parse_docgen directory run python parse.py. Then, cd into the src/embed_mathlib directory and run python embed_mathlib.py and embeddings_to_numpy.py.

How `demo_search.py` works

At a high-level, the script embeds queries using the embeddings api, then uses Faiss to do a fast kNN search against precomputed embeddings of mathlib declarations.

The search indexes into the file src/parse_docgen/docgen_export_with_formal_statement.jsonl, which is loaded into the docs variable in src/demo_search.py. The entries of docs are in exactly the same format as the entries of the export.json generated by gen_docs, so in theory we should be use to reuse the doc-gen code that maps export.json entries to html.

Each entry of docs actually contains one additional field compared to export.json: the formal_statement field, which is a deparsed string representation of the theorem statement.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deploy.md		deploy.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

deploy.md

deploy.md

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

Repository files navigation

mathlib-semantic-search

Running demo

Recomputing mathlib embeddings

How `demo_search.py` works

About

Releases

Packages

Contributors 2

Languages

License

zhangir-azerbayev/mathlib-semantic-search

Folders and files

Latest commit

History

Repository files navigation

mathlib-semantic-search

Running demo

Recomputing mathlib embeddings

How demo_search.py works

About

Resources

License

Stars

Watchers

Forks

Languages

How `demo_search.py` works