Note: Because this repo relied on the now unsupported code-davinci-002
, the search demo will no longer work. It is trivial to replace codex with the chat api.
Semantic search for Lean mathlib using OpenAI's embeddings API and Faiss.
Tested with python 3.10.4. Packages in requirements.txt
.
Save an openai api key to the environment variable OPENAI_API_KEY
. Then cd
into src
and run demo_search.py
. The mathlib embeddings are pre-computed, but do note embedding a query will cost you $0.0004 / 1K tokens
.
This step is only documented for reproducibility purposes. It is not necessary to run the demo.
Pull leanprover-community/doc-gen and run the gen_docs
script. When generating the docs, we used mathlib commit hash 06d0adfa76594f304b4650d098273d4366ede61b
. Move the generated export.json
to src/parse_docgen/docgen_export.json
.
Then, in the src/parse_docgen
directory run python parse.py
. Then, cd
into the src/embed_mathlib
directory and run python embed_mathlib.py
and embeddings_to_numpy.py
.
At a high-level, the script embeds queries using the embeddings api, then uses Faiss to do a fast kNN search against precomputed embeddings of mathlib declarations.
The search indexes into the file src/parse_docgen/docgen_export_with_formal_statement.jsonl
, which is loaded into the docs
variable in src/demo_search.py
. The entries of docs
are in exactly the same format as the entries of the export.json
generated by gen_docs
, so in theory we should be use to reuse the doc-gen
code that maps export.json
entries to html.
Each entry of docs
actually contains one additional field compared to export.json
: the formal_statement
field, which is a deparsed string representation of the theorem statement.