W?

An experimental natural language based querying system for Wikipedia.

How does it work?

The basic approach is as follows:

Vector embedddings of sections of Wikipedia articles are prepared and stored
Calculate the vector embedding for the given query
Do a vector search in the vector store with Hierarchical Navigable Small World (HNSW) to find vectors that are similar
Retrieve the corresponding article and narrowed down section

Note that we are not using an LLM here as usual RAG approach do. This experiment is to see if "Retrieval is all you need".

For searching in Wikipedia using natural language questions, getting the context paragraph and link to read more about from original wikipedia article is more useful than articulated single sentence answer. My reasoning behind this is:

LLMs are not 100% free from hallucination. Removing hallucination from answer articulation is compute intensive, costly and not guaranteed.
Wikipedia is known for its exploratory characteristics. It is always more useful to give content as such and freedom to learn and explore is well suited to Wikipedia than deadend single sentence answers.

Embedding

The primary way to injest articles to embedding store is by feeding a list of articles to wq.injest module. Then article are split into sections, and sections are split to sentences. For each sentence, embedding is calculated. Along with article metadata, sentence embeddings are stored.

The embedding model I used is E5. For English experiments I used e5-small-v2 with ONNX optimization. The multilingual-e5 supports about 100 languages and I did some experiments on it too. The results vary per language. The base, and large variants of e5 require more compute time even with ONNX optimization.

I also tried with Ctranslate2 inference optimization of e5 models. However, there are some bugs in my implementation.

Keeping the articles in sync with Wikipedia: When somebody search a question and if we don't get an answer from the retrieval step, I search the actual wikipedia with the same query, get the relevant titles, embed it and try retrieval again. This makes the system ready to accept any questions. Another approach that I am thinking is to refresh the article from Wiki if a new revision exist whenever that article is hit by a search.

I don't see a reason to have prebuilt vectorstore for every article in wikipedia and updated all the time when an edit happens. If the time taken for embedding an article is fast enough, we can do embedding insert/update and retrieval on demand. However, in a real production situation, I might be wrong about this theory.

Embedding storage

Currently Redis is used. Latest version of Redis has native vector embedding storage and KNN search. From testing it is very reliable and performant.

Known issues

Splitting content into sentences and doing embedding query on just sentences has some issues. The embedders might be able to accept more tokens than what is present in one sentence. Sentences are often semantically incomplete, especially with co-references. This will affect semantic retrieval.

My current thinking is, we can create embedding for multiple sentences at a time. For example, use 2 or 3 sentences together. This might help the co-reference issue. However if we use many sentences, it will affect the retrieval quality. The tokens present in these sentences will also go beyond the input size of tokenizers causing clipping.

Also, general clean up issues - such as skipping references section and external links from embedding.

How to feed table and list content to embedding so that they are semantically complete? One approach is to prefix Article title and section title for each entry in the list. If this works, why not feed wikidata props and values to embedding for each article?

Telegram Bot

To make the application a real chatbot, I added a telegram bot integrated. Please contact me to get the link to the bot.

How to run?

Use the .env.example to setup environment variables in a .env file.

Install dependencies in a virtual env:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Then run:

uvicorn app:app

Open the link provided by gunicorn.

Add Wikipedia articles to Vector store

Pass the titles to the wq.injest module. You may prepare a textfile with titles and pipe to the script as well.

echo "Oxygen" | python -m wq.injest

Using docker

Clone the repository, build the docker image and run it.

docker build -t wq .
docker run -dp 80:8989 wq:latest

Open http://0.0.0.0:8989/ using browser

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
models		models
static		static
templates		templates
wq		wq
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
TODO		TODO
app.py		app.py
docker-compose.yaml		docker-compose.yaml
entrypoint.sh		entrypoint.sh
feeder.py		feeder.py
logging.conf		logging.conf
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test.json		test.json

License

santhoshtr/wq

Folders and files

Latest commit

History

Repository files navigation

W?

How does it work?

Embedding

Embedding storage

Known issues

Telegram Bot

How to run?

Add Wikipedia articles to Vector store

Using docker

About

Topics

Resources

License

Stars

Watchers

Forks

Languages