Skip to content

Implement vector search #99

Closed
Closed
@m-i-l

Description

@m-i-l

This is spun off from #96 , to implement a simple vector search using Huggingface's sentence-transformers/all-MiniLM-L6-v2 model.

This would be accessed via a new link below the main search box, to the left of "Browse Sites", called e.g. "Chat". This would take you to a page with a larger box where you enter your search term, and are returned the closest matching chunks of text along with their URLs.

I've got a stand-alone (i.e. not integrated into the indexing process) vector search proof of concept running on local dev, and results are good, suggesting it might be viable from a performance, CPU and possibly even disk space perspective.

Integrating the proof of concept into the indexing process will be fairly involved though:

  • It will need to be implemented along with Detect if page content has changed #94 for efficiency, i.e. so embeddings are not repeatedly generated for unchanged content.
  • Given that all-MiniLM-L6-v2 only operates on input text less than 256 word pieces, a page will have to be "chunked" with each chunk a separate vector. I've found many giant (often auto-generated) pages that would have tens of thousands or even hundreds of thousands of chunks, and it wouldn't be viable (or even particularly useful) to vectorise all of them, so I'm going to need some sort of sensible limit, e.g. a max of 12 chunks/vectors per page.
  • Solr allows only one vector to be stored per page, given DenseVectorField is not currently multivalued (there is a proposal to change this, but nothing confirmed). The best workaround I can think of is to add the vectors for each chunk as child documents, although this will require reviewing and potentially updating all the existing search queries to ensure child documents are not returned for those by mistake.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions