Add Data Transformers to Data Repository #107

diptanu · 2023-09-01T07:28:08Z

Content is extracted when a developer binds an extractor to a data repository. As new content lands the extractors are applied on the content and the derived information is written to indexes.

Extractors are responsible for chunking content, for ex splitting text in a document before they are embedded. Certain extractors like NER and Embedding extractors could be sharing the same chunked content since the context length of the underlying models of the extractors is limited. Currently these extractors duplicate the text splitting work.

The solution would be to introduce a high level transformer concept which can apply algorithms content and store the intermediate representation such as - splitting text into smaller chunks, extracting log mel features from audio files (as most speech models use log mel features), applying filters to images, etc. The intermediate/processed content will live in buffers - a logical storage abstraction that will trigger the extractors when data lands in them.

So it will look some thing like -
Content -> Transformers -> Buffer -> Extractors -> Index (continuosly)

The text was updated successfully, but these errors were encountered:

yenicelik · 2023-10-03T21:52:49Z

could buffers be a a persistent queue like kafka or redis, i.e. serialized through protobuf? or were you thinking something more structured?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Data Transformers to Data Repository #107

Add Data Transformers to Data Repository #107

diptanu commented Sep 1, 2023 •

edited

yenicelik commented Oct 3, 2023

Add Data Transformers to Data Repository #107

Add Data Transformers to Data Repository #107

Comments

diptanu commented Sep 1, 2023 • edited

yenicelik commented Oct 3, 2023

diptanu commented Sep 1, 2023 •

edited