Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Data Transformers to Data Repository #107

Open
diptanu opened this issue Sep 1, 2023 · 1 comment
Open

Add Data Transformers to Data Repository #107

diptanu opened this issue Sep 1, 2023 · 1 comment

Comments

@diptanu
Copy link
Collaborator

diptanu commented Sep 1, 2023

Content is extracted when a developer binds an extractor to a data repository. As new content lands the extractors are applied on the content and the derived information is written to indexes.

Extractors are responsible for chunking content, for ex splitting text in a document before they are embedded. Certain extractors like NER and Embedding extractors could be sharing the same chunked content since the context length of the underlying models of the extractors is limited. Currently these extractors duplicate the text splitting work.

The solution would be to introduce a high level transformer concept which can apply algorithms content and store the intermediate representation such as - splitting text into smaller chunks, extracting log mel features from audio files (as most speech models use log mel features), applying filters to images, etc. The intermediate/processed content will live in buffers - a logical storage abstraction that will trigger the extractors when data lands in them.

So it will look some thing like -
Content -> Transformers -> Buffer -> Extractors -> Index (continuosly)

@yenicelik
Copy link
Contributor

could buffers be a a persistent queue like kafka or redis, i.e. serialized through protobuf? or were you thinking something more structured?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants