Chunking for the 384 words limit #82

rjalexa · 2024-05-08T12:16:00Z

What is the best way to chunk longer texts so each chunk fits under the 384 words (or 512 subtokens) ?
My articles on the average are around 1200 tokens / 5000 chars approx
Thank you

urchade · 2024-05-08T12:49:43Z

Hi, I think that gliner-spacy (https://github.com/theirstory/gliner-spacy?ref=bramadams.dev) integrate a chunking function

Cc @wjbmattingly

wjbmattingly · 2024-05-08T17:36:44Z

Hi all. Yes, Gliner spaCy handles the chunking for you. I kept it as an argument so that as the GliNER model improves (and can handle larger inputs), the package won't need to be updated.

rjalexa · 2024-05-09T05:22:39Z

Thank you

abedit · 2024-05-10T05:30:15Z

On that note, is it possible to use GLiNER SpaCy's chunking for finetuning GLiNER, Specifically the urchade/gliner_multi_pii-v1 model? I'm also dealing with large data.

wjbmattingly · 2024-05-30T09:07:56Z

I believe there are a few of us working on gliner finetuning packages. I have one that's not ready yet, but I believe @urchade has made progress and has a few notebooks in this repository to get you started. In all these cases, you could use gliner spacy to help with the annotation process in something like Prodigy, from ExplosionAI. It's primarily what I use for annotating textual data because it works so easily with spaCy. You would then need to modify the output to align with the gliner finetuning approach. This is actually exactly what we did for the Placing the Holocaust project. You can see our GliNER finetuned model here: https://huggingface.co/placingholocaust/gliner_small-v2.1-holocaust

urchade closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking for the 384 words limit #82

Chunking for the 384 words limit #82

rjalexa commented May 8, 2024

urchade commented May 8, 2024

wjbmattingly commented May 8, 2024

rjalexa commented May 9, 2024

abedit commented May 10, 2024

wjbmattingly commented May 30, 2024

Chunking for the 384 words limit #82

Chunking for the 384 words limit #82

Comments

rjalexa commented May 8, 2024

urchade commented May 8, 2024

wjbmattingly commented May 8, 2024

rjalexa commented May 9, 2024

abedit commented May 10, 2024

wjbmattingly commented May 30, 2024