In [None]:
from pathlib import Path

from datasets import load_from_disk

from vectormesh import Vectorizer

assets = Path("../assets")
tag = next(assets.glob("aktes_*/"))
trainpath = tag / "train"
trainpath, tag.name

I load the aktes dataset from the assets folder, and set up a vectorizer with a huggingface model.
You can find more information about the model [here](https://huggingface.co/Gerwin/legal-bert-dutch-english)



In [None]:
train = load_from_disk(trainpath)
model_name = "Gerwin/legal-bert-dutch-english"
vectorizer = Vectorizer(model_name=model_name, col_name="legal_dutch")

We load the data from the disk, and give the vectorizer the name of the model, and specify a colulmn name.
This is usefull later on, because we can add multiple types of vectors to a cache.

Lets use a sample, because vectorization can take a while.

In [None]:
sample = train.select(range(64))

The VectorCache will 
- loop over the full dataset (in this notebook, the sample. For the full dataset, use hardware accelaration.)
- add metadata to keep track of which vectors belong to which dataset and vectorizer
- store the vectorcache on disk

In [None]:
from vectormesh import VectorCache

vectorcache = VectorCache.create(
    cache_dir=Path("tmp/artefacts"),
    vectorizer=vectorizer,
    dataset=sample,
    dataset_tag=tag.name,
)

This takes a while! That is why i have already created a cache with this model. 

Check the vectorcache for some usefull metadata:

In [None]:
vectorcache.metadata

Every document is split into chunks of max 512 tokens, with an overlap of about 50 tokens (512 // 10).
This means that every document is turned into a 2D tensor with shape (chunks, dim) where dim is the embedding dimension (eg 768 for bert-base-uncased)

If we run `.__get_item__` on the vectorcache, by doing `vectorcache[0]`, this is being passed on to the underlying dataset, and we get the first document in the dataset.

- "text" is the original text
- "target" is the `rechtsfeit` we need to predict
- "labels" is the rechtsfeit, turned into a labels `0, 1, 2, ...`
- "legal_ductch" was the `col_name` we specified when creating the vectorizer, and contains the vectors created by that vectorizer.

In [None]:
vectorcache[0]

We can now **extend** the existing dataset with more vectors! 

I created a RegexVectorizer, that checks for references of laws in the text.
You can run it on the full trainset, it is pretty fast

In [None]:
from vectormesh import RegexVectorizer
from vectormesh.data.vectorizers import (
    build_legal_reference_pattern,
    harmonize_legal_reference,
)

# Initialize & fit with training_texts
regexvectorizer = RegexVectorizer(
    pattern_builder=build_legal_reference_pattern,
    harmonizer=harmonize_legal_reference,
    min_doc_frequency=15,
    max_features=200,
    device="cpu",
    training_texts=train["text"],  # fit it on the full train set
)
regexvectorizer.get_metadata

It initialized on the full trainset, and found 123 features. This is determined by

1. min_doc_frequency: the minimum amount of documents a feature must appear in
2. max_features: the maximum amount of features to be extracted

To have an idea what it found, we can print some stats:

In [None]:
regexvectorizer.print_stats()

But this was just the "training" of the regexvectorizer. We didnt actually store this as data in the vectorcache yet. We can do this now.

First, lets pick up the existing vectorcache from disk, such that we can extend it with the vectors.

In [None]:
new_tag = next(
    Path("tmp/artefacts").glob("*legal_dutch*/")
)  # next picks the first one it finds
new_tag.name

This is the foldername of the dataset in `tmp/artefacts` we want to extend.

For the new vectors, the cache will use the `.col_name` from `regexvectorizer`, which you can change.

In [None]:
regexvectorizer.col_name

We can now:
- use the existing vectorcache dataset with the huggingface vectors
- extend it with the regex vectors
- regexvectorizer has a feature `col_name`, which is set to `regex_features` by default. You can change it if you want, for example when you modify the regexes.
- save the new cache to tmp/artefacts

In [None]:
updated_cache = VectorCache.create(
    cache_dir=Path("tmp/artefacts"),
    vectorizer=regexvectorizer,  # use our new regex vectorizer
    dataset=vectorcache.dataset,  # use the existing dataset
    dataset_tag=new_tag.name,  # this will check for existing metadata.json in the old folder
)

The data is now extended with the regex vectors!
Lets check the metadata:

In [None]:
updated_cache.metadata

lets have a look at an actual observation:

In [None]:
updated_cache[0]

You see:
- The original text
- The orignal target to predict (`"rechtsfeit"`) , eg `tensor([579])`
- The text, embedded by the legal-dutch model as a 2D tensor
- the text, encoded with the regexvectorizer, as a binary 1D tensor

Check the shape of the vectors:

In [None]:
updated_cache[0]["legal_dutch"].shape, updated_cache[0]["regex_features"].shape

Lets clean up the tmp/artefacts folder, because we only made it for a small sample

In [None]:
import shutil

shutil.rmtree("tmp/", ignore_errors=True)
shutil.rmtree("logs/", ignore_errors=True)
