# Text tokenization and vectorization
We use Spacy to tokenize documents in batch, exploiting the pipe method. We then save the tokens to disk to avoid recomputing, lastly, we use a tfidf vectorizer to compute document vectors, and, as before, we save the results to disk.

In [1]:
import sys
sys.path.append("..")

import pandas as pd
import numpy as np

from src.dataset import Dataset
from src.tokenizers import BatchTokenizer
from src.vectorizers import TokenTfidfVectorizer

### Loading the dataset
The dataset is loaded in a list of documents, we do not need all the information about them right now.

In [2]:
# number of documents to load, -1 means all of them
n_documents = 300

processed_dataset = "processed" #10seconds to load, contains approx. 19k documents
# processed_dataset = "test_processed" # 1second to load, contains approx 1k documents

dataset = Dataset(dataset_path="../data/raw/data.jsonl", save_path=f"../data/processed/{processed_dataset}.json")
texts_list = dataset.load_text_list(size=n_documents)

In [3]:
len(texts_list)

300

## Tokenize in batch
The tokenizer processes the documents in a parallel fashion, 
the result is a token list for each document, we save those to disk and later load them. 

In [4]:
bt = BatchTokenizer()
tokens = bt.tokenize(texts_list)
# bt.save_tokens(tokens)
# tokens = BatchTokenizer.load_tokens()

## TfIdf vectorization

We now use TfIdf vectorizer on the obtained tokens.

In [None]:
dv = TokenTfidfVectorizer(tokens)

vectors = dv.vectors()
dv.save_vectors_vectorizer(vectors)

print(f"Vocabulary length: {len(dv.vectorizer.vocabulary_)}")


In [None]:
loaded_vectors, loaded_vec = TokenTfidfVectorizer.load_vectors_vectorizer()
X = pd.DataFrame(loaded_vectors.toarray(), columns=loaded_vec.get_feature_names())
X.head()