# Text tokenization and vectorization
We use Spacy to tokenize documents in batch, exploiting the pipe method. We then save the tokens to disk to avoid recomputing, lastly, we use a tfidf vectorizer to compute document vectors, and, as before, we save the results to disk.

In [1]:
import sys
sys.path.append("..")

import pandas as pd
import numpy as np

from src.dataset import Dataset
from src.tokenizers import BatchTokenizer

### Loading the dataset
The dataset is loaded in a list of documents, we do not need all the information about them right now.

In [3]:
# number of documents to load, -1 means all of them
n_documents = -1

name = "processed" #10seconds to load, contains approx. 19k documents

dataset = Dataset(dataset_path="../data/raw/data.jsonl", save_path=f"../data/processed/{name}.json")

In [5]:
texts_list = dataset.load_text_list(size=n_documents)

194366

## Tokenize in batch
The tokenizer processes the documents in a parallel fashion, 
the result is a token list for each document, we save those to disk and later load them. 
This cell is runned on Google Colab as it's really intensive.

In [2]:
bt = BatchTokenizer()

interval = 1000
for i in range(0, len(texts_list), interval):  
    
    print(f"\nProcessing documents from {i} to {i + min((interval), len(texts_list)-i)}")
    
    tokens = bt.tokenize(texts_list[i : i + min(interval, len(texts_list)-i)])
    BatchTokenizer.save_tokens(tokens, 
                   tokens_save_path=f"../data/processed/tokens/tokens_{i}-{i+interval}.json")

In [2]:
BatchTokenizer.merge_tokens() #merge of single token files

### Saving a new dataset with tokenized text

In [4]:
# dataset.save_token_dataset()
dataset.save_token_dataset(tokens_path="../data/processed/filtered_tokens.json")