# Text tokenization and vectorization
We use Spacy to tokenize documents in batch, exploiting the pipe method. We then save the tokens to disk to avoid recomputing, lastly, we use a tfidf vectorizer to compute document vectors, and, as before, we save the results to disk.

In [7]:
import sys
sys.path.append("..")

import json
import numpy as np
import pandas as pd

from src.dataset import Dataset
from src.tokenizers import BatchTokenizer

### Loading the dataset
The dataset is loaded in a list of documents, we do not need all the information about them right now.

In [3]:
dataset = Dataset(dataset_path="../data/raw/data.jsonl")

In [4]:
years = [1760, 1800, 1820, 1840, 1860, 1880, 
         1900, 1920, 1940, 1960, 1980, 2000]

tokens_folder = "../data/processed/tokens"

In [5]:
bt = BatchTokenizer()

In [14]:
for year in years:
    
    texts = [el["text"] for el in dataset.load_dataset(year)]
    print(f"Processing year {year}: {len(texts)} documents")
    tokens = bt.tokenize(texts)
    
    with open(f"{tokens_folder}/{year}.json", "w") as f:
        f.write(json.dumps(tokens))

Processing year 1760: 1 documents
Processing year 1800: 5 documents
Processing year 1820: 440 documents
Processing year 1840: 2657 documents
Processing year 1860: 9255 documents
Processing year 1880: 19648 documents
Processing year 1900: 28932 documents
Processing year 1920: 26954 documents
Processing year 1940: 14668 documents
Processing year 1960: 33683 documents
Processing year 1980: 35641 documents
Processing year 2000: 11262 documents


### Saving a new dataset with tokenized text