# Text tokenization and vectorization
We use Spacy to tokenize documents in batch, exploiting the pipe method. We then save the tokens to disk to avoid recomputing, lastly, we use a tfidf vectorizer to compute document vectors, and, as before, we save the results to disk.

In [1]:
import sys
sys.path.append("..")

import pandas as pd
import numpy as np

from src.dataset import Dataset
from src.tokenizers import BatchTokenizer

### Loading the dataset
The dataset is loaded in a list of documents, we do not need all the information about them right now.

In [None]:
# number of documents to load, -1 means all of them
n_documents = -1

name = "processed" #10seconds to load, contains approx. 19k documents

dataset = Dataset(dataset_path="../data/raw/data.jsonl", save_path=f"../data/processed/{name}.json")
texts_list = dataset.load_text_list(size=n_documents)

## Tokenize in batch
The tokenizer processes the documents in a parallel fashion, 
the result is a token list for each document, we save those to disk and later load them. 
This cell is runned on Google Colab as it's really intensive.

In [2]:
bt = BatchTokenizer()

interval = 1000
for i in range(0, len(texts_list), interval):  
    
    print(f"\nProcessing documents from {i} to {i + min((interval), len(texts_list)-i)}")
    
    tokens = bt.tokenize(texts_list[i : i + min(interval, len(texts_list)-i)])
    bt.save_tokens(tokens, 
                   tokens_save_path=f"../data/processed/tokens/tokens_{i}-{i+interval}.json")

In [19]:
BatchTokenizer.merge_tokens() #merge of single token files, runend once

## TfIdf vectorization

We now use TfIdf vectorizer on the obtained tokens.

In [11]:
dv = TokenTfidfVectorizer(tokens)

vectors = dv.vectors()
dv.save_vectors_vectorizer(vectors)

print(f"Vocabulary length: {len(dv.vectorizer.vocabulary_)}")

Vocabulary length: 219250


In [13]:
loaded_vectors, loaded_vec = TokenTfidfVectorizer.load_vectors_vectorizer()

In [18]:
print(loaded_vectors)

  (0, 82039)	0.0037695325196483685
  (0, 195381)	0.006203894177435243
  (0, 164259)	0.0029779364521888957
  (0, 67135)	0.004823325955442913
  (0, 109067)	0.005762846891644566
  (0, 206607)	0.017453122251463598
  (0, 204172)	0.007797523784253059
  (0, 18660)	0.002970212717526003
  (0, 55008)	0.0042905188991160335
  (0, 218411)	0.006062187392660119
  (0, 73201)	0.0016162452320801482
  (0, 148861)	0.0028484467099615823
  (0, 101074)	0.003726458166408488
  (0, 157691)	0.0022911045026632842
  (0, 165413)	0.004689255363353887
  (0, 167383)	0.0038350358220870426
  (0, 79660)	0.003189237888898805
  (0, 35153)	0.005554848421869665
  (0, 151123)	0.004888198204107622
  (0, 166799)	0.0033914349315768773
  (0, 53991)	0.007973018618683157
  (0, 24190)	0.0036485483688472584
  (0, 142971)	0.004067931882201199
  (0, 100215)	0.005377195363231539
  (0, 75298)	0.002514982429814531
  :	:
  (50999, 25372)	0.03001682081716593
  (50999, 166606)	0.0420542427795643
  (50999, 136686)	0.011504271206836085
  (5099