In this notebook, I am going to build a baseline model based on [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5) for the Jigsaw Multilingual Toxic Comment Classification (Kaggle challenge [link](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification)). 

**What am I predicting?** (comes from the challenge homepage)

You are predicting the probability that a comment is toxic. A toxic comment would receive a 1.0. A benign, non-toxic comment would receive a 0.0. In the test set, all comments are classified as either a 1.0 or a 0.0.

In [13]:
import tensorflow as tf
print(tf.__version__)

2.1.0


An amazing EDA on the dataset in available here: https://www.kaggle.com/tarunpaparaju/jigsaw-multilingual-toxicity-eda-models. 

## Load and prepare data

In [None]:
!ls /kaggle/input/jigsaw-multilingual-toxic-comment-classification/

Data description is available [here](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data). 

In [None]:
# Load datasets
import pandas as pd
import os

DATA_PATH = "/kaggle/input/jigsaw-multilingual-toxic-comment-classification/"

TEST_PATH = os.path.join(DATA_PATH, "test.csv")
VAL_PATH = os.path.join(DATA_PATH, "validation.csv")
TRAIN_PATH = os.path.join(DATA_PATH, "jigsaw-toxic-comment-train.csv")

val_data = pd.read_csv(VAL_PATH)
test_data = pd.read_csv(TEST_PATH)
train_data = pd.read_csv(TRAIN_PATH)

In [None]:
# Preview train set
train_data.sample(5)

Columns (comes from [here](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data)): 
- id - identifier within each file.
- comment_text - the text of the comment to be classified.
- toxic:identity_hate - whether or not the comment is classified as toxic. 

In [None]:
val_data.sample(5)

In [None]:
test_data.sample(5)

I am going to borrow the helper functions as shown here: https://www.kaggle.com/tarunpaparaju/jigsaw-multilingual-toxicity-eda-models. 

In [None]:
# Remove usernames and links
import re

val = val_data
train = train_data

def clean(text):
    # fill the missing entries and convert them to lower case
    text = text.fillna("fillna").str.lower()
    # replace the newline characters with space 
    text = text.map(lambda x: re.sub('\\n',' ',str(x)))
    text = text.map(lambda x: re.sub("\[\[User.*",'',str(x)))
    # remove usernames and links
    text = text.map(lambda x: re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",'',str(x)))
    text = text.map(lambda x: re.sub("\(http://.*?\s\(http://.*\)",'',str(x)))
    return text

val["comment_text"] = clean(val["comment_text"])
test_data["content"] = clean(test_data["content"])
train["comment_text"] = clean(train["comment_text"])

In [None]:
# Load DistilBERT tokenizer
import transformers

tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')

The following function comes from [here](https://github.com/dipanjanS/deep_transfer_learning_nlp_dhs2019/blob/master/notebooks/6%20-%20Transformers%20-%20DistilBERT.ipynb).

In [15]:
import numpy as np
import tqdm

def create_bert_input_features(tokenizer, docs, max_seq_length):
    
    all_ids, all_masks = [], []
    for doc in tqdm.tqdm(docs, desc="Converting docs to features"):
        tokens = tokenizer.tokenize(doc)
        if len(tokens) > max_seq_length-2:
            tokens = tokens[0 : (max_seq_length-2)]
        tokens = ['[CLS]'] + tokens + ['[SEP]']
        ids = tokenizer.convert_tokens_to_ids(tokens)
        masks = [1] * len(ids)
        # Zero-pad up to the sequence length.
        while len(ids) < max_seq_length:
            ids.append(0)
            masks.append(0)
        all_ids.append(ids)
        all_masks.append(masks)
    encoded = np.array([all_ids, all_masks])
    return encoded

In [None]:
# Segregate the comments and their labels (not applicable for test set)
train_comments = train.comment_text.astype(str).values
val_comments = val_data.comment_text.astype(str).values
test_comments = test_data.content.astype(str).values

y_valid = val.toxic.values
y_train = train.toxic.values

In [None]:
import gc
gc.collect()

In [16]:
# Encode the comments
MAX_SEQ_LENGTH = 500

train_features_ids, train_features_masks = create_bert_input_features(tokenizer, train_comments, 
                                                                      max_seq_length=MAX_SEQ_LENGTH)
val_features_ids, val_features_masks = create_bert_input_features(tokenizer, val_comments, 
                                                                  max_seq_length=MAX_SEQ_LENGTH)
# test_features = create_bert_input_features(tokenizer, test_comments, 
#                                            max_seq_length=MAX_SEQ_LENGTH)


Converting docs to features:   0%|          | 0/223549 [00:00<?, ?it/s][A
Converting docs to features:   0%|          | 36/223549 [00:00<10:26, 356.88it/s][A
Converting docs to features:   0%|          | 59/223549 [00:00<12:31, 297.42it/s][A
Converting docs to features:   0%|          | 95/223549 [00:00<11:55, 312.13it/s][A
Converting docs to features:   0%|          | 127/223549 [00:00<12:02, 309.21it/s][A
Converting docs to features:   0%|          | 161/223549 [00:00<12:21, 301.30it/s][A
Converting docs to features:   0%|          | 199/223549 [00:00<11:39, 319.46it/s][A
Converting docs to features:   0%|          | 230/223549 [00:00<11:48, 315.29it/s][A
Converting docs to features:   0%|          | 262/223549 [00:00<11:48, 315.14it/s][A
Converting docs to features:   0%|          | 312/223549 [00:00<10:30, 354.10it/s][A
Converting docs to features:   0%|          | 349/223549 [00:01<10:28, 355.28it/s][A
Converting docs to features:   0%|          | 392/223549 [00:01<09:

Converting docs to features:   2%|▏         | 3433/223549 [00:10<14:17, 256.56it/s][A
Converting docs to features:   2%|▏         | 3472/223549 [00:10<12:50, 285.57it/s][A
Converting docs to features:   2%|▏         | 3514/223549 [00:10<11:59, 305.62it/s][A
Converting docs to features:   2%|▏         | 3547/223549 [00:10<11:45, 311.84it/s][A
Converting docs to features:   2%|▏         | 3588/223549 [00:11<10:55, 335.80it/s][A
Converting docs to features:   2%|▏         | 3624/223549 [00:11<10:50, 338.09it/s][A
Converting docs to features:   2%|▏         | 3659/223549 [00:11<11:04, 330.84it/s][A
Converting docs to features:   2%|▏         | 3693/223549 [00:11<11:45, 311.54it/s][A
Converting docs to features:   2%|▏         | 3725/223549 [00:11<12:06, 302.66it/s][A
Converting docs to features:   2%|▏         | 3760/223549 [00:11<11:37, 315.33it/s][A
Converting docs to features:   2%|▏         | 3796/223549 [00:11<11:12, 326.89it/s][A
Converting docs to features:   2%|▏        

Converting docs to features:   3%|▎         | 6748/223549 [00:21<11:24, 316.59it/s][A
Converting docs to features:   3%|▎         | 6794/223549 [00:21<10:20, 349.18it/s][A
Converting docs to features:   3%|▎         | 6831/223549 [00:21<10:32, 342.58it/s][A
Converting docs to features:   3%|▎         | 6870/223549 [00:21<10:17, 351.17it/s][A
Converting docs to features:   3%|▎         | 6912/223549 [00:21<09:56, 363.40it/s][A
Converting docs to features:   3%|▎         | 6957/223549 [00:21<09:23, 384.12it/s][A
Converting docs to features:   3%|▎         | 6997/223549 [00:21<11:16, 319.88it/s][A
Converting docs to features:   3%|▎         | 7032/223549 [00:21<11:46, 306.35it/s][A
Converting docs to features:   3%|▎         | 7065/223549 [00:22<11:50, 304.82it/s][A
Converting docs to features:   3%|▎         | 7109/223549 [00:22<10:45, 335.43it/s][A
Converting docs to features:   3%|▎         | 7145/223549 [00:22<12:59, 277.66it/s][A
Converting docs to features:   3%|▎        

Converting docs to features:   5%|▍         | 10206/223549 [00:32<13:18, 267.18it/s][A
Converting docs to features:   5%|▍         | 10237/223549 [00:32<12:58, 273.92it/s][A
Converting docs to features:   5%|▍         | 10271/223549 [00:32<12:13, 290.67it/s][A
Converting docs to features:   5%|▍         | 10302/223549 [00:32<12:00, 295.90it/s][A
Converting docs to features:   5%|▍         | 10334/223549 [00:32<11:48, 300.84it/s][A
Converting docs to features:   5%|▍         | 10365/223549 [00:32<12:06, 293.55it/s][A
Converting docs to features:   5%|▍         | 10401/223549 [00:32<11:29, 309.17it/s][A
Converting docs to features:   5%|▍         | 10439/223549 [00:32<10:53, 326.19it/s][A
Converting docs to features:   5%|▍         | 10473/223549 [00:32<10:58, 323.39it/s][A
Converting docs to features:   5%|▍         | 10508/223549 [00:32<10:55, 325.14it/s][A
Converting docs to features:   5%|▍         | 10549/223549 [00:33<10:23, 341.86it/s][A
Converting docs to features:   5

Converting docs to features:   6%|▌         | 13496/223549 [00:42<11:00, 318.09it/s][A
Converting docs to features:   6%|▌         | 13528/223549 [00:42<11:25, 306.20it/s][A
Converting docs to features:   6%|▌         | 13570/223549 [00:42<10:32, 332.01it/s][A
Converting docs to features:   6%|▌         | 13605/223549 [00:42<10:47, 324.29it/s][A
Converting docs to features:   6%|▌         | 13639/223549 [00:42<10:52, 321.77it/s][A
Converting docs to features:   6%|▌         | 13672/223549 [00:43<12:45, 274.32it/s][A
Converting docs to features:   6%|▌         | 13701/223549 [00:43<12:54, 271.02it/s][A
Converting docs to features:   6%|▌         | 13735/223549 [00:43<12:10, 287.19it/s][A
Converting docs to features:   6%|▌         | 13765/223549 [00:43<12:33, 278.48it/s][A
Converting docs to features:   6%|▌         | 13798/223549 [00:43<12:04, 289.57it/s][A
Converting docs to features:   6%|▌         | 13835/223549 [00:43<11:17, 309.42it/s][A
Converting docs to features:   6

Converting docs to features:   8%|▊         | 16899/223549 [00:53<10:25, 330.34it/s][A
Converting docs to features:   8%|▊         | 16936/223549 [00:53<10:06, 340.86it/s][A
Converting docs to features:   8%|▊         | 16971/223549 [00:53<10:14, 336.39it/s][A
Converting docs to features:   8%|▊         | 17006/223549 [00:53<10:51, 317.09it/s][A
Converting docs to features:   8%|▊         | 17047/223549 [00:53<10:09, 338.81it/s][A
Converting docs to features:   8%|▊         | 17082/223549 [00:53<10:11, 337.37it/s][A
Converting docs to features:   8%|▊         | 17117/223549 [00:53<11:49, 291.00it/s][A
Converting docs to features:   8%|▊         | 17164/223549 [00:54<10:36, 324.06it/s][A
Converting docs to features:   8%|▊         | 17199/223549 [00:54<10:57, 313.73it/s][A
Converting docs to features:   8%|▊         | 17244/223549 [00:54<10:02, 342.67it/s][A
Converting docs to features:   8%|▊         | 17281/223549 [00:54<10:57, 313.90it/s][A
Converting docs to features:   8

Converting docs to features:   9%|▉         | 20421/223549 [01:03<10:48, 313.09it/s][A
Converting docs to features:   9%|▉         | 20454/223549 [01:03<11:08, 303.66it/s][A
Converting docs to features:   9%|▉         | 20493/223549 [01:04<10:30, 321.93it/s][A
Converting docs to features:   9%|▉         | 20540/223549 [01:04<09:35, 352.61it/s][A
Converting docs to features:   9%|▉         | 20593/223549 [01:04<08:39, 390.97it/s][A
Converting docs to features:   9%|▉         | 20635/223549 [01:04<09:37, 351.21it/s][A
Converting docs to features:   9%|▉         | 20673/223549 [01:04<10:12, 331.18it/s][A
Converting docs to features:   9%|▉         | 20714/223549 [01:04<09:45, 346.18it/s][A
Converting docs to features:   9%|▉         | 20757/223549 [01:04<09:11, 367.42it/s][A
Converting docs to features:   9%|▉         | 20796/223549 [01:04<09:45, 346.49it/s][A
Converting docs to features:   9%|▉         | 20836/223549 [01:04<09:21, 360.74it/s][A
Converting docs to features:   9

Converting docs to features:  11%|█         | 23972/223549 [01:14<10:12, 325.58it/s][A
Converting docs to features:  11%|█         | 24006/223549 [01:14<10:39, 311.96it/s][A
Converting docs to features:  11%|█         | 24044/223549 [01:14<10:11, 326.13it/s][A
Converting docs to features:  11%|█         | 24082/223549 [01:14<09:45, 340.53it/s][A
Converting docs to features:  11%|█         | 24121/223549 [01:14<09:25, 352.37it/s][A
Converting docs to features:  11%|█         | 24165/223549 [01:14<08:52, 374.18it/s][A
Converting docs to features:  11%|█         | 24204/223549 [01:15<11:18, 293.97it/s][A
Converting docs to features:  11%|█         | 24249/223549 [01:15<10:09, 327.19it/s][A
Converting docs to features:  11%|█         | 24286/223549 [01:15<10:06, 328.78it/s][A
Converting docs to features:  11%|█         | 24324/223549 [01:15<09:44, 340.80it/s][A
Converting docs to features:  11%|█         | 24362/223549 [01:15<09:30, 348.95it/s][A
Converting docs to features:  11

Converting docs to features:  12%|█▏        | 27413/223549 [01:24<10:52, 300.53it/s][A
Converting docs to features:  12%|█▏        | 27446/223549 [01:25<10:38, 307.19it/s][A
Converting docs to features:  12%|█▏        | 27492/223549 [01:25<09:35, 340.69it/s][A
Converting docs to features:  12%|█▏        | 27528/223549 [01:25<09:52, 330.97it/s][A
Converting docs to features:  12%|█▏        | 27568/223549 [01:25<09:26, 346.04it/s][A
Converting docs to features:  12%|█▏        | 27604/223549 [01:25<10:03, 324.45it/s][A
Converting docs to features:  12%|█▏        | 27642/223549 [01:25<09:37, 338.99it/s][A
Converting docs to features:  12%|█▏        | 27683/223549 [01:25<09:11, 355.23it/s][A
Converting docs to features:  12%|█▏        | 27720/223549 [01:25<10:12, 319.93it/s][A
Converting docs to features:  12%|█▏        | 27754/223549 [01:25<10:41, 305.05it/s][A
Converting docs to features:  12%|█▏        | 27786/223549 [01:26<11:45, 277.37it/s][A
Converting docs to features:  12

Converting docs to features:  14%|█▍        | 30842/223549 [01:35<10:06, 317.60it/s][A
Converting docs to features:  14%|█▍        | 30876/223549 [01:35<11:32, 278.12it/s][A
Converting docs to features:  14%|█▍        | 30909/223549 [01:35<11:02, 290.68it/s][A
Converting docs to features:  14%|█▍        | 30948/223549 [01:35<10:12, 314.40it/s][A
Converting docs to features:  14%|█▍        | 30992/223549 [01:35<09:21, 342.69it/s][A
Converting docs to features:  14%|█▍        | 31029/223549 [01:36<09:40, 331.73it/s][A
Converting docs to features:  14%|█▍        | 31064/223549 [01:36<11:54, 269.39it/s][A
Converting docs to features:  14%|█▍        | 31096/223549 [01:36<11:32, 278.01it/s][A
Converting docs to features:  14%|█▍        | 31135/223549 [01:36<10:32, 304.02it/s][A
Converting docs to features:  14%|█▍        | 31168/223549 [01:36<10:55, 293.36it/s][A
Converting docs to features:  14%|█▍        | 31204/223549 [01:36<10:19, 310.57it/s][A
Converting docs to features:  14

Converting docs to features:  15%|█▌        | 34057/223549 [01:46<10:07, 311.92it/s][A
Converting docs to features:  15%|█▌        | 34089/223549 [01:46<10:55, 288.82it/s][A
Converting docs to features:  15%|█▌        | 34119/223549 [01:46<11:18, 279.35it/s][A
Converting docs to features:  15%|█▌        | 34153/223549 [01:46<10:43, 294.41it/s][A
Converting docs to features:  15%|█▌        | 34190/223549 [01:46<10:04, 313.44it/s][A
Converting docs to features:  15%|█▌        | 34223/223549 [01:46<10:00, 315.29it/s][A
Converting docs to features:  15%|█▌        | 34260/223549 [01:46<09:34, 329.53it/s][A
Converting docs to features:  15%|█▌        | 34309/223549 [01:46<08:38, 365.19it/s][A
Converting docs to features:  15%|█▌        | 34348/223549 [01:46<08:40, 363.56it/s][A
Converting docs to features:  15%|█▌        | 34386/223549 [01:47<09:09, 344.56it/s][A
Converting docs to features:  15%|█▌        | 34422/223549 [01:47<09:17, 339.27it/s][A
Converting docs to features:  15

Converting docs to features:  17%|█▋        | 37348/223549 [01:56<09:34, 324.26it/s][A
Converting docs to features:  17%|█▋        | 37391/223549 [01:56<09:00, 344.21it/s][A
Converting docs to features:  17%|█▋        | 37426/223549 [01:56<09:35, 323.56it/s][A
Converting docs to features:  17%|█▋        | 37463/223549 [01:56<09:15, 335.28it/s][A
Converting docs to features:  17%|█▋        | 37498/223549 [01:57<10:54, 284.38it/s][A
Converting docs to features:  17%|█▋        | 37530/223549 [01:57<10:37, 291.59it/s][A
Converting docs to features:  17%|█▋        | 37561/223549 [01:57<10:41, 289.87it/s][A
Converting docs to features:  17%|█▋        | 37595/223549 [01:57<10:14, 302.41it/s][A
Converting docs to features:  17%|█▋        | 37628/223549 [01:57<10:01, 308.96it/s][A
Converting docs to features:  17%|█▋        | 37668/223549 [01:57<09:27, 327.50it/s][A
Converting docs to features:  17%|█▋        | 37702/223549 [01:57<09:44, 317.87it/s][A
Converting docs to features:  17

Converting docs to features:  18%|█▊        | 40780/223549 [02:07<09:49, 310.05it/s][A
Converting docs to features:  18%|█▊        | 40827/223549 [02:07<08:50, 344.70it/s][A
Converting docs to features:  18%|█▊        | 40864/223549 [02:07<09:03, 335.89it/s][A
Converting docs to features:  18%|█▊        | 40905/223549 [02:07<08:35, 354.61it/s][A
Converting docs to features:  18%|█▊        | 40942/223549 [02:07<08:36, 353.69it/s][A
Converting docs to features:  18%|█▊        | 40979/223549 [02:07<09:12, 330.57it/s][A
Converting docs to features:  18%|█▊        | 41014/223549 [02:07<09:03, 335.75it/s][A
Converting docs to features:  18%|█▊        | 41049/223549 [02:08<10:18, 294.97it/s][A
Converting docs to features:  18%|█▊        | 41085/223549 [02:08<09:46, 310.94it/s][A
Converting docs to features:  18%|█▊        | 41118/223549 [02:08<10:01, 303.41it/s][A
Converting docs to features:  18%|█▊        | 41158/223549 [02:08<09:26, 322.01it/s][A
Converting docs to features:  18

Converting docs to features:  20%|█▉        | 44211/223549 [02:17<10:05, 296.02it/s][A
Converting docs to features:  20%|█▉        | 44244/223549 [02:17<09:50, 303.47it/s][A
Converting docs to features:  20%|█▉        | 44275/223549 [02:17<10:23, 287.73it/s][A
Converting docs to features:  20%|█▉        | 44313/223549 [02:18<09:38, 309.98it/s][A
Converting docs to features:  20%|█▉        | 44346/223549 [02:18<09:28, 315.41it/s][A
Converting docs to features:  20%|█▉        | 44379/223549 [02:18<09:25, 316.68it/s][A
Converting docs to features:  20%|█▉        | 44413/223549 [02:18<09:14, 322.87it/s][A
Converting docs to features:  20%|█▉        | 44454/223549 [02:18<08:40, 344.40it/s][A
Converting docs to features:  20%|█▉        | 44490/223549 [02:18<08:35, 347.32it/s][A
Converting docs to features:  20%|█▉        | 44526/223549 [02:18<08:36, 346.49it/s][A
Converting docs to features:  20%|█▉        | 44561/223549 [02:18<08:36, 346.37it/s][A
Converting docs to features:  20

Converting docs to features:  21%|██▏       | 47600/223549 [02:28<09:55, 295.38it/s][A
Converting docs to features:  21%|██▏       | 47630/223549 [02:28<10:02, 291.87it/s][A
Converting docs to features:  21%|██▏       | 47671/223549 [02:28<09:10, 319.22it/s][A
Converting docs to features:  21%|██▏       | 47704/223549 [02:28<09:32, 307.16it/s][A
Converting docs to features:  21%|██▏       | 47736/223549 [02:28<10:18, 284.19it/s][A
Converting docs to features:  21%|██▏       | 47772/223549 [02:28<09:41, 302.45it/s][A
Converting docs to features:  21%|██▏       | 47804/223549 [02:28<09:37, 304.25it/s][A
Converting docs to features:  21%|██▏       | 47836/223549 [02:28<09:55, 294.84it/s][A
Converting docs to features:  21%|██▏       | 47880/223549 [02:28<08:58, 326.17it/s][A
Converting docs to features:  21%|██▏       | 47915/223549 [02:29<10:01, 292.16it/s][A
Converting docs to features:  21%|██▏       | 47952/223549 [02:29<09:23, 311.63it/s][A
Converting docs to features:  21

Converting docs to features:  23%|██▎       | 50968/223549 [02:38<08:57, 321.04it/s][A
Converting docs to features:  23%|██▎       | 51012/223549 [02:38<08:15, 348.50it/s][A
Converting docs to features:  23%|██▎       | 51053/223549 [02:38<07:53, 364.68it/s][A
Converting docs to features:  23%|██▎       | 51091/223549 [02:38<08:11, 351.22it/s][A
Converting docs to features:  23%|██▎       | 51129/223549 [02:39<08:00, 358.69it/s][A
Converting docs to features:  23%|██▎       | 51167/223549 [02:39<07:55, 362.21it/s][A
Converting docs to features:  23%|██▎       | 51204/223549 [02:39<08:14, 348.74it/s][A
Converting docs to features:  23%|██▎       | 51240/223549 [02:39<08:13, 349.00it/s][A
Converting docs to features:  23%|██▎       | 51276/223549 [02:39<08:19, 344.91it/s][A
Converting docs to features:  23%|██▎       | 51311/223549 [02:39<08:28, 338.45it/s][A
Converting docs to features:  23%|██▎       | 51346/223549 [02:39<08:27, 339.30it/s][A
Converting docs to features:  23

Converting docs to features:  24%|██▍       | 54182/223549 [02:53<09:41, 291.27it/s][A
Converting docs to features:  24%|██▍       | 54218/223549 [02:53<09:29, 297.53it/s][A
Converting docs to features:  24%|██▍       | 54253/223549 [02:53<09:05, 310.53it/s][A
Converting docs to features:  24%|██▍       | 54286/223549 [02:54<09:23, 300.44it/s][A
Converting docs to features:  24%|██▍       | 54317/223549 [02:54<09:56, 283.62it/s][A
Converting docs to features:  24%|██▍       | 54350/223549 [02:54<09:34, 294.53it/s][A
Converting docs to features:  24%|██▍       | 54392/223549 [02:54<08:46, 321.14it/s][A
Converting docs to features:  24%|██▍       | 54426/223549 [02:54<08:59, 313.60it/s][A
Converting docs to features:  24%|██▍       | 54470/223549 [02:54<08:13, 342.72it/s][A
Converting docs to features:  24%|██▍       | 54506/223549 [02:54<09:04, 310.37it/s][A
Converting docs to features:  24%|██▍       | 54543/223549 [02:54<08:39, 325.06it/s][A
Converting docs to features:  24

Converting docs to features:  26%|██▌       | 57510/223549 [03:04<08:26, 328.02it/s][A
Converting docs to features:  26%|██▌       | 57544/223549 [03:04<08:25, 328.44it/s][A
Converting docs to features:  26%|██▌       | 57581/223549 [03:04<08:14, 335.59it/s][A
Converting docs to features:  26%|██▌       | 57615/223549 [03:04<09:18, 297.34it/s][A
Converting docs to features:  26%|██▌       | 57646/223549 [03:04<09:38, 286.70it/s][A
Converting docs to features:  26%|██▌       | 57676/223549 [03:04<09:52, 279.81it/s][A
Converting docs to features:  26%|██▌       | 57707/223549 [03:04<09:38, 286.72it/s][A
Converting docs to features:  26%|██▌       | 57742/223549 [03:05<09:17, 297.57it/s][A
Converting docs to features:  26%|██▌       | 57789/223549 [03:05<08:16, 333.79it/s][A
Converting docs to features:  26%|██▌       | 57825/223549 [03:05<09:00, 306.70it/s][A
Converting docs to features:  26%|██▌       | 57858/223549 [03:05<10:04, 273.99it/s][A
Converting docs to features:  26

Converting docs to features:  27%|██▋       | 60895/223549 [03:14<07:59, 339.25it/s][A
Converting docs to features:  27%|██▋       | 60937/223549 [03:15<07:32, 359.41it/s][A
Converting docs to features:  27%|██▋       | 60974/223549 [03:15<07:28, 362.29it/s][A
Converting docs to features:  27%|██▋       | 61011/223549 [03:15<07:51, 344.78it/s][A
Converting docs to features:  27%|██▋       | 61049/223549 [03:15<07:39, 353.40it/s][A
Converting docs to features:  27%|██▋       | 61085/223549 [03:15<08:12, 329.99it/s][A
Converting docs to features:  27%|██▋       | 61122/223549 [03:15<07:58, 339.37it/s][A
Converting docs to features:  27%|██▋       | 61157/223549 [03:15<08:46, 308.68it/s][A
Converting docs to features:  27%|██▋       | 61189/223549 [03:15<08:59, 300.95it/s][A
Converting docs to features:  27%|██▋       | 61220/223549 [03:15<08:57, 301.73it/s][A
Converting docs to features:  27%|██▋       | 61251/223549 [03:16<09:04, 298.12it/s][A
Converting docs to features:  27

Converting docs to features:  29%|██▊       | 64183/223549 [03:25<10:12, 260.23it/s][A
Converting docs to features:  29%|██▊       | 64220/223549 [03:25<09:18, 285.10it/s][A
Converting docs to features:  29%|██▊       | 64259/223549 [03:25<08:33, 309.99it/s][A
Converting docs to features:  29%|██▉       | 64292/223549 [03:25<08:37, 307.88it/s][A
Converting docs to features:  29%|██▉       | 64333/223549 [03:26<08:03, 329.52it/s][A
Converting docs to features:  29%|██▉       | 64376/223549 [03:26<07:30, 353.09it/s][A
Converting docs to features:  29%|██▉       | 64413/223549 [03:26<08:01, 330.31it/s][A
Converting docs to features:  29%|██▉       | 64448/223549 [03:26<07:55, 334.72it/s][A
Converting docs to features:  29%|██▉       | 64485/223549 [03:26<07:47, 340.48it/s][A
Converting docs to features:  29%|██▉       | 64525/223549 [03:26<07:33, 350.79it/s][A
Converting docs to features:  29%|██▉       | 64561/223549 [03:26<07:42, 343.51it/s][A
Converting docs to features:  29

Converting docs to features:  30%|███       | 67602/223549 [03:36<07:37, 340.66it/s][A
Converting docs to features:  30%|███       | 67641/223549 [03:36<07:21, 353.42it/s][A
Converting docs to features:  30%|███       | 67678/223549 [03:36<07:21, 352.75it/s][A
Converting docs to features:  30%|███       | 67714/223549 [03:36<07:25, 350.06it/s][A
Converting docs to features:  30%|███       | 67750/223549 [03:36<07:32, 344.03it/s][A
Converting docs to features:  30%|███       | 67785/223549 [03:36<08:03, 322.18it/s][A
Converting docs to features:  30%|███       | 67818/223549 [03:36<08:01, 323.62it/s][A
Converting docs to features:  30%|███       | 67851/223549 [03:36<08:40, 299.24it/s][A
Converting docs to features:  30%|███       | 67882/223549 [03:36<08:40, 298.83it/s][A
Converting docs to features:  30%|███       | 67913/223549 [03:37<08:57, 289.82it/s][A
Converting docs to features:  30%|███       | 67943/223549 [03:37<09:02, 286.72it/s][A
Converting docs to features:  30

Converting docs to features:  32%|███▏      | 71016/223549 [03:46<07:45, 327.50it/s][A
Converting docs to features:  32%|███▏      | 71050/223549 [03:46<09:02, 281.26it/s][A
Converting docs to features:  32%|███▏      | 71081/223549 [03:46<09:12, 275.85it/s][A
Converting docs to features:  32%|███▏      | 71124/223549 [03:46<08:16, 307.11it/s][A
Converting docs to features:  32%|███▏      | 71158/223549 [03:46<08:02, 316.07it/s][A
Converting docs to features:  32%|███▏      | 71199/223549 [03:47<07:30, 338.29it/s][A
Converting docs to features:  32%|███▏      | 71235/223549 [03:47<07:22, 344.01it/s][A
Converting docs to features:  32%|███▏      | 71272/223549 [03:47<07:17, 347.85it/s][A
Converting docs to features:  32%|███▏      | 71308/223549 [03:47<07:30, 337.98it/s][A
Converting docs to features:  32%|███▏      | 71359/223549 [03:47<06:46, 374.25it/s][A
Converting docs to features:  32%|███▏      | 71399/223549 [03:47<06:49, 371.28it/s][A
Converting docs to features:  32

Converting docs to features:  33%|███▎      | 74506/223549 [03:57<08:17, 299.31it/s][A
Converting docs to features:  33%|███▎      | 74538/223549 [03:57<08:34, 289.44it/s][A
Converting docs to features:  33%|███▎      | 74576/223549 [03:57<08:02, 308.60it/s][A
Converting docs to features:  33%|███▎      | 74612/223549 [03:57<07:47, 318.37it/s][A
Converting docs to features:  33%|███▎      | 74645/223549 [03:57<07:58, 311.45it/s][A
Converting docs to features:  33%|███▎      | 74677/223549 [03:57<08:28, 292.54it/s][A
Converting docs to features:  33%|███▎      | 74714/223549 [03:57<07:58, 310.86it/s][A
Converting docs to features:  33%|███▎      | 74746/223549 [03:57<08:07, 305.16it/s][A
Converting docs to features:  33%|███▎      | 74778/223549 [03:57<08:26, 293.91it/s][A
Converting docs to features:  33%|███▎      | 74808/223549 [03:58<08:23, 295.14it/s][A
Converting docs to features:  33%|███▎      | 74838/223549 [03:58<08:43, 284.02it/s][A
Converting docs to features:  33

Converting docs to features:  35%|███▍      | 77856/223549 [04:07<08:54, 272.43it/s][A
Converting docs to features:  35%|███▍      | 77909/223549 [04:07<07:38, 317.86it/s][A
Converting docs to features:  35%|███▍      | 77945/223549 [04:08<07:38, 317.35it/s][A
Converting docs to features:  35%|███▍      | 77989/223549 [04:08<07:04, 342.71it/s][A
Converting docs to features:  35%|███▍      | 78026/223549 [04:08<07:57, 304.62it/s][A
Converting docs to features:  35%|███▍      | 78059/223549 [04:08<08:05, 299.64it/s][A
Converting docs to features:  35%|███▍      | 78095/223549 [04:08<07:42, 314.50it/s][A
Converting docs to features:  35%|███▍      | 78132/223549 [04:08<07:22, 328.64it/s][A
Converting docs to features:  35%|███▍      | 78167/223549 [04:08<07:34, 319.57it/s][A
Converting docs to features:  35%|███▍      | 78202/223549 [04:08<07:25, 326.35it/s][A
Converting docs to features:  35%|███▍      | 78238/223549 [04:08<07:23, 327.63it/s][A
Converting docs to features:  35

Converting docs to features:  36%|███▋      | 81339/223549 [04:18<07:03, 335.80it/s][A
Converting docs to features:  36%|███▋      | 81379/223549 [04:18<06:44, 351.52it/s][A
Converting docs to features:  36%|███▋      | 81415/223549 [04:18<06:53, 343.95it/s][A
Converting docs to features:  36%|███▋      | 81457/223549 [04:18<06:38, 356.96it/s][A
Converting docs to features:  36%|███▋      | 81494/223549 [04:18<06:38, 356.33it/s][A
Converting docs to features:  36%|███▋      | 81531/223549 [04:18<06:38, 356.57it/s][A
Converting docs to features:  36%|███▋      | 81576/223549 [04:18<06:13, 379.82it/s][A
Converting docs to features:  37%|███▋      | 81620/223549 [04:19<05:58, 395.93it/s][A
Converting docs to features:  37%|███▋      | 81661/223549 [04:19<07:20, 322.10it/s][A
Converting docs to features:  37%|███▋      | 81699/223549 [04:19<07:03, 335.15it/s][A
Converting docs to features:  37%|███▋      | 81735/223549 [04:19<07:01, 336.31it/s][A
Converting docs to features:  37

Converting docs to features:  38%|███▊      | 84777/223549 [04:28<07:41, 300.55it/s][A
Converting docs to features:  38%|███▊      | 84818/223549 [04:29<07:12, 320.70it/s][A
Converting docs to features:  38%|███▊      | 84853/223549 [04:29<07:02, 328.66it/s][A
Converting docs to features:  38%|███▊      | 84892/223549 [04:29<06:45, 342.29it/s][A
Converting docs to features:  38%|███▊      | 84936/223549 [04:29<06:20, 364.24it/s][A
Converting docs to features:  38%|███▊      | 84974/223549 [04:29<06:28, 356.41it/s][A
Converting docs to features:  38%|███▊      | 85020/223549 [04:29<06:02, 381.82it/s][A
Converting docs to features:  38%|███▊      | 85063/223549 [04:29<06:14, 369.78it/s][A
Converting docs to features:  38%|███▊      | 85105/223549 [04:29<06:01, 382.85it/s][A
Converting docs to features:  38%|███▊      | 85150/223549 [04:29<05:45, 400.62it/s][A
Converting docs to features:  38%|███▊      | 85191/223549 [04:30<06:44, 342.47it/s][A
Converting docs to features:  38

Converting docs to features:  39%|███▉      | 88247/223549 [04:39<06:39, 338.78it/s][A
Converting docs to features:  39%|███▉      | 88284/223549 [04:39<06:30, 346.66it/s][A
Converting docs to features:  40%|███▉      | 88319/223549 [04:39<07:12, 312.32it/s][A
Converting docs to features:  40%|███▉      | 88352/223549 [04:39<07:29, 300.76it/s][A
Converting docs to features:  40%|███▉      | 88383/223549 [04:39<08:08, 276.86it/s][A
Converting docs to features:  40%|███▉      | 88412/223549 [04:39<08:47, 256.31it/s][A
Converting docs to features:  40%|███▉      | 88439/223549 [04:40<09:47, 229.82it/s][A
Converting docs to features:  40%|███▉      | 88475/223549 [04:40<08:53, 253.02it/s][A
Converting docs to features:  40%|███▉      | 88515/223549 [04:40<07:55, 284.05it/s][A
Converting docs to features:  40%|███▉      | 88555/223549 [04:40<07:14, 310.34it/s][A
Converting docs to features:  40%|███▉      | 88589/223549 [04:40<07:05, 317.01it/s][A
Converting docs to features:  40

Converting docs to features:  41%|████      | 91632/223549 [04:49<06:59, 314.79it/s][A
Converting docs to features:  41%|████      | 91665/223549 [04:49<07:24, 296.90it/s][A
Converting docs to features:  41%|████      | 91705/223549 [04:50<06:50, 321.41it/s][A
Converting docs to features:  41%|████      | 91746/223549 [04:50<06:24, 342.63it/s][A
Converting docs to features:  41%|████      | 91782/223549 [04:50<06:28, 339.10it/s][A
Converting docs to features:  41%|████      | 91818/223549 [04:50<06:24, 342.60it/s][A
Converting docs to features:  41%|████      | 91854/223549 [04:50<06:20, 346.00it/s][A
Converting docs to features:  41%|████      | 91890/223549 [04:50<07:05, 309.09it/s][A
Converting docs to features:  41%|████      | 91922/223549 [04:50<07:44, 283.20it/s][A
Converting docs to features:  41%|████      | 91957/223549 [04:50<07:28, 293.43it/s][A
Converting docs to features:  41%|████      | 91988/223549 [04:50<07:28, 293.55it/s][A
Converting docs to features:  41

Converting docs to features:  43%|████▎     | 95092/223549 [05:00<07:31, 284.25it/s][A
Converting docs to features:  43%|████▎     | 95129/223549 [05:00<07:00, 305.39it/s][A
Converting docs to features:  43%|████▎     | 95169/223549 [05:00<06:33, 326.64it/s][A
Converting docs to features:  43%|████▎     | 95204/223549 [05:00<06:40, 320.22it/s][A
Converting docs to features:  43%|████▎     | 95240/223549 [05:00<06:28, 330.66it/s][A
Converting docs to features:  43%|████▎     | 95274/223549 [05:01<07:32, 283.58it/s][A
Converting docs to features:  43%|████▎     | 95314/223549 [05:01<06:53, 309.81it/s][A
Converting docs to features:  43%|████▎     | 95363/223549 [05:01<06:09, 346.46it/s][A
Converting docs to features:  43%|████▎     | 95401/223549 [05:01<06:50, 312.10it/s][A
Converting docs to features:  43%|████▎     | 95439/223549 [05:01<06:32, 326.57it/s][A
Converting docs to features:  43%|████▎     | 95474/223549 [05:01<06:34, 324.95it/s][A
Converting docs to features:  43

Converting docs to features:  44%|████▍     | 98601/223549 [05:11<05:46, 360.67it/s][A
Converting docs to features:  44%|████▍     | 98644/223549 [05:11<05:32, 375.98it/s][A
Converting docs to features:  44%|████▍     | 98683/223549 [05:11<05:51, 355.54it/s][A
Converting docs to features:  44%|████▍     | 98720/223549 [05:11<05:52, 354.41it/s][A
Converting docs to features:  44%|████▍     | 98757/223549 [05:11<06:23, 325.01it/s][A
Converting docs to features:  44%|████▍     | 98795/223549 [05:11<06:08, 338.51it/s][A
Converting docs to features:  44%|████▍     | 98830/223549 [05:11<06:10, 336.65it/s][A
Converting docs to features:  44%|████▍     | 98867/223549 [05:11<06:04, 342.52it/s][A
Converting docs to features:  44%|████▍     | 98905/223549 [05:11<05:53, 352.80it/s][A
Converting docs to features:  44%|████▍     | 98941/223549 [05:12<06:12, 334.84it/s][A
Converting docs to features:  44%|████▍     | 98975/223549 [05:12<06:39, 311.85it/s][A
Converting docs to features:  44

Converting docs to features:  46%|████▌     | 101951/223549 [05:21<07:30, 269.91it/s][A
Converting docs to features:  46%|████▌     | 101984/223549 [05:21<07:09, 282.72it/s][A
Converting docs to features:  46%|████▌     | 102030/223549 [05:21<06:23, 316.91it/s][A
Converting docs to features:  46%|████▌     | 102064/223549 [05:21<06:54, 293.37it/s][A
Converting docs to features:  46%|████▌     | 102095/223549 [05:22<07:24, 273.04it/s][A
Converting docs to features:  46%|████▌     | 102128/223549 [05:22<07:06, 284.59it/s][A
Converting docs to features:  46%|████▌     | 102168/223549 [05:22<06:30, 310.95it/s][A
Converting docs to features:  46%|████▌     | 102202/223549 [05:22<06:24, 315.40it/s][A
Converting docs to features:  46%|████▌     | 102251/223549 [05:22<05:43, 352.84it/s][A
Converting docs to features:  46%|████▌     | 102289/223549 [05:22<05:57, 338.97it/s][A
Converting docs to features:  46%|████▌     | 102325/223549 [05:22<06:03, 333.59it/s][A
Converting docs to fe

Converting docs to features:  47%|████▋     | 105298/223549 [05:32<06:27, 304.90it/s][A
Converting docs to features:  47%|████▋     | 105336/223549 [05:32<06:06, 322.98it/s][A
Converting docs to features:  47%|████▋     | 105370/223549 [05:32<06:42, 293.85it/s][A
Converting docs to features:  47%|████▋     | 105423/223549 [05:32<05:58, 329.25it/s][A
Converting docs to features:  47%|████▋     | 105461/223549 [05:32<05:45, 341.98it/s][A
Converting docs to features:  47%|████▋     | 105505/223549 [05:32<05:47, 340.13it/s][A
Converting docs to features:  47%|████▋     | 105541/223549 [05:32<05:42, 344.22it/s][A
Converting docs to features:  47%|████▋     | 105577/223549 [05:32<05:47, 339.46it/s][A
Converting docs to features:  47%|████▋     | 105612/223549 [05:32<05:55, 332.20it/s][A
Converting docs to features:  47%|████▋     | 105652/223549 [05:33<05:37, 349.37it/s][A
Converting docs to features:  47%|████▋     | 105688/223549 [05:33<06:08, 319.46it/s][A
Converting docs to fe

Converting docs to features:  49%|████▊     | 108639/223549 [05:42<06:03, 316.09it/s][A
Converting docs to features:  49%|████▊     | 108672/223549 [05:42<05:59, 319.61it/s][A
Converting docs to features:  49%|████▊     | 108710/223549 [05:42<05:42, 335.17it/s][A
Converting docs to features:  49%|████▊     | 108745/223549 [05:42<05:45, 332.57it/s][A
Converting docs to features:  49%|████▊     | 108781/223549 [05:42<05:38, 338.71it/s][A
Converting docs to features:  49%|████▊     | 108816/223549 [05:43<05:57, 320.55it/s][A
Converting docs to features:  49%|████▊     | 108849/223549 [05:43<06:32, 292.54it/s][A
Converting docs to features:  49%|████▊     | 108880/223549 [05:43<07:05, 269.22it/s][A
Converting docs to features:  49%|████▊     | 108924/223549 [05:43<06:16, 304.36it/s][A
Converting docs to features:  49%|████▊     | 108961/223549 [05:43<06:21, 300.00it/s][A
Converting docs to features:  49%|████▉     | 109004/223549 [05:43<05:47, 329.78it/s][A
Converting docs to fe

Converting docs to features:  50%|█████     | 111942/223549 [05:52<05:56, 313.29it/s][A
Converting docs to features:  50%|█████     | 111974/223549 [05:52<05:58, 311.14it/s][A
Converting docs to features:  50%|█████     | 112006/223549 [05:53<06:02, 307.89it/s][A
Converting docs to features:  50%|█████     | 112050/223549 [05:53<05:29, 338.07it/s][A
Converting docs to features:  50%|█████     | 112097/223549 [05:53<05:03, 366.77it/s][A
Converting docs to features:  50%|█████     | 112136/223549 [05:53<05:09, 360.19it/s][A
Converting docs to features:  50%|█████     | 112174/223549 [05:53<05:13, 354.78it/s][A
Converting docs to features:  50%|█████     | 112211/223549 [05:53<05:18, 350.04it/s][A
Converting docs to features:  50%|█████     | 112247/223549 [05:53<06:22, 290.82it/s][A
Converting docs to features:  50%|█████     | 112281/223549 [05:53<06:07, 303.03it/s][A
Converting docs to features:  50%|█████     | 112313/223549 [05:53<06:21, 291.73it/s][A
Converting docs to fe

Converting docs to features:  52%|█████▏    | 115503/223549 [06:03<05:48, 310.00it/s][A
Converting docs to features:  52%|█████▏    | 115536/223549 [06:03<05:45, 312.77it/s][A
Converting docs to features:  52%|█████▏    | 115572/223549 [06:03<05:33, 323.56it/s][A
Converting docs to features:  52%|█████▏    | 115606/223549 [06:03<05:33, 323.26it/s][A
Converting docs to features:  52%|█████▏    | 115639/223549 [06:03<06:02, 297.58it/s][A
Converting docs to features:  52%|█████▏    | 115680/223549 [06:03<05:32, 323.93it/s][A
Converting docs to features:  52%|█████▏    | 115717/223549 [06:03<05:28, 328.70it/s][A
Converting docs to features:  52%|█████▏    | 115753/223549 [06:03<05:19, 337.16it/s][A
Converting docs to features:  52%|█████▏    | 115788/223549 [06:04<05:20, 335.84it/s][A
Converting docs to features:  52%|█████▏    | 115823/223549 [06:04<05:31, 325.25it/s][A
Converting docs to features:  52%|█████▏    | 115867/223549 [06:04<05:22, 334.16it/s][A
Converting docs to fe

Converting docs to features:  53%|█████▎    | 118946/223549 [06:13<05:18, 328.13it/s][A
Converting docs to features:  53%|█████▎    | 118980/223549 [06:13<05:27, 319.59it/s][A
Converting docs to features:  53%|█████▎    | 119014/223549 [06:13<05:21, 325.41it/s][A
Converting docs to features:  53%|█████▎    | 119056/223549 [06:13<05:00, 347.98it/s][A
Converting docs to features:  53%|█████▎    | 119093/223549 [06:14<05:12, 333.99it/s][A
Converting docs to features:  53%|█████▎    | 119128/223549 [06:14<05:24, 321.33it/s][A
Converting docs to features:  53%|█████▎    | 119161/223549 [06:14<05:40, 306.96it/s][A
Converting docs to features:  53%|█████▎    | 119195/223549 [06:14<05:33, 312.81it/s][A
Converting docs to features:  53%|█████▎    | 119227/223549 [06:14<05:48, 299.36it/s][A
Converting docs to features:  53%|█████▎    | 119266/223549 [06:14<05:24, 321.29it/s][A
Converting docs to features:  53%|█████▎    | 119303/223549 [06:14<05:12, 333.95it/s][A
Converting docs to fe

Converting docs to features:  55%|█████▍    | 122337/223549 [06:23<04:42, 358.00it/s][A
Converting docs to features:  55%|█████▍    | 122374/223549 [06:23<04:44, 355.01it/s][A
Converting docs to features:  55%|█████▍    | 122411/223549 [06:24<04:42, 358.34it/s][A
Converting docs to features:  55%|█████▍    | 122448/223549 [06:24<04:42, 358.32it/s][A
Converting docs to features:  55%|█████▍    | 122485/223549 [06:24<05:10, 325.02it/s][A
Converting docs to features:  55%|█████▍    | 122519/223549 [06:24<05:28, 307.91it/s][A
Converting docs to features:  55%|█████▍    | 122560/223549 [06:24<05:06, 329.93it/s][A
Converting docs to features:  55%|█████▍    | 122596/223549 [06:24<05:08, 327.76it/s][A
Converting docs to features:  55%|█████▍    | 122630/223549 [06:24<05:10, 325.28it/s][A
Converting docs to features:  55%|█████▍    | 122664/223549 [06:24<05:11, 324.15it/s][A
Converting docs to features:  55%|█████▍    | 122704/223549 [06:24<04:59, 336.39it/s][A
Converting docs to fe

Converting docs to features:  56%|█████▋    | 125798/223549 [06:34<04:54, 331.75it/s][A
Converting docs to features:  56%|█████▋    | 125845/223549 [06:34<04:29, 362.62it/s][A
Converting docs to features:  56%|█████▋    | 125887/223549 [06:34<04:24, 369.85it/s][A
Converting docs to features:  56%|█████▋    | 125929/223549 [06:34<04:25, 367.08it/s][A
Converting docs to features:  56%|█████▋    | 125975/223549 [06:34<04:09, 390.34it/s][A
Converting docs to features:  56%|█████▋    | 126015/223549 [06:34<04:17, 379.13it/s][A
Converting docs to features:  56%|█████▋    | 126054/223549 [06:34<04:29, 362.36it/s][A
Converting docs to features:  56%|█████▋    | 126091/223549 [06:35<04:41, 346.41it/s][A
Converting docs to features:  56%|█████▋    | 126127/223549 [06:35<04:59, 325.20it/s][A
Converting docs to features:  56%|█████▋    | 126163/223549 [06:35<04:52, 332.94it/s][A
Converting docs to features:  56%|█████▋    | 126199/223549 [06:35<04:47, 339.06it/s][A
Converting docs to fe

Converting docs to features:  58%|█████▊    | 129266/223549 [06:44<04:17, 365.82it/s][A
Converting docs to features:  58%|█████▊    | 129304/223549 [06:45<04:21, 359.86it/s][A
Converting docs to features:  58%|█████▊    | 129341/223549 [06:45<04:34, 342.89it/s][A
Converting docs to features:  58%|█████▊    | 129376/223549 [06:45<04:36, 340.93it/s][A
Converting docs to features:  58%|█████▊    | 129411/223549 [06:45<05:15, 298.28it/s][A
Converting docs to features:  58%|█████▊    | 129442/223549 [06:45<05:22, 292.23it/s][A
Converting docs to features:  58%|█████▊    | 129473/223549 [06:45<05:33, 281.84it/s][A
Converting docs to features:  58%|█████▊    | 129517/223549 [06:45<04:58, 315.40it/s][A
Converting docs to features:  58%|█████▊    | 129551/223549 [06:45<05:30, 284.36it/s][A
Converting docs to features:  58%|█████▊    | 129587/223549 [06:45<05:13, 299.80it/s][A
Converting docs to features:  58%|█████▊    | 129619/223549 [06:46<05:18, 295.20it/s][A
Converting docs to fe

Converting docs to features:  59%|█████▉    | 132714/223549 [06:55<04:27, 339.84it/s][A
Converting docs to features:  59%|█████▉    | 132749/223549 [06:55<04:28, 338.44it/s][A
Converting docs to features:  59%|█████▉    | 132785/223549 [06:55<04:23, 344.20it/s][A
Converting docs to features:  59%|█████▉    | 132820/223549 [06:55<05:14, 288.39it/s][A
Converting docs to features:  59%|█████▉    | 132854/223549 [06:56<05:02, 300.04it/s][A
Converting docs to features:  59%|█████▉    | 132897/223549 [06:56<04:37, 327.10it/s][A
Converting docs to features:  59%|█████▉    | 132941/223549 [06:56<04:20, 347.66it/s][A
Converting docs to features:  59%|█████▉    | 132978/223549 [06:56<04:38, 325.58it/s][A
Converting docs to features:  60%|█████▉    | 133018/223549 [06:56<04:29, 335.80it/s][A
Converting docs to features:  60%|█████▉    | 133053/223549 [06:56<04:30, 334.73it/s][A
Converting docs to features:  60%|█████▉    | 133099/223549 [06:56<04:09, 362.30it/s][A
Converting docs to fe

Converting docs to features:  61%|██████    | 136114/223549 [07:05<04:21, 334.06it/s][A
Converting docs to features:  61%|██████    | 136149/223549 [07:06<04:44, 307.31it/s][A
Converting docs to features:  61%|██████    | 136181/223549 [07:06<04:42, 309.36it/s][A
Converting docs to features:  61%|██████    | 136216/223549 [07:06<04:36, 315.53it/s][A
Converting docs to features:  61%|██████    | 136251/223549 [07:06<04:30, 323.14it/s][A
Converting docs to features:  61%|██████    | 136284/223549 [07:06<04:39, 312.61it/s][A
Converting docs to features:  61%|██████    | 136316/223549 [07:06<04:48, 302.12it/s][A
Converting docs to features:  61%|██████    | 136358/223549 [07:06<04:25, 328.99it/s][A
Converting docs to features:  61%|██████    | 136392/223549 [07:06<05:04, 286.14it/s][A
Converting docs to features:  61%|██████    | 136425/223549 [07:06<04:53, 296.77it/s][A
Converting docs to features:  61%|██████    | 136460/223549 [07:07<04:41, 309.85it/s][A
Converting docs to fe

Converting docs to features:  62%|██████▏   | 139402/223549 [07:16<04:03, 345.25it/s][A
Converting docs to features:  62%|██████▏   | 139447/223549 [07:16<03:50, 364.91it/s][A
Converting docs to features:  62%|██████▏   | 139485/223549 [07:16<04:11, 333.75it/s][A
Converting docs to features:  62%|██████▏   | 139520/223549 [07:16<04:08, 337.87it/s][A
Converting docs to features:  62%|██████▏   | 139555/223549 [07:16<04:12, 332.44it/s][A
Converting docs to features:  62%|██████▏   | 139589/223549 [07:16<04:12, 331.99it/s][A
Converting docs to features:  62%|██████▏   | 139623/223549 [07:17<04:33, 306.58it/s][A
Converting docs to features:  62%|██████▏   | 139663/223549 [07:17<04:14, 329.29it/s][A
Converting docs to features:  62%|██████▏   | 139697/223549 [07:17<04:14, 329.57it/s][A
Converting docs to features:  63%|██████▎   | 139731/223549 [07:17<04:13, 330.44it/s][A
Converting docs to features:  63%|██████▎   | 139774/223549 [07:17<04:02, 345.56it/s][A
Converting docs to fe

Converting docs to features:  64%|██████▍   | 142833/223549 [07:26<04:48, 280.20it/s][A
Converting docs to features:  64%|██████▍   | 142867/223549 [07:26<04:36, 292.10it/s][A
Converting docs to features:  64%|██████▍   | 142906/223549 [07:26<04:15, 315.69it/s][A
Converting docs to features:  64%|██████▍   | 142942/223549 [07:27<04:06, 326.57it/s][A
Converting docs to features:  64%|██████▍   | 142984/223549 [07:27<03:50, 349.81it/s][A
Converting docs to features:  64%|██████▍   | 143021/223549 [07:27<04:05, 327.96it/s][A
Converting docs to features:  64%|██████▍   | 143055/223549 [07:27<04:09, 322.95it/s][A
Converting docs to features:  64%|██████▍   | 143089/223549 [07:27<04:25, 303.18it/s][A
Converting docs to features:  64%|██████▍   | 143129/223549 [07:27<04:06, 326.38it/s][A
Converting docs to features:  64%|██████▍   | 143163/223549 [07:27<04:21, 307.05it/s][A
Converting docs to features:  64%|██████▍   | 143195/223549 [07:27<04:40, 286.93it/s][A
Converting docs to fe

Converting docs to features:  65%|██████▌   | 146203/223549 [07:37<04:11, 307.85it/s][A
Converting docs to features:  65%|██████▌   | 146235/223549 [07:37<04:09, 309.56it/s][A
Converting docs to features:  65%|██████▌   | 146267/223549 [07:37<04:58, 259.24it/s][A
Converting docs to features:  65%|██████▌   | 146295/223549 [07:37<04:55, 261.46it/s][A
Converting docs to features:  65%|██████▌   | 146323/223549 [07:37<05:06, 251.66it/s][A
Converting docs to features:  65%|██████▌   | 146368/223549 [07:37<04:28, 287.75it/s][A
Converting docs to features:  65%|██████▌   | 146400/223549 [07:37<05:13, 245.94it/s][A
Converting docs to features:  66%|██████▌   | 146429/223549 [07:37<05:01, 255.84it/s][A
Converting docs to features:  66%|██████▌   | 146465/223549 [07:38<04:37, 278.11it/s][A
Converting docs to features:  66%|██████▌   | 146498/223549 [07:38<04:24, 291.69it/s][A
Converting docs to features:  66%|██████▌   | 146536/223549 [07:38<04:05, 313.23it/s][A
Converting docs to fe

Converting docs to features:  67%|██████▋   | 149459/223549 [07:47<03:43, 331.19it/s][A
Converting docs to features:  67%|██████▋   | 149493/223549 [07:47<03:41, 333.60it/s][A
Converting docs to features:  67%|██████▋   | 149527/223549 [07:47<03:52, 317.74it/s][A
Converting docs to features:  67%|██████▋   | 149560/223549 [07:47<04:08, 297.71it/s][A
Converting docs to features:  67%|██████▋   | 149591/223549 [07:48<04:15, 289.20it/s][A
Converting docs to features:  67%|██████▋   | 149628/223549 [07:48<03:59, 308.22it/s][A
Converting docs to features:  67%|██████▋   | 149672/223549 [07:48<03:39, 337.33it/s][A
Converting docs to features:  67%|██████▋   | 149708/223549 [07:48<04:04, 302.54it/s][A
Converting docs to features:  67%|██████▋   | 149749/223549 [07:48<03:45, 327.11it/s][A
Converting docs to features:  67%|██████▋   | 149791/223549 [07:48<03:35, 342.54it/s][A
Converting docs to features:  67%|██████▋   | 149827/223549 [07:48<04:04, 301.62it/s][A
Converting docs to fe

Converting docs to features:  68%|██████▊   | 152881/223549 [07:58<03:16, 359.34it/s][A
Converting docs to features:  68%|██████▊   | 152918/223549 [07:58<03:33, 330.44it/s][A
Converting docs to features:  68%|██████▊   | 152952/223549 [07:58<03:39, 321.56it/s][A
Converting docs to features:  68%|██████▊   | 152985/223549 [07:58<04:11, 281.02it/s][A
Converting docs to features:  68%|██████▊   | 153015/223549 [07:58<04:07, 285.25it/s][A
Converting docs to features:  68%|██████▊   | 153052/223549 [07:58<03:50, 305.34it/s][A
Converting docs to features:  68%|██████▊   | 153098/223549 [07:58<03:27, 339.52it/s][A
Converting docs to features:  69%|██████▊   | 153140/223549 [07:59<03:20, 350.65it/s][A
Converting docs to features:  69%|██████▊   | 153181/223549 [07:59<03:12, 366.46it/s][A
Converting docs to features:  69%|██████▊   | 153219/223549 [07:59<03:13, 363.00it/s][A
Converting docs to features:  69%|██████▊   | 153257/223549 [07:59<03:32, 331.27it/s][A
Converting docs to fe

Converting docs to features:  70%|██████▉   | 156351/223549 [08:08<03:25, 327.29it/s][A
Converting docs to features:  70%|██████▉   | 156385/223549 [08:08<03:32, 315.43it/s][A
Converting docs to features:  70%|██████▉   | 156426/223549 [08:08<03:18, 338.84it/s][A
Converting docs to features:  70%|██████▉   | 156462/223549 [08:08<03:35, 311.83it/s][A
Converting docs to features:  70%|███████   | 156495/223549 [08:09<04:07, 271.45it/s][A
Converting docs to features:  70%|███████   | 156534/223549 [08:09<03:48, 293.84it/s][A
Converting docs to features:  70%|███████   | 156568/223549 [08:09<03:38, 306.11it/s][A
Converting docs to features:  70%|███████   | 156613/223549 [08:09<03:18, 336.83it/s][A
Converting docs to features:  70%|███████   | 156657/223549 [08:09<03:05, 360.32it/s][A
Converting docs to features:  70%|███████   | 156695/223549 [08:09<03:27, 321.75it/s][A
Converting docs to features:  70%|███████   | 156730/223549 [08:09<03:34, 311.49it/s][A
Converting docs to fe

Converting docs to features:  71%|███████▏  | 159730/223549 [08:22<03:38, 292.08it/s][A
Converting docs to features:  71%|███████▏  | 159773/223549 [08:22<03:17, 322.31it/s][A
Converting docs to features:  71%|███████▏  | 159807/223549 [08:22<03:34, 296.81it/s][A
Converting docs to features:  72%|███████▏  | 159843/223549 [08:22<03:23, 313.00it/s][A
Converting docs to features:  72%|███████▏  | 159880/223549 [08:22<03:14, 327.57it/s][A
Converting docs to features:  72%|███████▏  | 159914/223549 [08:22<03:20, 317.01it/s][A
Converting docs to features:  72%|███████▏  | 159951/223549 [08:23<03:12, 330.71it/s][A
Converting docs to features:  72%|███████▏  | 159985/223549 [08:23<03:25, 309.02it/s][A
Converting docs to features:  72%|███████▏  | 160017/223549 [08:23<03:43, 284.59it/s][A
Converting docs to features:  72%|███████▏  | 160047/223549 [08:23<03:47, 279.57it/s][A
Converting docs to features:  72%|███████▏  | 160078/223549 [08:23<03:40, 287.40it/s][A
Converting docs to fe

Converting docs to features:  73%|███████▎  | 163180/223549 [08:32<03:42, 270.83it/s][A
Converting docs to features:  73%|███████▎  | 163213/223549 [08:32<03:48, 264.63it/s][A
Converting docs to features:  73%|███████▎  | 163255/223549 [08:33<03:23, 296.45it/s][A
Converting docs to features:  73%|███████▎  | 163291/223549 [08:33<03:16, 307.11it/s][A
Converting docs to features:  73%|███████▎  | 163324/223549 [08:33<03:14, 309.77it/s][A
Converting docs to features:  73%|███████▎  | 163359/223549 [08:33<03:07, 320.38it/s][A
Converting docs to features:  73%|███████▎  | 163403/223549 [08:33<02:53, 347.37it/s][A
Converting docs to features:  73%|███████▎  | 163440/223549 [08:33<02:53, 347.04it/s][A
Converting docs to features:  73%|███████▎  | 163485/223549 [08:33<02:41, 371.51it/s][A
Converting docs to features:  73%|███████▎  | 163524/223549 [08:33<03:10, 314.63it/s][A
Converting docs to features:  73%|███████▎  | 163560/223549 [08:33<03:05, 323.84it/s][A
Converting docs to fe

Converting docs to features:  74%|███████▍  | 166522/223549 [08:43<02:55, 325.64it/s][A
Converting docs to features:  75%|███████▍  | 166556/223549 [08:43<02:58, 319.52it/s][A
Converting docs to features:  75%|███████▍  | 166596/223549 [08:43<02:47, 339.03it/s][A
Converting docs to features:  75%|███████▍  | 166632/223549 [08:43<02:57, 320.02it/s][A
Converting docs to features:  75%|███████▍  | 166671/223549 [08:43<02:49, 335.85it/s][A
Converting docs to features:  75%|███████▍  | 166706/223549 [08:43<02:51, 330.86it/s][A
Converting docs to features:  75%|███████▍  | 166740/223549 [08:43<03:05, 305.66it/s][A
Converting docs to features:  75%|███████▍  | 166786/223549 [08:44<02:47, 339.48it/s][A
Converting docs to features:  75%|███████▍  | 166837/223549 [08:44<02:31, 374.91it/s][A
Converting docs to features:  75%|███████▍  | 166877/223549 [08:44<02:33, 368.37it/s][A
Converting docs to features:  75%|███████▍  | 166916/223549 [08:44<02:32, 370.53it/s][A
Converting docs to fe

Converting docs to features:  76%|███████▌  | 170071/223549 [08:53<02:35, 343.43it/s][A
Converting docs to features:  76%|███████▌  | 170109/223549 [08:53<02:31, 353.22it/s][A
Converting docs to features:  76%|███████▌  | 170155/223549 [08:53<02:20, 378.97it/s][A
Converting docs to features:  76%|███████▌  | 170194/223549 [08:54<02:32, 350.29it/s][A
Converting docs to features:  76%|███████▌  | 170231/223549 [08:54<02:46, 320.43it/s][A
Converting docs to features:  76%|███████▌  | 170265/223549 [08:54<02:44, 323.54it/s][A
Converting docs to features:  76%|███████▌  | 170303/223549 [08:54<02:40, 331.24it/s][A
Converting docs to features:  76%|███████▌  | 170337/223549 [08:54<02:39, 333.34it/s][A
Converting docs to features:  76%|███████▌  | 170371/223549 [08:54<02:55, 302.94it/s][A
Converting docs to features:  76%|███████▌  | 170403/223549 [08:54<02:52, 307.29it/s][A
Converting docs to features:  76%|███████▌  | 170435/223549 [08:54<02:50, 310.90it/s][A
Converting docs to fe

Converting docs to features:  78%|███████▊  | 173456/223549 [09:03<02:41, 310.00it/s][A
Converting docs to features:  78%|███████▊  | 173492/223549 [09:04<02:36, 320.33it/s][A
Converting docs to features:  78%|███████▊  | 173535/223549 [09:04<02:25, 343.74it/s][A
Converting docs to features:  78%|███████▊  | 173571/223549 [09:04<02:27, 339.47it/s][A
Converting docs to features:  78%|███████▊  | 173606/223549 [09:04<02:32, 327.68it/s][A
Converting docs to features:  78%|███████▊  | 173640/223549 [09:04<02:32, 327.94it/s][A
Converting docs to features:  78%|███████▊  | 173674/223549 [09:04<02:35, 321.28it/s][A
Converting docs to features:  78%|███████▊  | 173707/223549 [09:04<02:38, 315.27it/s][A
Converting docs to features:  78%|███████▊  | 173739/223549 [09:04<02:42, 306.55it/s][A
Converting docs to features:  78%|███████▊  | 173773/223549 [09:04<02:37, 315.68it/s][A
Converting docs to features:  78%|███████▊  | 173805/223549 [09:05<02:43, 304.82it/s][A
Converting docs to fe

Converting docs to features:  79%|███████▉  | 176949/223549 [09:14<02:17, 338.66it/s][A
Converting docs to features:  79%|███████▉  | 176991/223549 [09:14<02:10, 357.20it/s][A
Converting docs to features:  79%|███████▉  | 177044/223549 [09:14<01:57, 395.87it/s][A
Converting docs to features:  79%|███████▉  | 177093/223549 [09:14<01:51, 417.29it/s][A
Converting docs to features:  79%|███████▉  | 177137/223549 [09:14<01:54, 404.69it/s][A
Converting docs to features:  79%|███████▉  | 177179/223549 [09:14<01:56, 399.41it/s][A
Converting docs to features:  79%|███████▉  | 177229/223549 [09:14<01:51, 414.41it/s][A
Converting docs to features:  79%|███████▉  | 177272/223549 [09:15<01:52, 409.55it/s][A
Converting docs to features:  79%|███████▉  | 177314/223549 [09:15<01:56, 397.88it/s][A
Converting docs to features:  79%|███████▉  | 177355/223549 [09:15<02:06, 365.87it/s][A
Converting docs to features:  79%|███████▉  | 177393/223549 [09:15<02:21, 326.36it/s][A
Converting docs to fe

Converting docs to features:  81%|████████  | 180453/223549 [09:24<02:33, 280.64it/s][A
Converting docs to features:  81%|████████  | 180483/223549 [09:24<02:34, 279.11it/s][A
Converting docs to features:  81%|████████  | 180512/223549 [09:25<02:42, 264.67it/s][A
Converting docs to features:  81%|████████  | 180546/223549 [09:25<02:35, 276.41it/s][A
Converting docs to features:  81%|████████  | 180575/223549 [09:25<02:34, 277.26it/s][A
Converting docs to features:  81%|████████  | 180618/223549 [09:25<02:18, 310.30it/s][A
Converting docs to features:  81%|████████  | 180662/223549 [09:25<02:06, 340.35it/s][A
Converting docs to features:  81%|████████  | 180712/223549 [09:25<01:53, 376.12it/s][A
Converting docs to features:  81%|████████  | 180753/223549 [09:25<01:55, 371.05it/s][A
Converting docs to features:  81%|████████  | 180797/223549 [09:25<01:50, 388.51it/s][A
Converting docs to features:  81%|████████  | 180838/223549 [09:25<01:53, 376.20it/s][A
Converting docs to fe

Converting docs to features:  82%|████████▏ | 183886/223549 [09:35<02:08, 309.43it/s][A
Converting docs to features:  82%|████████▏ | 183923/223549 [09:35<02:01, 324.94it/s][A
Converting docs to features:  82%|████████▏ | 183958/223549 [09:35<02:00, 328.82it/s][A
Converting docs to features:  82%|████████▏ | 183992/223549 [09:35<02:02, 322.45it/s][A
Converting docs to features:  82%|████████▏ | 184034/223549 [09:35<01:54, 346.49it/s][A
Converting docs to features:  82%|████████▏ | 184070/223549 [09:35<02:04, 318.05it/s][A
Converting docs to features:  82%|████████▏ | 184111/223549 [09:35<01:55, 340.75it/s][A
Converting docs to features:  82%|████████▏ | 184151/223549 [09:35<01:50, 355.88it/s][A
Converting docs to features:  82%|████████▏ | 184188/223549 [09:36<02:07, 309.18it/s][A
Converting docs to features:  82%|████████▏ | 184221/223549 [09:36<02:20, 279.49it/s][A
Converting docs to features:  82%|████████▏ | 184263/223549 [09:36<02:07, 308.92it/s][A
Converting docs to fe

Converting docs to features:  84%|████████▍ | 187340/223549 [09:45<01:47, 336.84it/s][A
Converting docs to features:  84%|████████▍ | 187375/223549 [09:45<01:48, 333.21it/s][A
Converting docs to features:  84%|████████▍ | 187412/223549 [09:45<01:45, 343.38it/s][A
Converting docs to features:  84%|████████▍ | 187449/223549 [09:45<01:43, 349.95it/s][A
Converting docs to features:  84%|████████▍ | 187485/223549 [09:46<01:50, 326.99it/s][A
Converting docs to features:  84%|████████▍ | 187519/223549 [09:46<02:00, 299.52it/s][A
Converting docs to features:  84%|████████▍ | 187557/223549 [09:46<01:53, 317.62it/s][A
Converting docs to features:  84%|████████▍ | 187592/223549 [09:46<01:50, 326.15it/s][A
Converting docs to features:  84%|████████▍ | 187636/223549 [09:46<01:41, 353.47it/s][A
Converting docs to features:  84%|████████▍ | 187673/223549 [09:46<01:44, 343.88it/s][A
Converting docs to features:  84%|████████▍ | 187726/223549 [09:46<01:34, 379.09it/s][A
Converting docs to fe

Converting docs to features:  85%|████████▌ | 190825/223549 [09:56<01:45, 309.53it/s][A
Converting docs to features:  85%|████████▌ | 190858/223549 [09:56<01:55, 282.34it/s][A
Converting docs to features:  85%|████████▌ | 190888/223549 [09:56<01:54, 284.99it/s][A
Converting docs to features:  85%|████████▌ | 190921/223549 [09:56<01:50, 294.34it/s][A
Converting docs to features:  85%|████████▌ | 190952/223549 [09:56<01:52, 290.33it/s][A
Converting docs to features:  85%|████████▌ | 190989/223549 [09:56<01:45, 309.74it/s][A
Converting docs to features:  85%|████████▌ | 191031/223549 [09:56<01:37, 335.05it/s][A
Converting docs to features:  85%|████████▌ | 191066/223549 [09:56<01:45, 308.07it/s][A
Converting docs to features:  85%|████████▌ | 191101/223549 [09:56<01:41, 319.27it/s][A
Converting docs to features:  86%|████████▌ | 191145/223549 [09:57<01:33, 347.68it/s][A
Converting docs to features:  86%|████████▌ | 191182/223549 [09:57<01:55, 280.16it/s][A
Converting docs to fe

Converting docs to features:  87%|████████▋ | 194310/223549 [10:06<01:24, 345.52it/s][A
Converting docs to features:  87%|████████▋ | 194350/223549 [10:06<01:21, 359.47it/s][A
Converting docs to features:  87%|████████▋ | 194393/223549 [10:06<01:17, 377.69it/s][A
Converting docs to features:  87%|████████▋ | 194432/223549 [10:06<01:16, 378.31it/s][A
Converting docs to features:  87%|████████▋ | 194471/223549 [10:06<01:20, 362.58it/s][A
Converting docs to features:  87%|████████▋ | 194508/223549 [10:07<01:22, 350.74it/s][A
Converting docs to features:  87%|████████▋ | 194544/223549 [10:07<01:28, 327.82it/s][A
Converting docs to features:  87%|████████▋ | 194578/223549 [10:07<01:30, 320.87it/s][A
Converting docs to features:  87%|████████▋ | 194611/223549 [10:07<01:32, 313.64it/s][A
Converting docs to features:  87%|████████▋ | 194646/223549 [10:07<01:29, 323.21it/s][A
Converting docs to features:  87%|████████▋ | 194679/223549 [10:07<01:37, 295.38it/s][A
Converting docs to fe

Converting docs to features:  88%|████████▊ | 197783/223549 [10:16<01:12, 355.71it/s][A
Converting docs to features:  88%|████████▊ | 197820/223549 [10:16<01:17, 333.26it/s][A
Converting docs to features:  89%|████████▊ | 197855/223549 [10:17<01:17, 329.68it/s][A
Converting docs to features:  89%|████████▊ | 197901/223549 [10:17<01:11, 356.36it/s][A
Converting docs to features:  89%|████████▊ | 197952/223549 [10:17<01:05, 389.97it/s][A
Converting docs to features:  89%|████████▊ | 197993/223549 [10:17<01:12, 354.59it/s][A
Converting docs to features:  89%|████████▊ | 198031/223549 [10:17<01:12, 354.39it/s][A
Converting docs to features:  89%|████████▊ | 198068/223549 [10:17<01:14, 341.78it/s][A
Converting docs to features:  89%|████████▊ | 198104/223549 [10:17<01:22, 309.23it/s][A
Converting docs to features:  89%|████████▊ | 198141/223549 [10:17<01:18, 324.89it/s][A
Converting docs to features:  89%|████████▊ | 198175/223549 [10:18<01:17, 328.03it/s][A
Converting docs to fe

Converting docs to features:  90%|█████████ | 201319/223549 [10:27<01:18, 284.71it/s][A
Converting docs to features:  90%|█████████ | 201349/223549 [10:27<01:20, 275.87it/s][A
Converting docs to features:  90%|█████████ | 201378/223549 [10:27<01:23, 266.95it/s][A
Converting docs to features:  90%|█████████ | 201425/223549 [10:27<01:12, 305.52it/s][A
Converting docs to features:  90%|█████████ | 201459/223549 [10:27<01:10, 314.35it/s][A
Converting docs to features:  90%|█████████ | 201494/223549 [10:28<01:08, 322.81it/s][A
Converting docs to features:  90%|█████████ | 201528/223549 [10:28<01:11, 306.09it/s][A
Converting docs to features:  90%|█████████ | 201561/223549 [10:28<01:10, 310.22it/s][A
Converting docs to features:  90%|█████████ | 201606/223549 [10:28<01:05, 334.05it/s][A
Converting docs to features:  90%|█████████ | 201641/223549 [10:28<01:07, 326.83it/s][A
Converting docs to features:  90%|█████████ | 201678/223549 [10:28<01:06, 330.42it/s][A
Converting docs to fe

Converting docs to features:  92%|█████████▏| 204813/223549 [10:37<00:56, 332.91it/s][A
Converting docs to features:  92%|█████████▏| 204848/223549 [10:38<00:58, 318.11it/s][A
Converting docs to features:  92%|█████████▏| 204889/223549 [10:38<00:55, 337.08it/s][A
Converting docs to features:  92%|█████████▏| 204924/223549 [10:38<00:57, 324.12it/s][A
Converting docs to features:  92%|█████████▏| 204965/223549 [10:38<00:55, 333.14it/s][A
Converting docs to features:  92%|█████████▏| 205010/223549 [10:38<00:51, 360.97it/s][A
Converting docs to features:  92%|█████████▏| 205053/223549 [10:38<00:48, 378.08it/s][A
Converting docs to features:  92%|█████████▏| 205092/223549 [10:38<00:50, 364.31it/s][A
Converting docs to features:  92%|█████████▏| 205131/223549 [10:38<00:49, 370.56it/s][A
Converting docs to features:  92%|█████████▏| 205174/223549 [10:38<00:47, 384.23it/s][A
Converting docs to features:  92%|█████████▏| 205213/223549 [10:38<00:49, 368.45it/s][A
Converting docs to fe

Converting docs to features:  93%|█████████▎| 208284/223549 [10:48<00:52, 292.11it/s][A
Converting docs to features:  93%|█████████▎| 208322/223549 [10:48<00:48, 313.89it/s][A
Converting docs to features:  93%|█████████▎| 208355/223549 [10:48<00:48, 310.49it/s][A
Converting docs to features:  93%|█████████▎| 208387/223549 [10:48<00:52, 286.92it/s][A
Converting docs to features:  93%|█████████▎| 208425/223549 [10:48<00:48, 309.05it/s][A
Converting docs to features:  93%|█████████▎| 208470/223549 [10:49<00:44, 338.83it/s][A
Converting docs to features:  93%|█████████▎| 208509/223549 [10:49<00:43, 343.67it/s][A
Converting docs to features:  93%|█████████▎| 208545/223549 [10:49<00:44, 336.33it/s][A
Converting docs to features:  93%|█████████▎| 208585/223549 [10:49<00:42, 352.37it/s][A
Converting docs to features:  93%|█████████▎| 208623/223549 [10:49<00:41, 358.20it/s][A
Converting docs to features:  93%|█████████▎| 208660/223549 [10:49<00:45, 324.44it/s][A
Converting docs to fe

Converting docs to features:  95%|█████████▍| 211663/223549 [10:59<00:43, 270.97it/s][A
Converting docs to features:  95%|█████████▍| 211700/223549 [10:59<00:40, 293.54it/s][A
Converting docs to features:  95%|█████████▍| 211759/223549 [10:59<00:34, 345.27it/s][A
Converting docs to features:  95%|█████████▍| 211800/223549 [10:59<00:34, 337.39it/s][A
Converting docs to features:  95%|█████████▍| 211838/223549 [10:59<00:36, 316.93it/s][A
Converting docs to features:  95%|█████████▍| 211873/223549 [10:59<00:40, 290.88it/s][A
Converting docs to features:  95%|█████████▍| 211915/223549 [10:59<00:36, 314.70it/s][A
Converting docs to features:  95%|█████████▍| 211949/223549 [10:59<00:37, 307.65it/s][A
Converting docs to features:  95%|█████████▍| 211982/223549 [11:00<00:39, 290.01it/s][A
Converting docs to features:  95%|█████████▍| 212013/223549 [11:00<00:39, 295.32it/s][A
Converting docs to features:  95%|█████████▍| 212059/223549 [11:00<00:34, 328.74it/s][A
Converting docs to fe

Converting docs to features:  96%|█████████▋| 215205/223549 [11:09<00:21, 393.12it/s][A
Converting docs to features:  96%|█████████▋| 215246/223549 [11:09<00:21, 383.02it/s][A
Converting docs to features:  96%|█████████▋| 215287/223549 [11:09<00:21, 386.40it/s][A
Converting docs to features:  96%|█████████▋| 215327/223549 [11:09<00:24, 341.69it/s][A
Converting docs to features:  96%|█████████▋| 215368/223549 [11:10<00:23, 347.93it/s][A
Converting docs to features:  96%|█████████▋| 215408/223549 [11:10<00:22, 359.77it/s][A
Converting docs to features:  96%|█████████▋| 215445/223549 [11:10<00:23, 345.46it/s][A
Converting docs to features:  96%|█████████▋| 215481/223549 [11:10<00:26, 309.25it/s][A
Converting docs to features:  96%|█████████▋| 215527/223549 [11:10<00:23, 342.47it/s][A
Converting docs to features:  96%|█████████▋| 215564/223549 [11:10<00:23, 333.82it/s][A
Converting docs to features:  96%|█████████▋| 215599/223549 [11:10<00:23, 334.33it/s][A
Converting docs to fe

Converting docs to features:  98%|█████████▊| 218626/223549 [11:19<00:14, 348.63it/s][A
Converting docs to features:  98%|█████████▊| 218662/223549 [11:20<00:14, 345.63it/s][A
Converting docs to features:  98%|█████████▊| 218702/223549 [11:20<00:13, 358.65it/s][A
Converting docs to features:  98%|█████████▊| 218739/223549 [11:20<00:14, 337.03it/s][A
Converting docs to features:  98%|█████████▊| 218774/223549 [11:20<00:14, 330.52it/s][A
Converting docs to features:  98%|█████████▊| 218808/223549 [11:20<00:16, 290.29it/s][A
Converting docs to features:  98%|█████████▊| 218842/223549 [11:20<00:15, 301.20it/s][A
Converting docs to features:  98%|█████████▊| 218874/223549 [11:20<00:15, 295.66it/s][A
Converting docs to features:  98%|█████████▊| 218905/223549 [11:20<00:15, 295.80it/s][A
Converting docs to features:  98%|█████████▊| 218937/223549 [11:21<00:15, 302.51it/s][A
Converting docs to features:  98%|█████████▊| 218968/223549 [11:21<00:15, 293.16it/s][A
Converting docs to fe

Converting docs to features:  99%|█████████▉| 221992/223549 [11:30<00:05, 295.07it/s][A
Converting docs to features:  99%|█████████▉| 222023/223549 [11:30<00:05, 297.52it/s][A
Converting docs to features:  99%|█████████▉| 222054/223549 [11:30<00:05, 297.86it/s][A
Converting docs to features:  99%|█████████▉| 222108/223549 [11:30<00:04, 343.30it/s][A
Converting docs to features:  99%|█████████▉| 222146/223549 [11:30<00:04, 346.44it/s][A
Converting docs to features:  99%|█████████▉| 222183/223549 [11:30<00:03, 351.42it/s][A
Converting docs to features:  99%|█████████▉| 222220/223549 [11:31<00:03, 352.51it/s][A
Converting docs to features:  99%|█████████▉| 222257/223549 [11:31<00:03, 328.98it/s][A
Converting docs to features:  99%|█████████▉| 222301/223549 [11:31<00:03, 351.29it/s][A
Converting docs to features:  99%|█████████▉| 222338/223549 [11:31<00:03, 334.50it/s][A
Converting docs to features:  99%|█████████▉| 222373/223549 [11:31<00:03, 336.60it/s][A
Converting docs to fe

Converting docs to features:  23%|██▎       | 1823/8000 [00:05<00:18, 325.92it/s][A
Converting docs to features:  23%|██▎       | 1858/8000 [00:05<00:18, 331.19it/s][A
Converting docs to features:  24%|██▎       | 1892/8000 [00:05<00:18, 330.77it/s][A
Converting docs to features:  24%|██▍       | 1926/8000 [00:05<00:18, 330.80it/s][A
Converting docs to features:  24%|██▍       | 1960/8000 [00:05<00:18, 325.88it/s][A
Converting docs to features:  25%|██▍       | 1996/8000 [00:06<00:17, 334.92it/s][A
Converting docs to features:  25%|██▌       | 2030/8000 [00:06<00:18, 322.97it/s][A
Converting docs to features:  26%|██▌       | 2065/8000 [00:06<00:18, 329.36it/s][A
Converting docs to features:  26%|██▌       | 2099/8000 [00:06<00:18, 314.92it/s][A
Converting docs to features:  27%|██▋       | 2131/8000 [00:06<00:18, 312.25it/s][A
Converting docs to features:  27%|██▋       | 2163/8000 [00:06<00:19, 303.13it/s][A
Converting docs to features:  27%|██▋       | 2194/8000 [00:06<00

Converting docs to features:  64%|██████▍   | 5130/8000 [00:15<00:09, 311.98it/s][A
Converting docs to features:  65%|██████▍   | 5162/8000 [00:15<00:09, 308.78it/s][A
Converting docs to features:  65%|██████▍   | 5194/8000 [00:16<00:09, 307.11it/s][A
Converting docs to features:  65%|██████▌   | 5225/8000 [00:16<00:09, 302.89it/s][A
Converting docs to features:  66%|██████▌   | 5262/8000 [00:16<00:08, 315.73it/s][A
Converting docs to features:  66%|██████▌   | 5295/8000 [00:16<00:08, 315.48it/s][A
Converting docs to features:  67%|██████▋   | 5327/8000 [00:16<00:08, 313.12it/s][A
Converting docs to features:  67%|██████▋   | 5362/8000 [00:16<00:08, 322.63it/s][A
Converting docs to features:  67%|██████▋   | 5396/8000 [00:16<00:08, 322.29it/s][A
Converting docs to features:  68%|██████▊   | 5430/8000 [00:16<00:07, 323.89it/s][A
Converting docs to features:  68%|██████▊   | 5463/8000 [00:16<00:07, 317.93it/s][A
Converting docs to features:  69%|██████▊   | 5495/8000 [00:16<00

In [34]:
# Create TensorFlow datasets for better performance
train_ds = (
    tf.data.Dataset
    .from_tensor_slices(((train_features_ids, train_features_masks), y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(tf.data.experimental.AUTOTUNE)
)
    
valid_ds = (
    tf.data.Dataset
    .from_tensor_slices(((val_features_ids, val_features_masks), y_valid))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

## Model building and training

In [18]:
# Configure TPU
from kaggle_datasets import KaggleDatasets

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)

GCS_DS_PATH = KaggleDatasets().get_gcs_path('jigsaw-multilingual-toxic-comment-classification')

EPOCHS = 2
BATCH_SIZE = 32 * strategy.num_replicas_in_sync

In [35]:
# Create utility function to get a training ready model on demand
def get_training_model():
    inp_id = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype=tf.int64, name="bert_input_ids")
    inp_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype=tf.int64, name="bert_input_masks")
    inputs = [inp_id, inp_mask]

    hidden_state = transformers.TFDistilBertModel.from_pretrained('distilbert-base-multilingual-cased')(inputs)[0]
    pooled_output = hidden_state[:, 0]    
    dense1 = tf.keras.layers.Dense(128, activation='relu')(pooled_output)
    output = tf.keras.layers.Dense(1, activation='sigmoid')(dense1)

    model = tf.keras.Model(inputs=inputs, outputs=output)
    model.compile(optimizer=tf.optimizers.Adam(learning_rate=2e-5, 
                                            epsilon=1e-08), 
                loss='binary_crossentropy', metrics=['accuracy'])

    return model

In [36]:
print(train_features_ids.shape, train_features_masks.shape, y_train.shape)
print(val_features_ids.shape, val_features_masks.shape, y_valid.shape)

(223549, 500) (223549, 500) (223549,)
(8000, 500) (8000, 500) (8000,)


In [21]:
# Authorize wandb
import wandb

wandb.login()
from wandb.keras import WandbCallback

[34m[1mwandb[0m: [32m[41mERROR[0m Not authenticated.  Copy a key from https://app.wandb.ai/authorize


API Key: ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [22]:
# Initialize wandb
wandb.init(project="jigsaw-toxic", id="distilbert-tpu-kaggle")

W&B Run: https://app.wandb.ai/sayakpaul/jigsaw-toxic/runs/distilbert-tpu

In [23]:
# Create 32 random indices from the test comments
RANDOM_INDICES = np.random.choice(test_comments.shape[0], 32)
RANDOM_INDICES

array([46771,  5492, 17013, 43950, 37318, 19737, 21729, 53810, 59826,
       58041, 30132, 52925, 19974,  5363, 45463, 39680, 45243, 26091,
       38110,  4029, 15824,  4062, 58038, 15559, 31898, 38526,  2098,
       44042, 18642, 29816, 31069, 25361])

In [45]:
# Create a sample prediction logger
# A custom callback to view predictions on the above samples in real-time
class TextLogger(tf.keras.callbacks.Callback):
    def __init__(self):
        super(TextLogger, self).__init__()

    def on_epoch_end(self, logs, epoch):
        samples = []
        for index in RANDOM_INDICES:
            # Grab the comment
            comment = test_comments[index]
            # Create BERT features
            comment_feature_ids, comment_features_masks = create_bert_input_features(tokenizer,  
                                    comment, max_seq_length=MAX_SEQ_LENGTH)
            # Employ the model to get the prediction and parse it
            predicted_label = self.model.predict([comment_feature_ids, comment_features_masks])
            predicted_label = np.argmax(predicted_label[0])
            if predicted_label==0: predicted_label="Non-Toxic"
            else: predicted_label="Toxic"
            
            sample = [comment, predicted_label]
            
            samples.append(sample)
        wandb.log({"text": wandb.Table(data=samples, 
                                       columns=["Comment", "Predicted Label"])})

In [38]:
# Garbage collection
gc.collect()

3082

In [44]:
# Train the model
import time

start = time.time()

# Compile the model with TPU Strategy
with strategy.scope():
    model = get_training_model()
    
model.fit(train_ds, 
          steps_per_epoch=train_data.shape[0] // BATCH_SIZE,
          validation_data=valid_ds,
          validation_steps=val_data.shape[0] // BATCH_SIZE,
          epochs=EPOCHS,
          shuffle=True,
          callbacks=[WandbCallback(), TextLogger()],
          verbose=1)
end = time.time() - start
print("Time taken ",end)
wandb.log({"training_time":end})

Train for 873 steps, validate for 31 steps
Epoch 1/2

[34m[1mwandb[0m: [32m[41mERROR[0m Can't save model, h5py returned error: 

Converting docs to features: 100%|██████████| 289/289 [00:00<00:00, 3613.02it/s]


AttributeError: 'Model' object has no attribute 'predict_classes'