# Sponsor content detection in YouTube videos
## Transfomers for binary text classification
This notebook seeks to accomplish the task of sponsored-content detection using a binary text classification model. The text classification model is created by fine-tuning a DistilBERT pre-trained model.

## Motivation
Several similar projects based on a BERT-type text classification model have been written about in on the Internet. Unfortunately, in both instances the authors do not share details about the performance of the model. Instead, they used vague language like "95% accuracy" without qualifying that in any meaningful way. What is more, the trained models in both instances then demonstrably perform poorly in the downstream task of task classification, but no exact numbers are reported. 

We wanted to investigate how well a text classification model can perform on what is essentially a span extraction task.

In [4]:
import os
import sys

import numpy as np
import torch
from datasets import Dataset, IterableDataset, IterableDatasetDict, ClassLabel, load_dataset, load_from_disk, load_metric
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import pandas as pd
import pyarrow as pa

sys.path.append(os.path.dirname(os.path.realpath('..')))
from data_loader import load_examples_from_chunks, load_captions_from_chunks

os.environ["WANDB_DISABLED"] = "true"

# Prepare the data

Read the transcripts from the `data.N.json.gz` and extract examples using `load_examples_from_chunks`. 

In [5]:
LABELS = {
    'content': 0,
    'sponsor': 1,
}

def load_examples(chunks=None):
    for example, label in load_examples_from_chunks(base_name='data', root_dir='./', chunks=chunks):
        yield example, LABELS[label]

def iterable_to_pandas(columns, iterable, max_length):
    from tqdm.auto import tqdm
    df = pd.DataFrame(columns=columns)
    for item in tqdm(iterable, total=max_length):
        df.loc[len(df)] = item
    
    return df

# Save prepared data to disk
The dataset returned by `load_examples_from_chunks` is much smaller than the original ~10 GiB dataset because it does not include full video transcripts. We read this whole thing into memory into a pandas `DataFrame` and then save it to disk for further use. Loading the dataset into memory makes it easier to work with. 

In [6]:
import itertools
for x in itertools.islice(load_examples(), 0, 50):
    print(x)

[34mFound ./data.1.json.gz.[0m
[34mFound ./data.10.json.gz.[0m
[34mFound ./data.11.json.gz.[0m
[34mFound ./data.12.json.gz.[0m
[34mFound ./data.13.json.gz.[0m
[34mFound ./data.14.json.gz.[0m
[34mFound ./data.15.json.gz.[0m
[34mFound ./data.16.json.gz.[0m
[34mFound ./data.2.json.gz.[0m
[34mFound ./data.3.json.gz.[0m
[34mFound ./data.4.json.gz.[0m
[34mFound ./data.5.json.gz.[0m
[34mFound ./data.6.json.gz.[0m
[34mFound ./data.7.json.gz.[0m
[34mFound ./data.8.json.gz.[0m
[34mFound ./data.9.json.gz.[0m
[34mOpening ./data.1.json.gz for reading...[0m


("a sponsor of this video I work with pet flow because I think they really do offer a way to make your life better and easier so needless to say I think you should get your dog food from pet flow the great thing about them is that you can go and you order your dog food one time and then it's just automatically there whenever you need it you just select how often you want to deliver they save you the hassle of having to drive to the store every week or two to get your dog food I love that they and you guys support content like this because I think it's so important now I'll have their link in the description along with a coupon code that will give you an awesome discount on your first order did you know that", 1)
("puppies and their parents by making a contribution of any amount you'd like to our patreon campaign setup automatic pet food delivery with Peplow I'll have a link in the description as well as a coupon code that'll give you a terrific discount on your first order see you guys

("all right baby we back i want to give a huge shout out for rayconf for sponsoring another video i mean we already know i use my raycons every day at the gym just like in league we stay sweating at the gym and instead of you know rocking that dog music to jimmy plane i rock my 90 boomer tunes all day like a league we stay sweating at the gym and with their new rubber oil look and feel these bad boys damn that's what we like to call nice even when we're sweating bricks they don't fall off not only that but they're half the price of other premium brands out on the market we're talking about 32 hours of battery life eight hours of play time plus you got a built-in mic to answer calls you guys see me at them squats for big numbers baby big numbers look at the gel tips look how perfectly they fit in my ear bing bada boom all you got to do is go to the website link down below in the description pick a color add it to the card and boom 15 off just like that thank you again raycon for sponsor

('is sponsored by Squarespace the all-in-one platform to build an online presence and pursue your dream but more on that at the end of', 1)
('launch go to Squarespace com forward slash history time or simply use the offer code history time to get 10% off your first purchase of a website or domain', 0)
("suspicious speaking of smells though let's take a second and hear a word from the sponsor of this video now you guys know i love smelling like someone's expensive wife because that's what i am i am someone's expensive wife you know him his name's tony off-screen tony if you will now i love smelling expensive but i don't want to spend all that money on expensive perfumes so what's the best solution the sponsor of today's video which is scentbird if you don't know you most likely already do but scentbird is a subscription fragrance service that gives you the opportunity to shop over 600 that's a lot over 600 brands that's so many and that is a flexible subscription and it is monthly but n

In [None]:
df = iterable_to_pandas(['text', 'label'], load_examples(), 16 * 20_000)

NOTE: The above progress bar was out of 320,000 because that was the realistically maximum number of samples that we could get given the dataset that we have. The red color is not an indicator of failure.

In [6]:
Dataset.from_pandas(df).remove_columns('__index_level_0__').save_to_disk('./classification-dataset')

# Read prepared data
Read the prepared dataset using the 🤗 API. 

In [7]:
raw_datasets = load_from_disk('./classification-dataset').train_test_split(test_size=0.1)
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 112118
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12458
    })
})

In [8]:
raw_datasets['test'][:30]

{'text': ["illegal but before we do get started we have a sponsorship for today's video our first sponsorship on the channel and what better first sponsorship then braid shadow legends a game that I've been so addicted to that my girlfriend is threatening to break up with me because I refuse to put it down not going to lie a fluffy cartoon looking game here and there is no problem with me but why is every game becoming more and more cartoony I can't take it anymore I'm tired of all the candy all the unicorns all the bright colors I'm tired of it ok from now on you can miss me with all that I need an RPG game that's not afraid to get dark to get real to get raw but also one that's epic and awesome and guess what raid shadow legends is just that this game will take you to a ward of dark fantasy and realism now what do I like about the game you may be asking you know why am i promoting this game well for one the customization is out of this world I absolutely love it I love being able to 

In [10]:
# If we've arrived here, everything with the dataset is okay and it has been stored to disk. We
# can drop the in-memory `DataFrame` we constructed originally. 
df = None

# Tokenize inputs
Tokenize the datatset with the pre-trained tokenizer. Sequences are padded to the maximum length supported by BERT and truncated if longer.

In [11]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [12]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

  0%|          | 0/113 [00:00<?, ?ba/s]

  0%|          | 0/13 [00:00<?, ?ba/s]

In [13]:
cleaned_datasets = tokenized_datasets.remove_columns(['text'])
train_dataset = cleaned_datasets['train']
test_dataset = cleaned_datasets['test']

# Prepare for training
Set training parameters, configure metrics, etc.

In [14]:
torch.cuda.empty_cache()
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
training_args = TrainingArguments(
    output_dir="distilbert-classification-uncased", 
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4,
    save_total_limit=5, 
    evaluation_strategy='steps',
    eval_steps=10_001,
    save_steps=5_000)

accuracy_metric = load_metric("accuracy")
precision_metric = load_metric("precision")
recall_metric = load_metric("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    precision = precision_metric.compute(predictions=predictions, references=labels)
    recall = recall_metric.compute(predictions=predictions, references=labels)
    return {**accuracy, **precision, **recall}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

# Train the model ⚡
We're using the default number of batches, but we terminate the training early because we observe that the model performs extremely well on all metric on the test dataset and because the training loss and validation loss are comparable after step 30,000, indicating that there is not too much over- or under-fitting, and that the model is not likely to learn anything else.

In [15]:
trainer.train('distilbert-classification-uncased/checkpoint-30000')

Loading model from distilbert-classification-uncased/checkpoint-30000).
***** Running training *****
  Num examples = 112118
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 84090
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 1
  Continuing training from global step 30000
  Will skip the first 1 epochs then the first 1970 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/1970 [00:00<?, ?it/s]

Step,Training Loss,Validation Loss,Accuracy,Precision,Recall
30003,0.0329,0.040455,0.993739,0.994917,0.99271
40004,0.0452,0.025425,0.995906,0.996195,0.995721
50005,0.0263,0.033823,0.995264,0.995403,0.995246
60006,0.0284,0.023235,0.995746,0.993221,0.998415
70007,0.0032,0.020563,0.99703,0.996517,0.997623
80008,0.013,0.012398,0.997351,0.997149,0.997623


***** Running Evaluation *****
  Num examples = 12458
  Batch size = 4
Saving model checkpoint to distilbert-classification-uncased/checkpoint-35000
Configuration saved in distilbert-classification-uncased/checkpoint-35000/config.json
Model weights saved in distilbert-classification-uncased/checkpoint-35000/pytorch_model.bin
Deleting older checkpoint [distilbert-classification-uncased/checkpoint-10000] due to args.save_total_limit
Saving model checkpoint to distilbert-classification-uncased/checkpoint-40000
Configuration saved in distilbert-classification-uncased/checkpoint-40000/config.json
Model weights saved in distilbert-classification-uncased/checkpoint-40000/pytorch_model.bin
Deleting older checkpoint [distilbert-classification-uncased/checkpoint-15000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 12458
  Batch size = 4
Saving model checkpoint to distilbert-classification-uncased/checkpoint-45000
Configuration saved in distilbert-classification-unc

TrainOutput(global_step=84090, training_loss=0.015288903694017305, metrics={'train_runtime': 26905.3744, 'train_samples_per_second': 12.501, 'train_steps_per_second': 3.125, 'total_flos': 4.455593940754022e+16, 'train_loss': 0.015288903694017305, 'epoch': 3.0})

In [17]:
model = None
trainer = None
trained = None
torch.cuda.empty_cache()

def softmax_outputs(outputs) -> dict:
    return torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()

trained = AutoModelForSequenceClassification.from_pretrained('./distilbert-classification-uncased/checkpoint-80000')
trained.to('cuda')

loading configuration file ./distilbert-classification-uncased/checkpoint-80000/config.json
Model config DistilBertConfig {
  "_name_or_path": "./distilbert-classification-uncased/checkpoint-80000",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "vocab_size": 30522
}

loading weights file ./distilbert-classification-uncased/checkpoint-80000/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

# Run on full video transcripts

In [76]:
import itertools
from collections import defaultdict

from data_loader import Caption, load_captions_from_chunks, segment_text

def caption_times(c):
    return c.start, c.end

def prediction_times(p):
    return tuple(p[0])

def tumbling_time_window(captions, duration, key=caption_times):
    results = [captions[0]]
    for caption in captions:
        if key(results[-1])[1] - key(results[0])[0] <= duration:
            results.append(caption)
        else:
            yield results
            results = [caption]

    yield results
    
def session_time_window(captions, duration, key=caption_times):
    captions_iter = iter(captions)
    results = [next(captions_iter)]
    for caption in captions_iter:
        if key(results[-1])[1] - key(caption)[0] <= duration:
            results.append(caption)
        else:
            yield results
            results = [caption]

    yield results

def batch(iterable, n):
    length = len(iterable)
    for i in range(0, length, n):
        yield iterable[i:min(i + n, length)]
        
def decode_label(outputs):
    content, sponsor = outputs
    
    prediction_dict = {'sponsor': sponsor, 'content': content}
    prediction_dict = {k: v for k, v in sorted(prediction_dict.items(), key=lambda item: item[1], reverse=True)}

    return next(iter(prediction_dict.items()))
        
def predict_in_batches(texts, batch_size: int = 8):    
    batches = list(batch(texts, batch_size))
    for b in batches:
        inputs = defaultdict(list)
        for text in b:
            tokenized = tokenize_function({ 'text': text })
            for k, v in tokenized.items():
                inputs[k].append(v)
            
        inputs = { k: torch.tensor(v).cuda() for k, v in inputs.items() }
        outputs = trained(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1).tolist()
        yield from predictions
        
def predict_sponsor_segments(captions, window_duration=10):
    windows = list(tumbling_time_window(captions, window_duration))
    window_texts = [segment_text(window) for window in windows]
    predictions = predict_in_batches(window_texts, 4)
    
    for window, text, prediction in zip(windows, window_texts, predictions):
        yield [window[0].start, window[-1].end], text, *decode_label(prediction)
        
def merge_prediction_(predictions):
    assert len(set((label for _, _, label, _ in predictions))) == 1
    # All co-occurring predictions have the same label so we merge them
    merged_start, merged_end = predictions[0][0][0], predictions[-1][0][1]
    merged_text = ' '.join((text for _, text, _, _ in predictions))
    # Don't know what the correct way to compute the joint probability here is,
    # just assume they are independent; We don't really use this number anywhere
    prob = np.prod([prob for _, _, _, prob in predictions])
    return [merged_start, merged_end], text, predictions[0][2], prob

def merge_predictions(predictions, within_duration=5):
    for co_occuring in session_time_window(predictions, within_duration, key=prediction_times):
        merged = [co_occuring[0]]
        for times, text, label, prob in co_occuring[1:]:
            _, _, prev_label, _ = merged[0]
            if label == prev_label:
                merged.append((times, text, label, prob))
            else:
                yield merge_prediction_(merged)
                merged = [(times, text, label, prob)]
        
        if len(merged) > 0:
            yield merge_prediction_(merged)
        

In [84]:
from termcolor import colored

def evaluate(videos):
    from tqdm.auto import tqdm
    
    predicted = []
    expected = []
    
    output = []
    for video_id, captions, sponsor_ranges in tqdm(videos):
        sponsor_times = [(captions[start].start, captions[end].end) for start, end in sponsor_ranges]

        output.append(lambda: print(colored(f'{video_id} {sponsor_times}', None, 'on_magenta')))
        predicted_sponsor_times = []

        for times, text, label, prob in merge_predictions(predict_sponsor_segments(captions, window_duration=10), within_duration=10):
            if label == 'sponsor':
                predicted_sponsor_times.append((f'{int(prob * 100)}%', times))

            color = { 'sponsor': 'yellow', 'content': None }[label]
            # print(colored(f'{int(prob * 100)}% {times[0]} <--> {times[1]} {text}', color=color))

        predicted.append(predicted_sponsor_times)
        expected.append(sponsor_times)
        output.append(lambda: print(f'\tPredicted={predicted_sponsor_times},\n\tExpected={sponsor_times}'))    
    
    # TODO: Evaluate predicted vs. expected
    
    for o in output:
        o()
        
evaluate(list(itertools.islice(load_captions_from_chunks('data', './', [1]), 0, 100)))

[34mFound ./data.1.json.gz.[0m
[34mOpening ./data.1.json.gz for reading...[0m


Dropping --6T95cQa50 because sponsor times do not match the captions
Dropping --BXjAWlPDQ because sponsor times do not match the captions
Dropping --xfK_Uhly4 because sponsor times do not match the captions
Dropping --yUuR_F_wU because sponsor times do not match the captions
Dropping -1STsVEsLSU because sponsor times do not match the captions
Dropping -2a7i00mcS0 because sponsor times do not match the captions
Dropping -3bMKfaMY7I because sponsor times do not match the captions
Dropping -3GY3WjZY4Y because sponsor times do not match the captions


  0%|          | 0/100 [00:00<?, ?it/s]

[45m-3nw9slXrBc [(979.606, 1044.58)][0m
	Predicted=[('99%', [979.606, 1022.91]), ('99%', [1035.2, 1059.683]), ('99%', [1313.259, 1324.0])],
	Expected=[(979.606, 1044.58)]
[45m-3nw9slXrBc [(979.606, 1044.58)][0m
	Predicted=[('99%', [979.606, 1022.91]), ('99%', [1035.2, 1059.683]), ('99%', [1313.259, 1324.0])],
	Expected=[(979.606, 1044.58)]
[45m-3nw9slXrBc [(979.606, 1044.58)][0m
	Predicted=[('99%', [979.606, 1022.91]), ('99%', [1035.2, 1059.683]), ('99%', [1313.259, 1324.0])],
	Expected=[(979.606, 1044.58)]
[45m-3nw9slXrBc [(979.606, 1044.58)][0m
	Predicted=[('99%', [979.606, 1022.91]), ('99%', [1035.2, 1059.683]), ('99%', [1313.259, 1324.0])],
	Expected=[(979.606, 1044.58)]
[45m-3nw9slXrBc [(979.606, 1044.58)][0m
	Predicted=[('99%', [979.606, 1022.91]), ('99%', [1035.2, 1059.683]), ('99%', [1313.259, 1324.0])],
	Expected=[(979.606, 1044.58)]
[45m-3nw9slXrBc [(979.606, 1044.58)][0m
	Predicted=[('99%', [979.606, 1022.91]), ('99%', [1035.2, 1059.683]), ('99%', [1313.259, 1324

# Experiment with Topical Change Detection

In [37]:
tc_tokenizer = AutoTokenizer.from_pretrained('dennlinger/roberta-cls-consec')
tc_model = AutoModelForSequenceClassification.from_pretrained('dennlinger/roberta-cls-consec')

Downloading:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at dennlinger/roberta-cls-consec were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [45]:
for left_captions, right_captions in group(group(videos[2][1], 2), 2, []):
    left_text = ' '.join((caption['text'] for caption in left_captions))
    right_text = ' '.join((caption['text'] for caption in right_captions))
    inputs = tc_tokenizer(left_text, right_text)
    inputs = { k: torch.tensor([v]) for k, v in inputs.items() }
    outputs = tc_model(**inputs)
    non_sponsor, sponsor = torch.nn.functional.softmax(outputs.logits, dim=-1).tolist()[0]
    
    print({ 'not_same': '%.2f' % non_sponsor, 'same': '%.2f' % sponsor }, left_text, right_text)
    

{'not_same': '0.00', 'same': '1.00'} oh my god  i'm holding my life back in my hands
 right now it feels like this is the only

{'not_same': '0.21', 'same': '0.79'} good thing that being severely anemic
has ever done for me [Music]  okay wait actually before we go to

{'not_same': '0.28', 'same': '0.72'} boston i want to tell you guys about the
 sponsor today's video which is current i
 mean it's a way that i'm paying for this
 entire trip in general so it might as

{'not_same': '0.03', 'same': '0.97'} well tell you about how i'm doing this
 in the first place
 while wearing their awesome hat that
 they sent me as well current is the new

{'not_same': '0.00', 'same': '1.00'} way to bank it is truly the future i
 mean i feel like we're all always
 dreaming of what the future holds and
 honestly it is you with this car

{'not_same': '0.02', 'same': '0.98'} current is a mobile bank with a visa
 debit card and a real bank account with
 no hidden fees and no minimum balance
 requirement cur

{'not_same': '0.00', 'same': '1.00'} my dad and aiden but now tonight is
 about my mom
 i'm going out with her and her friends
 the ladies are going to go get some
drinks
{'not_same': '0.01', 'same': '0.99'} drinks
 let me show you the view of the hotel
 room right now i'm sure you really care
 but i personally appreciate it so i want

{'not_same': '0.00', 'same': '1.00'} you guys to appreciate with me we've got
 the highway we've got the garden right
 there oh yeah i lost a nail last night
 so that's really sad oh my god my nails

{'not_same': '0.03', 'same': '0.97'} so gross actually don't look at it it's
 kind of gross out kind of annoyed about
that that
 but yeah awesome 10 out of 10 scenery

{'not_same': '0.00', 'same': '1.00'} probably talk to you guys tomorrow
 because i'm really [ __ ] excited about
 that that's what i'm going to see
 all of my friends and go out to eat and

{'not_same': '0.00', 'same': '1.00'} get a hotel and go shopping and
 everything so let's just [ __ ] cu

{'not_same': '0.00', 'same': '1.00'} manicure first
 i was just about to here they are
 the next step manicure yes let's just
 get on that they only would let me have

{'not_same': '0.03', 'same': '0.97'} one shoe again
 the other one's in a bag but i just
 don't feel like unboxing that one so
 that's what i have ready to take on the

{'not_same': '0.01', 'same': '0.99'} summer with these shoes
 i'm destroying the box with water really
 good inspirational things oh
wait look at the real prize 
{'not_same': '0.01', 'same': '0.99'} are you kidding me oh wow i have to pull
 myself together and you did buy
 something else
 i did buy something else i bitched about

{'not_same': '0.00', 'same': '1.00'} this item in this bag for years
 yes now i do not buy designer bags often
 i truly only bought one which is my
 iconic balenciaga that i bought in
hawaii
{'not_same': '0.05', 'same': '0.95'} hawaii
 fall 2019. which one are you wearing
 tonight i think this
 this bag has been sold out for a ve

{'not_same': '0.01', 'same': '0.99'} don't know it just feels really good to
 just like
 reconnect and just remember where i came
 from i don't know

{'not_same': '0.01', 'same': '0.99'} thank you guys for [ __ ] everything
 and i will see you in my next
 video april like kind of might be fun
 like i don't know like we'll see i have

{'not_same': '0.31', 'same': '0.69'} some things planned that i think you
 guys will like but
 we'll [ __ ] see no more peace signs i
 need to end this video i love you guys

{'not_same': '0.01', 'same': '0.99'} so much oh my god i went to a sports
 game in this video holy [ __ ]
[ __ ]  bring that freaks out i put choices
through


TypeError: 'NoneType' object is not subscriptable

In [24]:
load_dataset('json', data_files='data.1.json.gz')

Using custom data configuration default-9dc7759ca38c0da6


Downloading and preparing dataset json/default to /home/veselin/.cache/huggingface/datasets/json/default-9dc7759ca38c0da6/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Failed to read file '/home/veselin/.cache/huggingface/datasets/downloads/extracted/58b78e85c06cba74c6ee3cc9f04a35a99e1d7a2d5eb255e822ce8bc17b8ba33d' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/captions/[]/[]) changed from string to number in row 0


ArrowInvalid: JSON parse error: Column(/captions/[]/[]) changed from string to number in row 0