# Testing words
This notebook provides some skeleton code for loading the training data and getting the predictions from the model for the different keywords.

## Warning: For now, I am still training the models, testing can already be done with the agency model

## Imports and setup

This section of the notebook makes sure you have all the libraries installed that are used by the code.
It also makes sure that they are updated to the newest version.

In [43]:
%pip install transformers -Uqq
%pip install nltk -Uqq

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [45]:
import json
from collections import defaultdict
from transformers import (
    BertForSequenceClassification,
    TextClassificationPipeline,
    BertTokenizer,
)
from nltk import word_tokenize

## Loading data

In [7]:
dataset = []

with open('data/data.json') as file:
    dataset = list(map(lambda x: x['text'], json.load(file)['data']))
    
dataset[0:2]

['A new vision of artificial intelligence for the people',
 'The gig workers fighting back against the algorithms']

## Loading model for some keyword

In [3]:
keyword = 'agency'  # Node that other models are not supported as of now
model = BertForSequenceClassification.from_pretrained(f'xt0r3/aihype_{keyword}-vs-rest')

Downloading (…)lve/main/config.json:   0%|          | 0.00/745 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/433M [00:00<?, ?B/s]

## Adding input processing

In [8]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [12]:
pipe = TextClassificationPipeline(
    model=model, tokenizer=tokenizer, top_k=None,
)

## Define prediction function

In [30]:
# Note that keyword is defined above when loading the model
decode = defaultdict(lambda: 'Undefined label', {
    'LABEL_0': f'no {keyword}',
    'LABEL_1': f'{keyword} present',
})

def predict(text):
    pred_dict = dict()
    preds = pipe(text)[0]
    preds = {
        k: v
        for d in map(lambda pred: {decode[pred["label"]]: pred["score"]}, preds)
        for k, v in d.items()
    }
    return preds

## Playing around with the model

In [33]:
dataset[0:4]

['A new vision of artificial intelligence for the people',
 'The gig workers fighting back against the algorithms',
 'How the AI industry profits from catastrophe',
 'Artificial intelligence is creating a new colonial world order']

In [41]:
for i in range(8):
    preds = predict(dataset[i])
    print(f'''Headline: {dataset[i]}
Prediction: {preds}
Classification: {max(preds, key=preds.get).upper()}
''')

Headline: A new vision of artificial intelligence for the people
Prediction: {'no agency': 0.9999657869338989, 'agency present': 3.424444730626419e-05}
Classification: NO AGENCY

Headline: The gig workers fighting back against the algorithms
Prediction: {'agency present': 0.9999499320983887, 'no agency': 5.0033140723826364e-05}
Classification: AGENCY PRESENT

Headline: How the AI industry profits from catastrophe
Prediction: {'no agency': 0.99996018409729, 'agency present': 3.9774677134118974e-05}
Classification: NO AGENCY

Headline: Artificial intelligence is creating a new colonial world order
Prediction: {'no agency': 0.9999510049819946, 'agency present': 4.896329119219445e-05}
Classification: NO AGENCY

Headline: South Africa’s private surveillance machine is fueling a digital apartheid
Prediction: {'agency present': 0.9999490976333618, 'no agency': 5.091609273222275e-05}
Classification: AGENCY PRESENT

Headline: We built a database to understand the China Initiative. Then the gove

## Tokenization example 
This chapter shows an example of tokenization so that you can do data classification easier

In [46]:
# NLTK tokenizer for human-understood words
word_tokenize(dataset[0])

['A',
 'new',
 'vision',
 'of',
 'artificial',
 'intelligence',
 'for',
 'the',
 'people']

In [52]:
word_tokenize('infinitesimal')

['infinitesimal']

In [50]:
# BERT tokeznier for subwords the model pays attention to. 
# This does not only find words, but also splits some long words to smaller subwords.
tokenizer.tokenize(dataset[0])

['A',
 'new',
 'vision',
 'of',
 'artificial',
 'intelligence',
 'for',
 'the',
 'people']

In [53]:
tokenizer.tokenize('infinitesimal')

['infinite', '##si', '##mal']