In [1]:
from datasets import load_dataset, get_dataset_split_names

In [2]:
dataset = load_dataset("rotten_tomatoes")
print(get_dataset_split_names("rotten_tomatoes"))

['train', 'validation', 'test']


In [6]:
training_data = dataset["train"]
print(f"Number of training examples: {len(training_data)}")
print(dataset['train'].info.description)
print(training_data.features)

Number of training examples: 8530

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}


In [14]:
sentences = training_data["text"][:5]
for sentence in sentences:
    print(f"{sentence}\n")

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

effective but too-tepid biopic

if you sometimes like to go to the movies to have fun , wasabi is a good place to start .

emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .



In [15]:
from transformers import AutoTokenizer
sentence = training_data["text"][0]
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens_input = tokenizer(sentence)
print(tokens_input)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'input_ids': [101, 1103, 2067, 1110, 17348, 1106, 1129, 1103, 6880, 1432, 112, 188, 1207, 107, 14255, 1389, 107, 1105, 1115, 1119, 112, 188, 1280, 1106, 1294, 170, 24194, 1256, 3407, 1190, 170, 11791, 5253, 188, 1732, 7200, 10947, 12606, 2895, 117, 179, 7766, 118, 172, 15554, 1181, 3498, 6961, 3263, 1137, 188, 1566, 7912, 14516, 6997, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [17]:
tokens = tokenizer.convert_ids_to_tokens(tokens_input['input_ids'])
print(tokens)

['[CLS]', 'the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'", 's', 'new', '"', 'con', '##an', '"', 'and', 'that', 'he', "'", 's', 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'a', '##rno', '##ld', 's', '##ch', '##war', '##zen', '##eg', '##ger', ',', 'j', '##ean', '-', 'c', '##lau', '##d', 'van', 'dam', '##me', 'or', 's', '##te', '##ven', 'se', '##gal', '.', '[SEP]']


In [18]:
# Seniment analysis with Pretrained Transformers
from datasets import load_dataset
from evaluate import evaluator, combine
from transformers import pipeline
import torch

In [19]:
device = 0 if torch.cuda.is_available() else -1
sentences = load_dataset("rotten_tomatoes", split="test").select(range(5))
[print(sentence) for sentence in sentences["text"]]

lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
consistently clever and suspenseful .
it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
red dragon " never cuts corners .


[None, None, None, None, None]

In [23]:
# Initialize the pipeline
roberta_pipeline = pipeline("sentiment-analysis",
                            model="textattack/roberta-base-rotten-tomatoes",)


Some weights of the model checkpoint at textattack/roberta-base-rotten-tomatoes were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [28]:
# classification
text_list = list(sentences["text"])
predictions = roberta_pipeline(text_list)
print(predictions)

[{'label': 'LABEL_1', 'score': 0.9650675058364868}, {'label': 'LABEL_1', 'score': 0.9958857893943787}, {'label': 'LABEL_0', 'score': 0.9162591099739075}, {'label': 'LABEL_1', 'score': 0.9955574870109558}, {'label': 'LABEL_1', 'score': 0.8789991736412048}]


In [33]:
for idx, sentence in enumerate(text_list):
    print(f"Sentence: {sentence}")
    print(f"Actual: {sentences['label'][idx]}")
    print(f"Prediction: {predictions[idx]['label']}")
    print(f"Score: {predictions[idx]['score']}\n")

Sentence: lovingly photographed in the manner of a golden book sprung to life , stuart little 2 manages sweetness largely without stickiness .
Actual: 1
Prediction: LABEL_1
Score: 0.9650675058364868

Sentence: consistently clever and suspenseful .
Actual: 1
Prediction: LABEL_1
Score: 0.9958857893943787

Sentence: it's like a " big chill " reunion of the baader-meinhof gang , only these guys are more harmless pranksters than political activists .
Actual: 1
Prediction: LABEL_0
Score: 0.9162591099739075

Sentence: the story gives ample opportunity for large-scale action and suspense , which director shekhar kapur supplies with tremendous skill .
Actual: 1
Prediction: LABEL_1
Score: 0.9955574870109558

Sentence: red dragon " never cuts corners .
Actual: 1
Prediction: LABEL_1
Score: 0.8789991736412048



In [34]:
# Generate inference for the entire test set
test_dataset = load_dataset("rotten_tomatoes", split="test")
task_evaluator = evaluator("sentiment-analysis")

In [35]:
eval_results = task_evaluator.compute(
    model_or_pipeline=roberta_pipeline,
    data=test_dataset,
    metric=combine(["accuracy", "precision", "recall", "f1"]),
    label_mapping={"LABEL_0": 0, "LABEL_1": 1}
)

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [37]:
print("\n" + "="*60)
print("EVALUATION RESULTS")
print("="*60)
for key, value in eval_results.items():
    print(f"{key:.<30} {value:.4f}")
print("="*60)


EVALUATION RESULTS
accuracy...................... 0.8874
precision..................... 0.9223
recall........................ 0.8462
f1............................ 0.8826
total_time_in_seconds......... 88.4583
samples_per_second............ 12.0509
latency_in_seconds............ 0.0830


In [38]:
# zero-shot classification
from transformers import pipeline
import torch
device = 0 if torch.cuda.is_available() else -1

In [40]:
zero_shot_pipe = pipeline(
    model="facebook/bart-large-mnli", 
    task="zero-shot-classification", 
    device=device)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [41]:
result = zero_shot_pipe(
    "I am so hooked to NLP modeling that I work late every night.",
    candidate_labels=["technology", "gaming", "hobby", "art", "computer"]
)

In [43]:
print(result["sequence"])
for i, label in enumerate(result["labels"]):
    print(f"{label:.<15} {result['scores'][i]:.4f}")

I am so hooked to NLP modeling that I work late every night.
technology..... 0.6315
hobby.......... 0.2763
computer....... 0.0664
art............ 0.0166
gaming......... 0.0091


In [44]:
# Getting text
from transformers import pipeline
import torch
device = 0 if torch.cuda.is_available() else -1

In [60]:
text = "I think the next big breakthrough in AI will be"

In [61]:
generator = pipeline(
    "text-generation",
    model="gpt2",
    device=device
)

Device set to use cpu


In [62]:
generated_outputs = generator(
    text,
    max_length=50,
    num_return_sequences=5,
    num_beams=5,
    pad_token_id=50256  # to avoid warnings
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In [63]:
for output in generated_outputs:
    print(output['generated_text'])

I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the 

In [64]:
generated_outputs2 = generator(
    text,
    max_length=50,
    num_return_sequences=5,
    num_beams=5,
    no_repeat_ngram_size=2,
    pad_token_id=50256  # to avoid warnings
)
for output in generated_outputs:
    print(output['generated_text'])

Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the next big breakthrough in AI will be in artificial intelligence. I think the 

In [67]:
generated_outputs2 = generator(
    text,
    max_length=500,
    num_return_sequences=1,
    num_beams=5,
    no_repeat_ngram_size=2,
    top_k=50,
    top_p=0.85,
    pad_token_id=50256  # to avoid warnings
)
for output in generated_outputs2:
    print(output['generated_text'])

Both `max_new_tokens` (=256) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


I think the next big breakthrough in AI will be in machine learning," he said.

"It's going to be very interesting to see how it evolves over time."


In [68]:
# Language translation
from transformers import (
    T5Tokenizer, T5ForConditionalGeneration
)

In [70]:
tokenizer = T5Tokenizer.from_pretrained(
    "t5-base", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True)
model = model.to(device)

ImportError: 
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Check out the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.
