In [1]:
from transformers import pipeline

In [2]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
classifier("I am using pipelines to access the HF tansformer library")

[{'label': 'NEGATIVE', 'score': 0.998201847076416}]

In [4]:
classifier("My interest in coding is opposite of dislike")

[{'label': 'POSITIVE', 'score': 0.9411544799804688}]

In [5]:
classifier("My interest in coding is opposite of like")

[{'label': 'NEGATIVE', 'score': 0.9859604835510254}]

In [6]:
classifier("My interest in coding is opposite of anti like")

[{'label': 'NEGATIVE', 'score': 0.9812062382698059}]

In [7]:
### Passing more than one input

In [8]:
results = classifier(
    [
        "Beauty is in the eyes of the beholder",
        "Its not binary",
        "You can be decent and gifted at the same time"
    ])

In [9]:
for result in results:
    print(f"label: {result['label']} with score: {round(result['score'], 4)}")

label: POSITIVE with score: 0.9998
label: POSITIVE with score: 0.7042
label: POSITIVE with score: 0.9999


### Automatic Speech Recognition Tasks

In [2]:
import torch
from transformers import pipeline

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [3]:
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
### Loading a audio dataset
from datasets import load_dataset, Audio

In [9]:
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")

Downloading and preparing dataset minds14/en-US to /home/cerebrum/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/aa40414f15e0f919231d617440192034af844835dc1e6a697f4b552e0551fd26...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset minds14 downloaded and prepared to /home/cerebrum/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/aa40414f15e0f919231d617440192034af844835dc1e6a697f4b552e0551fd26. Subsequent calls will reuse this data.


In [None]:
### Make sure that the sampling rate of the dataset matches the sampling rate the model is trained on.

In [10]:
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))

In [None]:
### The audio files are automatically loaded and resampled when calling the audio column.

In [12]:
result = speech_recognizer(dataset[:4]["audio"])

In [15]:
for res in result:
    print(res['text'])

I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT
FODING HOW I'D SET UP A JOIN TO HET WITH MY WIFE AND WHERE THE AP MIGHT BE
I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE AP SO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AND I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS
HOW DO I THURN A JOIN A COUNT


In [16]:
### AutoTokenizer

In [17]:
from transformers import AutoTokenizer

In [18]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

In [19]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [21]:
encoding = tokenizer("Show Avengers End Game on Disney Plus")
print(encoding)
encoding = tokenizer("What is the time?")
print(encoding)

{'input_ids': [101, 11391, 49766, 11421, 11336, 10125, 15258, 10608, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 11523, 10127, 10103, 10573, 136, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}


In [22]:
# making a tokenizer use a batch of inputs and padding of the text

In [23]:
pt_batch = tokenizer(
    ["Happy to use the transformer lib", "Whas is the time"],
    padding=True,
    max_length = 512,
    return_tensors="pt"
)



In [None]:
Note: We can also use AutoModel and load a pretrained model similar to the AutoTokenizer

In [24]:
### Custom Model Builds

We can modify a models configuration to change how the model is built. 
The configuration specifies a models attributes, i.e number of hidden layers or the attention heads

The model is trained from scratch when we initialize from a custom config

In [25]:
from transformers import AutoConfig

In [26]:
my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12)

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [28]:
# Creating a model using custom configuration
from transformers import AutoModel

my_model = AutoModel.from_config(my_config)

In [None]:
### All models in HF, inherit from torch.nn.Module and hence can be used in a typical traning loop.
# Transformers provide a Trainer class in pytorch which constains many proxy features like
# Distributed Training, Mixed Precision etc.

In [31]:
# Step 1: Selecting a Pretrained Model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier

In [117]:
# Step 2: Setting traning arguments
### Training arguments - we pass the model hyperparams like - LR, Batch size, no of epochs.
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = "media/cerebrum/HDD1/Huggingface_Data",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [35]:
# Step 3: Since the model is bert, we will need preprocessing clases like = tokenizer, feature extractor

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Step 4: Pre-process test and training data, since we need to transform the input data for the transformer model

In [40]:
from datasets import concatenate_datasets, load_dataset

dataset = load_dataset("indonlu", 'emot')

Downloading builder script:   0%|          | 0.00/33.1k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/38.7k [00:00<?, ?B/s]

Downloading and preparing dataset indonlu/emot (download: 821.21 KiB, generated: 835.31 KiB, post-processed: Unknown size, total: 1.62 MiB) to /home/cerebrum/.cache/huggingface/datasets/indonlu/emot/1.0.0/0a83b181cd831cd5d9c15ffe39f3be76af23407eba2c902bccca53fa905d68af...


Downloading data:   0%|          | 0.00/289k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3521 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/440 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/440 [00:00<?, ? examples/s]

Dataset indonlu downloaded and prepared to /home/cerebrum/.cache/huggingface/datasets/indonlu/emot/1.0.0/0a83b181cd831cd5d9c15ffe39f3be76af23407eba2c902bccca53fa905d68af. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
Apply the tokenizer over the entire dataset with map and then pass the dataset and 
tokenizer to prepare_tf_dataset(). You can also change the batch size and shuffle the dataset here if you’d like:

In [92]:
def tokenize_dataset(dataset):
    return tokenizer(dataset['tweet'], padding="max_length", max_length=90, truncation=True, return_tensors="pt") 

In [93]:
for x in dataset["train"]:
    print(tokenize_dataset(x)['input_ids'].shape)
    break

torch.Size([1, 90])


In [130]:
train_dataset = dataset["train"].map(tokenize_dataset ,batched=True)

Loading cached processed dataset at /home/cerebrum/.cache/huggingface/datasets/indonlu/emot/1.0.0/0a83b181cd831cd5d9c15ffe39f3be76af23407eba2c902bccca53fa905d68af/cache-36052c3fba3548f7.arrow


In [131]:
test_dataset = dataset["test"].map(tokenize_dataset, batched=True)

Loading cached processed dataset at /home/cerebrum/.cache/huggingface/datasets/indonlu/emot/1.0.0/0a83b181cd831cd5d9c15ffe39f3be76af23407eba2c902bccca53fa905d68af/cache-d3d827d7ff3ad873.arrow


In [44]:
### Step 5: initializing a data collator to create a batch of examples for your dataset

In [115]:
from transformers import DefaultDataCollator

In [121]:
data_collator = DefaultDataCollator()

Now that we have the data batch, we will gather these into the trainer

In [135]:
from transformers import Trainer 

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    tokenizer = tokenizer,
    data_collator = data_collator
)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In [120]:
# Calling the train method to train
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: tweet. If tweet are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3521
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 442


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)