# Text classification

Date : 10-04-2024

* Text classification is a common NLP task that assigns a label or class to text
* One of the most popular forms of text classification is sentiment analysis, which assigns a label like 🙂 positive, 🙁 negative, or 😐 neutral to a sequence of text

In [1]:
pip install transformers datasets evaluate accelerate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl.metadata (29 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0
Note: you may need to restart the kernel to use updated packages.


In [2]:

from kaggle_secrets import UserSecretsClient
huggingface_token = UserSecretsClient().get_secret("huggingface_token")

In [3]:
from huggingface_hub import login

login(token=huggingface_token)


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
import torch

# # Check if a GPU is available and if not, use a CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

**Load IMDb dataset**

In [5]:
from datasets import load_dataset

imdb = load_dataset("imdb")
imdb

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 57.3MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 34.6MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:00<00:00, 128MB/s] 


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [6]:
imdb.pop('unsupervised')
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})

In [7]:
from datasets import concatenate_datasets

imdb = concatenate_datasets([imdb['train'], imdb['test']])
imdb

Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})

In [8]:
imdb.push_to_hub("vishnun0027/imdb_dataset", private=False)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/50 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/vishnun0027/imdb_dataset/commit/8e18916dae8ecff0e9dbd6fd576a4cb65bda96bf', commit_message='Upload dataset', commit_description='', oid='8e18916dae8ecff0e9dbd6fd576a4cb65bda96bf', pr_url=None, pr_revision=None, pr_num=None)

In [9]:
imdb = imdb.train_test_split(test_size=0.2)

In [10]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 40000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 10000
    })
})

In [11]:
imdb["test"][100]

{'text': "The original Lensman series of novels is a classic of the genre. It's pure adventure SF with some substance (here and there) and I've always wondered why Hollywood hasn't filmed it verbatim because it's just the kind of thing they love: massive explosions, super-weapons, uber-heroics, hero gets the girl, aliens (great CGI potential), good versus evil in the purest form, etc etc. Instead (and bear in mind I'm a Japan-o-phile and anime lover) we get this horrendous kiddies movie that rips the guts out of the story, mixes in Star-Wars (ironic as the latter ripped off the books occasionally) pastiches and dumbs the whole thing down to 'Thundercats' level. To see Kimball Kinnison, the epitome of the Galactic Patrol officer and second stage Lensman portrayed as a small boy is pitiful (etc). I just can't understand why the makers did this because they obviously had the rights to the story and could have made far more money (FAR!) by telling straight. It makes no sense.",
 'label': 0

**Preprocess**

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [13]:
def preprocess_function(examples):
    return tokenizer(examples["text"],max_length=512,truncation=True)

In [14]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [15]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

2024-04-10 11:39:42.472267: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-10 11:39:42.472369: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-10 11:39:42.573843: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


**Evaluate**

In [16]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [17]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

**Train**

In [18]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [19]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
training_args = TrainingArguments(
    output_dir="vishnun0027/Text_classification_model_10042024",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.push_to_hub()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss,Accuracy
1,0.2125,0.183901,0.9307
2,0.1318,0.180242,0.9366
3,0.0762,0.228744,0.9373




CommitInfo(commit_url='https://huggingface.co/vishnun0027/Text_classification_model_10042024/commit/c0f4fa9ca9ed1a4fd1ce7c6607948c744e21cb31', commit_message='End of training', commit_description='', oid='c0f4fa9ca9ed1a4fd1ce7c6607948c744e21cb31', pr_url=None, pr_revision=None, pr_num=None)

**Inference**

In [21]:
import torch

In [22]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [23]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vishnun0027/Text_classification_model_10042024")
inputs = tokenizer(text, return_tensors="pt")

In [24]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("vishnun0027/Text_classification_model_10042024")
with torch.no_grad():
    logits = model(**inputs).logits

In [25]:
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

'POSITIVE'