<a href="https://colab.research.google.com/github/rebeljel/joke_classifier/blob/main/Humor_Classification_wth_DistillBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Humour Detection**

The goal is to train a model that can distinguish between funny text and neutral next. The humor data can be downloaded at:
https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection

# **Project steps**


- Download the dataset

*   Explore the data: info, value counts
*   Prep data
*   Split the data into training and test set
*   Load BERT model
*   Tokenize text
*   choose BERT as base model for classification
*   train using train dataset
*   evaluate model on testing dataset
*   save the model




In [None]:
!pip install transformers datasets evaluate huggingface_hub

In [None]:
# Data processing
import pandas as pd
import numpy as np

# Modeling
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_scheduler

# Progress bar
from tqdm.auto import tqdm

# Hugging Face Dataset
from datasets import Dataset

# Model performance evaluation
import evaluate

In [None]:
# Colab offers free GPU that needs to be turned on in the settings.
# In this cell it is checked upon whether GPU is available, otherwise
# the CPU is used.

import torch

# If there's a GPU available
if torch.cuda.is_available():

    # Tell PyTorch to use the GPU.
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

No GPU available, using the CPU instead.


**Load data**

In [None]:
# Load Humor dataset

df = pd.read_csv('dataset.csv')
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


**Explore data**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    200000 non-null  object
 1   humor   200000 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 1.7+ MB


In [None]:
# Rename column to label
df.rename(columns={"humor": "label"}, inplace=True)


In [None]:
df.describe()

Unnamed: 0,text,label
count,200000,200000
unique,200000,2
top,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
freq,1,100000


In [None]:
df['label'].value_counts()

False    100000
True     100000
Name: label, dtype: int64

The dataset has 200000 datapoints and no missing values. It is a balanced dataset with labels False and True.


**Data Processing**

In [None]:
import re

# Remove special characters from text

def remove_special_characters(text):
    pat = r'[^a-zA-z0-9]'
    return re.sub(pat, ' ', text)


df['text'] = df['text'].map(remove_special_characters)

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Remove stopwords from text

stop = set(stopwords.words("english"))

def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in stop]

    return " ".join(text)


df["text"] = df["text"].map(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Convert df to Dataset

hf_dataset = Dataset.from_pandas(df)

In [None]:
hf_dataset[:10]

{'text': ['joe biden rules 2020 bid guys running',
  'watch darvish gave hitter whiplash slow pitch',
  'call turtle without shell dead',
  '5 reasons 2016 election feels personal',
  'pasco police shot mexican migrant behind new autopsy shows',
  'martha stewart tweets hideous food photo twitter responds accordingly',
  'pokemon master favorite kind pasta wartortellini',
  'native americans hate rains april brings mayflowers',
  'obama climate change legacy impressive imperfect vulnerable',
  'family tree cactus pricks'],
 'label': [False, False, True, False, False, False, True, True, False, True]}

**Load Dataset**

In [None]:
# Split dataset into train and test set

dataset = hf_dataset.train_test_split(test_size=0.2).class_encode_column("label")
dataset

Stringifying the column:   0%|          | 0/160000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/160000 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/40000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/40000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 160000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 40000
    })
})

In [None]:
train_X = dataset['train']
test_X = dataset['test']

# Length of the Dataset

print(f'The training dataset has {len(train_X)} records.')
print(f'The testing dataset has {len(test_X)} records.')

The training dataset has 160000 records.
The testing dataset has 40000 records.


**Load Tokenizer and Model**

In [None]:
# Tokenizer from a pretrained model
# I chose Distilbert, a light version of BERT

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenizer

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [None]:
# Funtion to tokenize the text from the train and test datasets

def tokenize_dataset(data):
    return tokenizer.encode_plus(data["text"],
                     max_length=32,
                     truncation=True,
                     padding="max_length",
                     return_tensors="pt"
                     )

# Tokenize the dataset
dataset_train = train_X.map(tokenize_dataset)
dataset_test = test_X.map(tokenize_dataset)

Map:   0%|          | 0/160000 [00:00<?, ? examples/s]

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

In [None]:
dataset_train

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 160000
})

In [None]:
print(dataset_train.features)
print(dataset_test)

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['False', 'True'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}
Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 40000
})


In [None]:
# Take a look at the data

print(dataset_train.features)
print(dataset_test)

# At this point, there is a tokenized dataset that contains the text, the label,
# the input_ids and the attention_mask.
# The input_ids are the text tokens that were converted to ids.
# The attention_mask marks the tokens that should be treated and the ones that
# should be left out, like the CLS, SEP and PAD special tokens.

{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['False', 'True'], id=None), 'input_ids': Sequence(feature=Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), length=-1, id=None)}
Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 40000
})


In [None]:
# Formatting

# Remove the review and index columns because it will not be used in the model
dataset_train = dataset_train.remove_columns(["text"])
dataset_test = dataset_test.remove_columns(["text"])

# Rename label to labels because the model expects the name labels
dataset_train = dataset_train.rename_column("label", "labels")
dataset_test = dataset_test.rename_column("label", "labels")

# Change the format to PyTorch tensors
dataset_train.set_format("torch")
dataset_test.set_format("torch")

# Take a look at the data
print(dataset_train)
print(dataset_test)

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 160000
})
Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 40000
})


In [None]:
dataset_train.features

{'labels': ClassLabel(names=['False', 'True'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [None]:
dataset_train

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 160000
})

In [None]:
dataset_train[0]

# The datatypes from the rows have been converted to tensors

{'labels': tensor(0),
 'input_ids': tensor([  101,  2047,  9789,  3945,  7344,  2166,  2464,  3565,  3011, 12504,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0])}

In [None]:
# Empty cache
torch.cuda.empty_cache()

# Feed the dataset into the torch DataLoader
train_dataloader = DataLoader(dataset=dataset_train, shuffle=True, batch_size=16)
test_dataloader = DataLoader(dataset=dataset_test, batch_size=16)

In [None]:
for batch in train_dataloader:
  print(batch)

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Pretrained model

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

**Train Model**

In [None]:
# I based the number of epochs, learning rate and batch size off
# this finetunes distilbert model: https://huggingface.co/nsi319/distilbert-base-uncased-finetuned-app

# Number of epochs
num_epochs = 5

# Number of training steps
num_training_steps = num_epochs * len(train_dataloader)

# Optimizer
optimizer = AdamW(params=model.parameters(), lr=2e-5, weight_decay=0.01)

# Set up the learning rate scheduler
lr_scheduler = get_scheduler(name="linear",
                             optimizer=optimizer,
                             num_warmup_steps=0,
                             num_training_steps=num_training_steps)

# Use GPU if it is available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [None]:
# Before this method, I trained the model with the PyTorch wrapper
# class Trainer from Hugging Face. I decided to implement a PyTorch
# training loop to see the processes.

# Set the progress bar
progress_bar = tqdm(range(num_training_steps))

model.train()

# Loop through the epochs
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        # print(batch)
        # Compute the model output for the batch
        outputs = model(**batch)

        # Loss computed by the model
        loss = outputs.loss

        # backpropagates the error to calculate gradients
        loss.backward()

        # Update the model weights
        optimizer.step()

        lr_scheduler.step()

        # Clear the gradients
        optimizer.zero_grad()

        progress_bar.update(1)

    print(f"Epoch {epoch+1} loss: {loss.item():.5f}")



  0%|          | 0/50000 [00:00<?, ?it/s]

{'labels': tensor([0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0]), 'input_ids': tensor([[  101,  5962, 26406,  2667,  2425,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  6351,  4171,  2147,  3198,  2329,  3996,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101, 26101,  3348,  7817,  4987,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101,  5223, 14708,  2015,  9129,  4124, 19571,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0

KeyboardInterrupt: ignored

**Evaluate Model**

In [None]:
metric1 = evaluate.load("accuracy")
metric2 = evaluate.load("f1")
metric3 = evaluate.load("recall")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [None]:
model.eval()

# Test dataloder

for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    # Disable the gradient calculation
    with torch.no_grad():
        # Compute the model output
        outputs = model(**batch)
    logits = outputs.logits

    # Get the predicted probabilities for the batch
    predicted_prob = torch.softmax(logits, dim=1)

    # Get the predicted labels for the batch
    predictions = torch.argmax(logits, dim=-1)

    # Add the prediction batch to the evaluation metric
    metric1.add_batch(predictions=predictions, references=batch["labels"])
    metric2.add_batch(predictions=predictions, references=batch["labels"])
    metric3.add_batch(predictions=predictions, references=batch["labels"])

# Compute the metric
print(metric1.compute())
print(metric2.compute())
print(metric3.compute())

{'accuracy': 0.987575}
{'f1': 0.9874864667522724}
{'recall': 0.9834503510531595}


**Save Model / Load Model**

In [None]:
# The Model is available on Hugging Face:
# https://huggingface.co/r3b3lj3l/humor_classifier

In [None]:
# Save tokenizer
tokenizer.save_pretrained('./humor_classifier_tokenizer_pytorch/')

# Save model
model.save_pretrained('./humor_classifier_model_pytorch/')

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./humor_classifier_tokenizer_pytorch/")

# Load model
loaded_model = AutoModelForSequenceClassification.from_pretrained('./humor_classifier_model_pytorch/')