## Workflow

- [Load Dataset](#load-dataset)
- [Preprocess](#preprocess)
- [Prepare Data](#prepare-data)
- [Build Model](#build-model)
- [Train Model](#train-model)
- [Generate Jokes](#generate-jokes)
- [Evaluate & Tune](#evaluate--tune)
- [Save & Deploy](#save--deploy)



In [1]:
from datasets import load_dataset
import pandas as pd
import os
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns

df = load_dataset("diwank/good_joke-dataset")
df = df["train"].to_pandas() # since we are only interested in the training set, then convert it to pandas df

  from .autonotebook import tqdm as notebook_tqdm


### Load Dataset
Use HuggingFace `datasets` library to load `diwank/good_joke-dataset`.


In [2]:
print("Dataset shape:", df.shape)
df.info()

Dataset shape: (20045, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20045 entries, 0 to 20044
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   source    20045 non-null  object 
 1   body      20045 non-null  object 
 2   title     20045 non-null  object 
 3   category  20045 non-null  object 
 4   rating    20045 non-null  float64
dtypes: float64(1), object(4)
memory usage: 783.1+ KB


In [3]:
df.head()

Unnamed: 0,source,body,title,category,rating
0,wocka,What do you call a cow with no legs?\nGround B...,Cow With No Legs,Animal,4.0
1,wocka,What's black and white and red all over?\nA ne...,"Black, White and Red",Other / Misc,4.0
2,wocka,There's this dyslexic guy... he walked into a ...,Into the Bar,Bar,4.0
3,wocka,There was a dyslexic insomniac agnostic.\nHe l...,Pondering the afterlife,One Liners,4.0
4,wocka,What do you call 500 lawyers at the bottom of ...,500 Lawyers,Lawyer,4.0


In [4]:
df['category'].value_counts()

category
                   19451
Insults              109
One Liners            96
Miscellaneous         66
Yo Mama               43
Other / Misc          38
Yo Momma              33
Light Bulbs           27
Puns                  26
Animal                18
Lightbulb             11
Medical               11
Knock-Knock           10
Redneck                9
Bar                    9
Religious              8
Children               7
Sex                    6
Political              6
Lawyer                 5
Sports                 5
Men / Women            5
Computers              4
Tech                   3
Women                  3
Blond                  3
At Work                3
Blonde Jokes           3
Deep Thoughts          3
Heaven and Hell        3
Family, Parents        3
Animals                2
Men                    2
College                2
Gross                  2
Military               2
Police Jokes           2
News / Politics        1
Idiots                 1
Business        

In [5]:
# get the top 1 categories
top_categories = df['category'].value_counts().head(1).index.tolist()
print("Top categories:", top_categories)

Top categories: ['']


In [10]:
df['title'].value_counts().head(10)

title
                                                           286
Why did the chicken cross the road?                         15
Knock knock                                                 13
How do you make holy water?                                 12
What did the leper say to the prostitute?                   11
What do you call a cow with no legs?                        11
How many feminists does it take to change a light bulb?     10
How many tickles does it take to make an octopus laugh?      9
Why was six afraid of seven?                                 9
Knock Knock                                                  9
Name: count, dtype: int64

### Preprocess
- Clean text (lowercase, strip extra spaces).
- Tokenize jokes (split into sequences).
- Build vocabulary & encode tokens to integers.


In [11]:
import re

class JokePreprocessor:
    def __init__(self):
        self.vocab = {}
        self.token2idx = {}
        self.idx2token = {}

    def clean_text(self, text):
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
        text = re.sub(r'\n', '', text)  # Remove newlines
        text = re.sub(r'\t+', ' ', text)  # Remove tabs
        text = re.sub(r'\s+', ' ', text)  # Remove multiple spaces
        text = re.sub(r'\s+$', '', text)  # Remove trailing spaces
        text = re.sub(r'^\s+', '', text)  # Remove leading spaces
        return text.strip()

    def tokenize(self, text):
        # Simple whitespace tokenizer
        return text.split()

    def build_vocab(self, texts):
        # Build vocabulary from a list of texts
        tokens = set()
        for text in texts:
            tokens.update(self.tokenize(self.clean_text(text)))
        self.vocab = sorted(tokens)
        self.token2idx = {token: idx for idx, token in enumerate(self.vocab)}
        self.idx2token = {idx: token for token, idx in self.token2idx.items()} 

    def encode(self, text):
        # Encode text to list of token indices
        tokens = self.tokenize(self.clean_text(text))
        return [self.token2idx[token] for token in tokens if token in self.token2idx]

    def decode(self, indices):
        # Decode list of indices to text
        return ' '.join([self.idx2token[idx] for idx in indices if idx in self.idx2token])

In [15]:
# show full body of jokes
pd.set_option('display.max_colwidth', None)
df['body'].head()

0                                                        What do you call a cow with no legs?\nGround Beef!
1                                                    What's black and white and red all over?\nA newspaper.
2                                                      There's this dyslexic guy... he walked into a bra...
3    There was a dyslexic insomniac agnostic.\nHe laid awake all night wondering if there really was a Dog.
4                                          What do you call 500 lawyers at the bottom of the sea?\nA start.
Name: body, dtype: object

In [18]:
# Instantiate preprocessor and clean the 'body' column
joke_preprocessor = JokePreprocessor()
df['body_clean'] = df['body'].apply(joke_preprocessor.clean_text)

# Build vocabulary on cleaned jokes
joke_preprocessor.build_vocab(df['body_clean'])

# Show a sample of cleaned data
df[['body', 'body_clean']].head()

Unnamed: 0,body,body_clean
0,What do you call a cow with no legs?\nGround Beef!,what do you call a cow with no legs? ground beef!
1,What's black and white and red all over?\nA newspaper.,what's black and white and red all over? a newspaper.
2,There's this dyslexic guy... he walked into a bra...,there's this dyslexic guy... he walked into a bra...
3,There was a dyslexic insomniac agnostic.\nHe laid awake all night wondering if there really was a Dog.,there was a dyslexic insomniac agnostic. he laid awake all night wondering if there really was a dog.
4,What do you call 500 lawyers at the bottom of the sea?\nA start.,what do you call 500 lawyers at the bottom of the sea? a start.


### Prepare Data
- Convert cleaned jokes to sequences of token indices.
- Prepare input and target sequences for training (next-word prediction).

In [20]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import numpy as np

# Convert cleaned jokes to sequences of token indices
sequences = [joke_preprocessor.encode(joke) for joke in df['body_clean'] if len(joke_preprocessor.encode(joke)) > 1] # Filter out empty jokes

# Prepare input and target sequences
input_seqs = []
target_seqs = []
for seq in sequences: 
    for i in range(1, len(seq)):
        input_seqs.append(seq[:i])
        target_seqs.append(seq[i])

# Pad input sequences
max_seq_len = max([len(seq) for seq in input_seqs])
input_seqs_padded = pad_sequences(input_seqs, maxlen=max_seq_len, padding='pre') #make pre-padding for the input sequences that has a less length than the max sequence length
target_seqs = np.array(target_seqs)

# Split into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(input_seqs_padded, target_seqs, test_size=0.1, random_state=42)
vocab_size = len(joke_preprocessor.vocab)
print("Vocabulary size:", vocab_size)
print("Max sequence length:", max_seq_len)

2025-06-28 21:23:40.779420: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-28 21:23:40.912534: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751135020.948607   15040 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751135020.959498   15040 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1751135021.040430   15040 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

Vocabulary size: 25522
Max sequence length: 34


### Build Model
- Use a simple RNN (LSTM) for next-word prediction.

In [23]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

embedding_dim = 64
hidden_units = 128

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_seq_len),
    LSTM(hidden_units),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])



### Train Model

In [22]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    batch_size=128
)

Epoch 1/10


I0000 00:00:1751135225.387242   15568 cuda_dnn.cc:529] Loaded cuDNN version 90701


[1m1025/1025[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 10ms/step - accuracy: 0.0357 - loss: 8.0158 - val_accuracy: 0.0436 - val_loss: 7.5466
Epoch 2/10
[1m1025/1025[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 10ms/step - accuracy: 0.0538 - loss: 7.1213 - val_accuracy: 0.0764 - val_loss: 7.3168
Epoch 3/10
[1m1025/1025[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 10ms/step - accuracy: 0.0851 - loss: 6.6236 - val_accuracy: 0.1013 - val_loss: 7.1577
Epoch 4/10
[1m1025/1025[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 10ms/step - accuracy: 0.1154 - loss: 6.1628 - val_accuracy: 0.1144 - val_loss: 7.1094
Epoch 5/10
[1m1025/1025[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 10ms/step - accuracy: 0.1340 - loss: 5.7707 - val_accuracy: 0.1221 - val_loss: 7.1411
Epoch 6/10
[1m1025/1025[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.1513 - loss: 5.4200 - val_accuracy: 0.1301 - val_loss: 7.1833
Epoch 7/10
[1m1

In [24]:
model.summary()

### Generate Jokes
- Function to generate joke completion given a prompt.

In [None]:
def generate_joke(prompt, max_gen_len=30):
    # Clean and encode the prompt
    cleaned = joke_preprocessor.clean_text(prompt)
    encoded = joke_preprocessor.encode(cleaned)
    for _ in range(max_gen_len):
        padded = pad_sequences([encoded], maxlen=max_seq_len, padding='pre')
        pred_probs = model.predict(padded, verbose=0)[0]
        next_idx = np.argmax(pred_probs)
        encoded.append(next_idx)
        # Stop if end of joke (optional: if next_idx is for a special token)
    return joke_preprocessor.decode(encoded)

### Try Joke Completion!
- Enter a joke prompt and let the model complete it.

In [None]:
prompt = "Knock knock"
completed_joke = generate_joke(prompt)
print("Prompt:", prompt)
print("Generated joke:", completed_joke)