# Joke Generation Bot (Decoder based transformer)

## Collecting the Dataset

Dataset Resource: https://huggingface.co/datasets/SocialGrep/one-million-reddit-jokes

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import load_dataset
import pandas as pd

# Load the dataset from Hugging Face
dataset = load_dataset('socialgrep/one-million-reddit-jokes')



  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Convert the dataset to a Pandas DataFrame
df = pd.DataFrame(dataset['train'])

# Print the first few rows of the DataFrame
print(df.head())

   type      id subreddit.id subreddit.name  subreddit.nsfw  created_utc  \
0  post  ftbp1i        2qh72          jokes           False   1585785543   
1  post  ftboup        2qh72          jokes           False   1585785522   
2  post  ftbopj        2qh72          jokes           False   1585785508   
3  post  ftbnxh        2qh72          jokes           False   1585785428   
4  post  ftbjpg        2qh72          jokes           False   1585785009   

                                           permalink      domain   url  \
0  https://old.reddit.com/r/Jokes/comments/ftbp1i...  self.jokes  None   
1  https://old.reddit.com/r/Jokes/comments/ftboup...  self.jokes  None   
2  https://old.reddit.com/r/Jokes/comments/ftbopj...  self.jokes  None   
3  https://old.reddit.com/r/Jokes/comments/ftbnxh...  self.jokes  None   
4  https://old.reddit.com/r/Jokes/comments/ftbjpg...  self.jokes  None   

                                            selftext  \
0  My corona is covered with foreskin so i

## Preprocessing the Data

Preprocessing the data can include steps such as cleaning the data, removing stop words, tokenizing the text, and encoding the data.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m76.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m77.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.26.1


In [None]:
from transformers import AutoTokenizer

# Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the text
def tokenize_function(example):
    title = ' '.join(example['title'])
    example['selftext'] = my_list = [x if x is not None and x != '[removed]' and x != '[deleted]' else '' for x in example['selftext']]
    selftext = ' '.join(example['selftext'])

    text = title + ' ' + selftext
    return tokenizer(text, padding='max_length', truncation=True, max_length=1000)

# Apply the tokenizer to the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [None]:
# Encode the tokens as input and output sequences for the language model
def encode_function(example):
    return {'input_ids': example['input_ids'], 'attention_mask': example['attention_mask'], 'labels': example['input_ids']}

encoded_dataset = tokenized_dataset.map(encode_function, batched=True)

Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

## Creating the Model

We use the Hugging Face Transformers library to instantiate a pre-trained model 

In [None]:
from transformers import AutoModelForMaskedLM

# Instantiate the pre-trained BERT model
model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Training the Model

To train the language model, we use the Hugging Face Transformers library to define the training configuration and run the training process. 

In [None]:
from transformers import TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
import numpy as np
import torch

# Split the encoded dataset into training and validation sets
train_dataset, valid_dataset = train_test_split(encoded_dataset['train'], test_size=0.2)

In [None]:
from datasets import Dataset

# Convert dictionary of lists to Dataset object
train_dataset = Dataset.from_dict(train_dataset)
valid_dataset = Dataset.from_dict(valid_dataset)

# Set format of Dataset object
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
valid_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='steps',
    eval_steps=50,
)

# Create the Trainer object and train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Continue...