# Introduction
This notebook allows you to test out different types of language models that are described in the various blogs written by me on the topic of language modelling. The blogs are listed as below.

# Data Processing
Run the cells below to preprocess and prepare data for modelling. We will be using the [1 Million Reddit Jokes Dataset](https://www.kaggle.com/datasets/pavellexyr/one-million-reddit-jokes) for this task. The original dataset is in CSV format, we will be extracting raw joke text from this CSV and dump it into a text file which we will use to create our models.

In [1]:
import os
import pandas as pd

In [2]:
raw_file_path = "data/one-million-reddit-jokes.csv"
raw_df = pd.read_csv(raw_file_path)

In [3]:
raw_df.head()

Unnamed: 0,type,id,subreddit.id,subreddit.name,subreddit.nsfw,created_utc,permalink,domain,url,selftext,title,score
0,post,ftbp1i,2qh72,jokes,False,1585785543,https://old.reddit.com/r/Jokes/comments/ftbp1i...,self.jokes,,My corona is covered with foreskin so it is no...,I am soooo glad I'm not circumcised!,2
1,post,ftboup,2qh72,jokes,False,1585785522,https://old.reddit.com/r/Jokes/comments/ftboup...,self.jokes,,It's called Google Sheets.,Did you know Google now has a platform for rec...,9
2,post,ftbopj,2qh72,jokes,False,1585785508,https://old.reddit.com/r/Jokes/comments/ftbopj...,self.jokes,,The vacuum doesn't snore after sex.\n\n&amp;#x...,What is the difference between my wife and my ...,15
3,post,ftbnxh,2qh72,jokes,False,1585785428,https://old.reddit.com/r/Jokes/comments/ftbnxh...,self.jokes,,[removed],My last joke for now.,9
4,post,ftbjpg,2qh72,jokes,False,1585785009,https://old.reddit.com/r/Jokes/comments/ftbjpg...,self.jokes,,[removed],The Nintendo 64 turns 18 this week...,134


# Raw Data Preprocessing

In [4]:
# Filter those that are deleted or removed
# Remove NaN values
removed_or_deleted_title = (raw_df['title'] == '[removed]') | (raw_df['title'] == '[deleted]')
removed_or_deleted_text = (raw_df['selftext'] == '[removed]') | (raw_df['selftext'] == '[deleted]')
jokes_filtered = raw_df[~removed_or_deleted_title & ~removed_or_deleted_text]
jokes_filtered = jokes_filtered.dropna(subset=['title', 'selftext'])

In [5]:
print(f"Total rows original: {raw_df.shape[0]}")
print(f"Total rows filtered: {jokes_filtered.shape[0]}")

Total rows original: 1000000
Total rows filtered: 574120


In [6]:
# Create a single column to store the joke text
# Sometimes the title begins a joke and the self text completes the joke so we'll combine the two accordingly
def combine_title_and_text(title, text):
    text_preamble = text[0: len(title)]
    if text_preamble == title:
        return text
    return title + " " + text

jokes_filtered['joke'] = jokes_filtered[['title', 'selftext']].apply(lambda x: combine_title_and_text(x[0], x[1]), axis=1)

In [7]:
# Filter stray characters and non alpha-numeric characters from the string.
import re

def keep_alphanumeric(input_string):
    if type(input_string) is not str:
        return ""
    return re.sub(r'[^a-zA-Z0-9]', ' ', input_string)

def remove_stray_single_characters(input_string):
    tokens = input_string.split(" ")
    filtered_string = ""
    for tok in tokens:
        if len(tok) > 1:
            filtered_string += tok
        elif tok in ['I', 'a', 'A', 'm']:
            filtered_string += tok
        filtered_string += " "
    return filtered_string.strip()
            

jokes_filtered['joke'] = jokes_filtered['joke'].apply(keep_alphanumeric)
jokes_filtered['joke'] = jokes_filtered['joke'].apply(remove_stray_single_characters)

In [8]:
jokes_filtered['joke'].head(5)

0     I am soooo glad I m not circumcised  My corona...
1     Did you know Google now has a platform for rec...
2     What is the difference between my wife and my ...
7     What did the French man say to the attractive ...
10    Yo Mama Yo momma  so fat  that when she went t...
Name: joke, dtype: object

# Write Jokes to a Text File

In [9]:
joke_file = "data/jokes.txt"
with open(joke_file, 'w') as f:
    for joke in jokes_filtered['joke'].values:
        f.write(joke+'\n')