<h1 style="background-color:skyblue;font-family:sans-serif;font-size:320%;text-align:center">This Is Why Your Jokes Aren't Funny</h1>

In [None]:
from IPython.display import Image
import os
Image("../input/laughing/dan-cook-MCauAnBJeig-unsplash.jpg")

<h2 style="background-color:skyblue;font-family:sans-serif;font-size:300%;text-align:center">Table Of Content</h2>

* [1. Good Words - Bad Words](#1)
* [2. A Very Common And Surprising Mistake](#2)
* [3. There Is No Such Thing As The Ultimate Joke](#3)    
* [4. Some Special Jokes](#4) 
* [5. Conclusion](#5) 


You’re standing in a group of colleagues. Everyone is in a good mood and you’re cracking a joke. Suddenly there is an awkward silence. Some are looking away, some are giving you a pitying smile.

How embarrassing!

You can prepare better for the next time.

But what makes a good joke? And which mistakes should you avoid?

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import re

from wordcloud import WordCloud, STOPWORDS

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer



import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
items = pd.read_csv("/kaggle/input/jester-17m-jokes-ratings-dataset/jester_items.csv")
ratings = pd.read_csv("/kaggle/input/jester-17m-jokes-ratings-dataset/jester_ratings.csv")

In [None]:
# Data Exploration

In [None]:
## Exploration of items

In [None]:
items.shape

In [None]:
items.head()

In [None]:
items.describe()

In [None]:
items.info()

In [None]:
# Are there any duplicates in the data?

items.duplicated().sum()


# conclusion: No

In [None]:
items["length"] = items["jokeText"].str.len()

In [None]:
## Exploration of Ratings

In [None]:
ratings.shape

In [None]:
ratings

In [None]:
ratings.describe()

In [None]:
ratings.info()

In [None]:
# Are there any duplicates in the data?

ratings.duplicated().sum()


# conclusion: No

In [None]:
# Distribution of the ratings

ratings["rating"].hist(range=(-10,10), bins=20, color="purple", edgecolor="indigo", linewidth=1)
plt.grid(False)
plt.xlabel("Rating")
plt.ylabel("Count")
plt.title("Distribution of the Rating")

<a id="1"></a>
<h2 style="background-color:skyblue;font-family:sans-serif;font-size:300%;text-align:center">Good Words - Bad Words</h2>

In [None]:
# Normalization, Tokenization, Stopwords removal, Verb lemmatization

items["prepared_jokeText"] = items["jokeText"].apply(lambda x: re.sub(r"[^a-zA-Z0-9]", " ", x.lower()))
items["prepared_jokeText"] = items["prepared_jokeText"].apply(lambda x: word_tokenize(x))
items["prepared_jokeText"] = [[w for w in words if w not in stopwords.words("english")] 
                              for words in items["prepared_jokeText"]]
items["prepared_jokeText"] = [[WordNetLemmatizer().lemmatize(w, pos="v") for w in words] 
                              for words in items["prepared_jokeText"]]
items["prepared_jokeText"]

In [None]:
# Merge of the two data sets without the user information. Just mean ratings per joke.
# Advantage: The jokes do not have to be repeated per user.

jokes_mean_rating = items.merge(ratings.groupby(["jokeId"]).mean().rating , on='jokeId', how='inner')

In [None]:
jokes_mean_rating

In [None]:
best_rated = jokes_mean_rating[jokes_mean_rating["rating"] > 3]
worst_rated = jokes_mean_rating[jokes_mean_rating["rating"] < -1]

In [None]:
def word_list(joke_data):
    
    list_of_words = []
    
    for list in joke_data["prepared_jokeText"]:
        for word in list:
            list_of_words.append(word)
    return list_of_words

Good jokes often contain words like ‘man‘, ‘go‘, ‘say‘, ‘tell‘. So a storytelling seems to be an important part of a good joke.

In [None]:
text = ' '.join([word for word in word_list(best_rated)])
cloud = WordCloud(background_color='white', width=1920, height=1080).generate(text)
plt.figure(figsize=(32, 18))
plt.axis("off")
plt.imshow(cloud)

The word frequency of bad jokes reveals that ‘knock knock’ and ‘lightbulb’ jokes are out. You should avoid those.

In [None]:
stopwords = set(STOPWORDS)
stopwords.update(["q", "na", "j"])

text = ' '.join([word for word in word_list(worst_rated)])
cloud = WordCloud(stopwords=stopwords, background_color='white', width=1920, height=1080).generate(text)
plt.figure(figsize=(32, 18))
plt.axis('off')
plt.imshow(cloud)
plt.savefig('worst_jokes_wordcloud.png')

<a id="1"></a>
<h2 style="background-color:skyblue;font-family:sans-serif;font-size:300%;text-align:center">A very Common and Surprising Mistake</h2>

You keep your jokes short so that your listeners won’t be bored? That’s not the best idea! The scatterplot left bottom shows that jokes with at least 600 characters are funnier.

In [None]:
from scipy import optimize
def fitfunc (x, a, b, c, d):
    return a + b * np.log(c * x + d)


params, params_covariance = optimize.curve_fit(fitfunc,jokes_mean_rating["length"], jokes_mean_rating["rating"])
x_values = np.arange(0, 1400, 1)
plt.figure(figsize=(10,8))
plt.scatter(jokes_mean_rating["length"], jokes_mean_rating["rating"], label="Data", color="cadetblue")
plt.plot(x_values,fitfunc(x_values, params[0], params[1], params[2], params[3]),
         label='Fitted function')
plt.legend(loc='best')
plt.xlabel("Joke Length in Number of Charakters")
plt.ylabel("Rating")
plt.title("Are Longer or Shorter Jokes Funnier?")

plt.show()

Though 81% of the jokes are shorter.

In [None]:
(jokes_mean_rating["length"] < 600).sum()/len(jokes_mean_rating["length"])

In [None]:
jokes_mean_rating["length"].hist(range=(0,1000), bins=20, color="purple", edgecolor="indigo", linewidth=1)
plt.grid(False)
plt.xlabel("Text Length in Number of Characters")
plt.ylabel("Count")
plt.title("Distribution of the Text Length")

<h2 style="background-color:skyblue;font-family:sans-serif;font-size:300%;text-align:center">There Is No Such Thing As The Ultimate Joke</h2>

Finally the mean ratings for jokes only crowd in the middle between -2.7 and 3.7.

In [None]:
# Distribution of the mean ratings

jokes_mean_rating["rating"].hist(range=(-10,10), bins=20, color="purple", edgecolor="indigo", linewidth=1)
plt.grid(False)
plt.xlabel("Rating")
plt.ylabel("Count")
plt.title("Distribution of the Mean Rating per Joke")

In [None]:
# Are there many people who rate very high and others that rate very low? 
# Or does the mayority vote relatively neutral?

ratings_per_person = ratings.groupby(["userId"]).mean().rating  
ratings_per_person.hist(range=(-10,10), bins=20, color="purple", edgecolor="indigo", linewidth=1)
plt.grid(False)
plt.xlabel("Rating")
plt.ylabel("Count")
plt.title("Distribution of the Rating per person")

<h2 style="background-color:skyblue;font-family:sans-serif;font-size:300%;text-align:center">Some Special Jokes</h2>

The Best Rated Joke

In [None]:
# Best rated joke
best_joke = jokes_mean_rating[jokes_mean_rating["rating"] == jokes_mean_rating["rating"].max()]
best_joke.iloc[0].jokeText

The Worst Rated Joke

In [None]:
# Here I want to look at more than one joke to see relationships with 
# the most controversial jokes later.

worst_jokes = jokes_mean_rating.sort_values("rating")
worst_jokes

In [None]:
worst_jokes.iloc[0].jokeText

The Most Controversal (And Second Worst) Joke

In [None]:
# most controversial

jokes_mean_std = ratings.groupby(["jokeId"]).std().rating 
jokes_mean_std.sort_values(ascending=False)

In [None]:
# This one is both: second worst and most controversial

most_controversial_joke = jokes_mean_rating[jokes_mean_rating["jokeId"] == 124]
most_controversial_joke.iloc[0].jokeText

<h2 style="background-color:skyblue;font-family:sans-serif;font-size:300%;text-align:center">Some Special Jokes</h2>

Jokes that are funny to everyone just do not exist. Nor jokes that everyone regards as terrible. The second worst rated joke has also been the most controversial one. 

Whether a joke is funny or not remains very subjective. So the best you can do is respond to the preferences of your listeners.