# **They 'll never stop the Simpsons**

A simple graphic analysis of The Simpsons show (data available till season 27)


![](http://frinkiac.com/img/S13E17/1214296.jpg)

In [None]:
import warnings
#Remove anaconda warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

sns.set()

%matplotlib inline


plt.rcParams["figure.figsize"] = 12, 9
plt.style.use('fivethirtyeight')


We will read the first dataset which contains information about the episodes till season 27. We will drop the missing values as they belong to season 28 (we only have data for the first episode of this season)

In [None]:
df_e = pd.read_csv('../input/the-simpsons-dataset/simpsons_episodes.csv').dropna()
print(df_e.shape)

In [None]:
#Check the data
df_e.info()

In [None]:
#Drop Useless col
df_e.drop(['image_url','production_code', 'video_url'], axis=1, inplace=True)
#Sort for episodes 
df_e.sort_values(['number_in_series'], inplace=True)
df_e.head()

Let's see how the viewers have changed with time

In [None]:
sns.lineplot(x='number_in_series', y='us_viewers_in_millions', data=df_e, linewidth=0.8 )
plt.xlabel('Episode')
plt.ylabel('US viewers in millions')
plt.title('US viewers per episode')
plt.show()

We can see that, in general, the views have gone down during time.

Anyway, 600 episodes are a lot to visualize so let's try to group this plot by Season.


![](http://frinkiac.com/img/S08E21/269101.jpg)

In [None]:
df_e.groupby('season')['us_viewers_in_millions'].mean().plot(linestyle='--', marker='o', mfc='yellow', mec='black', linewidth=1.2)
plt.xticks((np.arange(0, 30, step=1)))
plt.xlabel('Season')
plt.ylabel('Average US views in millions')
plt.title('Average US views per season')
plt.show()

The trend is basically the same, but here we can see that after season 11 we have a little peak, till season 15 where the views go down again. Maybe this peak is a consequence of the movie that aired in 2007. Let's check if also season 11 aired in 2007.

In [None]:
df_e.loc[df_e['original_air_year'] == 2007, 'season']

Nope, the movie was released between season 18 and 19. So maybe we have a spike because the season 11 finale is _'Who Shot Mr. Burns?'_ Let's check it out.

In [None]:
df_e.loc[df_e['title'] == 'Who Shot Mr. Burns? (Part One)']

That's not the case. That episode was the finale of season 6.
Let's move on and see how the ratings changed during time.

In [None]:
sns.lineplot(x='number_in_series', y='imdb_rating', data=df_e, linewidth= 0.6)
plt.xlabel('Episodes')
plt.ylabel('Rating (IMDB)')
plt.title('IMDB rating per episode')
plt.show()

Here too we can see that, with time, ratings have decresed. Let's group the ratings by season to have a clearer view of the situation. 

In [None]:
df_e.groupby('season')['imdb_rating'].mean().plot(linestyle='--', marker='o', mfc='yellow', mec='black', linewidth=1.2)
plt.xticks((np.arange(0, 30, step=1)))
plt.xlabel('Season')
plt.ylabel('Average rating(IMDB)')
plt.title('Average rating by season')
plt.show()

So the things look quite different here. We can see that after season 8 ther's been a great decline in the average season rating but this doesn't mean there were no good episodes after season 8; the seasons ovearll had less great episodes. By the way, if you are planning to binge watch the Simpsons, according to this plot, seasons 5 and 7 were the best. But if yout want to watch only the best episodes, the next plot is going to be more useful.

In [None]:
#Create a pivot table with episodes and ratigns
ep_piv = df_e.pivot_table(index='season', columns='number_in_season', values='imdb_rating')
ep_piv.head()

In [None]:
#Heatmap visualization
sns.heatmap(ep_piv, cmap='RdYlGn', annot=True, linewidths=0.2,cbar_kws={'label': 'rating'})
plt.xlabel('Episode Number')
plt.ylabel('Season')
plt.title('IMDB rating per episode')
plt.show()

Sadly we have some missing values here and there and just the first episode of season 28; but the situtation is still clear. Here we can see that even the 'new' seasons have some good episodes, like '_Barthood_'(season 27 episode 9). But now, let's find out the best and the worst episode ever.

![](http://frinkiac.com/img/S08E14/961843.jpg)

In [None]:
#Worst episode ever
df_e[df_e.imdb_rating == df_e.imdb_rating.min()]

In [None]:
#Best episode ever
df_e[df_e.imdb_rating == df_e.imdb_rating.max()]

Looks like we have two best episodes both from season 8. Let's check my favourite episode: _'The Springfield Files'_

![](http://frinkiac.com/img/S08E10/1165663.jpg)

In [None]:
df_e.loc[df_e['title'] == 'The Springfield Files']

9.0, pretty good! And it's from season 8 as well. Now Let's move on with the other data sets.

In [None]:
df_c = pd.read_csv('../input/the-simpsons-dataset/simpsons_characters.csv')
df_c.head()

In [None]:
print(df_c.shape)

Oh my, looks like we have a lot of Simpson characters here. Luckly we have the gender value only for the main characters, dropping the missing values will leave us with a more manageable data frame.

In [None]:
df_c.dropna(inplace=True)
df_c.reset_index(drop=True, inplace=True)
df_c.head()

In [None]:
print(df_c.shape)

We can't do much with this data, let's just check the gender distribution of the characters 

In [None]:
sns.countplot(x=df_c.gender)
plt.title('Characters by gender')
plt.xlabel('Gender')
plt.ylabel('')
plt.show()


Looks like that most of the Simpsons cast is made up of male characters.
Let's move on with the analysis and open up a new dataframe.

In [None]:
df_d = pd.read_csv('../input/the-simpsons-dataset/simpsons_script_lines.csv').dropna()
df_d.head()

In [None]:
print(df_d.shape)

Here we have the script lines of every character by episode and we can pull off some cool plots from here. We will start with some cleaning and sorting operations.

In [None]:
df_d.sort_values(['number'], inplace=True)
df_d.head()

In [None]:
#Drop some columns that we will not use
df_d.drop(['id', 
            'episode_id',   
            'raw_text',     
            'timestamp_in_ms',    
            'speaking_line'],    
            axis=1, 
            inplace=True)
df_d.head()

Convert character_id in int so I can merge this dataframe with the one of the characters 

In [None]:
df_d['character_id'] = df_d['character_id'].astype('int64')

In [None]:
#Rename the column before merging
df_d.rename(columns={'character_id':'id'}, inplace=True)
df = df_c.merge(df_d, on='id', how='left').dropna()
df.head()

In [None]:
#Let's clean again some columns
df.drop(['normalized_name', 'number', 'raw_character_text'],axis=1, inplace=True)
#Sort values by script lines
engage = df.groupby('name')['name'].count().sort_values(ascending=False)
#get the first 10 values
top_10 = engage[:10]
print(top_10)

In [None]:
#Lollipop plot
fig, ax = plt.subplots()
ax.hlines(top_10.index, xmin=0, xmax=top_10.values, linewidth=2)
ax.plot(top_10.values, top_10.index, 'o', markersize=10, alpha=0.5)
ax.invert_yaxis()
ax.set_title('Top 10 characters for number of lines')
ax.set_xlabel('Number of lines')


plt.show()

As we would expect, the Simpsons' family is the most engaged in the show and the first non-Simpson character is Montgomery Burns who still has a direct relationship with the most engaged character, Homer Simpson, being his boss.
Let's see who are the supporting characters of the show.


![](http://frinkiac.com/img/S03E19/291597.jpg)

In [None]:
support = engage[10:40]

In [None]:
fig, ax = plt.subplots()
ax.hlines(support.index, xmin=0, xmax=support.values, linewidth=2, color='orange')
ax.plot(support.values, support.index, 'o', markersize=10, alpha=0.5, color='orange')
ax.invert_yaxis()
ax.set_title('Best supporting cast by number of lines')
ax.set_xlabel('Number of lines')


plt.show()

Just for fun let's create a wordcloud rapresentation of the most common words in the show. This kind of rapresentation has no statistical value but it's not too hard to do and pretty good to see.

![](http://frinkiac.com/img/S10E01/78778.jpg)

In [None]:
text = " ".join(line for line in df_d.spoken_words)
stopwords = set(STOPWORDS)
stopwords.update(["hey", "gonna", "yeah", "uh", "ya", "ho", "la", "em",
                   "ah", "huh", "ooh", "gotta", "eh", "aw", "heh", "wow",
                   "ow", "haw", "woo",  "ha", "wanna", "whoa", "hoo", "ye", "wait","now","Oh","Well","one",
                 "go", "okay", "know","right",'look', 'let','got', 'Thank', 'see', 'will', 'want',
                 'come', 'think', 'take', 'time', 'good', 'keep', 'say', 'make', 'going', 'Dad'])
plt.figure(figsize=[15,7])
wordcloud = WordCloud(max_words=1000, background_color="white", stopwords=stopwords).generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

After removing some of the most common words in the english language, with no surprises, we have that the most common words are the names of the main characters.
Now let's try to see where the action is taking places.

In [None]:
location = df_d.groupby('raw_location_text')['raw_location_text'].count().sort_values(ascending=False)
best_location = location[:10]

In [None]:
fig, ax = plt.subplots(1,1)
sns.barplot(y=best_location.index, x=best_location.values, palette='viridis')
ax.set_ylabel('Location')
ax.set_title('Top 10 locations')
plt.show()

As we can see here, the vast majority of dialogues takes place at Simpson Home.
This concludes this simple EDA of one of my favourite shows. I hope you liked it and found it interesting, also this is my first post on Kaggle and I'm working to improve.

All images from: [frinkiac](http://frinkiac.com)