In this kernel, I want to explore the Seinfeld scripts dataset.
Let's have a look at the data we're dealing with here.

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', -1) # I want to see the full column contents when displaying the dataframes

In [None]:
scripts = pd.read_csv('../input/scripts.csv')
episode_info = pd.read_csv('../input/episode_info.csv')

In [None]:
scripts.head()

So the scripts consist of character lines, one line per CSV record, in order. The episode information is present in the **SEID** and **Season** columns, but if you have a look at more lines than I display here, there seems to be no indication of when the scenes change.

### Dealing with missing data
As a next step, let's check if the dataset contains any null values.

In [None]:
scripts[scripts.isnull().any(axis=1)]

There are a handfull of records with null dialogues and odd words/sentences in the **Character** column. It is probably safe to just go ahead and delete those.

In [None]:
scripts = scripts.dropna(axis=0)
scripts.info()

And now we are left with 54606 lines across all seasons.

### Let's look at some simple stats

We've already seen that the dataset consists of 54606 character lines. Let's try to break those down a bit and learn a few basic facts about our dataset. The **Seinfeld** series ran for 9 seasons.

In [None]:
scripts['Season'].unique()

The dataset agrees.

Now let's look at how the lines are spread across the seasons. 

In [None]:
sns.distplot(scripts['Season'], kde=False)

This distribution is not so surprising given that the first seasons had a smaller number of episodes. Let's see the exact numbers.

In [None]:
episodes_per_season = scripts.groupby('Season')['SEID'].aggregate(['count', 'unique'])
episodes_per_season['nr_episodes'] = episodes_per_season['unique'].apply(lambda x: len(x))
episodes_per_season[['count', 'nr_episodes']]

So that gives us on average this many lines per episode (broken down per season):

In [None]:
episodes_per_season['count']/episodes_per_season['nr_episodes']

Interestingly, the dataset shows only 4 episodes for Season 1, although there should be 5, including the pilot (which we already saw marked as S01E01).
Sure enough, there are only 4 distinct SEIDs in Season 1.

In [None]:
scripts[scripts['Season'] == 1.0]['SEID'].unique()

So what's happening here? Well, as it turns out, both the pilot and the first episode after it are marked as S01E01. Here's some evidence:

In [None]:
def lines_in_episode(episode):
    return scripts[scripts['SEID'] == episode]['Dialogue'].count()

print('# lines S01E01: %d . # lines in S01E02: %d' %(lines_in_episode('S01E01'),lines_in_episode('S01E02')))

We can fix this by finding the end of the pilot and marking the first entries in the dataset as belonging to it. After some reading around, it looks like the pilot ends at line 210 (right after Jerry's monologue about not understanding women).


In [None]:
scripts[205:215]

So let's go ahead and tag those lines accordingly.

In [None]:
scripts[:211]['SEID'] = 'Pilot'

### Distribution of character lines

First, let's look at the characters involved in the show, according to this dataset.

In [None]:
scripts['Character'].value_counts()

Clearly, this is not the cleanest of data columns. This is something we'll have to keep in mind and look at Character entries that have multiple appearences when doing further analysis.

Next, let's have a look at the distribution of lines across different characters. We will make use of this helper function:

In [None]:
def plot_lines(season = None, episode = None, top_n = 10, ax = None):
    filtered_scripts = scripts
    if season:
        filtered_scripts = filtered_scripts[filtered_scripts['Season'] == season]
    if episode:
        filtered_scripts = filtered_scripts[filtered_scripts['SEID'] == episode]
    filtered_scripts['Character'].value_counts().head(top_n).plot(kind = 'bar', ax = ax)

In [None]:
plot_lines()

Nothing too surprising here, but worth noting the `[Setting`, which is a bit of a special "character" that makes a lot of appearences.

Let's see some examples of this kind of lines.

In [None]:
scripts[scripts['Character'] == '[Setting'].head(10)

Got the idea.

Next, let's try breaking down the character lines per season.

In [None]:
fig, axes = plt.subplots(ncols=3, nrows=3, figsize=(10, 20), dpi=100)
for i in range(9):
    season = i + 1
    row = i//3
    col = i%3
    plot_lines(season = season, ax = axes[row][col])
    axes[row][col].set_title(f'Season {season}', fontsize=12)
plt.show()

Now we're seeing quite a lot of episodinc characters. This really shows how much the show pretty much just revolved around the 4 main characters throughout.

## NLP

Next, let's try to do some more classical NLP.

### Meta features

We'll look at a couple of typical meta features used in NLP: `word_count` and `mean_word_length`.
Both of these will be calculated per Dialog line, so for each record in our dataset.

In [None]:
scripts['word_count'] = scripts['Dialogue'].apply(lambda x: len(str(x).split()))
scripts['mean_word_length'] = scripts['Dialogue'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

We'll focus on the characters that appear consistently on the show. We'll use the heuristic of picking all characters that have at least 100 dialog lines.

In [None]:
def get_characters_having_lines(min_lines):
    character_lines = scripts['Character'].value_counts()
    characters_with_multiple_lines = character_lines.index[character_lines > min_lines].tolist()
    return characters_with_multiple_lines

In [None]:
characters = get_characters_having_lines(100)
characters

In [None]:
# Following the example in https://stackoverflow.com/questions/17578115/pass-percentiles-to-pandas-agg-function
def get_percentiles(column_name):
    def percentile(n):
        def percentile_(x):
            return np.percentile(x, n)
        percentile_.__name__ = 'percentile_%s' % n
        return percentile_
    return scripts[scripts['Character'].isin(characters)][['Character', column_name]].groupby('Character').agg(
        [percentile(50), percentile(75), percentile(95)])

In [None]:
get_percentiles('word_count')

In [None]:
get_percentiles('mean_word_length')