# Exploring the Word Corpus and EDA

With the dataset cleaned and truncated, this notebook dives deeper into the actual dialogue and building the word corpus. A large part of that involves examining stop words and making decisions on which words to drop from the data.

In [2]:
# Load in the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

The next few cells set up the data. For more details refer to notebook \#1. The adjusted data set is named `lines_final`.

In [3]:
lines = pd.read_csv('../data/All-seasons.csv')

lines = lines[lines.Season != 'Season']

In [4]:
lines[['Season', 'Episode']] = lines[['Season', 'Episode']].astype('int64')

In [5]:
support_chars = ['Mr. Garrison', 'Chef', 'Kenny', 'Sharon',\
                 'Mr. Mackey', 'Gerald', 'Liane', 'Sheila',\
                 'Stephen', 'Ms. Garrison', 'Mrs. Garrison']

lines.loc[lines.Character.isin(support_chars), 'Character'] = 'Support Character'

In [6]:
final_labels = ['Cartman', 'Stan', 'Kyle', 'Butters', 'Randy', 'Support Character']

lines_final = lines[lines.Character.isin(final_labels)]

### Word Corpus

Here the basic corpus is created. Essentially, all the strings of dialogue from the `Line` column are compiled into one list. Then, regular expressions are used to remove the new line figure `\n` from each string.

In [7]:
import re

corpus = lines_final.Line.tolist()

for line in range(len(corpus)):
    corpus[line] = re.sub('\\n', '', corpus[line].rstrip())
    
corpus[:10]

['You guys, you guys! Chef is going away.',
 'Going away? For how long?',
 'Forever.',
 "I'm sorry boys.",
 "Chef said he's been bored, so he joining a group called the Super Adventure Club.",
 'Wow!',
 'Chef?? What kind of questions do you think adventuring around the world is gonna answer?!',
 "What's the meaning of life? Why are we here?",
 "I hope you're making the right choice.",
 "I'm gonna miss him.  I'm gonna miss Chef and I...and I don't know how to tell him!"]

<b>Average length</b><br>
What is the average document length for each line?

In [8]:
np.mean([len(doc.split()) for doc in corpus])

11.004596119444063

### Tokens

With the corpus created, we can now get even more granular and examine the individual words. The words from the corpus are compiled into one list and then converted to lowercase with all punctuation removed. Then a counter is called and all the words are added to an OrderedDict according to their frequency within the corpus.

In [9]:
import string

words = " ".join(corpus).lower()
words = " ".join(word.strip(string.punctuation) for word in words.split())

In [10]:
from collections import Counter, OrderedDict

word_counts = Counter(words.split())

token_dict = OrderedDict(word_counts.most_common())

Looking at the OrderedDict shows which words show up the most, indicating likely stop words.

In [78]:
[(k, v) for k, v in token_dict.items()][:50]

[('you', 13092),
 ('the', 10677),
 ('i', 10013),
 ('to', 9533),
 ('a', 7179),
 ('and', 6167),
 ('it', 5443),
 ('that', 4642),
 ('we', 4595),
 ('is', 4454),
 ('what', 4051),
 ('of', 4035),
 ('this', 3401),
 ('in', 3324),
 ('have', 3149),
 ('my', 3101),
 ('on', 3082),
 ('oh', 3077),
 ('all', 2946),
 ('just', 2875),
 ('do', 2770),
 ('me', 2727),
 ("i'm", 2707),
 ('no', 2584),
 ('for', 2562),
 ("don't", 2545),
 ('are', 2467),
 ('be', 2301),
 ('your', 2276),
 ("it's", 2235),
 ('yeah', 2199),
 ('get', 2195),
 ('but', 2000),
 ('not', 1972),
 ('with', 1960),
 ('know', 1948),
 ('so', 1925),
 ('dude', 1915),
 ('now', 1894),
 ('well', 1886),
 ('go', 1836),
 ('can', 1770),
 ('right', 1710),
 ('out', 1673),
 ('like', 1621),
 ('was', 1610),
 ('gonna', 1595),
 ("that's", 1565),
 ('here', 1564),
 ('guys', 1511)]

### Frequency Metrics

Beyond just identifying common words, below are a couple functions to analyze different types of word frequency. The first function compares the total number of times a word appears in the corpus, total frequency, versus the number of documents containing the word, the document frequency.

In [79]:
def freq_compare(word, corpus=corpus):
    '''Takes a word and a word corpus and calculates
    the total word frequency and the number of documents
    containing that word.'''
    
    word = word.lower()
    
    words = " ".join(corpus).lower()
    words = " ".join(word.strip(string.punctuation) for word in words.split())
    
    word_counts = Counter(words.split())
    token_dict = OrderedDict(word_counts.most_common())
    
    total_frequency = token_dict[word]
    
    doc_freq = 0
    
    for line in corpus:
        if word in [token.strip(string.punctuation).lower() for token in line.split()]:
            doc_freq += 1
    
    print('The total frequency of the word \'{}\' is: \t'.format(word), total_frequency)
    print('The number of documents with the word \'{}\': \t'.format(word), doc_freq)

Now a sanity check:

In [80]:
freq_compare('A', ['a', 'a a.', 'b'])

The total frequency of the word 'a' is: 	 3
The number of documents with the word 'a': 	 2


The next function compares the total frequency of a particular word between the different character labels. This is useful for assessing if a particular word has more importance for specific characters. 

In [81]:
def compare_labels(term):
    '''Takes a particular word and calculates
    how often it is used by each character'''
    
    term = term.lower()
    
    count_dict = {'Cartman': 0, 'Stan': 0, 'Kyle': 0, 'Butters': 0,\
                  'Randy': 0, 'Support Character': 0}
    
    for k, v in count_dict.items():
        subset = lines_final[lines_final.Character == k]
        
        corpus = subset.Line.tolist()

        for line in range(len(corpus)):
            corpus[line] = re.sub('\\n', '', corpus[line].rstrip())
        
        words = " ".join(corpus).lower()
        words = " ".join(word.strip(string.punctuation) for word in words.split())
    
        word_counts = Counter(words.split())
        token_dict = OrderedDict(word_counts.most_common())
        
        if term in token_dict.keys():
            count_dict[k] += token_dict[term]
    
    print('How often the word \'{}\' appears in each class:'.format(term))
    print(count_dict)

<b>Trying the functions: </b>Now that the functions are defined, let's examine a couple words from `token_dict`, the OrderedDict of all words. It helps to keep in mind that there are approximately 36,000 documents in the data, and that Cartman, Stan and Kyle have the majority of the lines.

In [82]:
freq_compare('The')
compare_labels('The')

The total frequency of the word 'the' is: 	 10677
The number of documents with the word 'the': 	 7654
How often the word 'the' appears in each class:
{'Cartman': 3678, 'Stan': 1729, 'Kyle': 1565, 'Butters': 683, 'Randy': 909, 'Support Character': 2113}


In [83]:
freq_compare('but')
compare_labels('but')

The total frequency of the word 'but' is: 	 2000
The number of documents with the word 'but': 	 1858
How often the word 'but' appears in each class:
{'Cartman': 546, 'Stan': 366, 'Kyle': 342, 'Butters': 217, 'Randy': 168, 'Support Character': 361}


In [77]:
freq_compare('dude')
compare_labels('dude')

The total frequency of the word 'dude' is: 	 1915
The number of documents with the word 'dude': 	 1867
How often the word 'dude' appears in each class:
{'Cartman': 415, 'Stan': 853, 'Kyle': 607, 'Butters': 0, 'Randy': 3, 'Support Character': 37}


The words 'the' and 'but' are common words and have fairly uniform frequencies. On the other hand, the word 'dude', although nearly as frequent as 'but', is used much more frequently by the first three characters. This shows that it might be a good word to help identify those characters.