# Exploring the Word Corpus and EDA

With the dataset cleaned, this notebook dives deeper into the actual dialogue and building the word corpus. A large part of that involves examining stop words and making decisions on which words to drop from the data.

In [1]:
# Load in the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

The next few cells set up the data. For more details refer to notebook \#1.

In [2]:
lines = pd.read_csv('../data/All-seasons.csv')

lines = lines[lines.Season != 'Season']

In [3]:
lines[['Season', 'Episode']] = lines[['Season', 'Episode']].astype('int64')

In [4]:
main_chars = ['Cartman', 'Stan', 'Kyle', 'Kenny', 'Butters']

lines['Group'] = 0

lines.loc[lines.Character.isin(main_chars), 'Group'] = 1

lines.head()

Unnamed: 0,Season,Episode,Character,Line,Group
0,10,1,Stan,"You guys, you guys! Chef is going away. \n",1
1,10,1,Kyle,Going away? For how long?\n,1
2,10,1,Stan,Forever.\n,1
3,10,1,Chef,I'm sorry boys.\n,0
4,10,1,Stan,"Chef said he's been bored, so he joining a gro...",1


### Word Corpus

Here the basic corpus is created. Essentially, all the strings of dialogue from the `Line` column are compiled into one list. Then, regular expressions are used to remove the new line figure `\n` from each string.

In [5]:
import re

corpus = lines.Line.tolist()

for line in range(len(corpus)):
    corpus[line] = re.sub('\\n', '', corpus[line].rstrip())
    
corpus[:10]

['You guys, you guys! Chef is going away.',
 'Going away? For how long?',
 'Forever.',
 "I'm sorry boys.",
 "Chef said he's been bored, so he joining a group called the Super Adventure Club.",
 'Wow!',
 'Chef?? What kind of questions do you think adventuring around the world is gonna answer?!',
 "What's the meaning of life? Why are we here?",
 "I hope you're making the right choice.",
 "I'm gonna miss him.  I'm gonna miss Chef and I...and I don't know how to tell him!"]

<b>Average length</b><br>
What is the average document length for each line?

In [6]:
np.mean([len(doc.split()) for doc in corpus])

11.455551009466838

### Tokens

With the corpus created, we can now get even more granular and examine the individual words. First, the contractions are removed. Then, the words from the corpus are compiled into one list and then converted to lowercase with all punctuation removed. Next, the verbs and nouns are lemmatized. And finally, a counter is called and all the words are added to an OrderedDict according to their frequency within the corpus.

In [7]:
import string
import contractions
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize

for line in range(len(corpus)):
    corpus[line] = contractions.fix(corpus[line])

words = " ".join(corpus).lower()
words = " ".join(word.strip(string.punctuation) for word in words.split())

In [8]:
lem = WordNetLemmatizer()

word_list = word_tokenize(words)
word_list = [lem.lemmatize(w, pos='v') for w in word_list]
words = ' '.join([lem.lemmatize(w) for w in word_list])

In [9]:
from collections import Counter, OrderedDict

word_counts = Counter(words.split())

token_dict = OrderedDict(word_counts.most_common())

Looking at the OrderedDict shows which words show up the most, indicating likely stop words.

In [10]:
[(k, v) for k, v in token_dict.items()][:250]

[('be', 50399),
 ('you', 29087),
 ('the', 25182),
 ('i', 25032),
 ('to', 23957),
 ('a', 16796),
 ('it', 14770),
 ('do', 14720),
 ('not', 14323),
 ('and', 13407),
 ('we', 12758),
 ('that', 11968),
 ('have', 11793),
 ('of', 9403),
 ('what', 8311),
 ('go', 8168),
 ('get', 7532),
 ('in', 7403),
 ('this', 6868),
 ('all', 6102),
 ('on', 6024),
 ('oh', 5617),
 ('for', 5602),
 ('my', 5585),
 ('can', 5511),
 ('just', 5206),
 ('your', 5088),
 ('me', 5028),
 ('no', 4595),
 ('will', 4436),
 ('with', 4224),
 ('he', 4183),
 ('now', 4064),
 ('but', 3900),
 ('know', 3899),
 ('they', 3723),
 ('yeah', 3672),
 ('u', 3648),
 ('so', 3592),
 ('here', 3589),
 ('well', 3540),
 ("'s", 3537),
 ('right', 3462),
 ('come', 3415),
 ('like', 3333),
 ('there', 3278),
 ('out', 3267),
 ('want', 3116),
 ('think', 2921),
 ('see', 2874),
 ('up', 2861),
 ('about', 2714),
 ('guy', 2609),
 ('our', 2603),
 ('make', 2520),
 ('how', 2483),
 ('let', 2474),
 ('say', 2453),
 ('look', 2412),
 ('if', 2406),
 ('at', 2380),
 ('take', 

### Frequency Metrics

Beyond just identifying common words, below are a couple functions to analyze different types of word frequency. The first function compares the total number of times a word appears in the corpus, total frequency, versus the number of documents containing the word, the document frequency.

In [11]:
def freq_compare(word, corpus=corpus):
    '''Takes a word and a word corpus and calculates
    the total word frequency and the number of documents
    containing that word.'''
    
    word = word.lower()
    
    words = " ".join(corpus).lower()
    words = " ".join(word.strip(string.punctuation) for word in words.split())
    
    word_counts = Counter(words.split())
    token_dict = OrderedDict(word_counts.most_common())
    
    total_frequency = token_dict[word]
    
    doc_freq = 0
    
    for line in corpus:
        if word in [token.strip(string.punctuation).lower() for token in line.split()]:
            doc_freq += 1
    
    print('The total frequency of the word \'{}\' is: \t'.format(word), total_frequency)
    print('The number of documents with the word \'{}\': \t'.format(word), doc_freq)

Now a sanity check:

In [12]:
freq_compare('A', ['a', 'a a.', 'b'])

The total frequency of the word 'a' is: 	 3
The number of documents with the word 'a': 	 2


The next function compares the total frequency of a particular word between the different classes. This is useful for assessing if a particular word has more importance for a particular label. 

In [13]:
def compare_labels(term):
    '''Takes a particular word and calculates
    how often it is used by each character'''
    
    term = term.lower()
    
    count_dict = {1: 0, 0: 0}
    
    for k, v in count_dict.items():
        subset = lines[lines.Group == k]
        
        corpus = subset.Line.tolist()

        for line in range(len(corpus)):
            corpus[line] = re.sub('\\n', '', corpus[line].rstrip())
        
        words = " ".join(corpus).lower()
        words = " ".join(word.strip(string.punctuation) for word in words.split())
    
        word_counts = Counter(words.split())
        token_dict = OrderedDict(word_counts.most_common())
        
        if term in token_dict.keys():
            count_dict[k] += token_dict[term]
        
        # Now convert to a ratio
        count_dict[k] = round((count_dict[k] / len(subset)), 3)
    
    print('How often the word \'{}\' appears in each class:'.format(term))
    print(count_dict)

<b>Trying the functions: </b><br>
Now that the functions are defined, let's examine a couple words from `token_dict`, the OrderedDict of all words. It helps to keep in mind that there are approximately 71,000 documents in the data, and that the majority class is represented by the label 0.

In [14]:
freq_compare('The')
compare_labels('The')

The total frequency of the word 'the' is: 	 25149
The number of documents with the word 'the': 	 16793
How often the word 'the' appears in each class:
{1: 0.277, 0: 0.405}


In [23]:
freq_compare('okay')
compare_labels('okay')

The total frequency of the word 'okay' is: 	 2103
The number of documents with the word 'okay': 	 1865
How often the word 'okay' appears in each class:
{1: 0.031, 0: 0.028}


In [16]:
freq_compare('dude')
compare_labels('dude')

The total frequency of the word 'dude' is: 	 2048
The number of documents with the word 'dude': 	 1998
How often the word 'dude' appears in each class:
{1: 0.068, 0: 0.003}


The words 'the' and 'okay' are common words and have fairly uniform frequencies. On the other hand, the word 'dude', although nearly as frequent as 'okay', is used much more frequently by the main characters. This shows that it might be a good word to help identify those characters.

<b>Sifting through stopwords</b><br>
With the functions defined, we can now run them over the list of common words to check the frequency comparisons and confirm which ones should be dropped as stopwords. Below, I have defined a small function to combine the two functions from above.

In [17]:
def word_freqs(word):
    freq_compare(word)
    compare_labels(word)

In [18]:
word_freqs("than")

The total frequency of the word 'than' is: 	 569
The number of documents with the word 'than': 	 550
How often the word 'than' appears in each class:
{1: 0.008, 0: 0.008}


For this process, I chose to manually check each word in `token_dict` in order to really dig into the list and compare each one. Having done so, the final list of stopwords is defined below. While none of the words in the list are too surprising, some of the words I chose to not treat as stopwords may be of interest. For example, I chose to keep 'my' and 'me' because they had slightly inbalanced distributions, which might owe to Cartman's selfish nature. I also chose to keep 'no' and 'not' because those might be useful later when moving beyond a simple bag-of-words model.<br>
<br>
The final list contains 53 words.

In [19]:
stop_words = ['be', 'you', 'i', 'to', 'the', 'do', 'it',\
              'a', 'we', 'that', 'and', 'have', 'go', 'what',\
              'get', 'of', 'this', 'in', 'on', 'all', 'just',\
              'for', 'he', 'know', 'will', 'but', 'with', 'so',\
              'they', 'now', 'well', "'s", 'guy', 'u', 'come',\
              'like', 'there', 'at', 'would', 'who', 'him',\
              'them', 'his', 'thing', 'where', 'should', 'an',\
              'please', 'maybe', 'their', 'even', 'any', 'than']

In [20]:
len(stop_words)

53