Switch branches/tags
Nothing to show
Clone or download
Latest commit d7cedb8 Feb 20, 2016
Failed to load latest commit information.
data IMDB ratings update Feb 19, 2016
plots Axis tweak Feb 16, 2016
scripts IMDB ratings update Feb 19, 2016
.gitignore Season by season plots Feb 15, 2016
README.md Fixed link Feb 19, 2016
southpark_loglikelihood.Rmd Season by season plots Feb 15, 2016
southpark_loglikelihood.pdf Season by season plots Feb 15, 2016


Text Mining South Park

South Park follows four fourth grade boys (Stan, Kyle, Cartman and Kenny) and an extensive ensemble cast of recurring characters. This analysis reviews their speech to determine which words and phrases are distinct for each character. Since the series uses a lot of running gags, common phrases should be easy to find.

The programming language R and packages tm, RWeka and stringr were used to read South Park episode transcripts from a repository, attribute them to a certain character, break them into ngrams, calculate the log likelihood for each ngram/character pair, and rank them to create a list of most characteristic words/phrases for each character. The results were visualized using ggplot2, wordcloud and RColorBrewer.


Complete transcripts (70,000 lines amounting to 5.5 MB) were downloaded from BobAdamsEE's github repository SouthParkData from the original source at the South Park Wikia page.

Log Likelihood

Each corpus was analyzed to determine the most characteristic words for each speaker. Frequent and characteristic words are not the same thing - otherwise words like "I", "school", and "you" would rise to the top instead of unique words and phrases like "professor chaos", "hippies" and "you killed kenny."

Log likelihood was used to measure the unique-ness of the ngrams by character. Log likelihood compares the occurrence of a word in a particular corpus (the body of a character's speech) to its occurrence in another corpus (all of the remaining South Park text) to determine if it shows up more or less likely that expected. The returned value represents the likelihood that the corpora are from the same, larger corpus, similar to a t-test.

Read the full report