Skip to content

Analyze South Park lines and build classification model to guess the speakers

Notifications You must be signed in to change notification settings

yanfei-wu/tv_lines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lines from South Park

This analysis was done in Python 3.5. It is a data analysis project for fun. The goal is to see what I can find about the show just from the lines. The dataset was from Kaggle. It contains lines of the first 18 seasons of the show. These lines have been annotated with season, episode, and speaker.

A quick line/word count shows that there are over 70,000 lines and over 800,000 words in the dataset. Not suprisingly, the lines are relatively short (with a median of 8 words and a 75 percentile of 14 words). The numbers of lines/words are compared by season to show how they have evolved. The lines are also used to understand the characters of the show, including finding the top speakers, their most frequent words, as well as words with highest term frequency-inverse document frequency (tf-idf).

About

Analyze South Park lines and build classification model to guess the speakers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published