Analysis of The Simpsons
Ruby R HTML CSS JavaScript
Latest commit 101fec7 Oct 30, 2016 @toddwschneider Update README
Permalink
Failed to load latest commit information.
analysis Update helpers.R Sep 28, 2016
app FLIM SPRINGFIELD Sep 26, 2016
bin FLIM SPRINGFIELD Sep 26, 2016
config FLIM SPRINGFIELD Sep 26, 2016
db FLIM SPRINGFIELD Sep 26, 2016
lib FLIM SPRINGFIELD Sep 26, 2016
log FLIM SPRINGFIELD Sep 26, 2016
public FLIM SPRINGFIELD Sep 26, 2016
test FLIM SPRINGFIELD Sep 26, 2016
vendor/assets FLIM SPRINGFIELD Sep 26, 2016
.gitignore FLIM SPRINGFIELD Sep 26, 2016
Gemfile FLIM SPRINGFIELD Sep 26, 2016
Gemfile.lock FLIM SPRINGFIELD Sep 26, 2016
LICENSE FLIM SPRINGFIELD Sep 26, 2016
README.md Update README Oct 30, 2016
Rakefile FLIM SPRINGFIELD Sep 26, 2016
config.ru FLIM SPRINGFIELD Sep 26, 2016

README.md

The Simpsons by the Data

Code in support of this post: The Simpsons by the Data

It's a Rails app, but isn't intended to be run as a server. It processes data from Simpsons World, Wikipedia, and IMDb, and populates a PostgreSQL database called simpsons_development. The database contains 4 primary tables: episodes, script_lines, characters, and locations

Instructions

Assumes you have Ruby and PostgreSQL installed

git clone git@github.com:toddwschneider/flim-springfield.git
cd flim-springfield/
createdb simpsons_development
bundle exec rake db:migrate
bundle exec rake import_data
bundle exec rake jobs:work

It takes about 45 minutes to process everything with one worker

Analysis

R code to analyze the data lives in the analysis/ folder

Caveats/areas for improvement

  • I deduped some character names when they're printed in different ways, e.g. "TROY" is the same as "Troy McClure", but I certainly did not dedupe all 6000+ characters that appear in the scripts
  • Similarly I manually assigned genders to the top 320 or so characters, who collectively account for 86% of the show's dialogue
  • I did not dedupe any locations
  • Simpsons World is not available in all countries, so the code might not run depending on where you're located

tab