We analyze Nielsen ratings and Rotten Tomatoes scores to find that critic and audience reviews don't have much to do with what people actually watch.
This is our Flatiron School (NYC Data Science) Module 1 project
See the presentation and conclusions on Google Slides or view the pdf shake-it-off.pdf in our repo.
- The purpose of this project was to provide actionable insights to a hypothetical large company looking to enter the streaming wars (i.e. compete with Amazon Prime Video, Netflix, Hulu, Disney+, Apple TV+, etc.)
- An ancillary purpose was to demonstrate and practice our new-found skills in web scraping, API usage, SQL, pandas, visualization (matplotlib/seaborn), and creation of ETL pipelines.
- Data:
- Nielsen Ratings (national overnights 18–49) on a daily basis for broadcast primetime and the top 25 cable shows. Scraped from TV By the Numbers. Available back to 2015
- Rotten Tomatoes audience and critics scores for matched TV shows (766)
- A list of Netflix and Amazon shows (via Wikipedia: Netflix, Amazon)
- Tools (all in Python):
- BeautifulSoup
- pandas
- SQLAlchemy
- MySQL Server on AWS RDS
- Seaborn/Matplotlib
- data-extraction.ipynb does the extraction (use this).
- tv_by_the_numbers.py scrapes TV By the Numbers. It also contains several scraping utilities
- tv-show-extra-finding.ipynb works to improve matching to TV By the Numbers shows to Rotten Tomatoes
- nflix_amaz_shows.ipynb loads the wikipedia data from a stored csv
- rotten_tomatoes.py provides Rotten Tomatoes scraping
- transform-and-load.ipynb joins the appropriate tables and does data cleaning
- analysis.ipynb in the top level directory