Predict movie's IMDB rating
Switch branches/tags
Nothing to show
Clone or download
Latest commit 4fc1952 Sep 5, 2017


Predict IMDB movie rating

by Chuan Sun (sundeepblue at gmail dot com)

Scrapy project @ NYC Data Science Academy




Fetch a list of 5000 movie titles and budgets from

This step will generate a JSON file 'movie_budget.json'

$ scrapy crawl movie_budget -o movie_budget.json



Load 5000+ movie titles from the JSON file 'movie_budget.json'

Then search those titles from IMDB website to get the real IMDB movie links

It will generate a JSON file 'fetch_imdb_url.json' containing movie-link pairs

$ scrapy crawl fetch_imdb_url -o fetch_imdb_url.json



Scrape 5000+ IMDB movie information

This step will load the JSON file 'fetch_imdb_url.json', go into each movie page, and grab data

This step will generate a JSON file 'imdb_output.json' (20M) containing detailed info of 5000+ movies

It will also download all available posters for all movies.

A total of 4907 posters can be downloaded (998MB). Note that I am not sure if I can upload all those posters into github, so I only uploaded a few. You can see from my code how to use scrapy to grab them all.

$ scrapy crawl imdb -o imdb_output.json



Perform face recognition to count face numbers from all posters

This step will save result into JSON file 'image_and_facenumber_pair_list.json'

$ python



Load two JSON files 'imdb_output.json' and 'image_and_facenumber_pair_list.json'

Parse all variables into valid format.

Generate a final CSV table containing 28 variables that can be loaded in R or Pandas

The output will be a CSV file 'movie_metadata.csv' (1.5MB)

"movie_facebook_likes" "duration"
"actor_3_name" "actor_3_facebook_likes"
"actor_1_name" "actor_1_facebook_likes"
"cast_total_facebook_likes" "facenumber_in_poster"

$ python



Load the 'movie_metadata.csv' file in RStudio, and perform EDA and LASSO regression

$ > run the RStudio

$ > load the file 'movie_rating_prediction.R'