- Introduction
- Tools Used
- Problem Statement
- The Dataset
- Data Exploration, Visualisation and Analysis
- The Model
- Evaluating the Model
- Conclusion
In this mini project, I built a Linear Regression model to study the linear relationship between a movie's production budget and its worldwide revenue. I used the Python programming language with the relevant modules to analyse, clean and visualise the data.
- JupyterLab - A web-based interactive development environment for notebooks, data and code.
- pandas - Provided necessary tools to manipulate the data and clean the data.
- matplotlib - Used to create basic plots in Python.
- scikit-learn - Provides simple tools for effective and efficient data analysis applications.
- requests - Used to fetch content(HTML) from the URL.
- BeautifulSoup - Provides simple tools for parsing HTML, and extracting the data.
As stated earlier, the question is:
"Can a reasonable relationship be established between production budget and movie revenue?"
I will be investigating the relationship between movie revenue and production budget. I will evaluate if the relationship is significant using the goodness of fit score value.
The dataset was scraped from "The Numbers" at https://www.the-numbers.com/movie/budgets/all
The dataset contains budget and worldwide revenue for 6370 movies with varying budgets and revenue. First, I used the requests module to obtain.I used BeautifulSoup to parse the retrieved html, and then created a CSV file from it using Pandas.
I used pandas to read the CSV file using the standard read_csv() function into a DataFrame. I explored the data and discovered that there were a lot of null or zero values for movies which had either not come out yet (as at the time of this project - 4th March, 2023) or had incomplete data. I dropped all of such conflicting rows reducing my dataset to 5954 movies.
I then used matplotlib to plot a scatter diagram of the dependent variable - movie revenue against the independent variable - movie budget. This plot is shown below:
Simple linear regression is the simplest machine learning algorithm. Linear Regression is used to find the linear relationship between a target and one or more predictors/features.
Simple linear regression is used to check and find the relationship between one target and one feature. It detects statistical relationships that roughly predict the dependent variable (revenue) based on the independent variable (budget).
I used the scikit-learn module to build a linear regression model using the data, generating a line of best fit. I obtained the slope to be 3.19688475. I then plotted this line of best fit on the scatter plot using matplotlib once more.
I also used the scikit-learn module to obtain the goodness of fit score to be 0.5465, in other words, the linear regression model can accurately predict 55% of the revenue values based off the budget values using the regression model.
Based off the goodness of fit score, it is evident that the model is not good enough to deploy on new data. The budget alone cannot be accurately used to predict the revenue. We would need to consider other factors such as date of release, the production studio and whether or not the movie is a sequel. This would help to give a more accurate relationship and prediction of movie revenue.