`Predicting Movie Rental Durations`

`October 2025`

This project focuses on predicting the duration of movie rentals using regression models. The goal is to help a movie rental company optimize its inventory planning by accurately forecasting how many days a customer will rent a movie based on various features.

`Any questions, please reach out!`

Chiawei Wang, PhD\
Data & Product Analyst\
<chiawei.w@outlook.com>

`*` Note that the table of contents and other links may not work directly on GitHub.

[Table of Contents](#table-of-contents)
1. [Executive Summary](#executive-summary)
   - [Challenge](#challenge)
   - [Objectives](#objectives)
   - [Data Overview](#data-overview)
   - [Approach](#approach)
   - [Results](#results)
   - [Conclusion](#conclusion)
2. [Exploratory Data Analysis](#exploratory-data-analysis)

# Executive Summary

## Challenge

With the rise of streaming services, movie rental companies are facing increasing competition. To stay competitive, a movie rental company wants to optimize its inventory planning by predicting how many days a customer will rent a movie based on various features. The goal is to develop a regression model that can accurately predict rental duration, helping the company manage its stock more efficiently.

## Objectives

- Develop a regression model to predict DVD rental duration
- Achieve a Mean Squared Error (MSE) of 3 or less on the test set
- Provide actionable insights to improve inventory management

## Data Overview

| Index | Column               | Type        | Description                                                                       |
| ----- | -------------------- | ----------- | --------------------------------------------------------------------------------- |
| 0     | `rental_date`        | object      | The date (and time) the customer rents the movie                                  |
| 1     | `return_date`        | object      | The date (and time) the customer returns the movie                                |
| 2     | `amount`             | float64     | The amount paid by the customer for renting the movie                             |
| 3     | `release_year`       | float64     | The year the movie being rented was released                                      |
| 4     | `rental_rate`        | float64     | The rate at which the movie is rented for                                         |
| 5     | `length`             | float64     | Length of the movie being rented, in minutes                                      |
| 6     | `replacement_cost`   | float64     | The amount it will cost the company to replace the movie                          |
| 7     | `special_features`   | object      | Any special features, for example trailers/deleted scenes that the movie also has |
| 8     | `NC-17`              | int64       | Dummy variable for movie rating NC-17                                             |
| 9     | `PG`                 | int64       | Dummy variable for movie rating PG                                                |
| 10    | `PG-13`              | int64       | Dummy variable for movie rating PG-13                                             |
| 11    | `R`                  | int64       | Dummy variable for movie rating R                                                 |
| 12    | `amount_2`           | float64     | The square of amount                                                              |
| 13    | `length_2`           | float64     | The square of length                                                              |
| 14    | `rental_rate_2`      | float64     | The square of rental_rate                                                         |

## Approach

1. Getting the number of rental days
2. Adding dummy variables using the special features column
3. Executing a train-test split
4. Performing feature selection
5. Choosing models and performing hyperparameter tuning
6. Predicting values on test set
7. Computing mean squared error

## Results

- Best model: Random Forest Regressor
- MSE = 2.23

## Conclusion

A Random Forest Regressor was developed to predict movie rental durations, achieving an MSE of 2.23 on the test set, which meets the company's requirement of an MSE of 3 or less. This model can help the movie rental company optimize its inventory planning by accurately forecasting rental durations based on various features.

# Exploratory Data Analysis

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [2]:
# Read in the CSV as a DataFrame
df = pd.read_csv('rental.csv')

# Preview the data
print(df.shape)
df.head()

(15861, 15)


Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [3]:
# Add information on rental duration
df['rental_length'] = pd.to_datetime(df['return_date']) - pd.to_datetime(df['rental_date'])
df['rental_length_days'] = df['rental_length'].dt.days

### Add dummy variables
# Add dummy for deleted scenes
df['deleted_scenes'] = np.where(df['special_features'].str.contains('Deleted Scenes'), 1, 0)
# Add dummy for behind the scenes
df['behind_the_scenes'] = np.where(df['special_features'].str.contains('Behind the Scenes'), 1, 0)

# Choose columns to drop
cols_to_drop = ['special_features', 'rental_length', 'rental_length_days', 'rental_date', 'return_date']

# Split into feature and target sets
X = df.drop(cols_to_drop, axis=1)
y = df['rental_length_days']

# Further split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state= 42)

# Create the Lasso model
lasso = Lasso(alpha = 0.3, random_state = 42) 

# Train the model and access the coefficients
lasso.fit(X_train, y_train)
lasso_coef = lasso.coef_

# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

# Run OLS models on lasso chosen regression
ols = LinearRegression()
ols = ols.fit(X_lasso_train, y_train)
y_test_pred = ols.predict(X_lasso_test)
mse_lin_reg_lasso = mean_squared_error(y_test, y_test_pred)

# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1, 101, 1),'max_depth':np.arange(1, 11, 1)}

# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, param_distributions = param_dist, cv = 5, random_state = 42)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params['n_estimators'], max_depth=hyper_params['max_depth'], random_state = 42)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

# Random forest gives lowest MSE:
best_model = rf
best_mse = mse_random_forest

In [4]:
# Displaying the best model and its MSE
print(f'Best model: {type(best_model).__name__}')
print(f'MSE = {best_mse:.2f}')

Best model: RandomForestRegressor
MSE = 2.22
