<a href="https://colab.research.google.com/github/vicaaa12/simple-regression-Predict-Movie-Rental-Duration/blob/main/Predict_Movie_Rental_Durations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

## Importing necessary libraries

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to split the data into train and test
from sklearn.model_selection import train_test_split

# to build regression_model
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
# to tune the model
from sklearn.model_selection import RandomizedSearchCV

# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
path = '/content/drive/MyDrive/python/datacampprojects/moviedurationprediction/rental_info.csv'

In [4]:
df = pd.read_csv(path)

In [5]:
df.shape

(15861, 15)

There are 15861 rows and 15 columns

In [6]:
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB


Rental date, return date, special features have object data types.
dataset doesn't have null values
Need to convert rental and return date to DateTime format

In [8]:
df['return_date'] = pd.to_datetime(df['return_date'])
df['rental_date'] = pd.to_datetime(df['rental_date'])

In [9]:
for column_name in ['rental_date', 'return_date']:
  dtype = df[column_name].dtype
  print(f"Column '{column_name}' has data type: {dtype}")

Column 'rental_date' has data type: datetime64[ns, UTC]
Column 'return_date' has data type: datetime64[ns, UTC]


In [10]:
#Calculating rental period
df['rental_length_days'] = (df['return_date'] - df['rental_date']).dt.days

In [11]:
df.duplicated().sum()

0

There is 0 duplicated

In [12]:
#rechecking for missing values
round(df.isnull().sum() / df.isnull().count() * 100, 2)

rental_date           0.0
return_date           0.0
amount                0.0
release_year          0.0
rental_rate           0.0
length                0.0
replacement_cost      0.0
special_features      0.0
NC-17                 0.0
PG                    0.0
PG-13                 0.0
R                     0.0
amount_2              0.0
length_2              0.0
rental_rate_2         0.0
rental_length_days    0.0
dtype: float64

In [13]:
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4


In [14]:
numerical = ['amount', 'rental_rate', 'length', 'replacement_cost', 'amount_2', 'length_2', 'rental_rate_2', 'rental_length_days']

In [15]:
# checking statistic summary of the numarical columns in the data
df[numerical].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
amount,15861.0,4.217161,2.360383,0.99,2.99,3.99,4.99,11.99
rental_rate,15861.0,2.944101,1.649766,0.99,0.99,2.99,4.99,4.99
length,15861.0,114.994578,40.114715,46.0,81.0,114.0,148.0,185.0
replacement_cost,15861.0,20.224727,6.083784,9.99,14.99,20.99,25.99,29.99
amount_2,15861.0,23.355504,23.503164,0.9801,8.9401,15.9201,24.9001,143.7601
length_2,15861.0,14832.841876,9393.431996,2116.0,6561.0,12996.0,21904.0,34225.0
rental_rate_2,15861.0,11.389287,10.005293,0.9801,0.9801,8.9401,24.9001,24.9001
rental_length_days,15861.0,4.525944,2.635108,0.0,2.0,5.0,7.0,9.0


In [16]:
df.corr()

  df.corr()


Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days
amount,1.0,0.021726,0.68587,0.018947,-0.026725,0.003968,-0.010591,0.012773,-0.007029,0.956141,0.017864,0.678597,0.551593
release_year,0.021726,1.0,0.037304,0.031088,0.069991,0.027187,-0.022237,0.027442,-0.052645,0.015941,0.03064,0.025106,0.007044
rental_rate,0.68587,0.037304,1.0,0.055224,-0.064787,0.03628,0.000174,0.022812,-0.033648,0.587627,0.05339,0.982489,-0.00106
length,0.018947,0.031088,0.055224,1.0,0.026976,-0.030133,-0.049304,0.057023,0.068685,0.015765,0.987667,0.051516,-0.004547
replacement_cost,-0.026725,0.069991,-0.064787,0.026976,1.0,-0.001685,-0.077158,0.044224,0.017768,-0.018281,0.029747,-0.065835,0.015684
NC-17,0.003968,0.027187,0.03628,-0.030133,-0.001685,1.0,-0.254017,-0.272206,-0.252767,0.001186,-0.029444,0.038815,0.000783
PG,-0.010591,-0.022237,0.000174,-0.049304,-0.077158,-0.254017,1.0,-0.268408,-0.24924,-0.012859,-0.053299,-0.00142,-0.008066
PG-13,0.012773,0.027442,0.022812,0.057023,0.044224,-0.272206,-0.268408,1.0,-0.267087,0.008954,0.062629,0.022525,0.010201
R,-0.007029,-0.052645,-0.033648,0.068685,0.017768,-0.252767,-0.24924,-0.267087,1.0,-0.004797,0.05931,-0.033067,-0.007961
amount_2,0.956141,0.015941,0.587627,0.015765,-0.018281,0.001186,-0.012859,0.008954,-0.004797,1.0,0.014662,0.596622,0.549412


Provided data use the same feature, just squared it.
For model selection features can be used only one attribute
rental rate and rental rate 2, amount and amount 2, lenghth and lenghth 2  are identical features. Thet are also highly correlated, it's important to choose only one of them to avoid multicollinearity issues in your model. Multicollinearity occurs when independent variables in a regression model are highly correlated with each other, which can lead to unstable estimates and difficulties in interpreting the model

In [17]:
df['special_features'].value_counts()

{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted Scenes"}                                 1011
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}     983
{Trailers,Commentaries,"Deleted Scenes"}                         916
{Trailers,"Deleted Scenes","Behind

Creating new columns behind the scenes and deleted scenes. I can then use these columns as features for machine learning model without the need for dummy variables as they encoded.

In [18]:
df['behind_the_scenes'] = df['special_features'].apply(lambda x: 1 if "Behind the Scene" in x else 0 )
df['deleted_scenes'] = df['special_features'].apply(lambda x: 1 if 'Deleted Scenes' in x else 0)

In [19]:
df['behind_the_scenes'].value_counts()

1    8507
0    7354
Name: behind_the_scenes, dtype: int64

In [20]:
df['deleted_scenes'].value_counts()

0    7973
1    7888
Name: deleted_scenes, dtype: int64

In [21]:
X = df.drop(['rental_length_days', 'special_features', 'rental_date', 'return_date', 'amount_2', 'length_2', 'rental_rate_2'], axis =1)

In [22]:
y=df['rental_length_days']

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [24]:

model = Lasso(alpha=0.3, random_state=9)
model.fit(X_train, y_train)
lasso_coef = model.coef_
print("Coefficients:", lasso_coef)

Coefficients: [ 9.62167821e-01  0.00000000e+00 -8.41179857e-01  4.94571646e-04
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -0.00000000e+00  0.00000000e+00 -0.00000000e+00]


In [25]:
coefficients_df = pd.DataFrame({'Column Name': X_train.columns, 'Coefficient': lasso_coef})

# Display the coefficients DataFrame
print(coefficients_df)

          Column Name  Coefficient
0              amount     0.962168
1        release_year     0.000000
2         rental_rate    -0.841180
3              length     0.000495
4    replacement_cost    -0.000000
5               NC-17     0.000000
6                  PG     0.000000
7               PG-13     0.000000
8                   R    -0.000000
9   behind_the_scenes     0.000000
10     deleted_scenes    -0.000000


The Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression model regularized with an L1 penalty term. It is used for variable selection and regularization to prevent overfitting in high-dimensional datasets.

The Lasso model includes a hyperparameter called alpha (α), which controls the strength of regularization. Increasing alpha leads to more regularization, which in turn leads to fewer features being selected by the mode

In [26]:
# Perform feature selectino by choosing columns with positive coefficients
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

In [27]:
model1 = LinearRegression()
model1.fit(X_lasso_train, y_train)

In [28]:
r_squared_1 = model1.score(X_lasso_test, y_test)
print("R-squared score:", r_squared_1)

R-squared score: 0.31794186050284556


In [29]:
y_pred1 = model1.predict(X_lasso_test)
mse1= mean_squared_error(y_test, y_pred1)
print(mse1)

4.842319865123174


In [30]:
model2 = LinearRegression()

In [31]:
model2.fit(X_train, y_train)

In [32]:
r_squared_2 = model2.score(X_test, y_test)
print(r_squared_2)

0.5753951653256832


In [33]:
y_pred2 = model2.predict(X_test)
mse2= mean_squared_error(y_test, y_pred2)
print(mse2)

3.0145119700303256


MAE and r2 shows better result

In [34]:
model3 = DecisionTreeRegressor()
model3.fit(X_train, y_train)
r_squared_3 = model3.score(X_test, y_test)
print('r2', r_squared_3)
y_pred3 = model3.predict(X_test)
mse3 = mean_squared_error(y_test, y_pred3)
print('mse', mse3)

r2 0.696193005367677
mse 2.1568991850988746


In [35]:
# Get feature importances
feature_importances = model3.feature_importances_

In [36]:
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})

# Sort the DataFrame by importances in ascending order
importance_df_sorted = importance_df.sort_values(by='Importance', ascending=False)

# Print the sorted DataFrame
print(importance_df_sorted)

              Feature  Importance
0              amount    0.649066
2         rental_rate    0.157775
3              length    0.074106
4    replacement_cost    0.044075
1        release_year    0.030408
9   behind_the_scenes    0.009803
10     deleted_scenes    0.009211
5               NC-17    0.007431
7               PG-13    0.007393
8                   R    0.005379
6                  PG    0.005353


In [37]:
model4 = RandomForestRegressor()
model4.fit(X_train, y_train)
r_squared_4 = model4.score(X_test, y_test)
print('r2', r_squared_3)
y_pred4 = model4.predict(X_test)
mse4 = mean_squared_error(y_test, y_pred4)
print('mse4', mse4)

r2 0.696193005367677
mse4 2.030966112160573


In [38]:
# Get feature importances
feature_importances_2 = model4.feature_importances_

In [39]:
importance_df_2 = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances_2})

# Sort the DataFrame by importances in ascending order
importance_df_sorted = importance_df_2.sort_values(by='Importance', ascending=False)

# Print the sorted DataFrame
print(importance_df_sorted)

              Feature  Importance
0              amount    0.608354
2         rental_rate    0.176139
3              length    0.081662
4    replacement_cost    0.053731
1        release_year    0.032213
10     deleted_scenes    0.009727
9   behind_the_scenes    0.009546
5               NC-17    0.007592
7               PG-13    0.007521
8                   R    0.006893
6                  PG    0.006623


In [40]:
param_grid = {
    "n_estimators": [20,40,80, 85, 90, 95, 00, 150, 180],

}

In [41]:

rf =RandomForestRegressor()
random_search = RandomizedSearchCV(rf, param_distributions = param_grid, random_state=9, cv= 5)
search = random_search.fit(X_train, y_train)
best_params = search.best_params_
print(best_params)
tuned_model = RandomForestRegressor(**best_params)
tuned_model.fit(X_train, y_train)
y_pred5=tuned_model.predict(X_test)
mse5=mean_squared_error(y_test, y_pred5)
print(mse5)

5 fits failed out of a total of 45.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_forest.py", line 340, in fit
    self._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
    raise InvalidParameterError(
sklea

{'n_estimators': 180}
2.026670582184777


In [43]:
results =pd.DataFrame({'Model': ['Linear Regression Lasso', 'Linear Regression', 'Decision Tree', 'RandomForest', 'Tuned Random Forest'],
                       'MSE': [ mse1, mse2, mse3, mse4, mse5]})
best_mse = results['MSE'].min()
print(results)

                     Model       MSE
0  Linear Regression Lasso  4.842320
1        Linear Regression  3.014512
2            Decision Tree  2.156899
3             RandomForest  2.030966
4      Tuned Random Forest  2.026671


Let's create a pipeline

In [44]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline

In [45]:
features =['amount','rental_rate', 'length', 'replacement_cost', 'NC-17', 'PG', 'PG-13', 'R', 'deleted_scenes', 'behind_the_scenes']

In [46]:
numeric_features = ['amount','rental_rate', 'length', 'replacement_cost']

In [47]:
cat = ['NC-17', 'PG', 'PG-13', 'R', 'deleted_scenes', 'behind_the_scenes']

In [48]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(df[features], y, test_size=0.30, random_state=1)

In [49]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(),
    cat ))

In [50]:
# Define pipelines for different models
pipelines = {
    'Linear Regression': make_pipeline(preprocessor, LinearRegression()),
    'Decision Tree': make_pipeline(preprocessor, DecisionTreeRegressor()),
    'Random Forest': make_pipeline(preprocessor, RandomForestRegressor())
}

In [51]:
for model_name, pipeline in pipelines.items():
    pipeline.fit(X_train2, y_train2)

    # Make predictions on the test data
    predictions = pipeline.predict(X_test2)
    MSE = mean_squared_error(y_test2, predictions)
    MAE = mean_absolute_error(y_test2, predictions)
    r2 = r2_score(y_test2, predictions)
    # Now you can use the predictions as needed
    # For example, you can print the predictions
    print(f"Predictions for {model_name}: mse {MSE} mae {MAE} r2 {r2}")

Predictions for Linear Regression: mse 2.928500833575855 mae 1.4119645118360704 r2 0.5739665831441506
Predictions for Decision Tree: mse 2.2080038699910385 mae 1.1175851119748115 r2 0.6787832796978936
Predictions for Random Forest: mse 2.0529644755282304 mae 1.1049781006433976 r2 0.701338152215925


In [52]:
feature_importances_d = pipelines['Random Forest'].named_steps['randomforestregressor'].feature_importances_
feature_importances_d

array([0.59058545, 0.18799644, 0.09717098, 0.06255774, 0.00463765,
       0.00435496, 0.00393157, 0.00412348, 0.00474477, 0.00477826,
       0.00432985, 0.00444258, 0.00667539, 0.00647411, 0.00666896,
       0.00652781])

In [53]:
# Create a dictionary mapping feature importances to column names
feature_importance_map = dict(zip(X_test2.columns, feature_importances_d))
feature_importance_map

{'amount': 0.5905854528302588,
 'rental_rate': 0.18799644100189933,
 'length': 0.09717098467557746,
 'replacement_cost': 0.06255773956244727,
 'NC-17': 0.004637645863014577,
 'PG': 0.004354960817037128,
 'PG-13': 0.003931569724015854,
 'R': 0.004123484589958561,
 'deleted_scenes': 0.004744765127403834,
 'behind_the_scenes': 0.0047782642445148025}

The ensemble model, Random Forest, outperformed both Linear Regression and Decision Tree Regressor models. The Random Forest algorithm leverages bootstrap sampling with replacement and aggregates predictions using mean values, leading to improved predictive performance.

Given the non-linear nature of the data, the Linear Regression model performed poorly. Additionally, some features exhibited multicollinearity, such as 'amount' and 'amount squared', where one feature could be derived from another. To address this, one of the correlated features was dropped from the model to prevent redundancy and improve interpretability.

Fortunately, there were no missing values in the dataset, simplifying the modeling process. To efficiently tune the Random Forest model, Randomized Search was employed, allowing for rapid hyperparameter optimization.

After tuning, the Random Forest model identified and utilized the most important features for prediction. These features were instrumental in capturing the underlying patterns in the data, leading to enhanced predictive accuracy.
Another parameter of Random Forest can be tuned as min_samples_split, min_samples_leaf, max_features
