![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [1]:
import pandas as pd
import numpy as np

In [2]:
rental = pd.read_csv("rental_info.csv")
rental.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [3]:
rental["rental_date"] = pd.to_datetime(rental["rental_date"])
rental["return_date"] = pd.to_datetime(rental["return_date"])

rental["rental_length_days"] = (rental["return_date"] - rental["rental_date"]).dt.days

In [4]:
rental["deleted_scenes"] =  np.where(rental["special_features"].str.contains("Deleted Scenes"), 1, 0)
rental["behind_the_scenes"] =  np.where(rental["special_features"].str.contains("Behind the Scenes"), 1, 0)

In [5]:
cols_to_drop = ["special_features", "rental_length_days", "rental_date", "return_date"]

X = rental.drop(cols_to_drop, axis=1)
y = rental["rental_length_days"]

In [6]:
rental.isna().any()

rental_date           False
return_date           False
amount                False
release_year          False
rental_rate           False
length                False
replacement_cost      False
special_features      False
NC-17                 False
PG                    False
PG-13                 False
R                     False
amount_2              False
length_2              False
rental_rate_2         False
rental_length_days    False
deleted_scenes        False
behind_the_scenes     False
dtype: bool

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [8]:
X_train.head()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,behind_the_scenes
6682,2.99,2010.0,2.99,90.0,25.99,1,0,0,0,8.9401,8100.0,8.9401,0,1
8908,4.99,2008.0,0.99,53.0,25.99,1,0,0,0,24.9001,2809.0,0.9801,1,0
11827,6.99,2007.0,4.99,171.0,25.99,0,0,1,0,48.8601,29241.0,24.9001,0,1
6153,2.99,2010.0,2.99,73.0,29.99,0,0,0,1,8.9401,5329.0,8.9401,1,1
10713,5.99,2004.0,0.99,122.0,14.99,1,0,0,0,35.8801,14884.0,0.9801,1,0


In [9]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor, AdaBoostRegressor, GradientBoostingRegressor

lr = LinearRegression()
lasso = Lasso()
ridge = Ridge()
en = ElasticNet()
dt = DecisionTreeRegressor(max_depth=5)
br = BaggingRegressor(estimator=dt, n_estimators=300)
ada = AdaBoostRegressor(estimator=dt, n_estimators=300)
gb = GradientBoostingRegressor(n_estimators=300, subsample=0.8, max_features=0.2, max_depth=5)
sgb = GradientBoostingRegressor(n_estimators=300, max_depth=5)
rf = RandomForestRegressor(n_estimators=300)
et = ExtraTreesRegressor(n_estimators=300)

regressors = [('linear_regression', lr),
             ('lasso_regression', lasso),
             ('ridge_regression', ridge),
             ('elasticnet_regression', en),
             ('decision_tree_regression', dt),
             ('bagging_regression', br),
             ('ada_boost_regression', ada),
             ('gradient_boosting_regression', gb),
             ('stochastic_gradient_boosting_regression', sgb),
             ('random_forest_regression', rf),
             ('extra_trees_regression', et)]

In [10]:
from sklearn.metrics import r2_score as r2, mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

regressor_names = []
r2_scores = []
cv_scores = []
mse_scores=[]

for reg_name, reg in regressors:
    cv_score = np.mean(cross_val_score(reg, X, y, cv=10))
    
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_test)
    r2_score = r2(y_test, y_pred)
    mse = MSE(y_test, y_pred)
    
    regressor_names.append(reg_name)
    r2_scores.append(r2_score)
    cv_scores.append(cv_score)
    mse_scores.append(mse)
    
    print(f"CV Score for {reg_name}: {cv_score}")
    print(f"R2 Score for {reg_name}: {r2_score}")    
    print(f"MSE Score for {reg_name}: {mse}\n")    

CV Score for linear_regression: 0.5838153291318162
R2 Score for linear_regression: 0.5856476313096645
MSE Score for linear_regression: 2.9417238646976434

CV Score for lasso_regression: 0.4644754083176464
R2 Score for lasso_regression: 0.46395512311674036
MSE Score for lasso_regression: 3.80568840926521

CV Score for ridge_regression: 0.5838180327426572
R2 Score for ridge_regression: 0.5856427463124059
MSE Score for ridge_regression: 2.941758546080207

CV Score for elasticnet_regression: 0.472125164658001
R2 Score for elasticnet_regression: 0.47089649871635486
MSE Score for elasticnet_regression: 3.7564076236388186

CV Score for decision_tree_regression: 0.6441034055933845
R2 Score for decision_tree_regression: 0.6449313942833753
MSE Score for decision_tree_regression: 2.520834608338213

CV Score for bagging_regression: 0.6474204403620847
R2 Score for bagging_regression: 0.6491290823817921
MSE Score for bagging_regression: 2.491032825631624

CV Score for ada_boost_regression: 0.6419636

In [11]:
performance_df = pd.DataFrame({'Algorithm':regressor_names,'R2':r2_scores, 'MSE': mse_scores, 
                               'CV Score':cv_scores}).sort_values('MSE',ascending=True)
performance_df

Unnamed: 0,Algorithm,R2,MSE,CV Score
8,stochastic_gradient_boosting_regression,0.71937,1.992351,0.60098
9,random_forest_regression,0.714287,2.028442,0.624234
10,extra_trees_regression,0.71326,2.035728,0.610901
7,gradient_boosting_regression,0.712117,2.043846,0.615679
5,bagging_regression,0.649129,2.491033,0.64742
4,decision_tree_regression,0.644931,2.520835,0.644103
6,ada_boost_regression,0.640623,2.551426,0.641964
0,linear_regression,0.585648,2.941724,0.583815
2,ridge_regression,0.585643,2.941759,0.583818
3,elasticnet_regression,0.470896,3.756408,0.472125


In [12]:
from sklearn.ensemble import VotingRegressor

top_regressors_indexes = [8, 9, 10, 7, 5]
top_regressors = [regressors[i] for i in top_regressors_indexes]
vc = VotingRegressor(estimators=top_regressors)

cv_score = np.mean(cross_val_score(vc, X, y, cv=10))
    
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)
r2_score = r2(y_test, y_pred)
mse = MSE(y_test, y_pred)
    
print(f"CV Score for Voting Regressor: {cv_score}")
print(f"R2 Score for Voting Regressor: {r2_score}")   
print(f"MSE Score for Voting Regressor: {mse}\n")  

CV Score for Voting Regressor: 0.6355553259906952
R2 Score for Voting Regressor: 0.7244248288067907
MSE Score for Voting Regressor: 1.956465363476781



In [13]:
best_model = vc
best_mse = mse

In [14]:
best_model

In [15]:
best_mse

1.956465363476781