#  Modeling and Evaluation

In this notebook, we will train and evaluate baseline machine learning models for both:

- **Revenue Prediction** (Regression)
- **Success Classification** (Binary Classification)

We will use Scikit-learn pipelines to ensure clean preprocessing, scaling, and model training.


In [3]:

import pandas as pd
import numpy as np


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split


In [5]:
movies_df = pd.read_csv("../data/clean_movies.csv")

##  Training Baseline Models with Pipelines

We now create baseline models for both regression and classification tasks using Scikit-learn Pipelines.

- **Regression:** RandomForestRegressor
- **Classification:** LogisticRegression

Each pipeline includes a StandardScaler to normalize the features before training.


In [8]:
X = movies_df.drop(columns=['revenue', 'success'])
y_reg = movies_df['revenue']
y_clf = movies_df['success']

 
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_reg, test_size=0.2, random_state=42)

 
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X, y_clf, test_size=0.2, stratify=y_clf, random_state=42)

In [12]:
reg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor(random_state=42))
])

reg_pipeline.fit(X_train_reg, y_train_reg)

 
y_pred_reg = reg_pipeline.predict(X_test_reg)


In [14]:
clf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=1000, random_state=42))
])

clf_pipeline.fit(X_train_clf, y_train_clf)

y_pred_clf = clf_pipeline.predict(X_test_clf)

##  Model Evaluation

In this step, we evaluate the performance of our baseline models using appropriate metrics.

- For **regression**, we assess:  
  - MAE (Mean Absolute Error)  
  - RMSE (Root Mean Squared Error)  
  - R² Score (Explained Variance)

- For **classification**, we assess:  
  - Accuracy  
  - Precision  
  - Recall  
  - F1 Score  
  - ROC AUC


In [19]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Regression Metrics:")
print(f"MAE  = {mae:,.2f}")
print(f"RMSE = {rmse:,.2f}")
print(f"R²   = {r2:.4f}")

Regression Metrics:
MAE  = 37,690,098.67
RMSE = 75,519,375.03
R²   = 0.7807


In [23]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

print(classification_report(y_test_clf, y_pred_clf, target_names=["Not Successful", "Successful"]))

roc_auc = roc_auc_score(y_test_clf, clf_pipeline.predict_proba(X_test_clf)[:,1])
print(f"ROC AUC: {roc_auc:.4f}")

                precision    recall  f1-score   support

Not Successful       0.68      0.57      0.62       392
    Successful       0.73      0.82      0.77       569

      accuracy                           0.71       961
     macro avg       0.71      0.69      0.69       961
  weighted avg       0.71      0.71      0.71       961

ROC AUC: 0.7830


## Hyperparameter Tuning with GridSearchCV

We will now use `GridSearchCV` to tune our models and find the best hyperparameters for:

- **LogisticRegression** (for classification)
- **RandomForestRegressor** (for regression)

We use 5-fold cross-validation to ensure our results are reliable and generalizable.


In [28]:
from sklearn.model_selection import GridSearchCV

param_grid_logreg = {
    'model__C': [0.01, 0.1, 1, 10, 100],
    'model__penalty': ['l2'],
    'model__solver': ['lbfgs']
}
grid_clf = GridSearchCV(clf_pipeline, param_grid_logreg, cv=5, scoring='f1', n_jobs=-1)

grid_clf.fit(X_train_clf, y_train_clf)

print("Best Logistic Regression Params:", grid_clf.best_params_)


Best Logistic Regression Params: {'model__C': 10, 'model__penalty': 'l2', 'model__solver': 'lbfgs'}


In [34]:
 
param_grid_rf = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [10, 20, None],
    'model__min_samples_split': [2, 5],
}
 
grid_reg = GridSearchCV(reg_pipeline, param_grid_rf, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)

 
grid_reg.fit(X_train_reg, y_train_reg)

 
print("Best Random Forest Params:", grid_reg.best_params_)


Best Random Forest Params: {'model__max_depth': 10, 'model__min_samples_split': 5, 'model__n_estimators': 200}


## Evaluation of Tuned Models

Now that we have identified the best hyperparameters using GridSearchCV, we evaluate the performance of the tuned models on the test set.

We compare the results with our baseline models to determine whether tuning improved performance.


In [37]:
best_reg = grid_reg.best_estimator_

y_pred_reg_tuned = best_reg.predict(X_test_reg)

mae_tuned = mean_absolute_error(y_test_reg, y_pred_reg_tuned)
rmse_tuned = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg_tuned))
r2_tuned = r2_score(y_test_reg, y_pred_reg_tuned)

print("Tuned Regression Model Metrics:")
print(f"MAE  = {mae_tuned:,.2f}")
print(f"RMSE = {rmse_tuned:,.2f}")
print(f"R²   = {r2_tuned:.4f}")

Tuned Regression Model Metrics:
MAE  = 37,148,816.07
RMSE = 75,969,538.58
R²   = 0.7780


In [39]:
best_clf = grid_clf.best_estimator_

y_pred_clf_tuned = best_clf.predict(X_test_clf)

print("Tuned Classification Model Report:")
print(classification_report(y_test_clf, y_pred_clf_tuned, target_names=["Not Successful", "Successful"]))

 
roc_auc_tuned = roc_auc_score(y_test_clf, best_clf.predict_proba(X_test_clf)[:,1])
print(f"ROC AUC: {roc_auc_tuned:.4f}")

Tuned Classification Model Report:
                precision    recall  f1-score   support

Not Successful       0.68      0.57      0.62       392
    Successful       0.73      0.82      0.77       569

      accuracy                           0.72       961
     macro avg       0.71      0.69      0.70       961
  weighted avg       0.71      0.72      0.71       961

ROC AUC: 0.7835


In [42]:
import joblib


In [44]:
#
joblib.dump(grid_clf.best_estimator_, "../models/classification_model.pkl")
print("Classification model saved as classification_model.pkl")


Classification model saved as classification_model.pkl


In [46]:
joblib.dump(grid_reg.best_estimator_, "../models/regression_model.pkl")
print("Regression model saved as regression_model.pkl")

Regression model saved as regression_model.pkl


In [52]:
joblib.dump((X_test_reg, y_test_reg), "../models/test_data_regression.pkl")

['../models/test_data_regression.pkl']

In [49]:
joblib.dump((X_test_clf, y_test_clf), "../models/test_data_clf.pkl")

['../models/test_data_clf.pkl']

## Summary of Modeling

Both regression and classification pipelines were trained and tuned using GridSearchCV.

- **Best classification model:** Logistic Regression with C=10
- **Best regression model:** Random Forest with 200 trees, max_depth=10

We now proceed to visualize and interpret the results in the next notebook:  
📁 `03_evaluation.ipynb`
