# Feature Importance Analysis – Movie Revenue Prediction

This notebook analyzes **global feature importance** for the movie revenue prediction model using **SHAP values**.

We will:

1. Load the `df_all_scored` dataset (2017–2025, with predictions).  
2. Load the trained **Random Forest pipeline** used in Ensemble C.  
3. Compute SHAP values on a representative subset of data.  
4. Generate and save:
   - SHAP summary (beeswarm) plot  
   - SHAP bar plot (top features)  
5. Optionally explore dependence plots for selected features.


In [1]:
# Imports and configuration

import numpy as np
import pandas as pd
from pathlib import Path

import matplotlib.pyplot as plt

import shap  # make sure shap is installed: pip install shap
import joblib
import json

from movie_revenue_prediction.utils.paths import RESULTS_DIR, ARTIFACTS_DIR

# Plot style
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["axes.grid"] = True


In [2]:
# Paths and data loading
DF_ALL_SCORED_PATH = Path(RESULTS_DIR/"predictions/df_all_predictions.csv")

df_all_scored = pd.read_csv(DF_ALL_SCORED_PATH)

print("df_all_scored shape:", df_all_scored.shape)
df_all_scored.head()


df_all_scored shape: (3505, 152)


Unnamed: 0,id,title,original_title,release_date,revenue,budget,runtime,certification,genres,production_countries,...,x_directors_avg_revenue_prevyear,x_lead_cast_avg_revenue_prevyear,x_composers_avg_revenue_prevyear,y_pred_log_revenue_C,y_pred_revenue_C,set,x_month_sin,x_month_cos,x_wday_sin,x_wday_cos
0,354912,Coco,Coco,2017-10-27,814641172,175000000,105,PG,Family|Animation|Music|Adventure,United States of America,...,63258680.0,63633040.0,63824750.0,19.425488,273138000.0,train,-0.866025,0.5,-0.433884,-0.900969
1,398175,Brawl in Cell Block 99,Brawl in Cell Block 99,2017-09-23,64453,10000000,132,NR,Action|Crime|Thriller,United States of America,...,63258680.0,63633040.0,63824750.0,14.997126,3259634.0,train,-1.0,-1.83697e-16,-0.974928,-0.222521
2,346364,It,It,2017-09-06,704242888,35000000,135,R,Horror|Thriller,United States of America,...,63258680.0,63633040.0,63824750.0,18.808114,147319500.0,train,-1.0,-1.83697e-16,0.974928,-0.222521
3,315635,Spider-Man: Homecoming,Spider-Man: Homecoming,2017-07-05,880166924,175000000,133,PG-13,Action|Adventure|Science Fiction,United States of America,...,63258680.0,63633040.0,63824750.0,20.070412,520558000.0,train,-0.5,-0.8660254,0.974928,-0.222521
4,419430,Get Out,Get Out,2017-02-24,255407969,4500000,104,R,Mystery|Thriller|Horror,United States of America,...,63258680.0,63633040.0,63824750.0,15.709495,6645798.0,train,0.866025,0.5,-0.433884,-0.900969


In [3]:
# Load canonical feature list used in Ensemble C (cyclical features)
feature_cols_path = Path(ARTIFACTS_DIR/"ensemble_C/preprocessing/feature_cols_cyc.json")
with open(feature_cols_path, "r") as f:
    feature_cols = json.load(f)

print("Number of features:", len(feature_cols))
print("First 10 features:", feature_cols[:10])

# Load the trained RandomForest pipeline from Ensemble C
rf_model_path = Path(ARTIFACTS_DIR/"ensemble_C/base_models/RF_cyc.pkl")
rf_pipe = joblib.load(rf_model_path)

rf_pipe

Number of features: 128
First 10 features: ['x_is_in_collection', 'x_has_homepage', 'x_budget_log', 'x_runtime', 'x_num_spoken_languages', 'x_year', 'x_month', 'x_weekday', 'x_day', 'x_cast_ratio_male']


0,1,2
,steps,"[('imp', ...), ('rf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,n_estimators,500
,criterion,'squared_error'
,max_depth,22
,min_samples_split,5
,min_samples_leaf,2
,min_weight_fraction_leaf,0.0
,max_features,0.7
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [4]:
# Prepare feature matrix and year column

# Ensure all required feature columns exist
missing_feats = [c for c in feature_cols if c not in df_all_scored.columns]
if missing_feats:
    raise ValueError(f"The following features are missing from df_all_scored: {missing_feats}")

X_all = df_all_scored[feature_cols].copy()

# Ensure we have a year column
if "x_year" in df_all_scored.columns:
    df_all_scored["year"] = df_all_scored["x_year"].astype(int)
elif "year" in df_all_scored.columns:
    df_all_scored["year"] = df_all_scored["year"].astype(int)
else:
    raise KeyError("df_all_scored needs a 'x_year' or 'year' column.")

# Optional: focus SHAP analysis on 2024–2025 (business-relevant period)
mask_2024_2025 = df_all_scored["year"].between(2024, 2025)
X_focus = X_all[mask_2024_2025].copy()

print("X_all shape:", X_all.shape)
print("X_focus (2024–2025) shape:", X_focus.shape)


X_all shape: (3505, 128)
X_focus (2024–2025) shape: (871, 128)


In [5]:
# Sample data for SHAP (to keep computations reasonable)

# Background data for SHAP expectation
n_background = min(1000, len(X_all))
X_background = X_all.sample(n=n_background, random_state=42)

# Data we actually explain (focusing on 2024–2025 if there is enough data)
if len(X_focus) >= 500:
    X_shap = X_focus.sample(n=min(2000, len(X_focus)), random_state=42)
else:
    X_shap = X_all.sample(n=min(2000, len(X_all)), random_state=42)

print("Background sample shape:", X_background.shape)
print("SHAP explanation sample shape:", X_shap.shape)


Background sample shape: (1000, 128)
SHAP explanation sample shape: (871, 128)


In [8]:
rf_pipe

0,1,2
,steps,"[('imp', ...), ('rf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,n_estimators,500
,criterion,'squared_error'
,max_depth,22
,min_samples_split,5
,min_samples_leaf,2
,min_weight_fraction_leaf,0.0
,max_features,0.7
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [9]:
from sklearn.pipeline import Pipeline

# rf_pipe is Pipeline(steps=[('imp', SimpleImputer), ('rf', RandomForestRegressor)])
assert isinstance(rf_pipe, Pipeline)

imp = rf_pipe.named_steps["imp"]
rf_model = rf_pipe.named_steps["rf"]

# 1) Transform background and SHAP samples with the same imputer
X_background_imp = imp.transform(X_background)  # numpy array
X_shap_imp = imp.transform(X_shap)              # numpy array

# 2) Build TreeExplainer on the RandomForest model itself
explainer = shap.TreeExplainer(rf_model, X_background_imp)

# 3) Compute SHAP values on the imputed SHAP sample
shap_values = explainer(X_shap_imp)

# Handle both old (np.array) and new (Explanation) SHAP formats
if hasattr(shap_values, "values"):
    shap_vals_array = shap_values.values
else:
    shap_vals_array = shap_values  # already a numpy array

shap_vals_array.shape




(871, 128)

In [10]:
# 6. Create output directory for feature importance plots

feature_importance_dir = RESULTS_DIR / "plots" / "feature_importance"
feature_importance_dir.mkdir(parents=True, exist_ok=True)

feature_importance_dir

PosixPath('/Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance')

In [None]:
# Global SHAP summary (beeswarm plot)

plt.figure()
shap.summary_plot(
    shap_vals_array,
    features=X_shap,
    feature_names=X_shap.columns,
    show=False
)
plt.tight_layout()
out_path = feature_importance_dir / "rf_shap_summary_beeswarm.png"
plt.savefig(out_path, dpi=150)
plt.close()

print("Saved:", out_path)

Saved: /Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance/rf_shap_summary_beeswarm.png


In [15]:
# SHAP bar plot (mean |SHAP| per feature, top 20)

plt.figure()
shap.summary_plot(
    shap_vals_array,
    features=X_shap,
    feature_names=X_shap.columns,
    plot_type="bar",
    max_display=20,
    show=False
)

plt.tight_layout()
out_path = feature_importance_dir / "rf_shap_summary_bar_top20.png"
plt.savefig(out_path, dpi=150)
plt.close()

print("Saved:", out_path)

Saved: /Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance/rf_shap_summary_bar_top20.png


In [16]:
# Identify top features by mean absolute SHAP value

mean_abs_shap = np.abs(shap_values.values).mean(axis=0)
feat_importance = pd.Series(mean_abs_shap, index=X_shap.columns).sort_values(ascending=False)

top_features = feat_importance.head(5).index.tolist()
feat_importance.head(10)

x_budget_log                          3.247812
x_runtime                             0.387752
x_is_in_collection                    0.265852
x_year                                0.090174
x_cast_ratio_male                     0.089431
x_cert_pg-13                          0.062654
x_cast_ratio_female                   0.062518
x_country_united_states_of_america    0.059974
x_wday_sin                            0.041948
x_has_homepage                        0.036726
dtype: float64

In [17]:
# SHAP dependence plots for the top 3 features

for feat in top_features[:3]:
    plt.figure()
    shap.dependence_plot(
        feat,
        shap_vals_array,
        X_shap,
        feature_names=X_shap.columns,
        show=False
    )
    plt.tight_layout()
    out_path = feature_importance_dir / f"rf_shap_dependence_{feat}.png"
    plt.savefig(out_path, dpi=150)
    plt.close()
    print(f"Saved dependence plot for {feat} → {out_path}")

Saved dependence plot for x_budget_log → /Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance/rf_shap_dependence_x_budget_log.png
Saved dependence plot for x_runtime → /Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance/rf_shap_dependence_x_runtime.png
Saved dependence plot for x_is_in_collection → /Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance/rf_shap_dependence_x_is_in_collection.png


<Figure size 1000x600 with 0 Axes>

<Figure size 1000x600 with 0 Axes>

<Figure size 1000x600 with 0 Axes>

## Interpretation Notes

- The **RandomForest SHAP summary (beeswarm)** highlights:
  - which features have the strongest overall impact on predicted log-revenue,
  - and whether high/low feature values push predictions up or down.

- The **bar plot (top 20)** shows global feature importance ordered by mean absolute SHAP value.

- **Dependence plots** for the top features show how each feature’s value relates to its contribution to the prediction, accounting for interactions.

Plots are saved to:

- `results/plots/feature_importance/rf_shap_summary_beeswarm.png`
- `results/plots/feature_importance/rf_shap_summary_bar_top20.png`
- `results/plots/feature_importance/rf_shap_dependence_<feature>.png`

The table:

- `results/plots/feature_importance/rf_feature_importance_top_all.csv`

can be used in Tableau / Power BI for interactive feature-importance views.


In [19]:
import pandas as pd

mean_abs_shap = np.abs(shap_vals_array).mean(axis=0)

fi_df = (
    pd.DataFrame({
        "feature": X_shap.columns,
        "mean_abs_shap": mean_abs_shap,
    })
    .sort_values("mean_abs_shap", ascending=False)
    .reset_index(drop=True)
)

# 2. Keep top 20 for dashboard
fi_top20 = fi_df.head(20).copy()

# 3. Save both full and top20 tables (handy for analysis)
full_path = feature_importance_dir / "rf_feature_importance_all.csv"
top20_path = feature_importance_dir / "rf_feature_importance_top20.csv"

fi_df.to_csv(full_path, index=False)
fi_top20.to_csv(top20_path, index=False)

print("Saved full feature importance to:", full_path)
print("Saved top 20 feature importance to:", top20_path)

fi_top20


Saved full feature importance to: /Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance/rf_feature_importance_all.csv
Saved top 20 feature importance to: /Users/newuser/Desktop/Victoria/Projects 2025/Movie Revenue Project/results/plots/feature_importance/rf_feature_importance_top20.csv


Unnamed: 0,feature,mean_abs_shap
0,x_budget_log,3.247812
1,x_runtime,0.387752
2,x_is_in_collection,0.265852
3,x_year,0.090174
4,x_cast_ratio_male,0.089431
5,x_cert_pg-13,0.062654
6,x_cast_ratio_female,0.062518
7,x_country_united_states_of_america,0.059974
8,x_wday_sin,0.041948
9,x_has_homepage,0.036726
