# In Class Assignment – "Behind the Scenes: Predicting Box Office Success with Random Forest"


**Dataset:** IMDB-Movie-Data.csv  
**Goal:** Predict whether a movie is a *Hit* (rating ≥ 7.0) or *Flop* using Random Forest.


## Theoretical Concept: Random Forests

### Overview
A **Random Forest** is an ensemble machine learning algorithm that combines the predictions of multiple **Decision Trees** to improve accuracy and reduce overfitting.  
Each tree in the forest is trained on a random subset of the data and features — a process called **bagging (Bootstrap Aggregating)** — which introduces diversity among trees and ensures the overall model generalizes better than any single tree.

---

### How It Works?
1. **Bootstrap Sampling:**  
   Each Decision Tree is trained on a randomly sampled subset of the original dataset (with replacement).  
   This ensures that each tree sees slightly different data, creating model diversity.

2. **Feature Randomness:**  
   At each node, a random subset of features is chosen to determine the best split instead of considering all features.  
   This prevents dominant predictors (like “Votes”) from controlling every split, improving generalization.

3. **Voting Mechanism:**  
   In classification tasks, each tree makes an independent prediction.  
   The final output of the Random Forest is decided by **majority voting** among all trees.

---

### Mathematical Intuition:
If \( T_1, T_2, ..., T_n \) are the predictions from individual trees, then for classification:

\[
\hat{y} = \text{mode}(T_1, T_2, ..., T_n)
\]

For regression problems, the average prediction is used:

\[
\hat{y} = \frac{1}{n} \sum_{i=1}^n T_i
\]

This aggregation reduces variance and stabilizes the prediction performance.

---

### Advantages:
- **Reduces Overfitting:** Combining many trees reduces the noise and variance of individual trees.  
- **Improves Accuracy:** Random Forests often outperform single models by leveraging ensemble learning.  
- **Feature Importance:** They naturally provide insights into which features are most influential.  
- **Handles Nonlinear Relationships:** Can model complex, nonlinear patterns without heavy preprocessing.

---

### Bias-Variance Trade-off:
A single Decision Tree tends to have **low bias but high variance** — it can overfit easily.  
Random Forests lower the variance by averaging multiple trees while keeping bias relatively low.  
This balance leads to stronger performance on unseen data.

---

### Application in This Assignment
In this assignment, Random Forests are used to predict whether a movie is a **Hit** (IMDb rating ≥ 7.0) or **Flop** based on features such as:
- **Votes:** audience engagement  
- **Revenue (Millions):** box office performance  
- **Metascore:** critic evaluation  
- **Runtime (Minutes):** viewer accessibility


---

### Key Terms
| Concept | Description |
|----------|-------------|
| **Bootstrap Sampling** | Randomly drawing data samples with replacement for each tree |
| **Feature Bagging** | Random selection of features for each split |
| **Majority Voting** | Aggregating predictions from multiple trees |
| **Overfitting** | When the model learns noise instead of true patterns |
| **Feature Importance** | Quantitative measure of how much each variable contributes to prediction |

---

### Summary
Random Forests combine simplicity, interpretability, and power, making them ideal for real-world prediction tasks like analyzing movie success.  
They reflect how multiple weak learners, when combined, can form a **strong, stable model** — much like how collective decisions often outperform individual judgments.



# Step 1 – Load Data

In [None]:
import pandas as pd

df = pd.read_csv('____')   ### FILL IN BLANK
print("Dataset shape:", ____)   ### FILL IN BLANK
print("\nColumns:\n", ____)   ### FILL IN BLANK
print("\nMissing values:\n", ____)   ### FILL IN BLANK
df.head()


**Interpretation:**  
Which features from this dataset do you think most strongly determine whether a movie is a hit or a flop?


# Step 2. Data Cleaning and Feature Engineering

In [None]:
# Drop rows with missing key numerical values
df = df.dropna(subset=['____','____','____','____'])   ### FILL IN BLANK

# Create a binary target: 1 = Hit (Rating ≥ 7), 0 = Flop
df['____'] = (df['____'] >= ____).astype(int).  ### FILL IN BLANK

# Select relevant numeric features
features = ['____','____','____','____']   ### FILL IN BLANK
X = df[features]
y = df['____']

print("Features:", features)
print("Target distribution:\n", y.value_counts(normalize=True))


**Interpretation:**  
Why might features like votes and revenue be more predictive of success compared to runtime or metascore?


# Step 3. Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    ____, ____, test_size=____, random_state=____, stratify=____   ### FILL IN BLANKS
)
print("Training samples:", X_train.shape[0], " Testing samples:", X_test.shape[0])


**Interpretation:**  
Why is stratified sampling useful when the dataset has a mix of hits and flops?

# Step 4. Train Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(
    n_estimators=____, max_depth=____, random_state=____,
    class_weight='____', n_jobs=____   ### FILL IN BLANKS
)
rf.fit(____, ____)   ### FILL IN BLANK

y_pred = rf.predict(____)   ### FILL IN BLANK
print("Random Forest Accuracy:", accuracy_score(____, ____))   ### FILL IN BLANK


**Interpretation:**  
How does using many trees help Random Forests reduce overfitting compared to a single Decision Tree?


# Step 5. Evaluate Model Performance

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

print("Confusion Matrix:\n", confusion_matrix(____, ____))   ### FILL IN BLANK
print("\nClassification Report:\n", classification_report(____, ____))   ### FILL IN BLANK



**Interpretation:**  
Which metric (precision, recall, or F1) gives you the best insight into the model’s reliability for predicting hit movies?


# Step 6. Feature Importance Visualization

In [None]:
import matplotlib.pyplot as plt
import numpy as np

importances = pd.Series(____.feature_importances_, index=____).sort_values(ascending=False)   ### FILL IN BLANK
importances.plot(kind='barh', color='lightgreen', title='Feature Importance in Random Forest')
plt.xlabel('Importance Score')
plt.show()


**Interpretation:**  
Which feature was most influential, and does this align with your intuition about what makes a movie successful?


# Step 7. Hyperparameter Experimentation (Group Task)

In [None]:

for n in [____, ____, ____, ____]:   ### FILL IN BLANK
    model = RandomForestClassifier(n_estimators=n, random_state=____)   ### FILL IN BLANK
    model.fit(____, ____)   ### FILL IN BLANK
    preds = model.predict(____)   ### FILL IN BLANK
    acc = accuracy_score(____, ____)   ### FILL IN BLANK
    print(f"n_estimators = {n}: Accuracy = {acc:.3f}")


**Interpretation:**  
As you increase the number of trees, what trend do you observe in model accuracy and stability?


# Step 8. Visualize ROC Curve

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

y_proba = ____.predict_proba(____)[:, 1]   ### FILL IN BLANK
fpr, tpr, _ = roc_curve(____, ____)   ### FILL IN BLANK
auc = roc_auc_score(____, ____)   ### FILL IN BLANK

plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f"Random Forest (AUC = {auc:.3f})")
plt.plot([0,1],[0,1],'--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve – Random Forest Classifier")
plt.legend()
plt.grid(True)
plt.show()



**Interpretation:**  
What does the ROC-AUC score tell you about the overall quality of your Random Forest model?


# Step 8: Reflection and Discussion

1. How does combining multiple Decision Trees improve predictive performance?  
2. Which features were most influential in predicting movie success?  
3. How does increasing the number of trees affect bias and variance?  
4. In what ways could this model be improved (e.g., adding categorical features, tuning hyperparameters)?
