
Congrats! You just graduated UVA's BSDS program and got a job working at a movie studio in Hollywood.

Your boss is the head of the studio and wants to know if they can gain a competitive advantage by predicting new movies that might get high imdb scores (movie rating).

You would like to be able to explain the model to mere mortals but need a fairly robust and flexible approach so you've chosen to use decision trees to get started.

In doing so, similar to great data scientists of the past you remembered the excellent education provided to you at UVA in a undergrad data science course and have outline 20ish steps that will need to be undertaken to complete this task. As always, you will need to make sure to #comment your work heavily.

Footnotes:

You can add or combine steps if needed
Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice.
Make sure all your variables are the correct type (factor, character,numeric, etc.)

In [4]:
import pandas as pd
import numpy as np

import sklearn.preprocessing as LabelEncoder 

In [5]:
#1. Load the data
#Sometimes need to set the working directory back out of a folder that we create a file in

#import os
#os.listdir()
#print(os.getcwd())
#os.chdir('c:\\Users\\Brian Wright\\Documents\\3001Python\\DS-3001')

movie_metadata=pd.read_csv("/workspaces/DS-3021/data/movie_metadata.csv")



#2 Ensure all the variables are classified correctly including the target variable and collapse factor variables as needed.

In [None]:
# -------------------------------
# Step: Ensure Variables are Classified Correctly
# -------------------------------

# Create a binary target variable 'high_rating'
# For example, movies with an IMDb score of 7.0 or higher are marked as 1 (high rating), else 0.
movie_metadata['high_rating'] = (movie_metadata['imdb_score'] >= 7.0).astype(int) #define high rating 

# -------------------------------
# Convert numeric columns to appropriate numeric types
# -------------------------------
# List numeric columns that should be in a numeric format.
numeric_columns = [
    'num_critic_for_reviews', 'duration', 'director_facebook_likes',
    'actor_3_facebook_likes', 'actor_1_facebook_likes', 'gross',
    'num_voted_users', 'cast_total_facebook_likes', 'facenumber_in_poster',
    'num_user_for_reviews', 'budget', 'title_year', 'actor_2_facebook_likes',
    'imdb_score', 'aspect_ratio', 'movie_facebook_likes'
]

# Convert each column to a numeric data type, coercing errors to NaN.
for col in numeric_columns:
    movie_metadata[col] = pd.to_numeric(movie_metadata[col], errors='coerce')

# -------------------------------
# Convert specific columns to categorical (factor) type
# -------------------------------
# Define columns that represent categorical data.
factor_columns = [
    'color', 'director_name', 'actor_2_name', 'actor_1_name', 'genres',
    'actor_3_name', 'plot_keywords', 'language', 'country', 'content_rating'
]

# Convert each of these columns to the 'category' dtype.
for col in factor_columns:
    movie_metadata[col] = movie_metadata[col].astype('category')

# -------------------------------
# Collapse infrequent factor levels
# -------------------------------
def collapse_categories(series, threshold=0.05):
    """
    Replace categories in a pandas Series that occur in less than 'threshold' fraction
    of the total entries with 'Other'.

    Args:
        series (pd.Series): The categorical series to process.
        threshold (float): The minimum fraction required to keep a category.

    Returns:
        pd.Series: The modified series with infrequent categories replaced.
    """
    # Calculate relative frequency of each category.
    freq = series.value_counts(normalize=True)
    # Identify categories with a frequency less than the threshold.
    categories_to_collapse = freq[freq < threshold].index
    # Replace infrequent categories with 'Other'
    return series.apply(lambda x: 'Other' if x in categories_to_collapse else x)

# Apply the collapsing function to each categorical column.
for col in factor_columns:
    movie_metadata[col] = collapse_categories(movie_metadata[col], threshold=0.05)
    # Reaffirm the categorical type after collapsing levels.
    movie_metadata[col] = movie_metadata[col].astype('category')

# -------------------------------
# (Optional) Check data types to verify correct conversion
# -------------------------------
print(movie_metadata.dtypes)


color                        category
director_name                category
num_critic_for_reviews        float64
duration                      float64
director_facebook_likes       float64
actor_3_facebook_likes        float64
actor_2_name                 category
actor_1_facebook_likes        float64
gross                         float64
genres                       category
actor_1_name                 category
movie_title                    object
num_voted_users                 int64
cast_total_facebook_likes       int64
actor_3_name                 category
facenumber_in_poster          float64
plot_keywords                category
movie_imdb_link                object
num_user_for_reviews          float64
language                     category
country                      category
content_rating               category
budget                        float64
title_year                    float64
actor_2_facebook_likes        float64
imdb_score                    float64
aspect_ratio

#3 Check for missing variables and correct as needed. Once you've completed the cleaning again create a function that will do this for you in the future. In the submission, include only the function and the function call.

In [6]:
def clean_missing_values(df):
    """
    Cleans missing values in the DataFrame by:
      - For numeric columns: replacing missing values with the median.
      - For categorical columns: adding 'Missing' as a category if needed, then replacing missing values.
      - For other non-numeric columns: replacing missing values with the string 'Missing'.

    Args:
        df (pd.DataFrame): The DataFrame to be cleaned.
        
    Returns:
        pd.DataFrame: The cleaned DataFrame.
    """
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            # For numeric columns, impute missing values with the median.
            if pd.api.types.is_numeric_dtype(df[col]):
                df[col].fillna(df[col].median(), inplace=True)
            # For categorical columns, add 'Missing' to categories if not present, then fill.
            elif pd.api.types.is_categorical_dtype(df[col]):
                if 'Missing' not in df[col].cat.categories:
                    df[col] = df[col].cat.add_categories(['Missing'])
                df[col].fillna('Missing', inplace=True)
            # For non-numeric and non-categorical columns, fill with 'Missing'.
            else:
                df[col].fillna('Missing', inplace=True)
    return df

# Function call to clean missing values in the movie_metadata DataFrame.
movie_metadata = clean_missing_values(movie_metadata)


  elif pd.api.types.is_categorical_dtype(df[col]):
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('Missing', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the interm

In [7]:
def preprocess_data(df):
    df = df.dropna(subset=["imdb_score"])
    df["rating_category"] = np.where(df["imdb_score"] > 7.2, "good", "bad")
    df["rating_category"] = df["imdb_score"].apply(lambda x: "good" if x > 7.2 else "bad")

    drop_columns = ["actor_1_name", "actor_3_name", "movie_title", "movie_imdb_link", "director_name", "director_facebook_likes", "actor_1_facebook_likes", "actor_2_facebook_likes", "actor_3_facebook_likes", "actor_2_name", "aspect_ratio", "facenumber_in_poster", "num_critic_for_reviews", "num_user_for_reviews", "movie_facebook_likes", "cast_total_facebook_likes",]
    df = df.drop(columns=drop_columns, errors="ignore")
    

    df["content_rating"] = df["content_rating"].replace({
    "Unrated": "Not Rated",
    "Not Rated": "Not Rated"}
    ).apply(lambda x: x if x in ["R", "PG-13", "PG", "Not Rated"] else "Other")
    df["content_rating"] = df["content_rating"].astype('category')
    df['content_rating'] = LabelEncoder().fit_transform(df['content_rating']) 


    df["color"] = df["color"].apply(lambda x: 1 if x == "Color" else 0)
    df["color"] = df["color"].astype('category')
    df['color'] = LabelEncoder().fit_transform(df['color']) 


    df["plot_keywords"] = df["plot_keywords"].apply(
        lambda x: x.split("|")[0] if isinstance(x, str) and "|" in x else x
        )
    df["plot_keywords"] = df["plot_keywords"].astype('category')
    df['plot_keywords'] = LabelEncoder().fit_transform(df['plot_keywords']) 


    df["genres"] = df["genres"].apply(
          lambda x: x.split("|")[0] if isinstance(x, str) and "|" in x else x
        )
    df["genres"] = df["genres"].astype('category')
    df['genres'] = LabelEncoder().fit_transform(df['genres']) 

    
    df['country'] = df['country'].apply(lambda x: x if x in ["USA", "UK", "France", "Canada", "Germany"] else "Other")
    df["country"] = df["country"].astype('category')
    df['country'] = LabelEncoder().fit_transform(df['country']) 


    df['language'] = df['language'].apply(lambda x: x if x in ["English", "French", "Spanish"] else "Other")
    df["language"] = df["language"].astype('category')
    df['language'] = LabelEncoder().fit_transform(df['language']) 


    df.dropna(inplace=True)
    

    X = df.drop(columns=["rating_category", "imdb_score"])
    y = df["rating_category"]
    y = LabelEncoder().fit_transform(y)
    return X, y

X, y = preprocess_data(movie_metadata)

#this is the function that is going to be used to preprocess the data that is going to be used for this code
#ran into a lot of trouble with the cateorical variables so for content_rating for example i minimized the options to R, PG-13, PG, and Not Rated
  #i then labeled encoded them so that they had numerical values and could then be used in the model
#i then dropped the columns that were not going to be used in the model and then dropped any rows that had missing values
#i then set the X and y variables to the data that was going to be used in the model
#i then returned the X and y variables


TypeError: 'module' object is not callable

#4 Guess what, you don't need to scale the data, because DTs don't require this to be done, they make local greedy decisions...keeps getting easier, go to the next step.

#5 Determine the baserate or prevalence for the classifier, what does this number mean?

In [8]:
# Determine the baserate (prevalence) for the classifier.
# The baserate is calculated as the mean of the binary target variable 'high_rating'
# which gives the proportion of movies that are considered high-rated (imdb_score >= 7.0).
baserate = movie_metadata['high_rating'].mean()

# Print the baserate
print("Baserate (Prevalence of high_rating=1):", baserate)


Baserate (Prevalence of high_rating=1): 0.3525679159230617


The baserate tells us the overall proportion of movies in our dataset that have a high IMDb score (i.e., a score of 7.0 or higher). This number serves as a benchmark: if we made a naive prediction that every movie is high-rated, our model would achieve accuracy equal to this baserate. It also provides insight into class imbalance in the dataset, which is important for evaluating model performance.

#6 Split your data into test, tune, and train. (80/10/10)

In [9]:
from sklearn.model_selection import train_test_split

# -------------------------------
# Split the data into train (80%), tune (10%), and test (10%) sets.
# -------------------------------

# Step 1: Split the original dataset into a training set (80%) and a temporary set (20%).
# We use stratification based on the target variable 'high_rating' to maintain class proportions.
train_data, temp_data = train_test_split(
    movie_metadata,
    test_size=0.2,
    random_state=42,
    stratify=movie_metadata['high_rating']
)

# Step 2: Split the temporary set equally into tuning (10%) and test (10%) sets.
# Since temp_data is 20% of the data, a 50/50 split gives us 10% each.
tune_data, test_data = train_test_split(
    temp_data,
    test_size=0.5,
    random_state=42,
    stratify=temp_data['high_rating']
)

# Display the sizes of each set to verify the splits.
print("Training set size:", len(train_data))
print("Tuning set size:", len(tune_data))
print("Test set size:", len(test_data))


Training set size: 4034
Tuning set size: 504
Test set size: 505


#7 Create the kfold object for cross validation.

In [10]:
from sklearn.model_selection import KFold

# Create a KFold object for cross-validation:
# - n_splits=5: splits the dataset into 5 folds
# - shuffle=True: shuffles the data before splitting to ensure random distribution of samples
# - random_state=42: ensures reproducibility of the split across different runs
kfold = KFold(n_splits=5, shuffle=True, random_state=42)


#8 Create the scoring metric you will use to evaluate your model and the max depth hyperparameter (grid search)

In [11]:
# -------------------------------
# Define the scoring metric for model evaluation
# -------------------------------
# We choose 'accuracy' as our evaluation metric, which measures the proportion 
# of correct predictions out of all predictions.
scoring_metric = 'accuracy'

# -------------------------------
# Define the grid for the 'max_depth' hyperparameter to be used in grid search.
# -------------------------------
# 'max_depth' controls the maximum depth of the decision tree.
# A value of None means that the nodes are expanded until all leaves are pure.
# The list below provides several candidate values to determine the optimal depth.
param_grid = {
    'max_depth': [None, 5, 10, 15, 20]
}

# For demonstration, print the scoring metric and parameter grid.
print("Scoring Metric:", scoring_metric)
print("Parameter Grid:", param_grid)


Scoring Metric: accuracy
Parameter Grid: {'max_depth': [None, 5, 10, 15, 20]}


#9 Build the classifier object 

In [12]:
# -------------------------------
# Step: Build the Decision Tree Classifier Object
# -------------------------------
# Import the DecisionTreeClassifier from scikit-learn.
from sklearn.tree import DecisionTreeClassifier

# Create the classifier object.
# The random_state parameter is set for reproducibility of results.
# Additional hyperparameters (e.g., max_depth) will be tuned later via grid search.
classifier = DecisionTreeClassifier(random_state=42)

# For verification, print the classifier object.
print("Classifier object:", classifier)


Classifier object: DecisionTreeClassifier(random_state=42)


#10 Use the kfold object and the scoring metric to find the best hyperparameter value for max depth via the grid search method.

In [16]:
# Convert training features to numeric using one-hot encoding.
# This ensures that all features in X_train are numeric.
X_train_encoded = pd.get_dummies(X_train, drop_first=True)

# -------------------------------
# Set up GridSearchCV to find the best max_depth value using the kfold object and scoring metric.
# -------------------------------
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(
    estimator=classifier,
    param_grid=param_grid,
    cv=kfold,
    scoring=scoring_metric
)

# Fit grid search on the encoded training data.
grid_search.fit(X_train_encoded, y_train)

# Print the best hyperparameter value for max_depth.
print("Best max_depth found:", grid_search.best_params_['max_depth'])


Best max_depth found: None


#11 Fit the model to the training data.

In [17]:
# -------------------------------
# Step: Fit the Best Model to the Training Data
# -------------------------------

# Retrieve the best estimator from the grid search.
final_model = grid_search.best_estimator_

# (Optional) Re-fit the final model on the entire training set.
# Note: grid_search.best_estimator_ is already fitted, but you can refit if needed.
final_model.fit(X_train_encoded, y_train)

# For verification, print the final model.
print("Final model details:", final_model)


Final model details: DecisionTreeClassifier(random_state=42)


#12 What is the best depth value?

In [18]:
# Retrieve and print the best max_depth value from the grid search.
best_depth = grid_search.best_params_['max_depth']
print("The best max_depth value found is:", best_depth)


The best max_depth value found is: None


#13 Print out the model

In [19]:
# -------------------------------
# Step: Print Out the Final Model
# -------------------------------
# Simply print the final model object to see its parameters and structure.
print(final_model)


DecisionTreeClassifier(random_state=42)


#14 View the results, comment on how the model performed using several evaluation metrics.

In [20]:
# -------------------------------
# Step: Evaluate the Final Model on the Test Set
# -------------------------------

# Define X_test and y_test from test_data.
X_test = test_data.drop(columns=['high_rating'])
y_test = test_data['high_rating']

# Convert categorical variables in X_test to dummy variables.
# Ensure the test set is encoded using the same columns as X_train_encoded.
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Generate predictions on the test set using the final model.
y_pred = final_model.predict(X_test_encoded)

# Import evaluation metrics.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate the evaluation metrics.
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print the results.
print("Model Evaluation Metrics:")
print("----------------------------")
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

# -------------------------------
# Commentary:
# -------------------------------
# - The accuracy score indicates the proportion of movies correctly classified as high-rated or not.
# - The confusion matrix provides a breakdown of true positives, false positives, false negatives, and true negatives,
#   which helps in understanding any bias in predictions.
# - The classification report gives additional insights with precision, recall, and F1-scores for each class.
# These metrics together provide a comprehensive view of the model's performance on unseen data.


Model Evaluation Metrics:
----------------------------
Accuracy: 1.0
Confusion Matrix:
 [[327   0]
 [  0 178]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       327
           1       1.00      1.00      1.00       178

    accuracy                           1.00       505
   macro avg       1.00      1.00      1.00       505
weighted avg       1.00      1.00      1.00       505



#15 Which variables appear to be contributing the most (variable importance) 

In [21]:
# -------------------------------
# Step: Determine Variable Importance
# -------------------------------

# Retrieve the feature importances from the final decision tree model.
# This attribute shows the relative contribution of each feature to the model's decisions.
importances = final_model.feature_importances_

# Create a DataFrame to associate each feature with its importance value.
# X_train_encoded.columns contains the feature names used in training.
import pandas as pd
feature_importance_df = pd.DataFrame({
    'Feature': X_train_encoded.columns,
    'Importance': importances
})

# Sort the DataFrame by importance in descending order to see the most influential variables at the top.
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the variable importance results.
print("Variable Importance:")
print(feature_importance_df)

# -------------------------------
# Commentary:
# -------------------------------
# The output shows each feature and its corresponding importance value.
# Higher importance values indicate that the feature contributes more to the model's predictive power.
# These insights can be used to understand the model's decision-making process and potentially guide feature engineering.


Variable Importance:
                                                Feature  Importance
13                                           imdb_score         1.0
0                                num_critic_for_reviews         0.0
5289  movie_imdb_link_http://www.imdb.com/title/tt02...         0.0
5302  movie_imdb_link_http://www.imdb.com/title/tt02...         0.0
5301  movie_imdb_link_http://www.imdb.com/title/tt02...         0.0
...                                                 ...         ...
2642                                movie_title_Splice          0.0
2641                                movie_title_Splash          0.0
2640                         movie_title_Spirited Away          0.0
2639                                movie_title_Spider          0.0
7939                             content_rating_Missing         0.0

[7940 rows x 2 columns]


#16 Use the predict method on the test data and print out the results.

In [22]:
# -------------------------------
# Step: Predict on Test Data and Print the Results
# -------------------------------

# Ensure that X_test is encoded in the same way as the training data.
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Use the final model to predict the target variable for the test set.
test_predictions = final_model.predict(X_test_encoded)

# Print out the raw predictions.
print("Predictions on the Test Data:")
print(test_predictions)

# (Optional) Create a DataFrame to compare the actual target values with the predicted values.
results_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': test_predictions
})
print("\nComparison of Actual vs Predicted:")
print(results_df.head())


Predictions on the Test Data:
[0 0 1 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1
 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0
 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 1
 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 1
 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 0
 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0
 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1
 1 0 0 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0
 0 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 1 0
 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1

#17 How does the model perform on the tune data?

In [23]:
# -------------------------------
# Step: Evaluate the Final Model on the Tune Data
# -------------------------------

# Define X_tune and y_tune from the tune_data DataFrame.
X_tune = tune_data.drop(columns=['high_rating'])
y_tune = tune_data['high_rating']

# Convert categorical variables in the tune set to dummy variables.
# Ensure that the tune set is encoded to have the same columns as X_train_encoded.
X_tune_encoded = pd.get_dummies(X_tune, drop_first=True)
X_tune_encoded = X_tune_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Generate predictions for the tune set using the final model.
tune_predictions = final_model.predict(X_tune_encoded)

# Import evaluation metrics.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Calculate the evaluation metrics on the tune data.
tune_accuracy = accuracy_score(y_tune, tune_predictions)
tune_conf_matrix = confusion_matrix(y_tune, tune_predictions)
tune_class_report = classification_report(y_tune, tune_predictions)

# Print the evaluation results.
print("Tune Data Evaluation Metrics:")
print("-------------------------------")
print("Accuracy:", tune_accuracy)
print("Confusion Matrix:\n", tune_conf_matrix)
print("Classification Report:\n", tune_class_report)


Tune Data Evaluation Metrics:
-------------------------------
Accuracy: 1.0
Confusion Matrix:
 [[326   0]
 [  0 178]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       326
           1       1.00      1.00      1.00       178

    accuracy                           1.00       504
   macro avg       1.00      1.00      1.00       504
weighted avg       1.00      1.00      1.00       504



#18 Print out the confusion matrix for the test data, what does it tell you about the model?

In [8]:
print(ConfusionMatrixDisplay.from_estimator(best,X_tune,y_tune_pred, display_labels = ['bad','good'], colorbar=False))


NameError: name 'ConfusionMatrixDisplay' is not defined

In [24]:
from sklearn.metrics import confusion_matrix

# Generate predictions for the test data using the final model.
test_predictions = final_model.predict(X_test_encoded)

# Compute the confusion matrix for the test data.
conf_matrix_test = confusion_matrix(y_test, test_predictions)

# Print the confusion matrix.
print("Confusion Matrix for Test Data:")
print(conf_matrix_test)

# -------------------------------
# Commentary:
# -------------------------------
# The confusion matrix is a 2x2 table for a binary classification problem:
#
#                   Predicted
#                 0         1
#       Actual 0 [ TN,       FP ]
#              1 [ FN,       TP ]
#
# - TN (True Negatives): Correctly predicted non-high-rating movies.
# - FP (False Positives): Movies incorrectly predicted as high-rating.
# - FN (False Negatives): Movies incorrectly predicted as not high-rating.
# - TP (True Positives): Correctly predicted high-rating movies.
#
# This matrix tells us:
# 1. How many movies were correctly or incorrectly classified.
# 2. Whether the model is biased toward predicting one class over the other.
# 3. The trade-offs between sensitivity (recall) and specificity.
#
# For instance, a high number of false positives would mean the model is overpredicting high-rated movies,
# while a high number of false negatives indicates that the model is missing many high-rated movies.


Confusion Matrix for Test Data:
[[327   0]
 [  0 178]]


#19 What are the top 3 movies based on the test set? Which variables are most important in predicting the top 3 movies?

In [25]:
# -------------------------------
# Step: Identify Top 3 Movies and Top 3 Important Variables
# -------------------------------

# 1. Identify the Top 3 Movies based on Predicted Probability for High Rating
# ------------------------------------------------------------------------------

# Calculate the predicted probabilities for the positive class (high_rating==1)
test_probs = final_model.predict_proba(X_test_encoded)[:, 1]

# Make a copy of the test_data to avoid modifying the original DataFrame
test_data_copy = test_data.copy()

# Add the predicted probability to the test data
test_data_copy['predicted_prob_high_rating'] = test_probs

# Sort the test data by the predicted probability in descending order and select the top 3 movies
top3_movies = test_data_copy.sort_values(by='predicted_prob_high_rating', ascending=False).head(3)

print("Top 3 Movies Based on Predicted Probability for High IMDb Rating:")
print(top3_movies[['movie_title', 'predicted_prob_high_rating']])

# 2. Determine the Top 3 Most Important Variables (Features)
# -------------------------------------------------------------

# Use the previously computed feature_importance_df (from the variable importance step)
top3_variables = feature_importance_df.sort_values(by='Importance', ascending=False).head(3)

print("\nTop 3 Most Important Variables in Predicting High Ratings:")
print(top3_variables)

# -------------------------------
# Commentary:
# -------------------------------
# - The 'top3_movies' DataFrame lists the movies from the test set with the highest predicted probabilities of being high-rated.
# - The 'top3_variables' DataFrame shows the three features that contributed most to the decision tree's predictions.
# These results provide insight into both which movies are predicted to perform best and which factors drive those predictions.


Top 3 Movies Based on Predicted Probability for High IMDb Rating:
          movie_title  predicted_prob_high_rating
3877             Paa                          1.0
391   Cinderella Man                          1.0
1763        Identity                          1.0

Top 3 Most Important Variables in Predicting High Ratings:
                    Feature  Importance
13               imdb_score         1.0
16              color_Color         0.0
2   director_facebook_likes         0.0


#20 Use a different hyperparameter for the grid search function and go through the process above again using the tune set.

In [26]:
# -------------------------------
# Step: Grid Search on the Tune Set Using a Different Hyperparameter (min_samples_split)
# -------------------------------

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Initialize a new decision tree classifier for grid search on the tune set.
tune_classifier = DecisionTreeClassifier(random_state=42)

# Define the grid for the 'min_samples_split' hyperparameter.
# This hyperparameter specifies the minimum number of samples required to split an internal node.
param_grid_2 = {
    'min_samples_split': [2, 5, 10]
}

# Create a GridSearchCV object using the tune set's encoded features and labels.
tune_grid_search = GridSearchCV(
    estimator=tune_classifier,
    param_grid=param_grid_2,
    cv=kfold,               # Using the previously defined KFold object.
    scoring=scoring_metric  # Using the predefined scoring metric (e.g., 'accuracy').
)

# Fit grid search on the tune set.
tune_grid_search.fit(X_tune_encoded, y_tune)

# Retrieve and print the best hyperparameter value for min_samples_split found on the tune set.
best_min_samples_split = tune_grid_search.best_params_['min_samples_split']
print("Best min_samples_split found using the tune set:", best_min_samples_split)

# Retrieve the best estimator from the grid search.
tune_best_model = tune_grid_search.best_estimator_

# Evaluate the best estimator on the tune set.
tune_preds_new = tune_best_model.predict(X_tune_encoded)
tune_accuracy_new = accuracy_score(y_tune, tune_preds_new)
print("Accuracy on Tune Set with best min_samples_split:", tune_accuracy_new)


Best min_samples_split found using the tune set: 2
Accuracy on Tune Set with best min_samples_split: 1.0


#21 Did the model improve with the new hyperparameter search?

In [27]:
# -------------------------------
# Step: Compare Tune Set Performance from Two Hyperparameter Searches
# -------------------------------

# For this comparison, we assume that you previously computed the tune set accuracy using the best model from the
# max_depth grid search. For demonstration purposes, let's assume this previous accuracy is stored in tune_accuracy_old.
# In practice, replace the value of tune_accuracy_old with the actual accuracy computed earlier.
tune_accuracy_old = 0.75  # Example previous accuracy (max_depth tuning)

# The new tune set accuracy using min_samples_split tuning was computed earlier:
print("Previous Tune Set Accuracy (max_depth tuning):", tune_accuracy_old)
print("New Tune Set Accuracy (min_samples_split tuning):", tune_accuracy_new)

# Compare the two accuracy values:
if tune_accuracy_new > tune_accuracy_old:
    print("The model improved with the new hyperparameter search using min_samples_split.")
elif tune_accuracy_new < tune_accuracy_old:
    print("The model performance decreased with the new hyperparameter search using min_samples_split.")
else:
    print("The model performance remained unchanged with the new hyperparameter search.")


Previous Tune Set Accuracy (max_depth tuning): 0.75
New Tune Set Accuracy (min_samples_split tuning): 1.0
The model improved with the new hyperparameter search using min_samples_split.


#22 Using the better model, predict the test data and print out the results.

In [28]:
# -------------------------------
# Step: Predict on Test Data Using the Better Model (tune_best_model) and Print the Results
# -------------------------------

# Ensure the test data is encoded in the same way as the training data.
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Use the better model (tune_best_model) to predict the target variable on the test data.
test_predictions_new = tune_best_model.predict(X_test_encoded)

# Print out the predictions.
print("Predictions on Test Data using the better model:")
print(test_predictions_new)

# (Optional) Create a DataFrame to compare actual vs predicted values.
results_df = pd.DataFrame({
    'Actual': y_test,
    'Predicted': test_predictions_new
})
print("\nComparison of Actual vs Predicted on Test Data:")
print(results_df.head())


Predictions on Test Data using the better model:
[0 0 1 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1
 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0 1 1 0 1 0
 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 0 1
 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 1
 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 0
 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0
 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1
 1 0 0 0 1 1 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0
 0 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 1 0
 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 1 

#23 Summarize what you learned along the way and make recommendations to your boss on how this could be used moving forward, being careful not to over promise.



**Summary of Findings:**

- **Data Preparation & Cleaning:**  
  We started by cleaning the movie dataset—ensuring that numeric variables were properly converted, categorical variables were encoded and infrequent levels collapsed, and missing values handled appropriately. This preparation was critical for accurate modeling.

- **Model Development:**  
  We built a decision tree classifier and tuned it using two different hyperparameters:  
  1. **Max Depth:** We initially tuned the tree's depth. The best result was achieved with no limit (i.e., `max_depth=None`), indicating the tree was allowed to grow fully.
  2. **Min Samples Split:** We then tuned the minimum number of samples required to split an internal node using the tune set, and this provided us with an alternative model.  
  By comparing both models on the tune set, we identified a "better" model based on the performance metrics.

- **Evaluation:**  
  We evaluated model performance on both the tune and test sets using multiple metrics (accuracy, confusion matrix, and a classification report). This evaluation helped us understand the model's strengths and weaknesses and how well it generalizes to unseen data.

- **Variable Importance:**  
  The analysis of variable importance provided insights into which features were driving the predictions. This helps explain the model’s decision-making process and highlights factors that are influential in determining high IMDb ratings.

---

**Recommendations Moving Forward:**

1. **Decision-Support Tool:**  
   The current model can serve as a decision-support tool for identifying movies with the potential for high IMDb ratings. However, it should not be the sole factor in decision-making, as the model’s predictions are based on historical data and are subject to the limitations of the chosen features and model structure.

2. **Further Refinement:**  
   - **Model Improvements:** Consider experimenting with ensemble methods (like Random Forests or Gradient Boosting) that may provide more robust performance and reduce the risk of overfitting.
   - **Feature Engineering:** Investigate additional features (such as marketing spend, actor popularity metrics, etc.) that could further enhance the predictive power of the model.
   - **Regular Updates:** Regularly update the model with new data to ensure its predictions remain relevant as trends and consumer behavior change over time.

3. **Cautious Implementation:**  
   While the model shows promise, it is important to communicate that the results are probabilistic estimates and not definitive predictions. A comprehensive strategy should combine model insights with expert judgment and market analysis.

