Congrats! You just graduated UVA's BSDS program and got a job working at a movie studio in Hollywood. 

Your boss is the head of the studio and wants to know if they can gain a competitive advantage by predicting new movies that might get high imdb scores (movie rating). 

You would like to be able to explain the model to mere mortals but need a fairly robust and flexible approach so you've chosen to use decision trees to get started. 

In doing so, similar to  great data scientists of the past you remembered the excellent education provided to you at UVA in a undergrad data science course and have outline 20ish steps that will need to be undertaken to complete this task. As always, you will need to make sure to #comment your work heavily. 

 Footnotes: 
-	You can add or combine steps if needed
-	Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice.
- Make sure all your variables are the correct type (factor, character,numeric, etc.)

In [2]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

In [3]:
df_movies=pd.read_csv("data/movie_metadata.csv")

In [4]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5024 non-null   object 
 1   director_name              4939 non-null   object 
 2   num_critic_for_reviews     4993 non-null   float64
 3   duration                   5028 non-null   float64
 4   director_facebook_likes    4939 non-null   float64
 5   actor_3_facebook_likes     5020 non-null   float64
 6   actor_2_name               5030 non-null   object 
 7   actor_1_facebook_likes     5036 non-null   float64
 8   gross                      4159 non-null   float64
 9   genres                     5043 non-null   object 
 10  actor_1_name               5036 non-null   object 
 11  movie_title                5043 non-null   object 
 12  num_voted_users            5043 non-null   int64  
 13  cast_total_facebook_likes  5043 non-null   int64

#2 Ensure all the variables are classified correctly including the target variable and collapse factor variables as needed.

In [5]:
# No null movies, nice
df_movies['imdb_score'].info()

df_movies.shape

<class 'pandas.core.series.Series'>
RangeIndex: 5043 entries, 0 to 5042
Series name: imdb_score
Non-Null Count  Dtype  
--------------  -----  
5043 non-null   float64
dtypes: float64(1)
memory usage: 39.5 KB


(5043, 28)

#3 Check for missing variables and correct as needed.

In [6]:
# damn
df_movies.isna().sum()

color                         19
director_name                104
num_critic_for_reviews        50
duration                      15
director_facebook_likes      104
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        884
genres                         0
actor_1_name                   7
movie_title                    0
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                153
movie_imdb_link                0
num_user_for_reviews          21
language                      14
country                        5
content_rating               303
budget                       492
title_year                   108
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 329
movie_facebook_likes           0
dtype: int64

In [137]:
#weird columns that wouldn't work, we'd have to split out the genres and keywords, color is weird too:
cols_to_drop = ['movie_imdb_link', 'plot_keywords', 'genres', 'color']
df_movies_dropped = df_movies.drop(columns=cols_to_drop)

In [138]:
#going to just arbitrarily fill continuous with the average and categorical with most frequent
from sklearn.impute import SimpleImputer

numerical_columns = df_movies_dropped.select_dtypes(include = ['int', 'float']).columns
categorical_columns = df_movies_dropped.select_dtypes(include = ['object']).columns

num_imputer = SimpleImputer(strategy='mean')
cat_imputer = SimpleImputer(strategy='most_frequent')

df_movies_dropped[numerical_columns] = num_imputer.fit_transform(df_movies_dropped[numerical_columns])
df_movies_dropped[categorical_columns] = cat_imputer.fit_transform(df_movies_dropped[categorical_columns])

In [139]:
# no NA's! 
df_movies_dropped.isna().sum()

director_name                0
num_critic_for_reviews       0
duration                     0
director_facebook_likes      0
actor_3_facebook_likes       0
actor_2_name                 0
actor_1_facebook_likes       0
gross                        0
actor_1_name                 0
num_voted_users              0
cast_total_facebook_likes    0
actor_3_name                 0
facenumber_in_poster         0
num_user_for_reviews         0
language                     0
country                      0
content_rating               0
budget                       0
title_year                   0
actor_2_facebook_likes       0
imdb_score                   0
aspect_ratio                 0
movie_facebook_likes         0
dtype: int64

In [176]:
cols_to_drop = ['director_name', 'actor_2_name', 'actor_1_name', 'actor_3_name']

df_movies_transformed = df_movies_dropped.drop(columns=cols_to_drop)

In [177]:
categorical_columns = df_movies_transformed.select_dtypes(include=['object']).columns
df_movies_one_hot = pd.get_dummies(df_movies_transformed, columns=categorical_columns, drop_first=True)

#4 Guess what, you don't need to scale the data, because DTs don't require this to be done, they make local greedy decisions...keeps getting easier, go to the next step.

In [178]:
# Sweet

#5 Determine the baserate or prevalence for the classifier, what does this number mean?

In [179]:
# This is a regression task so base rate and prevalence aren't applicable. 

#6 Split your data into test, tune, and train. (80/10/10)

In [181]:
from sklearn.model_selection import train_test_split

# get target and OG df
X = df_movies_one_hot.drop(columns='imdb_score')
y = df_movies_one_hot['imdb_score']


X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, train_size = 0.8)

X_tune, X_test, y_tune, y_test = train_test_split(X_remaining, y_remaining, train_size = 0.5)


In [182]:
print(f"Train size: {X_train.shape}")
print(f"Tune size: {X_tune.shape}")
print(f"Test size: {X_test.shape}")

Train size: (4034, 141)
Tune size: (504, 141)
Test size: (505, 141)


#7 Create the kfold object for cross validation.

In [183]:
from sklearn.model_selection import KFold

# Define the number of folds (K)
n_splits = 5  # You can choose the number of splits you want

# Create a K-Fold cross-validation object
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

#8 Create the scoring metric you will use to evaluate your model and the max depth hyperparameter 

In [184]:
scoring = "neg_mean_squared_error"

max_depth = 3

#9 Build the classifier object 

In [186]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(max_depth=max_depth)

scores = cross_val_score(model, X_train, y_train, cv=kf, scoring=scoring)

mean = -scores.mean() 
std = scores.std()

In [187]:
print(f"Mean of Mean Squared Errors: {mean}")
print(f"STD of Mean Squared Errors: {std}")

Mean of Mean Squared Errors: 0.9414555506133979
STD of Mean Squared Errors: 0.033666606823298785


#10 Use the kfold object and the scoring metric to find the best hyperparameter value for max depth via the grid search method.

In [188]:
from sklearn.model_selection import GridSearchCV

# Create your Decision Tree Regressor
regressor = DecisionTreeRegressor()

# Define the parameter grid to search over
param_grid = {
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10]  # Define the range of max_depth values to search
}

# Create a GridSearchCV object with K-Fold cross-validation
grid_search = GridSearchCV(regressor, param_grid, cv=kf, scoring='neg_mean_squared_error')

# Fit the GridSearchCV object to your data
grid_search.fit(X_train, y_train)

# Get the best hyperparameter values and corresponding MSE score
best_max_depth = grid_search.best_params_
best_mse = -grid_search.best_score_  # Reverse the negative sign to get positive MSE

#11 Fit the model to the training data.

In [189]:
grid_search.fit(X_train, y_train)

#12 What is the best depth value?

In [190]:
print(f"Best RMSE: {best_mse}")
print(f"Best Depth: {best_max_depth}")

Best RMSE: 0.8665000296130924
Best Depth: {'max_depth': 6}


#13 Print out the model

In [191]:
grid_search.best_estimator_

#14 View the results, comment on how the model performed using several evaluation metrics.

In [192]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_pred = grid_search.best_estimator_.predict(X_tune)

mae = mean_absolute_error(y_tune, y_pred)
mse = mean_squared_error(y_tune, y_pred)
r_squared = r2_score(y_tune, y_pred)

# Print and comment on the evaluation metrics
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R^2): {r_squared}")



"""
All of our metrics seem to be pretty good, on average we are within one imdb_score point, so the prediction could certainly be worse.

R^2 not ideal though, closer to 1 would be better so that the model features are a little more predictive of the imdb_score

"""

Mean Absolute Error (MAE): 0.6766427516943283
Mean Squared Error (MSE): 0.8016945198600803
R-squared (R^2): 0.3420779559433199


'\nAll of our metrics seem to be pretty good, on average we are within one imdb_score point, so the prediction could certainly be worse.\n\nR^2 not ideal though, closer to 1 would be better so that the model features are a little more predictive of the imdb_score\n\n'

#15 Which variables appear to be contributing the most (variable importance) 

In [193]:
for feature, importance in zip(X_test.columns, grid_search.best_estimator_.feature_importances_):
    print(f" Feature: {feature}, importance {importance}")

 Feature: num_critic_for_reviews, importance 0.031810030403633775
 Feature: duration, importance 0.19545719270880793
 Feature: director_facebook_likes, importance 0.0017460606292825144
 Feature: actor_3_facebook_likes, importance 0.014589973498547371
 Feature: actor_1_facebook_likes, importance 0.0010190738634170363
 Feature: gross, importance 0.056411908369452965
 Feature: num_voted_users, importance 0.4264122579485081
 Feature: cast_total_facebook_likes, importance 0.03992957609961394
 Feature: facenumber_in_poster, importance 0.009817102592987167
 Feature: num_user_for_reviews, importance 0.00037926986397254065
 Feature: budget, importance 0.07330862276235219
 Feature: title_year, importance 0.06641051689614823
 Feature: actor_2_facebook_likes, importance 0.011672090115371393
 Feature: aspect_ratio, importance 0.0018839249012145962
 Feature: movie_facebook_likes, importance 0.004907957459345934
 Feature: language_Arabic, importance 0.0
 Feature: language_Aramaic, importance 0.0
 Fea

#16 Use the predict method on the test data and print out the results.

In [194]:
y_pred = grid_search.best_estimator_.predict(X_test)
y_pred

array([6.62094241, 6.21145511, 5.39969183, 8.76666667, 6.24365672,
       7.87894737, 7.75517241, 6.2       , 6.665     , 5.39969183,
       5.39969183, 6.6173913 , 6.80050761, 6.24365672, 6.97692308,
       6.25661376, 6.21145511, 7.16818182, 5.39969183, 6.21145511,
       6.21145511, 6.02553191, 5.39969183, 7.03645833, 5.39969183,
       6.62094241, 6.80050761, 7.46233766, 6.21145511, 5.39969183,
       6.96865672, 7.34      , 5.96444444, 7.75517241, 8.        ,
       7.36666667, 6.15744681, 6.626     , 6.96865672, 6.96865672,
       6.25661376, 7.34      , 6.62094241, 6.626     , 6.24365672,
       5.39969183, 6.21145511, 7.61818182, 6.25921053, 5.6287037 ,
       6.25921053, 5.39969183, 7.08888889, 8.1625    , 6.80050761,
       6.21145511, 5.6287037 , 6.02553191, 6.09411765, 6.21145511,
       6.80050761, 7.03645833, 7.328     , 7.43333333, 7.328     ,
       6.62094241, 7.475     , 6.97692308, 5.72833333, 7.5       ,
       6.96865672, 6.09411765, 5.39969183, 5.72833333, 6.96865

#17 How does the model perform on the test data?

In [195]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

# Print and comment on the evaluation metrics
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R^2): {r_squared}")


"""
The model generalizes well to the test data, with a little less than 1 point off the imdb_score on average.
"""

Mean Absolute Error (MAE): 0.6952281502575215
Mean Squared Error (MSE): 0.9263197108906793
R-squared (R^2): 0.36547633249236244


'\nThe model generalizes well to the test data, with a little less than 1 point off the imdb_score on average.\n'

#18 Print out the confusion matrix for the test data, what does it tell you about the model?

In [196]:
# confusion matrix does not apply to regression tasks

#19 What are the top 3 movies based on the test set? Which variables are most important in predicting the top 3 movies?

In [197]:

y_pred_series = pd.Series(y_pred, name='Predicted IMDB')
X_preds_concat = pd.concat([y_pred_series, X_test.reset_index(drop=True)], axis=1)


"""
The most essential columns in predicting the imdb score appear to be duration, num_voted_users, and budget among others with smaller impacts
"""

#top 5 predicted rated movies
X_preds_concat.sort_values(by="Predicted IMDB", ascending=False).head()

Unnamed: 0,Predicted IMDB,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,aspect_ratio,movie_facebook_likes,language_Arabic,language_Aramaic,language_Bosnian,language_Cantonese,language_Chinese,language_Czech,language_Danish,language_Dari,language_Dutch,language_Dzongkha,language_English,language_Filipino,language_French,language_German,language_Greek,language_Hebrew,language_Hindi,language_Hungarian,language_Icelandic,language_Indonesian,language_Italian,language_Japanese,language_Kannada,language_Kazakh,language_Korean,language_Mandarin,language_Maya,language_Mongolian,language_Norwegian,language_Panjabi,language_Persian,language_Polish,language_Portuguese,language_Romanian,language_Russian,language_Slovenian,language_Spanish,language_Swahili,language_Swedish,language_Tamil,language_Telugu,language_Thai,language_Urdu,language_Vietnamese,language_Zulu,country_Argentina,country_Aruba,country_Australia,country_Bahamas,country_Belgium,country_Brazil,country_Bulgaria,country_Cambodia,country_Cameroon,country_Canada,country_Chile,country_China,country_Colombia,country_Czech Republic,country_Denmark,country_Dominican Republic,country_Egypt,country_Finland,country_France,country_Georgia,country_Germany,country_Greece,country_Hong Kong,country_Hungary,country_Iceland,country_India,country_Indonesia,country_Iran,country_Ireland,country_Israel,country_Italy,country_Japan,country_Kenya,country_Kyrgyzstan,country_Libya,country_Mexico,country_Netherlands,country_New Line,country_New Zealand,country_Nigeria,country_Norway,country_Official site,country_Pakistan,country_Panama,country_Peru,country_Philippines,country_Poland,country_Romania,country_Russia,country_Slovakia,country_Slovenia,country_South Africa,country_South Korea,country_Soviet Union,country_Spain,country_Sweden,country_Switzerland,country_Taiwan,country_Thailand,country_Turkey,country_UK,country_USA,country_United Arab Emirates,country_West Germany,content_rating_G,content_rating_GP,content_rating_M,content_rating_NC-17,content_rating_Not Rated,content_rating_PG,content_rating_PG-13,content_rating_Passed,content_rating_R,content_rating_TV-14,content_rating_TV-G,content_rating_TV-MA,content_rating_TV-PG,content_rating_TV-Y,content_rating_TV-Y7,content_rating_Unrated,content_rating_X
3,8.766667,645.0,152.0,22000.0,11000.0,23000.0,533316100.0,1676169.0,57802.0,0.0,4667.0,185000000.0,2008.0,13000.0,2.35,37000.0,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False
185,8.733333,253.0,146.0,0.0,413.0,888.0,48468410.0,610333.0,2560.0,0.0,1320.0,19000000.0,1980.0,629.0,1.37,37000.0,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
245,8.55,103.0,44.0,686.509212,148.0,544.0,48468410.0,159910.0,996.0,1.0,270.0,39752620.0,2002.470517,183.0,1.78,59000.0,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
168,8.55,79.0,22.0,686.509212,424.0,813.0,48468410.0,133415.0,1784.0,5.0,151.0,39752620.0,2002.470517,547.0,1.33,0.0,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False
238,8.525,162.0,101.0,194.0,602.0,1000.0,6712241.0,782437.0,3858.0,2.0,1420.0,7500000.0,1998.0,816.0,1.85,35000.0,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False


#20 Summarize what you learned along the way and make recommendations on how this could be used moving forward, being careful not to over promise.

In [200]:
"""
The model performs relatively well, the only disappointing factor at the moment is the implementation of scikit learns decision tree regressor. Because it does not allow non-numeric values, many of our categorical
variables had to be given up for the sake of the datasets dimensionality not getting out of control. Had the implementation supported categorical variables, we could have included all actor names as well as movie titles
which could certainly be impactful. Other than that, the model performs well and generalizes nicely to new data, improvements could likely be made in pipelining and generalizing the preprocessing steps, but the proof of 
concept is here. 
"""
x = 1