Congrats! You just graduated UVA's BSDS program and got a job working at a movie studio in Hollywood. 

Your boss is the head of the studio and wants to know if they can gain a competitive advantage by predicting new movies that might get high imdb scores (movie rating). 

You would like to be able to explain the model to mere mortals but need a fairly robust and flexible approach so you've chosen to use decision trees to get started. 

In doing so, similar to  great data scientists of the past you remembered the excellent education provided to you at UVA in a undergrad data science course and have outline 20ish steps that will need to be undertaken to complete this task. As always, you will need to make sure to #comment your work heavily. 

 Footnotes: 
-	You can add or combine steps if needed
-	Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice.
- Make sure all your variables are the correct type (factor, character,numeric, etc.)

In [1]:
import pandas as pd
import numpy as np

1 Load the data.

In [2]:
# data loading
movie_metadata=pd.read_csv("../data/movie_metadata.csv")

In [3]:
# EDA 

# list all numeric columns
numeric_columns = movie_metadata.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns in the dataset:")
print(numeric_columns)

# relevant columns for analysis
# num_critic_for_reviews, actor_1_facebook_likes, num_voted_users, num_user_for_reviews, gross, cast_total_facebook_likes, budget, title_year, imdb_score, movie_facebook_likes

Numeric columns in the dataset:
['num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_1_facebook_likes', 'gross', 'num_voted_users', 'cast_total_facebook_likes', 'facenumber_in_poster', 'num_user_for_reviews', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes']


In [4]:
# list all categorical columns
categorical_columns = movie_metadata.select_dtypes(include=[object]).columns.tolist()
print("\nCategorical columns in the dataset:")
print(categorical_columns)

# relevant columns for analysis
# director_name, genres, actor_1_name, movie_title, country, content_rating


Categorical columns in the dataset:
['color', 'director_name', 'actor_2_name', 'genres', 'actor_1_name', 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'language', 'country', 'content_rating']


In [5]:
# drop irrelevant columns
columns = ['duration', 'actor_3_facebook_likes', 'facenumber_in_poster', 'actor_2_facebook_likes', 'aspect_ratio', 'color',
           'actor_2_name', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'language']
movie_metadata.drop(columns=columns, inplace=True)
movie_metadata.reset_index(drop=True, inplace=True)


movie_metadata.head()

Unnamed: 0,director_name,num_critic_for_reviews,director_facebook_likes,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,num_user_for_reviews,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
0,James Cameron,723.0,0.0,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,3054.0,USA,PG-13,237000000.0,2009.0,7.9,33000
1,Gore Verbinski,302.0,563.0,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,1238.0,USA,PG-13,300000000.0,2007.0,7.1,0
2,Sam Mendes,602.0,0.0,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,994.0,UK,PG-13,245000000.0,2015.0,6.8,85000
3,Christopher Nolan,813.0,22000.0,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,2701.0,USA,PG-13,250000000.0,2012.0,8.5,164000
4,Doug Walker,,131.0,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,,,,,7.1,0


2 Ensure all the variables are classified correctly including the target variable and collapse factor variables as needed.

In [6]:
movie_metadata.info()
movie_metadata.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   director_name              4939 non-null   object 
 1   num_critic_for_reviews     4993 non-null   float64
 2   director_facebook_likes    4939 non-null   float64
 3   actor_1_facebook_likes     5036 non-null   float64
 4   gross                      4159 non-null   float64
 5   genres                     5043 non-null   object 
 6   actor_1_name               5036 non-null   object 
 7   movie_title                5043 non-null   object 
 8   num_voted_users            5043 non-null   int64  
 9   cast_total_facebook_likes  5043 non-null   int64  
 10  num_user_for_reviews       5022 non-null   float64
 11  country                    5038 non-null   object 
 12  content_rating             4740 non-null   object 
 13  budget                     4551 non-null   float

(5043, 17)

In [7]:
# check unique values in director_name
unique_directors = movie_metadata['director_name'].unique()
print("\nUnique directors in the dataset:")
print(len(unique_directors))


Unique directors in the dataset:
2399


In [8]:
# parsing 'genres' column to ensure it contains only the first genre
def parse_genres(genres):
    if isinstance(genres, str):  # Check if the value is a string
        # Split the string by '|' and return the first genre
        return genres.split('|')[0].strip()
    elif isinstance(genres, list) and len(genres) > 0:  # If it's already a list, return the first element
        return genres[0]
    else:
        return None  # Return None if the value is not a string or a non-empty list

# applying
movie_metadata['genres'] = movie_metadata['genres'].apply(parse_genres)

# updated
movie_metadata.head()

Unnamed: 0,director_name,num_critic_for_reviews,director_facebook_likes,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,num_user_for_reviews,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
0,James Cameron,723.0,0.0,1000.0,760505847.0,Action,CCH Pounder,Avatar,886204,4834,3054.0,USA,PG-13,237000000.0,2009.0,7.9,33000
1,Gore Verbinski,302.0,563.0,40000.0,309404152.0,Action,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,1238.0,USA,PG-13,300000000.0,2007.0,7.1,0
2,Sam Mendes,602.0,0.0,11000.0,200074175.0,Action,Christoph Waltz,Spectre,275868,11700,994.0,UK,PG-13,245000000.0,2015.0,6.8,85000
3,Christopher Nolan,813.0,22000.0,27000.0,448130642.0,Action,Tom Hardy,The Dark Knight Rises,1144337,106759,2701.0,USA,PG-13,250000000.0,2012.0,8.5,164000
4,Doug Walker,,131.0,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,,,,,7.1,0


In [9]:
# check unique values in genres 
unique_genres = movie_metadata['genres'].unique()
print("\nUnique genres in the dataset:")
print(len(unique_genres))


Unique genres in the dataset:
21


In [10]:
# unique values in country 
unique_countries = movie_metadata['country'].unique()
print("\nUnique countries in the dataset:")
print(len(unique_countries))


Unique countries in the dataset:
66


In [11]:
# unique title_year values 
unique_years = movie_metadata['title_year'].unique()
print("\nUnique title_year values in the dataset:")
print(len(unique_years))


Unique title_year values in the dataset:
92


In [22]:
# unique actor_1_name values
unique_actors = movie_metadata['actor_1_name'].unique()
print("\nUnique actor_1_name values in the dataset:")
print(len(unique_actors))



Unique actor_1_name values in the dataset:
1472


In [21]:
# collapse countries, genres, title_years, and director names to categorical values

movie_metadata['country'] = movie_metadata['country'].astype('category')
movie_metadata['genres'] = movie_metadata['genres'].astype('category')
movie_metadata['title_year'] = movie_metadata['title_year'].astype('category')
movie_metadata['director_name'] = movie_metadata['director_name'].astype('category')
movie_metadata['content_rating'] = movie_metadata['content_rating'].astype('category')

movie_metadata.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3838 entries, 0 to 3837
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   director_name              3838 non-null   category
 1   num_critic_for_reviews     3838 non-null   float64 
 2   director_facebook_likes    3838 non-null   float64 
 3   actor_1_facebook_likes     3838 non-null   float64 
 4   gross                      3838 non-null   float64 
 5   genres                     3838 non-null   category
 6   actor_1_name               3838 non-null   object  
 7   movie_title                3838 non-null   object  
 8   num_voted_users            3838 non-null   int64   
 9   cast_total_facebook_likes  3838 non-null   int64   
 10  num_user_for_reviews       3838 non-null   float64 
 11  country                    3838 non-null   category
 12  content_rating             3838 non-null   category
 13  budget                     3838 n

In [13]:
movie_metadata.head()

Unnamed: 0,director_name,num_critic_for_reviews,director_facebook_likes,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,num_user_for_reviews,country,content_rating,budget,title_year,imdb_score,movie_facebook_likes
0,James Cameron,723.0,0.0,1000.0,760505847.0,Action,CCH Pounder,Avatar,886204,4834,3054.0,USA,PG-13,237000000.0,2009.0,7.9,33000
1,Gore Verbinski,302.0,563.0,40000.0,309404152.0,Action,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,1238.0,USA,PG-13,300000000.0,2007.0,7.1,0
2,Sam Mendes,602.0,0.0,11000.0,200074175.0,Action,Christoph Waltz,Spectre,275868,11700,994.0,UK,PG-13,245000000.0,2015.0,6.8,85000
3,Christopher Nolan,813.0,22000.0,27000.0,448130642.0,Action,Tom Hardy,The Dark Knight Rises,1144337,106759,2701.0,USA,PG-13,250000000.0,2012.0,8.5,164000
4,Doug Walker,,131.0,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,,,,,7.1,0


In [None]:
# content rating collapsing 
movie_metadata.content_rating.value_counts()

top_ratings = ['R', 'PG-13', 'PG', 'G', 'Not Rated']
movie_metadata['content_rating'] = movie_metadata['content_rating'].apply(lambda x: x if x in top_ratings else 'Other').astype('category')
movie_metadata['content_rating'].value_counts()

content_rating
R            1737
PG-13        1330
PG            576
G              91
Other          63
Not Rated      41
Name: count, dtype: int64

In [26]:
# collapsing actor_1_name 
movie_metadata.actor_1_name.value_counts()

top5actors = ['Robert De Niro', 'Johnny Depp', 'Nicolas Cage', 'J.K. Simmons', 'Denzel Washington']
movie_metadata.actor_1_name = (movie_metadata.actor_1_name.apply(lambda x: x if x in top5actors else 'Other')).astype('category')
movie_metadata.actor_1_name.value_counts()


actor_1_name
Other                3665
Robert De Niro         42
Johnny Depp            39
J.K. Simmons           31
Nicolas Cage           31
Denzel Washington      30
Name: count, dtype: int64

In [14]:
# collapsing country
movie_metadata.country.value_counts()
top5countries = ['USA', 'UK', 'France', 'Canada', 'Germany']

movie_metadata.country = (movie_metadata.country.apply(lambda x: x if x in top5countries else "Other")).astype('category')

movie_metadata.country.value_counts()

country
USA        3807
UK          448
Other       406
France      154
Canada      126
Germany      97
Name: count, dtype: int64

In [15]:
# collapsing genres
movie_metadata.genres.value_counts()
top5genres = ['Comedy', 'Action', 'Drama', 'Adventure', 'Crime']

movie_metadata.genres = (movie_metadata.genres.apply(lambda x: x if x in top5genres else "Other")).astype('category')

movie_metadata.genres.value_counts()

genres
Comedy       1329
Action       1153
Drama         972
Other         787
Adventure     453
Crime         349
Name: count, dtype: int64

In [16]:
# collapsing title_year 
movie_metadata.title_year.value_counts()
top5years = [2009.0, 2014.0, 2006.0, 2013.0, 2010.0]
movie_metadata.title_year = (movie_metadata.title_year.apply(lambda x: x if x in top5years else "Other")).astype('category')

movie_metadata.title_year.value_counts()
 

title_year
Other     3717
2009.0     260
2014.0     252
2006.0     239
2013.0     237
2010.0     230
Name: count, dtype: int64

In [17]:
# collapsing directors 
movie_metadata.director_name.value_counts()
top5directors = ['Steven Spielberg', 'Woody Allen', 'Clint Eastwood', 'Martin Scorsese', 'Ridley Scott']

movie_metadata.director_name = (movie_metadata.director_name.apply(lambda x: x if x in top5directors else "Other")).astype('category')

movie_metadata.director_name.value_counts()


director_name
Other               4834
Steven Spielberg      26
Woody Allen           22
Clint Eastwood        20
Martin Scorsese       20
Ridley Scott          17
Name: count, dtype: int64

In [28]:
movie_metadata.dropna(inplace=True)
movie_metadata.reset_index(drop=True, inplace=True)
movie_metadata.head()
movie_metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3838 entries, 0 to 3837
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   director_name              3838 non-null   category
 1   num_critic_for_reviews     3838 non-null   float64 
 2   director_facebook_likes    3838 non-null   float64 
 3   actor_1_facebook_likes     3838 non-null   float64 
 4   gross                      3838 non-null   float64 
 5   genres                     3838 non-null   category
 6   actor_1_name               3838 non-null   category
 7   movie_title                3838 non-null   object  
 8   num_voted_users            3838 non-null   int64   
 9   cast_total_facebook_likes  3838 non-null   int64   
 10  num_user_for_reviews       3838 non-null   float64 
 11  country                    3838 non-null   category
 12  content_rating             3838 non-null   category
 13  budget                     3838 n

3 Check for missing variables and correct as needed. Once you've completed the cleaning again create a function that will do this for you in the future. In the submission, include only the function and the function call.

4 Guess what, you don't need to scale the data, because DTs don't require this to be done, they make local greedy decisions...keeps getting easier, go to the next step.

5 Determine the baserate or prevalence for the classifier, what does this number mean?

6 Split your data into test, tune, and train. (80/10/10)

7 Create the kfold object for cross validation.

8 Create the scoring metric you will use to evaluate your model and the max depth hyperparameter (grid search) 

9 Build the classifier object 

10 Use the kfold object and the scoring metric to find the best hyperparameter value for max depth via the grid search method.

11 Fit the model to the training data.

12 What is the best depth value?

13 Print out the model

14 View the results, comment on how the model performed using the metrics you selected.

15 Which variables appear to be contributing the most (variable importance) 

16 Use the predict method on the tune data and print out the results.

17 How does the model perform on the tune data?

18 Print out the confusion matrix for the tune data, what does it tell you about the model?

19 What are the top 3 movies based on the tune set? Which variables are most important in predicting the top 3 movies?

20 Use a different hyperparameter for the grid search function and go through the process above again using the tune set. 

21 Did the model improve with the new hyperparameter search?

22 Using the better model, predict the test data and print out the results.

23 Summarize what you learned along the way and make recommendations to your boss on how this could be used moving forward, being careful not to over promise.