## Prediction Model based on Genres


### Model 1: generate list of movie recommendation based on "genres" + "rating".
- Input from user: genres
- scores are calculated and ranked to produce the top 5 movies
- In this notebook, I will offer 2 simple way:
    - based on rating score alone
    - based on rating score and genre picked by the user
- Features include: 
    - genres
    - movie rating score (based on IMDB formula)

In [227]:
import pandas as pd

pd.set_option("display.max_columns", 85)
pd.set_option("display.max_rows", 85)
df = pd.read_csv('../../edit_data/Lee/cleaned_data/movies_main.csv')  
df.head()

Unnamed: 0,Id,Original_Title,Original_Language,Spoken_Languages,Budget,Revenue,Runtime,Release_Date,Production_Companies,Production_Countries,imdb_id,Popularity,Vote_Average,Vote_Count,Belongs_To_Collection,Tagline,Overview,Genres,Genres_Parse,Belongs_To_Collection_Parse,Spoken_Languages_Parse,Production_Companies_Parse,Production_Countries_Parse,Production_Countries_Code_Parse,Keywords,Keywords_parse,Cast_parse,Director_parse
0,862,Toy Story,en,"[{'iso_639_1': 'en', 'name': 'English'}]",30000000.0,373554033.0,81.0,1995-10-30,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0114709,21.946943,7.7,5415.0,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","['Animation', 'Comedy', 'Family']",['Toy Story Collection'],['English'],['Pixar Animation Studios'],['United States of America'],['US'],"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","['jealousy', 'toy', 'boy', 'friendship', 'frie...","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...",['John Lasseter']
1,8844,Jumanji,en,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",65000000.0,262797249.0,104.0,1995-12-15,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113497,17.015539,6.9,2413.0,,Roll the dice and unleash the excitement!,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","['Adventure', 'Fantasy', 'Family']",,"['English', 'Français']","['TriStar Pictures', 'Teitler Film', 'Intersco...",['United States of America'],['US'],"[{'id': 10090, 'name': 'board game'}, {'id': 1...","['board game', 'disappearance', ""based on chil...","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...",['Joe Johnston']
2,15602,Grumpier Old Men,en,"[{'iso_639_1': 'en', 'name': 'English'}]",0.0,0.0,101.0,1995-12-22,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113228,11.7129,6.5,92.0,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",Still Yelling. Still Fighting. Still Ready for...,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","['Romance', 'Comedy']",['Grumpy Old Men Collection'],['English'],"['Warner Bros.', 'Lancaster Gate']",['United States of America'],['US'],"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","['fishing', 'best friend', 'duringcreditssting...","['Walter Matthau', 'Jack Lemmon', 'Ann-Margret...",['Howard Deutch']
3,31357,Waiting to Exhale,en,"[{'iso_639_1': 'en', 'name': 'English'}]",16000000.0,81452156.0,127.0,1995-12-22,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",tt0114885,3.859495,6.1,34.0,,Friends are the people who let you be yourself...,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","['Comedy', 'Drama', 'Romance']",,['English'],['Twentieth Century Fox Film Corporation'],['United States of America'],['US'],"[{'id': 818, 'name': 'based on novel'}, {'id':...","['based on novel', 'interracial relationship',...","['Whitney Houston', 'Angela Bassett', 'Loretta...",['Forest Whitaker']
4,11862,Father of the Bride Part II,en,"[{'iso_639_1': 'en', 'name': 'English'}]",0.0,76578911.0,106.0,1995-02-10,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113041,8.387519,5.7,173.0,"{'id': 96871, 'name': 'Father of the Bride Col...",Just When His World Is Back To Normal... He's ...,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",['Comedy'],['Father of the Bride Collection'],['English'],"['Sandollar Productions', 'Touchstone Pictures']",['United States of America'],['US'],"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","['baby', 'midlife crisis', 'confidence', 'agin...","['Steve Martin', 'Diane Keaton', 'Martin Short...",['Charles Shyer']


Method 1: step 1 - calculate the C and m values.

In [231]:
## To calculate the mean value of Vote Average
C= df['Vote_Average'].mean()
C

5.6259312895463855

In [232]:
## set vote count threshold to 90 %tile.
m= df['Vote_Count'].quantile(0.9)
m

163.0

Method 1: Step 2 - Set qualified Movies (The Filter)
- Qualified movies are set based on Number of Vote Count received by the movies. 
- To qualified, the movies must has be at 90 %tile in total vote counts (i.e. 163 votes or more)
- This is where we would set the ['Genres']' as a filter. 

Filter 1: m-value only

In [110]:
## only filter based on m value
# qualified_movies = df.copy().loc[df['Vote_Count'] >= m]
# qualified_movies.shape


Filter 2: m-value + genres (entered by user)

Re-Parse Genres again to a new column (the parsed column is not working)

In [111]:
import ast
import numpy as np

# Function to parse the string and extract 'name'
def extract_names(data_string):
    if pd.isna(data_string):
        return np.nan
    try:
        # Safely evaluate the string to a Python object
        data_object = ast.literal_eval(data_string)
        if isinstance(data_object, list):
            # Extract 'name' from each dictionary in the list
            names = [item['name'] for item in data_object]
            return names
        elif isinstance(data_object, dict):
            # Extract 'name' from the dictionary
            return [data_object.get('name', np.nan)]
    except (ValueError, SyntaxError):
        return np.nan
    
    
df['Genres_Parse_new'] = df['Genres'].apply(extract_names)

In [112]:
### TEST : to generate a unique list of genres

# Flatten the list of genres, ignoring NaN values, and extract unique values
unique_genres = set(
    genre
    for sublist in df['Genres_Parse_new'].dropna()  # Drop NaN values
    for genre in sublist
)

# Convert the set to a sorted list (optional)
unique_genres_list = sorted(unique_genres)

# Display the unique genres
print(unique_genres_list)
print(len(unique_genres_list))

['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']
20


Request user to enter the Genre
- codes can be enhanced to avoid typo, etc. 

In [113]:

# Genres List in total (20)
genres_list= ['Action', 'Adventure', 'Animation', 
              'Comedy', 'Crime', 'Documentary', 
              'Drama', 'Family', 'Fantasy', 
              'Foreign', 'History', 'Horror', 
              'Music', 'Mystery', 'Romance', 
              'Science Fiction', 'TV Movie', 
              'Thriller', 'War', 'Western']

print('total genres:', len(genres_list))

genre_input = input("Enter the genre: ")

if x in genres_list:
    filt = (df['Vote_Count'] >= m) & (df['Genres_Parse_new'].apply(lambda x: genre_input in x if isinstance(x, list) else False) )
    qualified_movies = df.copy().loc[filt]
    print("You have selected the genre: ", genre_input)
    print("qualified_movies: ", qualified_movies.shape)
else:
    print("wrong genre entered.")
    Game_is_on = False



total genres: 20
You have selected the genre:  Science Fiction
qualified_movies:  (578, 29)


Method 1: step 3 - set up the rating score calculation

In [230]:
def weighted_rating(x, m=m, C=C):
    v = x['Vote_Count']
    R = x['Vote_Average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [115]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
qualified_movies['Score'] = qualified_movies.apply(weighted_rating, axis=1)

qualified_movies.head()

Unnamed: 0,Id,Original_Title,Original_Language,Spoken_Languages,Budget,Revenue,Runtime,Release_Date,Production_Companies,Production_Countries,imdb_id,Popularity,Vote_Average,Vote_Count,Belongs_To_Collection,Tagline,Overview,Genres,Genres_Parse,Belongs_To_Collection_Parse,Spoken_Languages_Parse,Production_Companies_Parse,Production_Countries_Parse,Production_Countries_Code_Parse,Keywords,Keywords_parse,Cast_parse,Director_parse,Genres_Parse_new,Score
28,902,La Cité des Enfants Perdus,fr,"[{'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}, {'i...",18000000.0,1738611.0,108.0,1995-05-16,"[{'name': 'Procirep', 'id': 311}, {'name': 'Co...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",tt0112682,9.822423,7.6,308.0,,Where happily ever after is just a dream.,A scientist in a surrealist society kidnaps ch...,"[{'id': 14, 'name': 'Fantasy'}, {'id': 878, 'n...","['Fantasy', 'Science Fiction', 'Adventure']",,"['广州话 / 廣州話', 'Français']","['Procirep', 'Constellation Productions', 'Fra...","['France', 'Germany', 'Spain']","['FR', 'DE', 'ES']","[{'id': 402, 'name': 'clone'}, {'id': 1566, 'n...","['clone', 'dream', 'island', 'eye', 'dystopia'...","['Ron Perlman', 'Dominique Pinon', 'Judith Vit...","['Jean-Pierre Jeunet', 'Marc Caro']","[Fantasy, Science Fiction, Adventure]",6.91683
31,63,Twelve Monkeys,en,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",29500000.0,168840000.0,129.0,1995-12-29,"[{'name': 'Universal Pictures', 'id': 33}, {'n...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0114746,12.297305,7.4,2470.0,,The future is history.,"In the year 2035, convict James Cole reluctant...","[{'id': 878, 'name': 'Science Fiction'}, {'id'...","['Science Fiction', 'Thriller', 'Mystery']",,"['English', 'Français']","['Universal Pictures', 'Atlas Entertainment', ...",['United States of America'],['US'],"[{'id': 222, 'name': 'schizophrenia'}, {'id': ...","['schizophrenia', 'philadelphia', 'cassandra s...","['Bruce Willis', 'Madeleine Stowe', 'Brad Pitt...",['Terry Gilliam'],"[Science Fiction, Thriller, Mystery]",7.290173
157,10329,Congo,en,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",50000000.0,152022101.0,109.0,1995-06-09,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0112715,7.260574,5.0,214.0,,Where you are the endangered species.,Eight people embark on an expedition into the ...,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","['Action', 'Adventure', 'Drama', 'Mystery', 'S...",,"['English', 'Français']","['Paramount Pictures', 'Kennedy/Marshall Compa...",['United States of America'],['US'],"[{'id': 690, 'name': 'gorilla'}, {'id': 2521, ...","['gorilla', 'kongo', 'diamond mine', 'diamond']","['Laura Linney', 'Dylan Walsh', 'Ernie Hudson'...",['Frank Marshall'],"[Action, Adventure, Drama, Mystery, Science Fi...",5.270628
169,9886,Johnny Mnemonic,en,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",25000000.0,19075720.0,97.0,1995-05-26,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",tt0113481,11.715868,5.5,380.0,,The hottest data on earth. In the coolest head...,"A data courier, literally carrying a data pack...","[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...","['Adventure', 'Action', 'Drama', 'Science Fict...",,"['English', '日本語']","['TriStar Pictures', 'Alliance Communications ...","['Canada', 'United States of America']","['CA', 'US']","[{'id': 2588, 'name': 'brain'}, {'id': 3321, '...","['brain', 'childhood memory', 'dystopia', 'pha...","['Keanu Reeves', 'Dina Meyer', 'Takeshi Kitano...",['Robert Longo'],"[Adventure, Action, Drama, Science Fiction, Th...",5.537803
170,9482,Judge Dredd,en,"[{'iso_639_1': 'en', 'name': 'English'}]",90000000.0,113493481.0,96.0,1995-06-30,"[{'name': 'Hollywood Pictures', 'id': 915}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113492,8.184815,5.4,643.0,,"In the future, one man is the law.","In a dystopian future, Dredd, the most famous ...","[{'id': 878, 'name': 'Science Fiction'}]",['Science Fiction'],,['English'],"['Hollywood Pictures', 'Cinergi Pictures Enter...",['United States of America'],['US'],"[{'id': 934, 'name': 'judge'}, {'id': 4458, 'n...","['judge', 'post-apocalyptic', 'dystopia', 'bas...","['Sylvester Stallone', 'Diane Lane', 'Armand A...",['Danny Cannon'],[Science Fiction],5.445691


Method 1: step 4 - create the movie recommendation based on the Score

In [193]:
#Sort movies based on score calculated above
qualified_movies = qualified_movies.sort_values('Score', ascending=False)

#Print the top 5 movies
print("Genre you have chosen:      ", genre_input)
print("Number of qualified movies: ", qualified_movies.shape)
qualified_movies[['Original_Title', 'Genres_Parse_new','Vote_Count','Vote_Average', 'Popularity','Score']].head(5)

Genre you have chosen:       Science Fiction
Number of qualified movies:  (578, 30)


Unnamed: 0,Original_Title,Genres_Parse_new,Vote_Count,Vote_Average,Popularity,Score
1149,The Empire Strikes Back,"[Adventure, Action, Science Fiction]",5998.0,8.2,19.470959,8.131899
15395,Inception,"[Action, Thriller, Science Fiction, Mystery, A...",14075.0,8.1,29.108149,8.071676
22702,Interstellar,"[Adventure, Drama, Science Fiction]",11187.0,8.1,32.213481,8.064469
255,Star Wars,"[Adventure, Action, Science Fiction]",6778.0,8.1,42.149697,8.0419
1220,Back to the Future,"[Adventure, Comedy, Science Fiction, Family]",6239.0,8.0,25.778509,7.939554


### Model 2: Simple Linear KNN Model by Score

- KNN Regression Model
- input (features, X):
    - ['Vote_Average']
    - [''Vote_Count']
    - ['Budget']
    - ['Revenue']
    - [-'Runtime']
- Output (Target, y): ['Score']

In [228]:
# load the data
df.head()

Unnamed: 0,Id,Original_Title,Original_Language,Spoken_Languages,Budget,Revenue,Runtime,Release_Date,Production_Companies,Production_Countries,imdb_id,Popularity,Vote_Average,Vote_Count,Belongs_To_Collection,Tagline,Overview,Genres,Genres_Parse,Belongs_To_Collection_Parse,Spoken_Languages_Parse,Production_Companies_Parse,Production_Countries_Parse,Production_Countries_Code_Parse,Keywords,Keywords_parse,Cast_parse,Director_parse
0,862,Toy Story,en,"[{'iso_639_1': 'en', 'name': 'English'}]",30000000.0,373554033.0,81.0,1995-10-30,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0114709,21.946943,7.7,5415.0,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","['Animation', 'Comedy', 'Family']",['Toy Story Collection'],['English'],['Pixar Animation Studios'],['United States of America'],['US'],"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","['jealousy', 'toy', 'boy', 'friendship', 'frie...","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...",['John Lasseter']
1,8844,Jumanji,en,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",65000000.0,262797249.0,104.0,1995-12-15,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113497,17.015539,6.9,2413.0,,Roll the dice and unleash the excitement!,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","['Adventure', 'Fantasy', 'Family']",,"['English', 'Français']","['TriStar Pictures', 'Teitler Film', 'Intersco...",['United States of America'],['US'],"[{'id': 10090, 'name': 'board game'}, {'id': 1...","['board game', 'disappearance', ""based on chil...","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...",['Joe Johnston']
2,15602,Grumpier Old Men,en,"[{'iso_639_1': 'en', 'name': 'English'}]",0.0,0.0,101.0,1995-12-22,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113228,11.7129,6.5,92.0,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",Still Yelling. Still Fighting. Still Ready for...,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","['Romance', 'Comedy']",['Grumpy Old Men Collection'],['English'],"['Warner Bros.', 'Lancaster Gate']",['United States of America'],['US'],"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","['fishing', 'best friend', 'duringcreditssting...","['Walter Matthau', 'Jack Lemmon', 'Ann-Margret...",['Howard Deutch']
3,31357,Waiting to Exhale,en,"[{'iso_639_1': 'en', 'name': 'English'}]",16000000.0,81452156.0,127.0,1995-12-22,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",tt0114885,3.859495,6.1,34.0,,Friends are the people who let you be yourself...,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","['Comedy', 'Drama', 'Romance']",,['English'],['Twentieth Century Fox Film Corporation'],['United States of America'],['US'],"[{'id': 818, 'name': 'based on novel'}, {'id':...","['based on novel', 'interracial relationship',...","['Whitney Houston', 'Angela Bassett', 'Loretta...",['Forest Whitaker']
4,11862,Father of the Bride Part II,en,"[{'iso_639_1': 'en', 'name': 'English'}]",0.0,76578911.0,106.0,1995-02-10,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113041,8.387519,5.7,173.0,"{'id': 96871, 'name': 'Father of the Bride Col...",Just When His World Is Back To Normal... He's ...,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",['Comedy'],['Father of the Bride Collection'],['English'],"['Sandollar Productions', 'Touchstone Pictures']",['United States of America'],['US'],"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","['baby', 'midlife crisis', 'confidence', 'agin...","['Steve Martin', 'Diane Keaton', 'Martin Short...",['Charles Shyer']


In [None]:
### NOT USING THIS 
#create a mapping of unique title and Id
# lookup_title = dict(zip(df.Id.unique(), df.Original_Title.unique()))
# lookup_title

In [233]:
# add Score as a new column
df['Score'] = df.apply(weighted_rating, axis=1)
df.head()

Unnamed: 0,Id,Original_Title,Original_Language,Spoken_Languages,Budget,Revenue,Runtime,Release_Date,Production_Companies,Production_Countries,imdb_id,Popularity,Vote_Average,Vote_Count,Belongs_To_Collection,Tagline,Overview,Genres,Genres_Parse,Belongs_To_Collection_Parse,Spoken_Languages_Parse,Production_Companies_Parse,Production_Countries_Parse,Production_Countries_Code_Parse,Keywords,Keywords_parse,Cast_parse,Director_parse,Score
0,862,Toy Story,en,"[{'iso_639_1': 'en', 'name': 'English'}]",30000000.0,373554033.0,81.0,1995-10-30,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0114709,21.946943,7.7,5415.0,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"Led by Woody, Andy's toys live happily in his ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","['Animation', 'Comedy', 'Family']",['Toy Story Collection'],['English'],['Pixar Animation Studios'],['United States of America'],['US'],"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","['jealousy', 'toy', 'boy', 'friendship', 'frie...","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...",['John Lasseter'],7.639392
1,8844,Jumanji,en,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",65000000.0,262797249.0,104.0,1995-12-15,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113497,17.015539,6.9,2413.0,,Roll the dice and unleash the excitement!,When siblings Judy and Peter discover an encha...,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","['Adventure', 'Fantasy', 'Family']",,"['English', 'Français']","['TriStar Pictures', 'Teitler Film', 'Intersco...",['United States of America'],['US'],"[{'id': 10090, 'name': 'board game'}, {'id': 1...","['board game', 'disappearance', ""based on chil...","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...",['Joe Johnston'],6.819382
2,15602,Grumpier Old Men,en,"[{'iso_639_1': 'en', 'name': 'English'}]",0.0,0.0,101.0,1995-12-22,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113228,11.7129,6.5,92.0,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",Still Yelling. Still Fighting. Still Ready for...,A family wedding reignites the ancient feud be...,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","['Romance', 'Comedy']",['Grumpy Old Men Collection'],['English'],"['Warner Bros.', 'Lancaster Gate']",['United States of America'],['US'],"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","['fishing', 'best friend', 'duringcreditssting...","['Walter Matthau', 'Jack Lemmon', 'Ann-Margret...",['Howard Deutch'],5.941282
3,31357,Waiting to Exhale,en,"[{'iso_639_1': 'en', 'name': 'English'}]",16000000.0,81452156.0,127.0,1995-12-22,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",tt0114885,3.859495,6.1,34.0,,Friends are the people who let you be yourself...,"Cheated on, mistreated and stepped on, the wom...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","['Comedy', 'Drama', 'Romance']",,['English'],['Twentieth Century Fox Film Corporation'],['United States of America'],['US'],"[{'id': 818, 'name': 'based on novel'}, {'id':...","['based on novel', 'interracial relationship',...","['Whitney Houston', 'Angela Bassett', 'Loretta...",['Forest Whitaker'],5.70775
4,11862,Father of the Bride Part II,en,"[{'iso_639_1': 'en', 'name': 'English'}]",0.0,76578911.0,106.0,1995-02-10,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",tt0113041,8.387519,5.7,173.0,"{'id': 96871, 'name': 'Father of the Bride Col...",Just When His World Is Back To Normal... He's ...,Just when George Banks has recovered from his ...,"[{'id': 35, 'name': 'Comedy'}]",['Comedy'],['Father of the Bride Collection'],['English'],"['Sandollar Productions', 'Touchstone Pictures']",['United States of America'],['US'],"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","['baby', 'midlife crisis', 'confidence', 'agin...","['Steve Martin', 'Diane Keaton', 'Martin Short...",['Charles Shyer'],5.664068


Prepare the DF for Model

In [242]:
df2 = df[['Vote_Average','Vote_Count','Budget','Revenue','Runtime','Score']].copy()

#drop records if no values in these 3 columns: daszDXf[['Vote_Average','Score','Runtime']]
df2.dropna(axis="index", how="all", subset=['Vote_Average','Vote_Count','Budget','Revenue','Runtime','Score'], inplace=True)

print(df2.shape)
print(df2.head())

(44884, 6)
   Vote_Average  Vote_Count      Budget      Revenue  Runtime     Score
0           7.7      5415.0  30000000.0  373554033.0     81.0  7.639392
1           6.9      2413.0  65000000.0  262797249.0    104.0  6.819382
2           6.5        92.0         0.0          0.0    101.0  5.941282
3           6.1        34.0  16000000.0   81452156.0    127.0  5.707750
4           5.7       173.0         0.0   76578911.0    106.0  5.664068


Plot the Training and Test set
- plot a scatter matrix
- 75-25 for default train_test_split

In [249]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

X = df[['Vote_Average','Vote_Count','Budget','Revenue','Runtime']]
y = df['Score']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Create KNN Regression: 
- option 1: Imputate X with NaN + remove NaN in rows for Y, 

In [251]:
# Step 0 - drop rows where y is NaN
not_nan_indices = ~np.isnan(y)
X = X.loc[not_nan_indices]
y = y[not_nan_indices]

# Step 1: Imputation - Replace NaN with the mean of each column
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Step 2: Verify that there are no NaN values left in the data
print("Are there any NaN values after imputation?")
print(pd.DataFrame(X_imputed).isna().sum())  # Should output 0 for each column

# Step 3: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

# Step 4: Fit the KNN Regressor model
knnreg = KNeighborsRegressor(n_neighbors=5).fit(X_train, y_train)

# Step 5: Make predictions and evaluate the model
print("Predictions:", knnreg.predict(X_test))
print('R-squared test score: {:.3f}'.format(knnreg.score(X_test, y_test)))

Are there any NaN values after imputation?
0    0
1    0
2    0
3    0
4    0
dtype: int64
Predictions: [5.72256889 5.67225629 5.62593129 ... 5.78875609 5.63735532 5.53499975]
R-squared test score: 0.340


Option 2: Imputate both X and y with NaN.

In [252]:
# Step 1: Impute missing values in X
imputer_X = SimpleImputer(strategy='mean')
X_imputed = imputer_X.fit_transform(X)

# Step 2: Impute missing values in y
imputer_y = SimpleImputer(strategy='mean')
y_imputed = imputer_y.fit_transform(y.reshape(-1, 1)).ravel()  # ravel() to return to original shape

# Step 3: Check that there are no NaN values left in y
print("Are there any NaN values in y after imputation?")
print(np.isnan(y_imputed).sum())  # Should output 0

# Step 4: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y_imputed, test_size=0.2, random_state=42)

# Step 5: Fit the KNN Regressor model
knnreg = KNeighborsRegressor(n_neighbors=5).fit(X_train, y_train)

# Step 6: Make predictions and evaluate the model
print("Predictions:", knnreg.predict(X_test))
print('R-squared test score: {:.3f}'.format(knnreg.score(X_test, y_test)))

AttributeError: 'Series' object has no attribute 'reshape'