use overview column to help with recommender system | doc.to.vec | gensem wrapper, etc

# The Cinematic Nexus: Unveiling the Future of Movie Recommendations and Analysis

by Anthony Amadasun

## 1.2 Data Modeling 

---

### 1.2.1 Introduction




In this section, the project will delve into the process of transforming and engineering the data for our movie recommendation system. Additionally, we will build and evaluate predictive models and employ data visualization techniques to gain insights into the performance and characteristics of the models.

---

#### Imports

In [37]:
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix

#sklearn import
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#nltk import
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

import string
import json


In [2]:
tmdb_df = pd.read_csv('../data/tmdb_data.csv')

In [3]:
tmdb_df.head()

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,...,vote_average,vote_count,directors,cast,popularity_normalized,vote_count_normalized,vote_average_normalized,genre_names,release_year,genre_ids_str
0,False,/4MCKNAc6AbWjEsM2h9Xc29owo4z.jpg,"[28, 53, 18]",866398,en,The Beekeeper,One man’s campaign for vengeance takes on nati...,3775.726,/A7EByudX0eOzlkQ2FIbogzyazm2.jpg,2024-01-10,...,7.252,881,David Ayer,"Jason Statham, Emmy Raver-Lampman, Bobby Nader...",1.0,0.025019,0.7252,"['Action', 'Thriller', 'Drama']",2024.0,"['Action', 'Thriller', 'Drama']"
1,False,/pWsD91G2R1Da3AKM3ymr3UoIfRb.jpg,"[28, 878, 18]",933131,ko,황야,After a deadly earthquake turns Seoul into a l...,1734.954,/zVMyvNowgbsBAL6O6esWfRpAcOb.jpg,2024-01-26,...,6.794,245,Heo Myeong-haeng,"Ma Dong-seok, Lee Hee-jun, Lee Jun-young, Roh ...",0.4554,0.006958,0.6794,"['Action', 'Science Fiction', 'Drama']",2024.0,"['Action', 'Science Fiction', 'Drama']"
2,False,/criPrxkTggCra1jch49jsiSeXo1.jpg,"[878, 12, 28]",609681,en,The Marvels,"Carol Danvers, aka Captain Marvel, has reclaim...",1362.364,/9GBhzXMFjgcZ3FdR9w3bUMMTps5.jpg,2023-11-08,...,6.331,1485,Nia DaCosta,"Brie Larson, Teyonah Parris, Iman Vellani, Zaw...",0.355971,0.042172,0.6331,"['Science Fiction', 'Adventure', 'Action']",2023.0,"['Science Fiction', 'Adventure', 'Action']"
3,False,/yyFc8Iclt2jxPmLztbP617xXllT.jpg,"[35, 10751, 14]",787699,en,Wonka,Willy Wonka – chock-full of ideas and determin...,1340.068,/qhb1qOilapbapxWQn9jtRCMwXJF.jpg,2023-12-06,...,7.208,1955,Paul King,"Timothée Chalamet, Calah Lane, Keegan-Michael ...",0.350021,0.055519,0.7208,"['Comedy', 'Family', 'Fantasy']",2023.0,"['Comedy', 'Family', 'Fantasy']"
4,False,/cnqwv5Uz3UW5f086IWbQKr3ksJr.jpg,"[28, 12, 14]",572802,en,Aquaman and the Lost Kingdom,Black Manta seeks revenge on Aquaman for his f...,993.425,/7lTnXOy0iNtBAdRP3TZvaKJ77F6.jpg,2023-12-20,...,6.95,1510,James Wan,"Jason Momoa, Patrick Wilson, Yahya Abdul-Matee...",0.257516,0.042882,0.695,"['Action', 'Adventure', 'Fantasy']",2023.0,"['Action', 'Adventure', 'Fantasy']"


### 1.2.2 Data Transformation/Engineering

**Deliverables:**
- Feature Engineering: Create new features that might enhance the predictive power of the models, such as extracting information from actors amd directors.
- Handle Sparse Data: Address potential sparsity issues in user-item interaction matrices, as sparse data can impact collaborative filtering models.
- Encoding: Encode categorical features, ensuring all data is in a format suitable for modeling.

---

**Feature Engineering**

Create columns that looks at the influences of lead actors/actress and Director popularity

In [4]:
# Define a function to extract the lead actor from the comma-separated list
def extract_lead_actor(x):
    try:
        if not pd.isna(x): #Check if the value is non NaN
            cast_list = x.split(', ')
            return cast_list[0] if cast_list else None
        else:
            return None
    except Exception as e:
        print(f"Error extracting lead actor: {e}")
        return None

# Apply the function to create the lead_actor column
tmdb_df['lead_actor'] = tmdb_df['cast'].apply(extract_lead_actor)

# Calculate the average user rating for movies featuring each lead actor
actor_avg_rating = tmdb_df.groupby('lead_actor')['vote_average'].mean().reset_index()
actor_avg_rating.rename(columns={'vote_average': 'lead_actor_avg_rating'}, inplace=True)

# Merge the actor average ratings back to df
tmdb_df = pd.merge(tmdb_df, actor_avg_rating, how='left', on='lead_actor')


In [5]:
def extract_director(x):
    try:
        if pd.notna(x):  # Check if the value is not NaN
            directors_list = json.loads(x)
            if directors_list:
                return directors_list[0]['name']
    except (json.JSONDecodeError, KeyError, IndexError):
        pass  # Handle errors by returning None or any default value
    return None


In [6]:
# Create the single director column by extracting the first director name
tmdb_df['director'] = tmdb_df['directors'].apply(lambda x: x.split(',')[0] if pd.notna(x) else None)

# Calculate the average user rating for movies directed by each director
director_avg_rating = tmdb_df.groupby('director')['vote_average'].mean().reset_index()
director_avg_rating.rename(columns={'vote_average': 'director_avg_rating'}, inplace=True)

# Merge the director average ratings to df
tmdb_df = pd.merge(tmdb_df, director_avg_rating, how='left', on='director')


**Handling Sparse Data and Encoding**

In [7]:
# Convert the genre_ids_str column to a list of lists
tmdb_df['genre_ids_str'] = tmdb_df['genre_ids_str'].apply(eval)

In [8]:
# Use MultiLabelBinarizer to one-hot encode the genre_ids_str column
mlb = MultiLabelBinarizer()
genre_encoded = pd.DataFrame(mlb.fit_transform(tmdb_df['genre_ids_str']), columns=mlb.classes_, index=tmdb_df.index)

In [9]:
# Concatenate the one-hot encoded genres with the original DataFrame
tmdb_df = pd.concat([tmdb_df, genre_encoded], axis=1)

In [10]:
# Drop the original genre_ids_str column
tmdb_df = tmdb_df.drop('genre_ids_str', axis=1)

In [11]:
# Preparing data for building a recommendation system 
# by handling sparsity and creating a sparse matrix for 
# Collaborative filtering methods
interaction_data = tmdb_df[['popularity_normalized', 'vote_average_normalized', 
                            'vote_count_normalized', 'release_year', 'lead_actor_avg_rating',
                           'director_avg_rating'] + list(mlb.classes_)]

In [12]:
interaction_data.dtypes

popularity_normalized      float64
vote_average_normalized    float64
vote_count_normalized      float64
release_year               float64
lead_actor_avg_rating      float64
director_avg_rating        float64
Action                       int64
Adventure                    int64
Animation                    int64
Comedy                       int64
Crime                        int64
Documentary                  int64
Drama                        int64
Family                       int64
Fantasy                      int64
History                      int64
Horror                       int64
Music                        int64
Mystery                      int64
Romance                      int64
Science Fiction              int64
TV Movie                     int64
Thriller                     int64
War                          int64
Western                      int64
dtype: object

In [36]:
# columns in the interaction_data_pivot
interaction_data_columns = ['director_avg_rating', 'lead_actor_avg_rating', 'popularity_normalized', 'vote_average_normalized', 'vote_count_normalized', 'release_year', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']

# title as index and  the desired columns
interaction_data_pivot = tmdb_df.set_index('title')[interaction_data_columns]

interaction_data_pivot.head()


Unnamed: 0_level_0,director_avg_rating,lead_actor_avg_rating,popularity_normalized,vote_average_normalized,vote_count_normalized,release_year,Action,Adventure,Animation,Comedy,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Beekeeper,6.895333,6.525095,1.0,0.7252,0.025019,2024.0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Badland Hunters,6.794,7.219714,0.4554,0.6794,0.006958,2024.0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
The Marvels,6.181,5.987,0.355971,0.6331,0.042172,2023.0,1,1,0,0,...,0,0,0,0,0,1,0,0,0,0
Wonka,7.336,6.0616,0.350021,0.7208,0.055519,2023.0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Aquaman and the Lost Kingdom,7.0163,6.3992,0.257516,0.695,0.042882,2023.0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# Handling Sparse Data: Replace NaN values with 0
interaction_matrix = interaction_data.fillna(0)

In [15]:
#convert into sparse matrix using csr_matrix
sparse_interaction_matrix = csr_matrix(interaction_matrix.values)

In [16]:
# Calculate cosine similarity using sparse matrix
sparse_distances = pairwise_distances(sparse_interaction_matrix, metric='cosine')
sparse_cosine_similarities = 1.0 - sparse_distances


### 1.2.3 Data Modeling


**Deliverables:**

- Collaborative Filtering: Implement collaborative filtering techniques to make movie recommendations based on user preferences and similarities between users or items.
- Content-Based Filtering: Apply content-based filtering approaches to recommend movies based on their features, such as overview, cast, or director and nlp technique.
- Hybrid Models: Explore the development of hybrid models that combine collaborative and content-based filtering for improved recommendation accuracy.

---

**Collaborative Filtering**

- Created an interaction matrix 
- Apply collaborative filtering using cosine similarity on interaction_data_pivot
- item_similarity_matrix will then be used to recommend movies for a given input movie using the get_movie_recommendations function.
- create an iteractive recommendation function that allow users to input their favorite movie and receive recommendations based on their preferences.

In [17]:
def recommend_based_on_user_preference(user_preference, tmdb_df):
    # Split user input
    preferences = [preference.strip() for preference in user_preference.split(',')]

    # Filter movies based on user preferences in directors and cast columns
    filtered_movies = tmdb_df[
        tmdb_df['directors'].apply(lambda x: any(pref.lower() in str(x).lower() for pref in preferences)) |
        tmdb_df['cast'].apply(lambda x: any(pref.lower() in str(x).lower() for pref in preferences))
    ]

    # Sort movies by popularity (can change for other metric)
    sorted_movies = filtered_movies.sort_values(by='popularity_normalized', ascending=False)

    # Extract recommended movie titles
    recommended_movies = sorted_movies['title'].tolist()

    return recommended_movies


In [18]:
item_similarity = cosine_similarity(sparse_interaction_matrix.T, dense_output=False)
# interaction_data_pivot is user interaction data
item_similarity_matrix = cosine_similarity(interaction_data_pivot.fillna(0))

# Function to get movie recommendations based on item similarity
def get_movie_recommendations(movie_title, item_similarity_matrix, interaction_data_pivot):
    """
    This function takes a movie title, an item similarity matrix, and a interaction_data_pivot as input.
    It returns a list of movie recommendations based on the item similarity of the input movie.
    
    - movie_title (str): The title of the movie for which recommendations are requested.
    - item_similarity_matrix (numpy.ndarray): The item similarity matrix, computed using collaborative filtering
      (e.g., cosine similarity on the interaction_data_pivot ).
    - interaction_data_pivot (pd.DataFrame): The user-item interaction matrix where rows represent movies, and columns represent various features
      like 'director_avg_rating', 'lead_actor_avg_rating', 'popularity_normalized', 'vote_average_normalized', 'vote_count_normalized', 'release_year', and genre indicators.
      The values represent movie features or characteristics.
    
    Returns:
    List[str]: A list of recommended movies based on item similarity. 
    The list is sorted in descending order of similarity.
      
    """
    if movie_title in interaction_data_pivot.index:
        similar_scores = item_similarity_matrix[interaction_data_pivot.index.get_loc(movie_title)]
        similar_movies = list(interaction_data_pivot.index[np.argsort(similar_scores)[::-1]])
        return similar_movies[1:]  # Exclude the input movie itself
    else:
        print(f"Movie '{movie_title}' not found in the dataset.")
        user_preference = input("Enter your preferred actor, genre, or other relevant information: ")
        # Perform recommendation based on user's additional input
        recommendations = recommend_based_on_user_preference(user_preference, tmdb_df)
        return recommendations


In [19]:
def interactive_movie_recommendation(tmdb_df):
    """
    This function allows users to input their favorite movie and receive recommendations
    based on their preferences,including genre, director, actor/actress, and release year range.
    
    Parameters:
    - tmdb_df (pd.DataFrame): DataFrame containing movie data, including columns like 'title', 'genre_names', 'directors',
    'cast', 'release_year', 'popularity_normalized', and others.
    
    Returns:
    None
    
    Note: The function utlizes the 'get_movie_recommendations' function, and the 'item_similarity_matrix' computed
    using collaborative filtering (e.g., cosine similarity on the item interaction matrix).
    
    Example usage:
    interactive_movie_recommendation(tmdb_df)
    """
    while True:
        # Prompt the user to enter their favorite movie
        user_input_movie = input("Enter your favorite movie: ")

        # Check if the movie exists in the dataset
        matching_movies = tmdb_df[tmdb_df['title'].str.lower() == user_input_movie.lower()]

        if not matching_movies.empty:
            # If the movie is found, recommend similar movies
            recommended_movies = get_movie_recommendations(user_input_movie, item_similarity_matrix, interaction_data_pivot)
            print(f"\nHere are some recommendations based on '{user_input_movie}':")
            print(recommended_movies[:5])
        else:
            print(f"Movie '{user_input_movie}' not found in the dataset.")
            print("Let's try to find recommendations based on your preferences.")

            while True:
                # Prompt the user for their favorite genre, director, or actor/actress
                user_preference = input("Enter your favorite genre, director, or actor/actress: ")

                # Prompt the user for the desired release year range
                start_year = int(input("Enter the starting year: "))
                end_year = int(input("Enter the ending year: "))

                # Filter movies based on user preferences and release year range
                filtered_movies = tmdb_df[
                    (tmdb_df['genre_names'].apply(lambda x: user_preference.lower() in str(x).lower())) |
                    (tmdb_df['directors'].apply(lambda x: user_preference.lower() in str(x).lower())) |
                    (tmdb_df['cast'].apply(lambda x: user_preference.lower() in str(x).lower())) &
                    (tmdb_df['release_year'].between(start_year, end_year))
                ]
                print("Filtered Movies:")
                print(filtered_movies[['title', 'release_year']])


                # Sort movies by popularity
                sorted_movies = filtered_movies.sort_values(by='popularity_normalized', ascending=False)

                # Extract recommended movie titles
                recommended_movies = sorted_movies['title'].tolist()

                if not recommended_movies:
                    print("No movies found based on your preferences.")
                    break

                print(f"\nHere are some recommendations based on your preferences:")
                print(recommended_movies[:5])

                # user feedback
                user_feedback = input("Do these movies appeal to you? (yes/no): ").lower()

                if user_feedback == 'yes':
                    print("Great! Enjoy watching.")
                    return
                elif user_feedback == 'no':
                    print("Let's try refining your preferences.")
                    continue
                else:
                    print("Invalid input. Please enter 'yes' or 'no'.")
                    continue

        #user feedback
        user_feedback = input("Do these movies appeal to you? (yes/no): ").lower()

        if user_feedback == 'yes':
            print("Great! Enjoy watching.")
            break
        elif user_feedback == 'no':
            print("Sorry to hear that. Let's try refining your preferences.")
            continue
        else:
            print("Invalid input. Please enter 'yes' or 'no'.")
            continue


In [20]:
# resulting interaction prompt
interactive_movie_recommendation(tmdb_df)

Enter your favorite movie:  Wonka



Here are some recommendations based on 'Wonka':
['How the Grinch Stole Christmas', "Roald Dahl's Matilda the Musical", "Roald Dahl's The Witches", 'Disenchanted', 'Hocus Pocus 2']


Do these movies appeal to you? (yes/no):  no


Sorry to hear that. Let's try refining your preferences.


Enter your favorite movie:  star war


Movie 'star war' not found in the dataset.
Let's try to find recommendations based on your preferences.


Enter your favorite genre, director, or actor/actress:  Jason Momoa
Enter the starting year:  2015
Enter the ending year:  2024


Filtered Movies:
                                   title  release_year
4           Aquaman and the Lost Kingdom        2023.0
43                                Fast X        2023.0
46                                  Dune        2021.0
68                             The Flash        2023.0
101                              Aquaman        2018.0
390         Zack Snyder's Justice League        2021.0
477                       Justice League        2017.0
650   Batman v Superman: Dawn of Justice        2016.0
1504                          Sweet Girl        2021.0
2934                              Braven        2018.0

Here are some recommendations based on your preferences:
['Aquaman and the Lost Kingdom', 'Fast X', 'Dune', 'The Flash', 'Aquaman']


Do these movies appeal to you? (yes/no):  no


Let's try refining your preferences.


Enter your favorite genre, director, or actor/actress:  George Lucas
Enter the starting year:  2000
Enter the ending year:  2023


Filtered Movies:
                                             title  release_year
397                                      Star Wars        1977.0
1840     Star Wars: Episode I - The Phantom Menace        1999.0
2022  Star Wars: Episode III - Revenge of the Sith        2005.0
2376                             American Graffiti        1973.0
2437  Star Wars: Episode II - Attack of the Clones        2002.0
2522               Obi-Wan Kenobi: A Jedi's Return        2022.0

Here are some recommendations based on your preferences:
['Star Wars', 'Star Wars: Episode I - The Phantom Menace', 'Star Wars: Episode III - Revenge of the Sith', 'American Graffiti', 'Star Wars: Episode II - Attack of the Clones']


Do these movies appeal to you? (yes/no):  yes


Great! Enjoy watching.


In [21]:
american_graffiti_release_year = tmdb_df.loc[tmdb_df['title'] == 'American Graffiti', 'release_year']
print(american_graffiti_release_year)


2376    1973.0
Name: release_year, dtype: float64


In [22]:
print(tmdb_df['release_year'].dtype)


float64


**Content Based Filtering**

<ins>Text preprocessing step:</ins> 
- Lowercasing: Convert all text to lowercase to ensure consistency.
- Tokenization: Break down the text into individual words or tokens.
- Removing Stopwords: Exclude common words that don't carry much meaning (e.g., "the," "and," "is").
- Removing Punctuation and Special Characters: Keep only alphanumeric characters and relevant symbols.
- Stemming or Lemmatization: Reduce words to their base or root form for better feature representation.

<ins>Feature Extraction:</ins>
-  convert the text into numerical features (TF-IDF or Cvec)

In [28]:
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aamad_000/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/aamad_000/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/aamad_000/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [40]:
#functions for text preprocessing 
#
def preprocess_text(text):
    #lowcasing
    text = text.lower()
    
    #tokenization
    tokens = word_tokenize(text)
    
    #removing stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
    
    #Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]
    
    return ' '.join(tokens)

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_text(text):
    
    if pd.isnull(text):  # Check for NaN values
        return ''
    
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    
    tokens = word_tokenize(text)
    filtered_tokens = [token.lower() for token in tokens if token.isalpha() and token.lower() not in stop_words]

    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in filtered_tokens]
    return ' '.join(lemmatized_tokens)


In [35]:
# overview_example = "In a GALAXY far, far away..."
# preprocessed_overview = lemmatize_text(overview_example)
# print(preprocessed_overview)

galaxy far far away


In [41]:
tmdb_df['preprocessed_overview'] = tmdb_df['overview'].apply(lemmatize_text)

### 1.2.4 Data Visualization


**Deliverables:**

- Model Evaluation: Visualize the performance of different recommendation models using metrics such as precision, recall, and accuracy.
- Feature Importance: Gain insights into the importance of different features in the models through visualizations, aiding in model interpretation.
- User-Item Interaction: Visualize patterns in user-item interaction matrices to understand user preferences and item popularity.

---