use overview column to help with recommender system | doc.to.vec | gensem wrapper, etc

# The Cinematic Nexus: Unveiling the Future of Movie Recommendations and Analysis

by Anthony Amadasun

## 1.2 Data Modeling 

---

### 1.2.1 Introduction




In this section, the project will delve into the process of transforming and engineering the data for our movie recommendation system. Additionally, we will build and evaluate predictive models and employ data visualization techniques to gain insights into the performance and characteristics of the models.

---

#### Imports

In [1]:
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix

from sklearn.preprocessing import MultiLabelBinarizer

import json


In [2]:
tmdb_df = pd.read_csv('../data/tmdb_data.csv')

In [3]:
tmdb_df.head()

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,...,vote_average,vote_count,directors,cast,popularity_normalized,vote_count_normalized,vote_average_normalized,genre_names,release_year,genre_ids_str
0,False,/4MCKNAc6AbWjEsM2h9Xc29owo4z.jpg,"[28, 53, 18]",866398,en,The Beekeeper,One man’s campaign for vengeance takes on nati...,3775.726,/A7EByudX0eOzlkQ2FIbogzyazm2.jpg,2024-01-10,...,7.252,881,David Ayer,"Jason Statham, Emmy Raver-Lampman, Bobby Nader...",1.0,0.025019,0.7252,"['Action', 'Thriller', 'Drama']",2024.0,"['Action', 'Thriller', 'Drama']"
1,False,/pWsD91G2R1Da3AKM3ymr3UoIfRb.jpg,"[28, 878, 18]",933131,ko,황야,After a deadly earthquake turns Seoul into a l...,1734.954,/zVMyvNowgbsBAL6O6esWfRpAcOb.jpg,2024-01-26,...,6.794,245,Heo Myeong-haeng,"Ma Dong-seok, Lee Hee-jun, Lee Jun-young, Roh ...",0.4554,0.006958,0.6794,"['Action', 'Science Fiction', 'Drama']",2024.0,"['Action', 'Science Fiction', 'Drama']"
2,False,/criPrxkTggCra1jch49jsiSeXo1.jpg,"[878, 12, 28]",609681,en,The Marvels,"Carol Danvers, aka Captain Marvel, has reclaim...",1362.364,/9GBhzXMFjgcZ3FdR9w3bUMMTps5.jpg,2023-11-08,...,6.331,1485,Nia DaCosta,"Brie Larson, Teyonah Parris, Iman Vellani, Zaw...",0.355971,0.042172,0.6331,"['Science Fiction', 'Adventure', 'Action']",2023.0,"['Science Fiction', 'Adventure', 'Action']"
3,False,/yyFc8Iclt2jxPmLztbP617xXllT.jpg,"[35, 10751, 14]",787699,en,Wonka,Willy Wonka – chock-full of ideas and determin...,1340.068,/qhb1qOilapbapxWQn9jtRCMwXJF.jpg,2023-12-06,...,7.208,1955,Paul King,"Timothée Chalamet, Calah Lane, Keegan-Michael ...",0.350021,0.055519,0.7208,"['Comedy', 'Family', 'Fantasy']",2023.0,"['Comedy', 'Family', 'Fantasy']"
4,False,/cnqwv5Uz3UW5f086IWbQKr3ksJr.jpg,"[28, 12, 14]",572802,en,Aquaman and the Lost Kingdom,Black Manta seeks revenge on Aquaman for his f...,993.425,/7lTnXOy0iNtBAdRP3TZvaKJ77F6.jpg,2023-12-20,...,6.95,1510,James Wan,"Jason Momoa, Patrick Wilson, Yahya Abdul-Matee...",0.257516,0.042882,0.695,"['Action', 'Adventure', 'Fantasy']",2023.0,"['Action', 'Adventure', 'Fantasy']"


### 1.2.2 Data Transformation/Engineering

**Deliverables:**
- Feature Engineering: Create new features that might enhance the predictive power of the models, such as extracting information from movie titles, actors, or directors.
- Handle Sparse Data: Address potential sparsity issues in user-item interaction matrices, as sparse data can impact collaborative filtering models.
- Encoding: Encode categorical features, ensuring all data is in a format suitable for modeling.

---

**Feature Engineering**

Create columns that looks at the influences of lead actors/actress and Director popularity

In [4]:
# Define a function to extract the lead actor from the comma-separated list
def extract_lead_actor(x):
    try:
        if not pd.isna(x):
            cast_list = x.split(', ')
            return cast_list[0] if cast_list else None
        else:
            return None
    except Exception as e:
        print(f"Error extracting lead actor: {e}")
        return None

# Apply the function to create the lead_actor column
tmdb_df['lead_actor'] = tmdb_df['cast'].apply(extract_lead_actor)

# Calculate the average user rating for movies featuring each lead actor
actor_avg_rating = tmdb_df.groupby('lead_actor')['vote_average'].mean().reset_index()
actor_avg_rating.rename(columns={'vote_average': 'lead_actor_avg_rating'}, inplace=True)

# Merge the actor average ratings back to df
tmdb_df = pd.merge(tmdb_df, actor_avg_rating, how='left', on='lead_actor')


In [5]:
def extract_director(x):
    try:
        if pd.notna(x):  # Check if the value is not NaN
            directors_list = json.loads(x)
            if directors_list:
                return directors_list[0]['name']
    except (json.JSONDecodeError, KeyError, IndexError):
        pass  # Handle errors by returning None or any default value
    return None


In [6]:
# Create the single director column by extracting the first director name
tmdb_df['director'] = tmdb_df['directors'].apply(lambda x: x.split(',')[0] if pd.notna(x) else None)

# Calculate the average user rating for movies directed by each director
director_avg_rating = tmdb_df.groupby('director')['vote_average'].mean().reset_index()
director_avg_rating.rename(columns={'vote_average': 'director_avg_rating'}, inplace=True)

# Merge the director average ratings to df
tmdb_df = pd.merge(tmdb_df, director_avg_rating, how='left', on='director')


In [7]:
tmdb_df

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,...,popularity_normalized,vote_count_normalized,vote_average_normalized,genre_names,release_year,genre_ids_str,lead_actor,lead_actor_avg_rating,director,director_avg_rating
0,False,/4MCKNAc6AbWjEsM2h9Xc29owo4z.jpg,"[28, 53, 18]",866398,en,The Beekeeper,One man’s campaign for vengeance takes on nati...,3775.726,/A7EByudX0eOzlkQ2FIbogzyazm2.jpg,2024-01-10,...,1.000000,0.025019,0.7252,"['Action', 'Thriller', 'Drama']",2024.0,"['Action', 'Thriller', 'Drama']",Jason Statham,6.525095,David Ayer,6.895333
1,False,/pWsD91G2R1Da3AKM3ymr3UoIfRb.jpg,"[28, 878, 18]",933131,ko,황야,After a deadly earthquake turns Seoul into a l...,1734.954,/zVMyvNowgbsBAL6O6esWfRpAcOb.jpg,2024-01-26,...,0.455400,0.006958,0.6794,"['Action', 'Science Fiction', 'Drama']",2024.0,"['Action', 'Science Fiction', 'Drama']",Ma Dong-seok,7.219714,Heo Myeong-haeng,6.794000
2,False,/criPrxkTggCra1jch49jsiSeXo1.jpg,"[878, 12, 28]",609681,en,The Marvels,"Carol Danvers, aka Captain Marvel, has reclaim...",1362.364,/9GBhzXMFjgcZ3FdR9w3bUMMTps5.jpg,2023-11-08,...,0.355971,0.042172,0.6331,"['Science Fiction', 'Adventure', 'Action']",2023.0,"['Science Fiction', 'Adventure', 'Action']",Brie Larson,5.987000,Nia DaCosta,6.181000
3,False,/yyFc8Iclt2jxPmLztbP617xXllT.jpg,"[35, 10751, 14]",787699,en,Wonka,Willy Wonka – chock-full of ideas and determin...,1340.068,/qhb1qOilapbapxWQn9jtRCMwXJF.jpg,2023-12-06,...,0.350021,0.055519,0.7208,"['Comedy', 'Family', 'Fantasy']",2023.0,"['Comedy', 'Family', 'Fantasy']",Timothée Chalamet,6.061600,Paul King,7.336000
4,False,/cnqwv5Uz3UW5f086IWbQKr3ksJr.jpg,"[28, 12, 14]",572802,en,Aquaman and the Lost Kingdom,Black Manta seeks revenge on Aquaman for his f...,993.425,/7lTnXOy0iNtBAdRP3TZvaKJ77F6.jpg,2023-12-20,...,0.257516,0.042882,0.6950,"['Action', 'Adventure', 'Fantasy']",2023.0,"['Action', 'Adventure', 'Fantasy']",Jason Momoa,6.399200,James Wan,7.016300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,False,/cImYSNm0piuUfjlBCozNPEEvTK2.jpg,"[28, 18, 10752]",74513,fr,Forces spéciales,Afghanistan. War correspondent Elsa Casanova i...,28.451,/xpe9F3214mNrk8tnIqZvHSmIV1I.jpg,2011-11-02,...,0.000004,0.015165,0.6705,"['Action', 'Drama', 'War']",2011.0,"['Action', 'Drama', 'War']",Diane Kruger,6.705000,Stéphane Rybojad,6.705000
2996,False,/oisfyausZuNUeKOafx8nxaSdIuB.jpg,"[27, 53, 28]",760883,de,Blood Red Sky,A woman with a mysterious illness is forced in...,28.446,/v7aOJKI5vxCHotHvN8O7SR6SpP6.jpg,2021-07-23,...,0.000002,0.040354,0.6884,"['Horror', 'Thriller', 'Action']",2021.0,"['Horror', 'Thriller', 'Action']",Peri Baumeister,6.884000,Peter Thorwarth,6.774500
2997,False,/b5NH2rjQFATYfkd1XYbGQWeyevx.jpg,"[10751, 12, 16, 14]",47292,ja,劇場版ポケットモンスター ダイヤモンド&パール ギラティナと氷空（そら）の花束 シェイミ,When Giratina is discovered to be able to crea...,28.444,/8dJxgyryI2XmheaTd3hrCkOabNu.jpg,2008-07-19,...,0.000002,0.006446,0.6749,"['Family', 'Adventure', 'Animation', 'Fantasy']",2008.0,"['Family', 'Adventure', 'Animation', 'Fantasy']",Rica Matsumoto,6.828400,Kunihiko Yuyama,6.750333
2998,False,/sJ2FSMSYJqCQNOf76KGnzJQmPzG.jpg,"[35, 27, 14]",928,en,Gremlins 2: The New Batch,Young sweethearts Billy and Kate move to the B...,28.443,/jN7yvxnIHRozhq2mzWZDE5GPRc0.jpg,1990-06-15,...,0.000002,0.067674,0.6407,"['Comedy', 'Horror', 'Fantasy']",1990.0,"['Comedy', 'Horror', 'Fantasy']",Zach Galligan,6.752000,Joe Dante,6.752000


**Handling Sparse Data and Encoding**

In [8]:
## Convert the genre_ids_str column to a list of lists
tmdb_df['genre_ids_str'] = tmdb_df['genre_ids_str'].apply(eval)

In [9]:
# Use MultiLabelBinarizer to one-hot encode the genre_ids_str column
mlb = MultiLabelBinarizer()
genre_encoded = pd.DataFrame(mlb.fit_transform(tmdb_df['genre_ids_str']), columns=mlb.classes_, index=tmdb_df.index)

In [10]:
# Concatenate the one-hot encoded genres with the original DataFrame
tmdb_df = pd.concat([tmdb_df, genre_encoded], axis=1)

In [11]:
# Drop the original genre_ids_str column
tmdb_df = tmdb_df.drop('genre_ids_str', axis=1)

In [12]:
# Preparing data for building building a recommendation system 
# by handling sparsity and creating a sparse matrix for 
# Collaborative filtering methods
interaction_data = tmdb_df[['popularity_normalized', 'vote_average_normalized', 
                            'vote_count_normalized', 'release_year'] + list(mlb.classes_)]

In [13]:
interaction_data.dtypes

popularity_normalized      float64
vote_average_normalized    float64
vote_count_normalized      float64
release_year               float64
Action                       int64
Adventure                    int64
Animation                    int64
Comedy                       int64
Crime                        int64
Documentary                  int64
Drama                        int64
Family                       int64
Fantasy                      int64
History                      int64
Horror                       int64
Music                        int64
Mystery                      int64
Romance                      int64
Science Fiction              int64
TV Movie                     int64
Thriller                     int64
War                          int64
Western                      int64
dtype: object

In [14]:
# Handling Sparse Data: Replace NaN values with 0
interaction_matrix = interaction_data.fillna(0)

In [15]:
#convert into sparse matrix using csr_matrix
sparse_interaction_matrix = csr_matrix(interaction_matrix.values)

In [None]:
#will need to include director and cast to sparse matrix

### 1.2.3 Data Modeling


**Deliverables:**

- Collaborative Filtering: Implement collaborative filtering techniques to make movie recommendations based on user preferences and similarities between users or items.
- Content-Based Filtering: Apply content-based filtering approaches to recommend movies based on their features, such as genre, cast, or director.
- Hybrid Models: Explore the development of hybrid models that combine collaborative and content-based filtering for improved recommendation accuracy.

---

### 1.2.4 Data Visualization


**Deliverables:**

- Model Evaluation: Visualize the performance of different recommendation models using metrics such as precision, recall, and accuracy.
- Feature Importance: Gain insights into the importance of different features in the models through visualizations, aiding in model interpretation.
- User-Item Interaction: Visualize patterns in user-item interaction matrices to understand user preferences and item popularity.

---