# Vectorizing Movies Overview

In this notebook, we explore the process of transforming movie data into numerical vectors that can be used for building a recommendation system. This involves representing each movie as a vector of 100 items, capturing various aspects of the film's attributes, such as genres, descriptions, and user ratings.





In [1]:
# Import libraries
import pandas as pd              
import numpy as np                   
import matplotlib.pyplot as plt      

## Data Loading
In this section, we load the movie metadata and user ratings datasets from CSV files. 


In [2]:
movies_metadata_path = r"f:\projects\Movie-Recommender\dataset\movies_metadata.csv"
ratings_small_path = r"f:\projects\Movie-Recommender\dataset\ratings_small.csv"

# Read the CSV files containing movie metadata and user ratings
movies = pd.read_csv(movies_metadata_path, low_memory=False)
ratings = pd.read_csv(ratings_small_path)

movies.head(5)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
column_names = movies.columns.tolist()
print(column_names)


['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count']


## Data Cleaning and Vectorization
In this section, we perform data cleaning and vectorization of the movie metadata dataset. This includes transforming categorical features into a numerical format suitable for machine learning algorithms.

In [4]:
import ast
# Clean up the 'id' column in the movies dataframe
movies['id'] = pd.to_numeric(movies['id'], errors='coerce')
# Drop rows where 'id' is NaN (i.e., rows with invalid IDs)
movies.dropna(subset=['id'], inplace=True)
movies['release_year'] = pd.to_datetime(movies['release_date']).dt.year

# Genre processing
# Fill missing 'genres' with an empty list ('[]') and convert genre strings to a list of dictionaries
movies['genres'] = movies['genres'].fillna('[]')
movies['genres'] = movies['genres'].apply(ast.literal_eval)
movies['genres'] = movies['genres'].apply(
    lambda x: [genre['name'] for genre in x])


In [5]:
movies['spoken_languages'] = movies['spoken_languages'].fillna('[]')
movies['spoken_languages'] = movies['spoken_languages'].apply(ast.literal_eval)
movies['spoken_languages'] = movies['spoken_languages'].apply(
    lambda x: [language['name'] for language in x])


In [6]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genres_encoded = mlb.fit_transform(movies['genres'])

In [10]:
movies['adult'] = movies['adult'].map({'True': 1, 'False': 0}).astype(int)


In [8]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
movies['budget_normalized'] = scaler.fit_transform(movies[['budget']])


In [9]:
original_language_encoded = pd.get_dummies(movies['original_language'])

In [11]:
spoken_languages_encoded = mlb.fit_transform(movies['spoken_languages'])


In [12]:
movies['popularity_normalized'] = scaler.fit_transform(movies[['popularity']])


In [13]:
movies['runtime_normalized'] = scaler.fit_transform(movies[['runtime']])


In [14]:
movies['vote_average_normalized'] = scaler.fit_transform(
    movies[['vote_average']])
movies['vote_count_normalized'] = scaler.fit_transform(movies[['vote_count']])


In [16]:
import numpy as np


# Combine all encoded features into one DataFrame

final_features = np.hstack((
    genres_encoded,
    spoken_languages_encoded,
    movies[['adult', 'budget_normalized', 'popularity_normalized', 'runtime_normalized', 'vote_average_normalized', 'vote_count_normalized']].to_numpy()
))
# You can convert this back to a DataFrame if needed
import pandas as pd
final_df = pd.DataFrame(final_features)

In [18]:
import pandas as pd


# Assuming 'movies' is your original DataFrame and 'final_features' is your encoded feature array

movie_ids = movies['id'].values  # Replace 'id' with the actual column name for movie IDs
# Create a DataFrame containing movie IDs and their corresponding vectors
final_df = pd.DataFrame(final_features, index=movie_ids)
final_df.index.name = 'movie_id'
final_df.to_csv('movie_vectors.csv')
