<a href="https://colab.research.google.com/github/sharlynmuturi/Pytorch-Tutorial/blob/main/movie_recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os


[Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata/data?select=tmdb_5000_credits.csv)

In [3]:
from google.colab import files

uploaded = files.upload()

Saving tmdb_5000_movies.csv to tmdb_5000_movies.csv


In [4]:
uploaded = files.upload()

Saving tmdb_5000_credits.csv to tmdb_5000_credits (1).csv


In [5]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

# Data Pre-processing

Combining movie info and credits using movie title, selecting only necessary columns and removing rows with missing values.



In [6]:

movies = movies.merge(credits, on = 'title')

movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

movies.dropna(inplace=True)


In [7]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


## Merging all important textual info into a single list.

### Converting stringified lists/dicts into Python lists.


Extracting each genre/keyword name into a simple list.

In [8]:
import ast

def convert(text):
    l = []
    for i in ast.literal_eval(text):
        l.append(i['name'])
    return l

movies['genres'] = movies['genres'].apply(convert)
movies['keywords'] = movies['keywords'].apply(convert)

Extracting top 3 cast members

In [9]:
def convert_cast(text):
    l = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            l.append(i['name'])
        counter += 1
    return l

movies['cast'] = movies['cast'].apply(convert_cast)

Finding and extracting only the director's name from the crew.

In [10]:
def fetch_director(text):
    l = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            l.append(i['name'])
            break
    return l

movies['crew'] = movies['crew'].apply(fetch_director)

### Splitting overview into individual words

Tokenizing the movie overview text.

In [11]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

### Removing spaces inside multi-word feature entries

This is important for vectorization

In [12]:
def remove_space(word):
    l = []
    for i in word:
        l.append(i.replace(' ',''))
    return l

movies['keywords'] = movies['keywords'].apply(remove_space)
movies['cast'] = movies['cast'].apply(remove_space)
movies['crew'] = movies['crew'].apply(remove_space)
movies['genres'] = movies['genres'].apply(remove_space)


In [13]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']


## Creating dataframe to store only essential columns for recommendation.

In [14]:
new_df = movies[['movie_id', 'title', 'tags']]

new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))

new_df['tags'] = new_df['tags'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x: x.lower())


In [15]:
new_df.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,following the death of district attorney harve...
4,49529,John Carter,"john carter is a war-weary, former military ca..."


## Apply stemming to reduces words to root forms.

Reducing vocabulary size improves vectorization.

In [16]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

def stems(text):
    l = []
    for i in text.split():
        l.append(ps.stem(i))
    return ' '.join(l)

new_df['tags'] = new_df['tags'].apply(stems)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(stems)


### Converting text to numeric vectors

Creates a Bag-of-Words matrix with 5,000 most common words.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')
vector = cv.fit_transform(new_df['tags']).toarray()


### Compute similarity between movies

Measures how similar two movies are based on their tags.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)


# Recommendation function

*   Gets the index of the selected movie.
*   Retrieves its similarity scores with all movies.
*   Sorts movies based on similarity score.
*   Skips the first one (because it's the same movie)
*   Prints top 5 most similar movies




In [19]:
def recommend(movie):
    index = new_df[new_df['title'] == movie].index[0]
    distances = sorted(list(enumerate(similarity[index])),
                       reverse=True,
                       key = lambda x: x[1])
    for i in distances[1:6]:
        print(new_df.iloc[i[0]].title)


### Eg

In [21]:
recommend('Spider-Man')

Spider-Man 3
Spider-Man 2
The Amazing Spider-Man 2
Arachnophobia
Kick-Ass


### Saving data to pickle files

Saves the processed data and similarity matrix to disk.so they can be loaded in another script.

In [20]:
import pickle
pickle.dump(new_df, open('movie_list.pkl', 'wb'))
pickle.dump(similarity, open('similarity_list.pkl', 'wb'))
