**Project Name:** Movie Recommender Module

**Created By:** Raushan Kumar

**Date:** July 11th, 2022.

**Description:** Becsically there are three types of recommendation approach/system. They are 
1. Content Based -> In this approcah we collect information about the movie ex. genre,cast,crew etc and based on that we build.
2. Collaborative Based -> It is based on the user's similarity. ex. if user A shares the similar choice with user B then user B may get similar movie recommendation to the user A or vice versa.
3. Hybrid Technique -> This is combination of Content based and Collaborative based approach.

In this mini project i am going to use Content Based technique. Gathered ~5k records of IMDB movies rating and related data. I'll use only specifc fetures or those fetures which contributions are meaningful.


In [156]:
# Importing Library
import pandas as pd
import numpy as np

In [157]:
# Reading zip csv data from Github
movies=pd.read_csv('https://github.com/raushan1info/NLP/blob/main/tmdb_5000_movies.csv.zip?raw=true',compression='zip')
credits=pd.read_csv('https://github.com/raushan1info/NLP/blob/main/tmdb_5000_credits.csv.zip?raw=true',compression='zip')

In [158]:
#movies=movie.copy()

In [159]:
(movies.shape,credits.shape)

((4803, 20), (4803, 4))

In [160]:
print('movies columns: ',movies.columns)
print('credit columns: ', credits.columns)

movies columns:  Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')
credit columns:  Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')


In [161]:
movies=movies.merge(credits,on='title')
#movies.head(3)

# Data Cleaning & Preprocessing

In [162]:
# selecting responsible/helpful columns
movies=movies[['movie_id','title','overview','genres','keywords','cast','crew']]

In [163]:
# Checking nulls
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [164]:
# dropping null value as null counts are negligible
movies.dropna(axis=0,inplace=True)

In [165]:
# Checking duplicate value
movies.duplicated().sum()

0

**AST Module:** Abstract Syntax Tree. This module will be used to convert list of dictionaries of string into numeric 
For ex. '[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]' 
in this ex. entire list is a string so we need to convert it where literal_eval function will come into the picture

In [166]:
import ast

In [167]:
# s='[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'
# for i in ast.literal_eval(s):
#   print(i['name'])


Defining a function to extract specific text from the list of dictionaries

In [168]:
def convert(text):
  l=[]
  for i in ast.literal_eval(text):
    l.append(i['name'])
  return l


In [169]:
# Extracting Genre names
movies['genres']=movies['genres'].apply(convert)

In [170]:
#movies.keywords[0]

In [171]:
# Extracting name od keywords
movies['keywords']=movies['keywords'].apply(convert)

In [172]:
movies.cast[0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [173]:
# function to extrct cast from disctionary/list upto count 3
def convert2(text):
  l=[]
  count=0
  for i in ast.literal_eval(text):
    if count<3:
      l.append(i['name'])
      count+=1
  return l

In [174]:
# Extracting 3 leads cast
movies['cast']=movies['cast'].apply(convert2)

In [175]:
#movies.cast.apply(lambda x : x[0:3])

In [176]:
# func to fetch director name

def fetch_director(text):
  l=[]
  for i in ast.literal_eval(text):
    if i['job']=='Director':
      l.append(i['name'])
  return l

In [177]:
# Extracting Director name
movies['crew']=movies.crew.apply(fetch_director)

In [178]:
# Fields values contains space in their values, those spaces need to removed to reduce name ambiguity or for better result
def collapse(text):
  t=[]
  for i in text:
    t.append(i.replace(' ',''))
  return t



In [179]:
# instead of collapse function, lambda function can also be used
movies['genres']=movies['genres'].apply(collapse)
movies['keywords']=movies['keywords'].apply(collapse)
movies['cast']=movies['cast'].apply(collapse)
movies['crew']=movies['crew'].apply(collapse)

In [180]:
# def find_float(text):
#   f=[]
#   for i in ast.literal_eval(text):
#     if i.isfloat():
#       f.append(i)
#     else:
#       i.split()
#   return f,i
#movies.iloc[0]

In [181]:
# Converting overview data into List from String
movies['overview']=movies['overview'].apply(lambda x: list(str(x).split()))


In [182]:
# Combining all fields into a single one for further process
movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']

In [183]:
# Dropping the columns which are unuseful after combining them
new_df = movies.drop(columns=['overview','genres','keywords','cast','crew'])

In [184]:
# converting tags field into string from list
new_df['tags']=new['tags'].apply(lambda x : ' '.join(x))

In [185]:
# Convert tags filed data into lower case
new_df['tags']=new_df['tags'].apply(lambda x:x.lower())

# Text conversion into vector andn calculating similarities between movies 

In [186]:
new_df.head(3)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,a cryptic message from bond’s past sends him o...


**Here using Bag of Words Technique for text vectorization** !
Now Data is in good text format, Need to vectorize them. For this purpose i'll use CountVectorizer class with max_feature & stop_words parameters.
max_features: it picks 5000 most used words from the corpus.
stop_words: The words which is useful for sentence making but does not contribute any meaning alone. Those words will be removed using this parameter.

In [187]:
# cv.get_feture_names() use to get the words list, also need to validate the words are unique or not. 
#There are chances that a word can have another verd forms that may cause the bad result. It need to be rectified
# nltk library,porter_stemmer function 
from nltk.stem.porter import PorterStemmer
# Initializing Stemmer class
ps=PorterStemmer()

In [188]:
# Defining a function to stem multiple similar words into a single word
def steming(text):
  l=[]
  for i in text.split():
    l.append(ps.stem(i))
  return ' '.join(l)


In [189]:
new_df['tags']=new_df['tags'].apply(steming)

In [190]:
# Now we have arreanged text data in good format. Now need to vectorize those text to produce the right result
# importing sklearn modules to handle it
from sklearn.feature_extraction.text import CountVectorizer

In [191]:
# Initializing CountVectorizer class
cv=CountVectorizer(max_features=5000,stop_words='english')

In [192]:
# Transforming text data into numerical vector form
vector=cv.fit_transform(new_df['tags']).toarray()

**Calculate similarity between movies**

Here Similarity will be based on the distance between movies. To calculate the distabnce, i'll use Cosine distance i.e. nothing but angle between movies. In high dimensinal space data Euclidean diatance is not a good measure. So Similarity is an inversly proportional to distance. 



In [193]:
from sklearn.metrics.pairwise import cosine_similarity

In [194]:
similarity=cosine_similarity(vector)

In [195]:
#similarity[0]

In [196]:
# Defining a function which will recommned 5 similar movies
def recomend_movie(movie):
  index_movie=new_df[new_df['title']==movie].index[0]     # returns index position of movie
  distance=similarity[index_movie]
  movie_list=sorted(enumerate(distance),reverse=True,key=lambda x:x[1])[1:6]    # storing 5 similar movies index position
  print('5 similar movies to "{}"\n'.format(movie))
  count=1
  for i in movie_list:
    print(count,'. ',new_df.iloc[i[0]][1])  # consuming index and returning similar movies name
    count+=1

In [198]:
# Final module
recomend_movie("Pirates of the Caribbean: At World's End")

5 similar movies to "Pirates of the Caribbean: At World's End"

1 .  Pirates of the Caribbean: Dead Man's Chest
2 .  Pirates of the Caribbean: The Curse of the Black Pearl
3 .  Pirates of the Caribbean: On Stranger Tides
4 .  Life of Pi
5 .  20,000 Leagues Under the Sea


### Module Ends Here

**Future Work**
1. To Build a website
2. Accuracy control/improvement
3. Look for another field/feture 