###**Content Based**
##**[Movie Recommender System](https://github.com/pkvidyarthi?tab=repositories)**

`By Prince Kumar`



**Mounting Google Drive**

In [65]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Importing Import Libraries**

In [66]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

**Loading Datasets**

In [67]:
movies = pd.read_csv('/content/drive/MyDrive/Dataset/tmdb_5000_movies.csv')
credits = pd.read_csv('/content/drive/MyDrive/Dataset/tmdb_5000_credits.csv')

In [68]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [69]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


**Shape of Datasets**

In [70]:
movies.shape

(4803, 20)

In [71]:
credits.shape

(4803, 4)

**Merging both datasets on the basis of 'title' or 'movie_id'**

In [72]:
movies = movies.merge(credits, on = 'title')

In [73]:
movies.shape

(4809, 23)

**Keeping important columns**

In [74]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

**Lists Of Important Columns**
1. genres
2. movie_id
3. keywords
4. title
5. overview
6. cast
7. crew
**All of the above columns are only important for our projects rest are not much important**

In [75]:
movies = movies[['movie_id', 'title','overview','genres','keywords','cast','crew']]

In [76]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


**We'll make a new column 'tags' merging all columns except 'movie_id' & 'title' to create a tag.**

**Which will help to recommend the desired movie or output.**

In [77]:
# Checking for null values
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [78]:
# overview has 3 null values, 
# we'll delete those rows which contains null values.

In [79]:
movies.dropna(inplace = True)       # It'll drop all null values.
movies.isnull().sum()

movie_id    0
title       0
overview    0
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [80]:
# Checking for duplicate rows
movies.duplicated().sum()

0

In [81]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

**'genres' contains data in the form of dictionary, we have to convert these in list.**

**Keep only name of genres like 'Action', 'Adventure', 'Fantasy'**

In [82]:
# Making convertor function
def converter(obj):
  l = []
  # We have to convert these string of lists to list.
  for i in ast.literal_eval(obj):
    l.append(i['name'])
  return l


In [83]:
# Import ast
import ast

In [84]:
movies['genres'] = movies['genres'].apply(converter)

In [85]:
movies['genres']

0       [Action, Adventure, Fantasy, Science Fiction]
1                        [Adventure, Fantasy, Action]
2                          [Action, Adventure, Crime]
3                    [Action, Crime, Drama, Thriller]
4                [Action, Adventure, Science Fiction]
                            ...                      
4804                        [Action, Crime, Thriller]
4805                                [Comedy, Romance]
4806               [Comedy, Drama, Romance, TV Movie]
4807                                               []
4808                                    [Documentary]
Name: genres, Length: 4806, dtype: object

In [86]:
# Doing same thing for keyword column
movies['keywords'] = movies['keywords'].apply(converter)

In [87]:
movies['keywords']

0       [culture clash, future, space war, space colon...
1       [ocean, drug abuse, exotic island, east india ...
2       [spy, based on novel, secret agent, sequel, mi...
3       [dc comics, crime fighter, terrorist, secret i...
4       [based on novel, mars, medallion, space travel...
                              ...                        
4804    [united states–mexico barrier, legs, arms, pap...
4805                                                   []
4806    [date, love at first sight, narration, investi...
4807                                                   []
4808            [obsession, camcorder, crush, dream girl]
Name: keywords, Length: 4806, dtype: object

In [88]:
# movie['cast][0]

**We have to choose first three dictionaries, because we have to take only top three casts for each movie.**
- We have to select 'name' of cast column not the character.

In [89]:
# Defining a function 'converter2'

def converter2(obj):
  l = []
  count = 0
  for i  in ast.literal_eval(obj):
    if count != 3:
      l.append(i['name'])
      count += 1
    else:
      break
  return l

In [90]:
movies['cast'] = movies['cast'].apply(converter2)

In [91]:
movies['cast']

0        [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1           [Johnny Depp, Orlando Bloom, Keira Knightley]
2            [Daniel Craig, Christoph Waltz, Léa Seydoux]
3            [Christian Bale, Michael Caine, Gary Oldman]
4          [Taylor Kitsch, Lynn Collins, Samantha Morton]
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805         [Edward Burns, Kerry Bishé, Marsha Dietlein]
4806           [Eric Mabius, Kristin Booth, Crystal Lowe]
4807            [Daniel Henney, Eliza Coupe, Bill Paxton]
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4806, dtype: object

In [92]:
# movies['crew'].iloc[0]

**We have to store directors name for each movie**
- We'll store name in a list where crew has 'Director' as job.

In [93]:
# Defining a function 
def director(obj):
  l = []
  for i in ast.literal_eval(obj):
    if i['job'] == 'Director':
      l.append(i['name'])
      break
  return l

In [94]:
movies['crew'] = movies['crew'].apply(director)

In [95]:
movies['crew']

0           [James Cameron]
1          [Gore Verbinski]
2              [Sam Mendes]
3       [Christopher Nolan]
4          [Andrew Stanton]
               ...         
4804     [Robert Rodriguez]
4805         [Edward Burns]
4806          [Scott Smith]
4807          [Daniel Hsia]
4808     [Brian Herzlinger]
Name: crew, Length: 4806, dtype: object

- overview contains data in string format, convert it into list.

In [96]:
 movies['overview'] = movies['overview'].apply(lambda x : x.split())

In [97]:
movies.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]


**We have to create tags so we'll apply transformation for removing speces between name.**
- Replace space ' ' with nnothing '' using lambda function. 

In [98]:
movies['genres'] = movies['genres'].apply(lambda x : [i.replace(' ','') for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x : [i.replace(' ','') for i in x])
movies['cast'] = movies['cast'].apply(lambda x : [i.replace(' ','') for i in x])
movies['crew'] = movies['crew'].apply(lambda x : [i.replace(' ','') for i in x])

**Concatinating all the above columns to make 'tags' column.**

In [99]:
movies['tags'] = movies['genres'] + movies['overview'] + movies['keywords'] + movies['cast'] + movies['crew']

**Making a new dataframe using 'movie_id', 'title' and 'tags'**

In [100]:
df = movies[['movie_id', 'title','tags']]

In [101]:
df.head(3)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[Action, Adventure, Fantasy, ScienceFiction, I..."
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action, Captain, Barbossa..."
2,206647,Spectre,"[Action, Adventure, Crime, A, cryptic, message..."


- 'tags' columns contains list datatype, so convert it into string.
- join each list on the basis of space.

In [102]:
df['tags'] = df['tags'].apply(lambda x : ' '.join(x))

In [103]:
df.head(5)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,Action Adventure Fantasy ScienceFiction In the...
1,285,Pirates of the Caribbean: At World's End,"Adventure Fantasy Action Captain Barbossa, lon..."
2,206647,Spectre,Action Adventure Crime A cryptic message from ...
3,49026,The Dark Knight Rises,Action Crime Drama Thriller Following the deat...
4,49529,John Carter,Action Adventure ScienceFiction John Carter is...


In [104]:
df['tags'][0]

'Action Adventure Fantasy ScienceFiction In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [105]:
 # Converting tags' texts into lowercase
df['tags'] = df['tags'].apply(lambda x : x.lower())

In [106]:
df['tags'][0]

'action adventure fantasy sciencefiction in the 22nd century, a paraplegic marine is dispatched to the moon pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d samworthington zoesaldana sigourneyweaver jamescameron'

In [107]:
df['tags'][1]

"adventure fantasy action captain barbossa, long believed to be dead, has come back to life and is headed to the edge of the earth with will turner and elizabeth swann. but nothing is quite as it seems. ocean drugabuse exoticisland eastindiatradingcompany loveofone'slife traitor shipwreck strongwoman ship alliance calypso afterlife fighter pirate swashbuckler aftercreditsstinger johnnydepp orlandobloom keiraknightley goreverbinski"

**Text Vectorization:**

- Text vectorization refers to the process of converting text data into numerical vectors or arrays that can be processed by machine learning algorithms. Text data is usually unstructured and in a format that is difficult for algorithms to interpret. 
- By vectorizing text, we can transform it into a format that can be easily understood and used for modeling.

There are several techniques for text vectorization in machine learning, including:
1. **Bag-Of-Words**
2. **TF-IDF**
3. **Word Embeddings**

**Bag-of-words:** *This technique converts each document or sentence into a frequency count of the words in it. It creates a sparse vector, where the length of the vector is equal to the total number of unique words in the corpus.*

####But we'll use an easy technique 'Bag-of-Words' to convert texts into vectors.

**Note : When we have to recommend five movies on the basis of a movie which will choosen by user, then we'll pick five nearest vectors from that movie vector which is choosen by a user.**

- Concatinate all tags => tag1 + tag2 + tag3 +......+tag4809
- We'll pick 4000 words and aplly vectorization on it.
- Note : We'll not consider 'STOPWORDS' in these 4000 words.
- English stopwords like is, are, and, to, from, in etc

**We'll use 'CounterVectorizer' class to apply vectorization.

In [108]:
!pip install -U scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [109]:
import sklearn

In [110]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 4000, stop_words = 'english')

# stop_words('english') => it will remove all the stopwords
# max_features = 4000 means we have to apply vectorization on max 4000 words.

In [111]:
# cv.fit_transform(df['tags']) 
# => it will return sparse matrix, we have to convert it into array
vectors = cv.fit_transform(df['tags']).toarray()

In [112]:
vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [113]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0])

In [114]:
vectors.shape

(4806, 4000)

In [115]:
feature_name = cv.get_feature_names_out()
# Adjust numpy print options to display the entire array
# np.set_printoptions(threshold=np.inf)
# print(feature_name)
for feature_name in feature_name:
    print(feature_name)
# Above code or loop is used to print words one by one

000
007
10
100
11
12
13
14
15
16
17
18
18th
19
1930s
1940s
1950s
1960s
1970s
1980s
19th
19thcentury
20
200
20th
24
25
30
3d
40
50
60
60s
aaroneckhart
abandoned
abducted
abigailbreslin
abilities
ability
able
aboard
abuse
abusive
academy
accept
accepts
access
accident
accidental
accidentally
accompanied
account
accountant
accused
ace
achieve
act
action
actions
activist
activities
actor
actors
actress
acts
actual
actually
adam
adams
adamsandler
adaptation
addict
addicted
addiction
adolescence
adopted
adoption
adrienbrody
adult
adultery
adulthood
adults
advantage
adventure
adventures
advertising
advice
affair
affection
affections
afghanistan
africa
african
aftercreditsstinger
afterlife
aftermath
age
aged
agedifference
agency
agenda
agent
agents
aggressive
aging
ago
agree
agrees
ahead
aid
aided
aids
ailing
air
airplane
airport
aka
al
alabama
alaska
albert
alcohol
alcoholic
alcoholism
alecbaldwin
alex
ali
alice
alien
alieninvasion
alienlife
aliens
alike
alive
allen
alliance
allied
allies
all

It this output, we have words like 
[love, loved, loving]  have the same meaning, so we'll replace them with only one word that has the same meaning.

So we'll apply stemming, we have to import nltk library first.

In [116]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [117]:
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [118]:
def stemming(text):
  l = []
  for i in text.split():   # to convert string into list
      l.append(ps.stem(i))
  return ' '.join(l)       # to covert list into string

  

In [119]:
df['tags'] = df['tags'].apply(stemming)

In [120]:
df['tags']

0       action adventur fantasi sciencefict in the 22n...
1       adventur fantasi action captain barbossa, long...
2       action adventur crime a cryptic messag from bo...
3       action crime drama thriller follow the death o...
4       action adventur sciencefict john carter is a w...
                              ...                        
4804    action crime thriller el mariachi just want to...
4805    comedi romanc a newlyw couple' honeymoon is up...
4806    comedi drama romanc tvmovi "signed, sealed, de...
4807    when ambiti new york attorney sam is sent to s...
4808    documentari ever sinc the second grade when he...
Name: tags, Length: 4806, dtype: object

- **We will calculate the cosine distance between two vectors instead of the Euclidean distance.**

- **The desired output vector will consist of the five vectors with the lowest distance.**
- **The cosine distance is inversely proportional to the cosine similarity, and we will import cosine similarity because if the similarity value is higher, it indicates that the movies are more similar.**

In [121]:
from sklearn.metrics.pairwise import cosine_similarity

In [122]:
similarity = cosine_similarity(vectors)
# it will return a digonal vector

In [123]:
cosine_similarity(vectors).shape 
# The program will calculate the distance of each movie vector with every other movie.

(4806, 4806)

In [124]:
similarity[0]
# The distance between the first movie and every other movie will be calculated.

array([1.        , 0.09258201, 0.06172134, ..., 0.02577696, 0.02817181,
       0.        ])

In [125]:
df[df['title'] == 'Batman Begins'].index[0]

119

**Creating Recommendation Function**

In [144]:
# Creating Recommendation function
def recommendation(movie):
    movie_index = df[df['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x:x[1])[1:6]
    # If we only apply the sorted() function to a list of similarity values, 
    # the values will be sorted, but the corresponding index positions will be lost. 
    # To preserve the index positions, we need to first apply the enumerate() function to the list of similarity values, 
    # which will associate each value with its index. Then, we can use the sorted() function to sort 
    # the list based on the similarity values while maintaining the original index positions.

    # key = lambda x:x[1] ==> It will sort the list on the basis of 2nd index.
    # [1:6] ==> It will return only first five value.

    for i in movies_list:
      print(df.iloc[i[0]].title)
      print(df.iloc[i[0]])


**Note :** 
- If we only apply the sorted() function to a list of similarity values, the values will be sorted, but the corresponding index positions will be lost. To preserve the index positions, we need to first apply the enumerate() function to the list of similarity values, which will associate each value with its index. Then, we can use the sorted() function to sort the list based on the similarity values while maintaining the original index positions.
- key = lambda x:x[1] ==> It will sort the list on the basis of 2nd index.
- [1:6] ==> It will return only first five value.

In [143]:
recommendation('Iron Man')

Iron Man 2


ValueError: ignored

In [128]:
# Dumping Data on Streamlit using pickle
import pickle
pickle.dump(df.to_dict(),open('moviesDict.pkl','wb'))
pickle.dump(similarity, open('similarity.pkl','wb'))