# Netflix Movies Recomendation System
BY Sheikh Md Abid
## Project Workflow

### 1. **Importing Libraries**
The necessary Python libraries such as `numpy`, `pandas`, and `ast` are imported to handle data processing and manipulation.

In [1]:
import numpy as np
import pandas as pd

### 2. Loading the Datasets
The datasets `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv` are loaded using Pandas. These datasets contain information about movies, including titles, genres, cast, crew, and more, which will be used to build the recommendation system.

In [2]:
movies = pd.read_csv("tmdb_5000_movies.csv")
credits = pd.read_csv("tmdb_5000_credits.csv")
movies_list = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
ratings.head(1)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044


In [5]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [6]:
movies_list.head(1)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [7]:
movies_list = movies_list.drop('genres', axis=1)
movies_list.head(1)

Unnamed: 0,movieId,title
0,1,Toy Story (1995)


In [8]:
movies_list['title'] = movies_list['title'].str.replace(r'\(\d{4}\)', '', regex=True).str.strip()
movies_list.head(2)

Unnamed: 0,movieId,title
0,1,Toy Story
1,2,Jumanji


### 3. Merging Datasets
The `movies` and `credits` datasets are merged on the movie `title` to combine relevant information from both datasets into a single dataframe. This allows for a more comprehensive analysis by linking movie metadata with corresponding cast and crew details.


In [9]:
ratings_with_movies = ratings.merge(movies_list, on='movieId')
ratings_with_movies.head(2)

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,296,5.0,1147880044,Pulp Fiction
1,3,296,5.0,1439474476,Pulp Fiction


In [10]:
movies = movies.merge(credits, on='title')
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [11]:
movies = movies.merge(movies_list, on='title')
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew,movieId
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de...",72998


In [12]:
ratings_with_movies = ratings_with_movies.drop_duplicates(subset=['userId', 'movieId', 'rating'])
ratings_with_movies.shape

(25000095, 5)

In [13]:
movies_ratings_all = movies.merge(ratings, on = 'movieId')
movies_ratings_with_newID = movies_ratings_all[['userId', 'id', 'title', 'rating']]
movies_ratings_with_newID.head(1)

Unnamed: 0,userId,id,title,rating
0,3,19995,Avatar,4.0


### 4. Data Preprocessing
The merged dataset is reduced to the most relevant columns for building the recommendation system. These columns include `movie_id`, `title`, `overview`, `genres`, `keywords`, `cast`, and `crew`. This reduction helps to focus the analysis on the essential features needed for content-based recommendations.

In [14]:
movies_ratings_with_newID = movies_ratings_with_newID.drop_duplicates(subset=['userId', 'id', 'rating'])
movies_ratings_with_newID.shape

(12664758, 4)

In [15]:
movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']]
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [16]:
rating_main = movies_ratings_with_newID
rating_main.head(1)

Unnamed: 0,userId,id,title,rating
0,3,19995,Avatar,4.0


In [17]:
x = rating_main.groupby('userId').count()['rating'] > 200
genuine_user = x[x].index

In [18]:
filtered_rating = rating_main[rating_main['userId'].isin(genuine_user)]
filtered_rating.shape

(5363482, 4)

In [19]:
y = filtered_rating.groupby('title').count()['rating']>=50
famous_movies = y[y].index

In [20]:
final_ratings = filtered_rating[filtered_rating['title'].isin(famous_movies)]
final_ratings.shape

(5354017, 4)

In [21]:
user_movie_rating_table = final_ratings.pivot_table(index='id',columns='userId',values='rating')
user_movie_rating_table.fillna(0, inplace=True)
user_movie_rating_table

userId,3,12,13,31,43,57,72,75,80,120,...,162481,162484,162495,162508,162512,162516,162519,162521,162533,162534
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.5,0.0,0.0,0.0,0.0
12,4.0,4.0,4.0,3.0,4.0,4.5,0.0,0.0,0.0,0.0,...,4.0,3.0,1.0,5.0,4.5,4.0,0.0,4.0,4.5,4.0
13,4.0,4.0,5.0,3.0,5.0,4.0,5.0,2.0,4.0,5.0,...,4.5,3.0,5.0,5.0,0.0,4.5,0.0,3.5,4.5,2.5
14,5.0,4.0,4.0,3.0,5.0,5.0,5.0,3.5,4.0,5.0,...,4.0,3.5,4.0,0.0,4.0,5.0,5.0,0.0,4.0,4.0
16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.5,0.0,0.0,0.0,4.0,2.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
365222,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
374461,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
376659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
396152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 5. Handling Missing Data
To ensure clean and consistent data, rows with missing values in the selected columns are dropped. This step is crucial for maintaining the integrity of the dataset, as missing values could affect the accuracy of the recommendation system.


In [22]:
movies.dropna(inplace=True)

### 6. Data Transformation
The columns `genres`, `keywords`, `cast`, and `crew` contain nested JSON-like structures, which need to be converted into lists of strings for further processing. This transformation is done using the `ast.literal_eval()` function, along with custom helper functions, to extract the relevant information from these nested structures.

In [23]:
import ast

In [24]:
def convert(text):
    L = []
    for i in ast.literal_eval(text):
        L.append(i['name']) 
    return L

In [25]:
movies['genres'] = movies['genres'].apply(convert)

In [26]:
movies['keywords'] = movies['keywords'].apply(convert)
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


### 7. Processing Cast and Crew
The `cast` and `crew` columns are refined to focus on the most relevant information:

- **Cast:** The `cast` column is limited to the top 3 actors, as they typically have the most significant impact on a movie's identity and appeal.
- **Crew:** The `crew` column is filtered to retain only the director's name, as the director plays a crucial role in shaping the movie's vision and style.

These transformations ensure that the tags used for recommendations are concise and focused on the key contributors to a movie.


In [27]:
def convert3(text):
    L = []
    counter = 0
    for i in ast.literal_eval(text):
        if counter < 3:
            L.append(i['name'])
        counter+=1
    return L 

In [28]:
movies['cast'] = movies['cast'].apply(convert)
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weave...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [29]:
movies['cast'] = movies['cast'].apply(lambda x:x[0:3])

In [30]:
def fetch_director(text):
    L = []
    for i in ast.literal_eval(text):
        if i['job'] == 'Director':
            L.append(i['name'])
    return L 

In [31]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [32]:
movies.sample(3)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
939,4421,G.I. Jane,A female Senator succeeds in enrolling a woman...,"[Action, Drama]","[poem, middle east, helicopter, satellite, nav...","[Demi Moore, Viggo Mortensen, Anne Bancroft]",[Ridley Scott]
3425,340816,Christmas Eve,"Hilarity, romance, and transcendence prevail a...","[Romance, Comedy]","[photographer, surgeon, orchestra, doctor, car...","[Patrick Stewart, Cheryl Hines, Gary Cole]",[Mitch Davis]
2070,617,Wild Things,When teen-socialite Kelly Van Ryan (Richards) ...,"[Crime, Drama, Mystery]","[upper class, poison, sailboat, rape, sexual a...","[Matt Dillon, Kevin Bacon, Denise Richards]",[John McNaughton]


### 8. Removing Spaces in Tags
To ensure that the vectorization process works effectively, spaces in the tags (such as actor names, genres, keywords, etc.) are removed. This step prevents issues where multi-word tags might be treated as separate tokens, which could dilute their significance in the recommendation system.


In [33]:
def collapse(L):
    L1 = []
    for i in L:
        L1.append(i.replace(" ",""))
    return L1

In [34]:
movies['cast'] = movies['cast'].apply(collapse)
movies['crew'] = movies['crew'].apply(collapse)
movies['genres'] = movies['genres'].apply(collapse)
movies['keywords'] = movies['keywords'].apply(collapse)

In [35]:
movies.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]


### 9. Creating the Tags Column
A new column called `tags` is created by combining the content from the `overview`, `genres`, `keywords`, `cast`, and `crew` columns into a single string. This consolidated column serves as the primary input for generating movie recommendations, as it encapsulates all the key descriptive information about each movie.


In [36]:
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [37]:
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']

In [38]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [40]:
new_dataframe= movies.drop(columns=['overview','genres','keywords','cast','crew'])
new_dataframe.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
3,206647,Spectre,"[A, cryptic, message, from, Bond’s, past, send..."
4,49529,John Carter,"[John, Carter, is, a, war-weary,, former, mili..."


In [41]:
new_dataframe['tags'] = new_dataframe['tags'].apply(lambda x: " ".join(x))
new_dataframe.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,206647,Spectre,A cryptic message from Bond’s past sends him o...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


### 10. Vectorization
The `tags` column is vectorized using `CountVectorizer` from Scikit-Learn. This process converts the text data in the `tags` column into a numerical format that can be used for similarity calculations. The `CountVectorizer` is configured to handle a maximum of 5000 features and remove English stop words, which helps in focusing on the most meaningful terms.


In [42]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=5000,stop_words='english')

In [43]:
vector = cv.fit_transform(new_dataframe['tags']).toarray()

In [44]:
vector.shape

(4332, 5000)

In [45]:
print(cv)

CountVectorizer(max_features=5000, stop_words='english')


In [46]:
from sklearn.metrics.pairwise import cosine_similarity

In [47]:
similarity_scores = cosine_similarity(user_movie_rating_table)
similarity_scores.shape

(2902, 2902)

In [48]:
!pip install python-Levenshtein



In [49]:
from fuzzywuzzy import process

In [73]:
def recommend_cf(movie_name):
    # Extract input movie ID
    movie_index = process.extractOne(movie_name, movies['title'])[2]
    movie_id = movies['movie_id'].iloc[movie_index]
    
    # index fetch
    index = np.where(user_movie_rating_table.index==movie_id)[0][0]
    similar_items = sorted(list(enumerate(similarity_scores[index])),key=lambda x:x[1],reverse=True)[1:6]
    recommended_movies = []
    for i in similar_items:
        item = []
        temp_df = movies[movies['movie_id'] == user_movie_rating_table.index[i[0]]]
        recommedation = temp_df['title'].values[0]
        recommended_movies.append(recommedation)
    
    return recommended_movies

### 11. Calculating Similarity
Cosine similarity is used to measure the similarity between movies based on their vectorized `tags`. This technique calculates how similar two vectors are by determining the cosine of the angle between them, providing a measure of how closely related two movies are in terms of their tags.


In [51]:
from sklearn.metrics.pairwise import cosine_similarity

In [52]:
similarity = cosine_similarity(vector)

In [53]:
similarity

array([[1.        , 0.08492078, 0.05661385, ..., 0.0238705 , 0.02632491,
        0.        ],
       [0.08492078, 1.        , 0.0625    , ..., 0.02635231, 0.        ,
        0.        ],
       [0.05661385, 0.0625    , 1.        , ..., 0.02635231, 0.        ,
        0.        ],
       ...,
       [0.0238705 , 0.02635231, 0.02635231, ..., 1.        , 0.07352146,
        0.04901431],
       [0.02632491, 0.        , 0.        , ..., 0.07352146, 1.        ,
        0.05405405],
       [0.        , 0.        , 0.        , ..., 0.04901431, 0.05405405,
        1.        ]])

### 12. Building the Recommendation Function
The `recommend()` function is designed to take a movie title as input and return a list of the top 5 similar movies based on content similarity. This function uses the similarity matrix to find movies that are most similar to the input movie.


In [69]:
def recommend_cb(movie):
    index = new_dataframe[new_dataframe['title'].map(lambda x: x.lower()) == movie.lower()].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    content_rec= []
    for i in distances[1:6]:
        content_rec.append(new_dataframe.iloc[i[0]].title)
    return content_rec

### 13. Example Usage
To get movie recommendations using the `recommend()` function, simply provide the title of a movie as input. For example, to find movies similar to "Avatar", use the following code:


In [75]:
recommend_cb('avatar')

['Titan A.E.', 'Aliens', 'Small Soldiers', 'Krull', "Ender's Game"]

In [74]:
recommend_cf('avatar')

['Inception', 'Iron Man', 'WALL·E', 'Up', 'District 9']

In [61]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [78]:
def hybrid_recommendation(movie_name):
    content_based_score = recommend_cb(movie_name)
    collaborative_filtering_score = recommend_cb(movie_name)
    hybrid_rec = list(set(content_based_score + collaborative_filtering_score))
    hybrid_rec[0: 7]

In [79]:
hybrid_recommendation('Avatar')

## Conclusion
This Netflix Movies Recommendation System effectively identifies similar movies based on content. By leveraging cosine similarity, the system ensures that the recommended movies closely match the input movie in terms of genres, keywords, cast, and crew. This approach provides users with personalized and relevant movie suggestions, enhancing their viewing experience.
