# Project Title: Netflix Recommendation System

### Candidate Name: Mayur Kumar Sharma

### Mail ID: mayur4everyone@gmail.com

## 1. Importing libraries

In [17]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

from warnings import filterwarnings
filterwarnings('ignore')

## 2. Importing dataset

In [2]:
data = pd.read_csv("netflixData.csv")
data.head()

Unnamed: 0,Show Id,Title,Description,Director,Genres,Cast,Production Country,Release Date,Rating,Duration,Imdb Score,Content Type,Date Added
0,cc1b6ed9-cf9e-4057-8303-34577fb54477,(Un)Well,This docuseries takes a deep dive into the luc...,,Reality TV,,United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
1,e2ef4e91-fb25-42ab-b485-be8e3b23dedb,#Alive,"As a grisly virus rampages a city, a lone man ...",Cho Il,"Horror Movies, International Movies, Thrillers","Yoo Ah-in, Park Shin-hye",South Korea,2020.0,TV-MA,99 min,6.2/10,Movie,"September 8, 2020"
2,b01b73b7-81f6-47a7-86d8-acb63080d525,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...","Sabina Fedeli, Anna Migotto","Documentaries, International Movies","Helen Mirren, Gengher Gatti",Italy,2019.0,TV-14,95 min,6.4/10,Movie,"July 1, 2020"
3,b6611af0-f53c-4a08-9ffa-9716dc57eb9c,#blackAF,Kenya Barris and his family navigate relations...,,TV Comedies,"Kenya Barris, Rashida Jones, Iman Benson, Genn...",United States,2020.0,TV-MA,1 Season,6.6/10,TV Show,
4,7f2d4170-bab8-4d75-adc2-197f7124c070,#cats_the_mewvie,This pawesome documentary explores how our fel...,Michael Margolis,"Documentaries, International Movies",,Canada,2020.0,TV-14,90 min,5.1/10,Movie,"February 5, 2020"


## 3. Preparation:
### In the first impressions on the dataset, I can see that the Title column needs preparation as it contains # before the name of the movies or tv shows. I will get back to it. For now, let’s have a look at whether the data contains null values or not.

In [3]:
print(data.isnull().sum())

Show Id                  0
Title                    0
Description              0
Director              2064
Genres                   0
Cast                   530
Production Country     559
Release Date             3
Rating                   4
Duration                 3
Imdb Score             608
Content Type             0
Date Added            1335
dtype: int64


## 4. Selecting features:
### The dataset contains null values, but before removing the null values, let’s select the columns that we can use to build a Netflix recommendation system.

In [5]:
data = data[["Title", "Description", "Content Type", "Genres"]]
data.head()

Unnamed: 0,Title,Description,Content Type,Genres
0,(Un)Well,This docuseries takes a deep dive into the luc...,TV Show,Reality TV
1,#Alive,"As a grisly virus rampages a city, a lone man ...",Movie,"Horror Movies, International Movies, Thrillers"
2,#AnneFrank - Parallel Stories,"Through her diary, Anne Frank's story is retol...",Movie,"Documentaries, International Movies"
3,#blackAF,Kenya Barris and his family navigate relations...,TV Show,TV Comedies
4,#cats_the_mewvie,This pawesome documentary explores how our fel...,Movie,"Documentaries, International Movies"


##  As the name suggests:
1. The title column contains the titles of movies and TV shows on Netflix
2. Description column describes the plot of the TV shows and movies
3. The Content Type column tells us if it’s a movie or a TV show
4. The Genre column contains all the genres of the TV show or the movie

## 5. Dropping the Null values

In [6]:
data = data.dropna()

## 6. Cleaning
### Now let's clean the Title column as it contains some data preparation:

In [None]:
import nltk
import re
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))


In [14]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
data["Title"] = data["Title"].apply(clean)

## 7. Now let’s have a look at some samples of the Titles before moving forward:

In [15]:
data.Title.sample(10)

4975                       last paradiso
139                    princess christma
5276                              unlist
232                         air forc one
2411                         johnni test
120                            lion hous
5690    w kamau bell privat school negro
80                                    ml
2732                              listen
4388                  spi kid time world
Name: Title, dtype: object

## 8. Cosine similarity
### Now I will use the Genres column as the feature to recommend similar content to the user. I will use the concept of cosine similarity here (used to find similarities in two documents):

In [27]:
# Converting Genre column values into string

data["Genres"] = data["Genres"].apply(lambda x: ', '.join(x) if isinstance(x, list) else str(x))
feature = data["Genres"].astype(str)
#feature = data["Genres"]
tfidf = text.TfidfVectorizer(input="content", stop_words="english")
tfidf_matrix = tfidf.fit_transform(feature)
similarity = cosine_similarity(tfidf_matrix)

In [36]:
similarity

array([[1.        , 0.        , 0.        , ..., 0.32075218, 0.        ,
        0.        ],
       [0.        , 1.        , 0.30428612, ..., 0.07587812, 0.68953015,
        0.15936057],
       [0.        , 0.30428612, 1.        , ..., 0.11962968, 0.27899812,
        0.12562419],
       ...,
       [0.32075218, 0.07587812, 0.11962968, ..., 1.        , 0.25478887,
        0.        ],
       [0.        , 0.68953015, 0.27899812, ..., 0.25478887, 1.        ,
        0.110801  ],
       [0.        , 0.15936057, 0.12562419, ..., 0.        , 0.110801  ,
        1.        ]])

## 9. Setting Index:
### Now I will set the Title column as an index so that we can find similar content by giving the title of the movie or TV show as an input.

In [34]:
indices = pd.Series(data.index, 
                    index=data['Title']).drop_duplicates()

## 10. Recommendation
### using input movie name: "girlfriend"

In [42]:
def netFlix_recommendation(title, similarity = similarity):
    index = indices[title]
    similarity_scores = list(enumerate(similarity[index]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[0:10]
    movieindices = [i[0] for i in similarity_scores]
    return data['Title'].iloc[movieindices]

print(netFlix_recommendation("girlfriend"))

3                          blackaf
285                     washington
417                 arrest develop
434     astronomi club sketch show
451    aunti donna big ol hous fun
656                      big mouth
752                bojack horseman
805                   brew brother
935                       champion
937                   chappel show
Name: Title, dtype: object


## Summary:
### The recommendation system of Netflix predicts a personalised catalogue for you based on factors like your viewing history, the viewing history of other users with similar tastes and preferences, and the genres, category, descriptions, and more information of the content you watched.

# Thank You
