<center><h3><b>A Recommender System that recommends movies based on movie description.</b></h3></center>

<center><img src = "https://wallpapercave.com/wp/wp5063342.png" width="650"></center>

<html>
<style>
h2 {
  text-align: center;
} </style>
<h2 text-align:center > Recommendation Systems - Terminology</h2>
</html>

* **Items (also known as documents)**:
The entities a system recommends.
* **Query (also known as context)**:
The information a system uses to make recommendations.
* **Embedding**:
A mapping from a discrete set (in this case, the set of queries, or the set of items to recommend) to a vector space called the embedding space. Many recommendation systems rely on learning an appropriate embedding representation of the queries and items

<h2> Types of Recommendation Engines </h2>

* **Content based Filtering**: Uses similarity between items to recommend items similar to what the user likes.
* **Collaborative Filtering**: Uses similarities between queries and items simultaneously to provide recommendations.

Here we are going to use description column for recommending movie. since we are recommending based on what the user has watched this comes under content based filtering

<h3>Load csv into Pandas Dataframe<h3>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import missingno
netflix_data = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
netflix_data.info()

<h3>Check for missing data in description column<h3> 

In [None]:
missingno.bar(netflix_data,figsize=(12,5))

<h3>From above barplot there is no misisng data in description column, so no need of any missing data handling.<h3>

# **TF-IDF Based**

Term Frequency-inverse document frequency (or TF-idf) is an established technique for scoring document similarity based on the importance of the words that they share.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(netflix_data['description'])

**<h4>Cosine Similarity<h4>**It computes the L2-normalized dot product of vectors. This is called cosine similarity, because Euclidean (L2) normalization projects the vectors onto the unit sphere, and their dot product is then the cosine of the angle between the points denoted by the vectors.This kernel is a popular choice for computing the similarity of documents represented as tf-idf vectors.

In [None]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) 
results = {}
for idx, row in netflix_data.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], netflix_data['show_id'][i]) for i in similar_indices] 
    results[row['show_id']] = similar_items[1:]

In [None]:
def item(id):  
    return netflix_data.loc[netflix_data['show_id'] == id]['title'].tolist()[0].split(' - ')[0] 

# Just reads the results out of the dictionary.def 
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")   
    print("-------")    
    recs = results[item_id][:num] 
    for rec in recs: 
        print("Recommended: " + item(rec[1]) + " (score:" +      str(rec[0]) + ")")

In [None]:
recommend('s1305',6)

> **From above results we can see that the recommendations for chef's table were not close to the show's theme, so there is a need to improve the embeddings representation. We now use sentence transformers to respresent descriptions of show.**

# **Sentence Transformer Based**

<h3>This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc.</h3>

In [None]:
!pip install -U sentence-transformers

With SentenceTransformer('paraphrase-distilroberta-base-v1') we define which sentence transformer model we like to load. In this example, we load paraphrase-distilroberta-base-v1, which is a DistilBERT-base-uncased model fine tuned on a large dataset of paraphrase sentences.

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

> Find Embeddings for all show descriptions in dataset.

<h4> <b> *** Run the below cell only to find embeddings for each description. ***</b> <br> I have added the embeddings as npy file seperately skip to next cell.</h4>

In [None]:
import numpy as np
descriptions = netflix_data['description'].tolist()
# print(descriptions)
des_embeddings = []
for i,des in enumerate(descriptions):
    des_embeddings.append(model.encode(des))
    

In [None]:
import numpy as np
des_embeddings = np.load('../input/netflix-descriptions-bert-embeddings/descriptions_embeddings.npy')

<h4>For a query show id lets find the top five shows with highest cosine similarity.</h4>

In [None]:
import torch
from sentence_transformers import SentenceTransformer, util

def recommend(query):
    #Compute cosine-similarities with all embeddings 
    query_embedd = model.encode(query)
    cosine_scores = util.pytorch_cos_sim(query_embedd, des_embeddings)
    top5_matches = torch.argsort(cosine_scores, dim=-1, descending=True).tolist()[0][1:6]
    return top5_matches

id = 's1305'
query_show_des = netflix_data.loc[netflix_data['show_id'] == id]['description'].to_list()[0]
recommendded_results = recommend(query_show_des)

for index in recommendded_results:
    print(netflix_data.iloc[index,:])



Above results show that the top 5 recommendations of the show are:

1. **Chef's Table: France**
2. **Rotten**
3. **The Mind of a Chef**
4. **Chef's Table: BBQ**
5. **The Chef Show**

The recommendations are now more close to the query.

<center> <h3><b> Please upvote if you liked this approach. In case of improvements or suggestion write it in comments. </b></h3> </center> 