# Recommender System

## Content Based Filtering

* uses item features to recommend other items similar to what the user likes, based on the previous actions or feedback [more](https://developers.google.com/machine-learning/recommendation/content-based/basics)

<center><img src="https://developers.google.com/machine-learning/recommendation/images/Matrix1.svg" width=600 /></center>

> recommend movies based on:
>* the genres
>* similar in types 
>* user's last watched movie 

* Cosine Similarity

In [1]:
import numpy as np
import pandas as pd
import nltk  

In [2]:
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
rating = pd.read_csv('ratings.csv')
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [4]:
data = pd.concat([movies, rating.loc[:, rating.columns != 'movieId']], axis=1)
data.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,1217897793
1,2.0,Jumanji (1995),Adventure|Children|Fantasy,1,1.5,1217895807
2,3.0,Grumpier Old Men (1995),Comedy|Romance,1,4.0,1217896246
3,4.0,Waiting to Exhale (1995),Comedy|Drama|Romance,1,4.0,1217896556
4,5.0,Father of the Bride Part II (1995),Comedy,1,4.0,1217896523


### Lemmatization

**Lemmatization** is the process of converting a word to its base form. 

<center>Caring [Lemmatization] -> Care</center>

<center>Caring [Stemming] -> Car</center>


> The difference between stemming and lemmatization: 
>* lemmatization considers the context and converts the word to its meaningful base form
>* stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

#### NLTK WordnetLemmatizer

> Wordnet - large, freely, publicly available lexical database for English language 

* 'raw' WordnetLemmatizer didn't do a good job, but to improve processing POS tag (part-of-speech) can be provided as the second argument to `lemmatize()`

In [5]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /home/lena/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/lena/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/lena/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [6]:
from nltk.stem import WordNetLemmatizer

# Without Part-of-Speech tag [POS]

# Init the Wordnet Lemmatizer
test_lemmatizer = WordNetLemmatizer() 

print(test_lemmatizer.lemmatize('cats'))
print(test_lemmatizer.lemmatize('are'))
print(test_lemmatizer.lemmatize('feet'))

# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"

# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']

# Lemmatize list of words and join
test_lemmatized_output = ' '.join([test_lemmatizer.lemmatize(w) for w in word_list])
print(test_lemmatized_output)
#> The striped bat are hanging on their foot for best

cat
are
foot
['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
The striped bat are hanging on their foot for best


In [7]:
print(nltk.pos_tag(['feet']))

print(nltk.pos_tag(nltk.word_tokenize(sentence)))

[('feet', 'NNS')]
[('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS')]


In [8]:
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


# 1. Init Lemmatizer
lemmatizer = WordNetLemmatizer()

# 2. Lemmatize Single Word with the appropriate POS tag
word = 'feet'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

# 3. Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
#> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']

foot
['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


#### spaCy Lemmatization

* spaCy is a relatively new and is billed as an industial strength NLP engine. It comes with pre-built models that can parse text and compute various NLP related features through one single call

* spaCy determines the part-of-speech tag by default and assigns the corresponding lemma. It comes with a bunch of prebuilt models where the ‘en’ we just downloaded above is one of the standard ones for english.

In [9]:
# Install spaCy (run in terminal/prompt)
import sys
!{sys.executable} -m pip install spacy

# Download spaCy's  'en' Model
!{sys.executable} -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/home/lena/RS/test_vevn/lib/python3.8/site-packages/en_core_web_sm -->
/home/lena/RS/test_vevn/lib/python3.8/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [10]:
import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)

# Extract the lemma for each token and join
" ".join([token.lemma_ for token in doc])
#> 'the strip bat be hang on -PRON- foot for good'

'the stripe bat be hang on -PRON- foot for good'

#### TextBlob Lemmatizer

* TextBlob is a powerful, fast and convenient NLP package 

In [11]:
! pip3 install textblob



In [12]:
# pip install textblob
from textblob import TextBlob, Word

# Lemmatize a word
word = 'stripes'
w = Word(word)
w.lemmatize()
#> stripe

# Lemmatize a sentence (without POS)
sentence = "The striped bats are hanging on their feet for best"
sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])
#> 'The striped bat are hanging on their foot for best'

'The striped bat are hanging on their foot for best'

In [13]:
# Define function to lemmatize each word with its POS tag
def lemmatize_with_postag(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)

# Lemmatize
sentence = "The striped bats are hanging on their feet for best"
lemmatize_with_postag(sentence)
#> 'The striped bat be hang on their foot for best'

'The striped bat be hang on their foot for best'

#### Pattern Lemmatizer 

* Pattern by CLiPs is a versatile module with many useful NLP capabilities. 

In [14]:
! pip3 install wheel
! pip3 install pattern



In [16]:
import pattern
from pattern.en import lemma, lexeme

sentence = "The striped bats were hanging on their feet and ate best fishes"
" ".join([lemma(wd) for wd in sentence.split()])
#> 'the stripe bat be hang on their feet and eat best fishes'

'the stripe bat be hang on their feet and eat best fishes'

In [43]:
# Lexeme's for each word 
[lexeme(wd) for wd in sentence.split()]

[['the', 'thes', 'thing', 'thed'],
 ['stripe', 'stripes', 'striping', 'striped'],
 ['bat', 'bats', 'batting', 'batted'],
 ['be',
  'am',
  'are',
  'is',
  'being',
  'was',
  'were',
  'been',
  'am not',
  "aren't",
  "isn't",
  "wasn't",
  "weren't"],
 ['hang', 'hangs', 'hanging', 'hung'],
 ['on', 'ons', 'oning', 'oned'],
 ['their', 'theirs', 'theiring', 'theired'],
 ['feet', 'feets', 'feeting', 'feeted'],
 ['and', 'ands', 'anding', 'anded'],
 ['eat', 'eats', 'eating', 'ate', 'eaten'],
 ['best', 'bests', 'besting', 'bested'],
 ['fishes', 'fishing', 'fishesed']]

In [48]:
from pattern.en import parse

print(parse('The striped bats were hanging on their feet and ate best fishes', lemmata=True, tags=False, chunks=False))
#> The/DT/the striped/JJ/striped bats/NNS/bat were/VBD/be hanging/VBG/hang on/IN/on their/PRP$/their 
#>  feet/NNS/foot and/CC/and ate/VBD/eat best/JJ/best fishes/NNS/fish

The/DT/the striped/JJ/striped bats/NNS/bat were/VBD/be hanging/VBG/hang on/IN/on their/PRP$/their feet/NNS/foot and/CC/and ate/VBD/eat best/JJ/best fishes/NNS/fish


#### Gensim Lemmatize 

* based on the pattern package 

In [50]:
! pip3 install gensim

Collecting gensim
  Downloading gensim-3.8.3-cp38-cp38-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 21.3 MB/s 
Collecting smart-open>=1.8.1
  Downloading smart_open-4.0.1.tar.gz (117 kB)
[K     |████████████████████████████████| 117 kB 67.2 MB/s 
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Created wheel for smart-open: filename=smart_open-4.0.1-py3-none-any.whl size=108243 sha256=ee691331d313a142bce0ead510e4064f455ff90cb0761b8706b3938e72b8340a
  Stored in directory: /home/lena/.cache/pip/wheels/8c/f9/f4/4ddd9ddee3488f48be20e9bf3108961f03ae23da29b7ed26d1
Successfully built smart-open
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-4.0.1


In [53]:
from gensim.utils import lemmatize

sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']
lemmatized_out

['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

In [20]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
genres = movies['genres']

In [23]:
li = [] 

for i in range(len(genres)):
    tmp = genres[i].lower()
    tmp = tmp.split('|')
    tmp = [lemmatizer.lemmatize(word) for word in tmp]
    li.append(" ".join(tmp))

In [29]:
lemmatized_movies = pd.DataFrame(li, columns=['genres'], index=movies['title'])
lemmatized_movies.head()

Unnamed: 0_level_0,genres
title,Unnamed: 1_level_1
Toy Story (1995),adventure animation child comedy fantasy
Jumanji (1995),adventure child fantasy
Grumpier Old Men (1995),comedy romance
Waiting to Exhale (1995),comedy drama romance
Father of the Bride Part II (1995),comedy


In [27]:
#Finding based on similar movies
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(lemmatized_movies["genres"]).toarray()
X.shape

(10329, 24)

In [30]:
print("Count Vector : \n",X)
print("\nNote: First row of above count vector: ",X[0])
print("\nColumns Coresponding to above count vector is :\n",cv.get_feature_names())

Count Vector : 
 [[0 1 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Note: First row of above count vector:  [0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Columns Coresponding to above count vector is :
 ['action', 'adventure', 'animation', 'child', 'comedy', 'crime', 'documentary', 'drama', 'fantasy', 'fi', 'film', 'genres', 'horror', 'imax', 'listed', 'musical', 'mystery', 'no', 'noir', 'romance', 'sci', 'thriller', 'war', 'western']


In [32]:
output = movies.loc[:,['movieId','title']]
output = output.join(pd.DataFrame(X, columns=cv.get_feature_names()))
output

Unnamed: 0,movieId,title,action,adventure,animation,child,comedy,crime,documentary,drama,...,listed,musical,mystery,no,noir,romance,sci,thriller,war,western
0,1,Toy Story (1995),0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),0,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10325,146878,Le Grand Restaurant (1966),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10326,148238,A Very Murray Christmas (2015),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10327,148626,The Big Short (2015),0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [35]:
#Row corresponds to a movie name
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(X) 
#Each row of matrix coressponds to similarity of a movie with all other movies (row len = 10329)
similarities

array([[1.        , 0.77459667, 0.31622777, ..., 0.4472136 , 0.        ,
        0.        ],
       [0.77459667, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.31622777, 0.        , 1.        , ..., 0.70710678, 0.        ,
        0.        ],
       ...,
       [0.4472136 , 0.        , 0.70710678, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [39]:
uid = 18 #For user 18 lets recommend movies based on his recent watched movie

time = rating.loc[rating["userId"]==uid,["movieId","timestamp"]]
latest_movieId_watched_by_user = time.sort_values(by="timestamp",ascending=False)["movieId"].values[0]
latest_movieId_watched_by_user

8798

In [40]:
movie_index = movies.loc[movies['movieId']==latest_movieId_watched_by_user,["title"]].index[0]
output.loc[output['movieId']==8798,:]

Unnamed: 0,movieId,title,action,adventure,animation,child,comedy,crime,documentary,drama,...,listed,musical,mystery,no,noir,romance,sci,thriller,war,western
5801,8798,Collateral (2004),1,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,0


In [41]:
movie_index,"for movie id",latest_movieId_watched_by_user

(5801, 'for movie id', 8798)

In [42]:
#we need index but we are using id to find which row is crct in similarities matrix
movie_index = movies.loc[movies['movieId']==latest_movieId_watched_by_user,["title"]].index[0]
similarity_values = pd.Series(similarities[movie_index])

In [43]:
#We converted list into series in order to preserve the actual indexes of dataset even after sorting
similarity_values.sort_values(ascending=False)

2906    1.0
7942    1.0
6531    1.0
6530    1.0
7534    1.0
       ... 
4538    0.0
4536    0.0
4533    0.0
4532    0.0
0       0.0
Length: 10329, dtype: float64

In [45]:
similar_movie_indexes = list(similarity_values.sort_values(ascending=False).index)
# similar_movie_indexes

In [46]:
#Remove the already watched movie from index list
similar_movie_indexes.remove(movie_index)

In [49]:
def get_movie_by_index(idx, dataframe):
    return dataframe.index[idx]
def get_movie_by_id(mv_id, dataframe):
    return dataframe.loc[dataframe['movieId']==mv_id,['title']].values[0][0]
get_movie_by_index(8899, lemmatized_movies)

'Background to Danger (1943)'

In [51]:
print("Since u watched --->",get_movie_by_id(latest_movieId_watched_by_user, movies),"<--- We recommend you")
for i in range(15):
    print(get_movie_by_index(similar_movie_indexes[i], lemmatized_movies))

Since u watched ---> Collateral (2004) <--- We recommend you
Magnum Force (1973)
Punisher: War Zone (2008)
Thriller: A Cruel Picture (Thriller - en grym film) (1974)
Violent Cop (Sono otoko, kyôbô ni tsuki) (1989)
Elite Squad (Tropa de Elite) (2007)
Cop Land (1997)
Max Payne (2008)
Rampart (2011)
Get the Gringo (2012)
Hand Gun (1994)
Wild Card (2015)
Taken (2008)
Boondock Saints II: All Saints Day, The (2009)
Corruptor, The (1999)
Run All Night (2015)


## Collaborative Filtering

* based on users
* based on similar kind of users recommend the movies

<center><img src="https://miro.medium.com/max/700/1*DqKuqlvMREPH18ccbs5hmg.png" width=600 /></center>

> In case of movies we will find almost similar person based on his movie experience and recommend the movies which are not common between them.

In [63]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [74]:
df = movies.merge(rating)
df = df.loc[:, ["userId", "movieId", "title", "genres", "rating"]]
df_ratings = df.loc[:, ["title", "rating"]].groupby("title").mean()
genres = df["genres"]
df_ratings.head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
'71 (2014),3.5
'Hellboy': The Seeds of Creation (2004),3.0
'Round Midnight (1986),2.5
'Til There Was You (1997),4.0
"'burbs, The (1989)",3.125


In [95]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
li = []
for i in range(len(genres)):
    temp = genres[i].split("|")
    for j in range(len(temp)):
        temp[j] = lemmatizer.lemmatize(temp[j])
    li.append(" ".join(temp))

In [96]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(li).toarray()

genres = pd.DataFrame(X,columns=cv.get_feature_names())
df = df.iloc[:,:-2]
new_dataset = df.join(genres)

new_dataset

Unnamed: 0,userId,movieId,title,action,adventure,animation,children,comedy,crime,documentary,...,listed,musical,mystery,no,noir,romance,sci,thriller,war,western
0,2,1,Toy Story (1995),0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5,1,Toy Story (1995),0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,8,1,Toy Story (1995),0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,11,1,Toy Story (1995),0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,14,1,Toy Story (1995),0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105334,475,148238,A Very Murray Christmas (2015),0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
105335,458,148626,The Big Short (2015),0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
105336,576,148626,The Big Short (2015),0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
105337,668,148626,The Big Short (2015),0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [97]:
users = new_dataset.drop(["movieId","title"],axis=1)
users_moviemat = users.groupby("userId").sum()
X = users_moviemat.iloc[:,:].values
users_moviemat

Unnamed: 0_level_0,action,adventure,animation,children,comedy,crime,documentary,drama,fantasy,fi,...,listed,musical,mystery,no,noir,romance,sci,thriller,war,western
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,46,31,2,3,31,31,1,45,8,25,...,0,1,13,0,2,16,25,43,10,1
2,9,10,2,3,11,3,0,11,4,5,...,0,2,2,0,0,8,5,12,0,0
3,13,9,2,5,35,12,1,36,5,3,...,0,3,4,0,0,22,3,21,3,3
4,14,17,4,6,46,18,0,76,8,3,...,0,6,10,0,6,37,3,18,16,5
5,17,22,21,21,45,6,0,19,16,6,...,0,11,3,0,0,21,6,11,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
664,36,30,2,3,26,9,1,16,8,22,...,0,6,5,0,0,20,22,22,7,0
665,73,51,7,16,60,61,0,128,25,25,...,0,5,24,0,0,31,25,102,17,5
666,34,27,10,19,83,36,1,101,19,30,...,0,2,20,0,7,26,30,51,8,4
667,16,13,3,3,37,12,0,37,10,7,...,0,1,12,0,0,20,7,17,3,0


In [98]:
from sklearn.neighbors import NearestNeighbors

classifier = NearestNeighbors()
classifier.fit(X)

NearestNeighbors()

In [101]:
def sort_movies_by_year(li):
    def merge_sort(a,l,r):
        if l==r:
            return
        mid=(l+r)//2
        merge_sort(a,l,mid)
        merge_sort(a,mid+1,r)
        merge(a,l,mid,r)

    def merge(a,l,mid,r):
        n1=mid-l+1
        n2=r-(mid+1)+1
        L=[a[i+l] for i in range(n1)]
        R=[a[i+mid+1] for i in range(n2)]
        i,j,k=0,0,l
        while(i<n1 and j<n2):
            if int(L[i][-5:-1])>int(R[j][-5:-1]) :
                a[k]=L[i]
                i+=1
            else:
                a[k]=R[j]
                j+=1
            k+=1
        while(i<n1):
            a[k]=L[i]
            i+=1
            k+=1
        while(j<n2):
            a[k]=R[j]
            j+=1
            k+=1
    merge_sort(li,0,len(li)-1)

In [102]:
uid = int(input("Enter User Id "))
li = classifier.kneighbors([X[uid-1]],n_neighbors=5,return_distance=False)
current_user = new_dataset.loc[new_dataset["userId"]==li[0][0],:]["title"].values
similar_user = new_dataset.loc[new_dataset["userId"]==li[0][1],:]["title"].values
movies_list = [movie for movie in similar_user if movie not in current_user]
sort_movies_by_year(movies_list)
for i in range(len(movies_list)):
    movies_list[i] = (movies_list[i], df_ratings['rating'][df_ratings.index == movies_list[i]].values[0])
print("Recommended Movies are: ")
movies_list

Recommended Movies are: 


[('Harry Potter and the Deathly Hallows: Part 1 (2010)', 3.7580645161290325),
 ('Black Swan (2010)', 3.9655172413793105),
 ('Devil (2010)', 3.0714285714285716),
 ('Easy A (2010)', 3.3461538461538463),
 ('Social Network, The (2010)', 3.875),
 ('Machete (2010)', 3.3636363636363638),
 ('Scott Pilgrim vs. the World (2010)', 4.0),
 ('Expendables, The (2010)', 3.125),
 ('Salt (2010)', 3.3125),
 ('Karate Kid, The (2010)', 3.3333333333333335),
 ('Inception (2010)', 4.189320388349515),
 ('Toy Story 3 (2010)', 4.142857142857143),
 ('Get Him to the Greek (2010)', 3.15),
 ('Iron Man 2 (2010)', 3.6451612903225805),
 ('Death at a Funeral (2010)', 2.75),
 ('Kick-Ass (2010)', 3.857142857142857),
 ('How to Train Your Dragon (2010)', 3.757142857142857),
 ("She's Out of My League (2010)", 3.0),
 ("Valentine's Day (2010)", 3.0),
 ('Avatar (2009)', 3.856060606060606),
 ('Invictus (2009)', 3.5833333333333335),
 ('Blind Side, The  (2009)', 3.9347826086956523),
 ('Ninja Assassin (2009)', 3.2857142857142856),


## Rating Based Filtering

* based on higher ratings 
* number of reviews 
* Correlation comes into act 

> Comparison one movie ratings given by all user with another movie ratings given by all users => correlation 

In [104]:
dataset = rating.merge(movies)
df = dataset.groupby("title")['rating'].mean()
df1= dataset.groupby("title")['rating'].count()
dataset_based_on_ratings = pd.DataFrame({"rating":df,"number of ratings":df1})
dataset_based_on_ratings

Unnamed: 0_level_0,rating,number of ratings
title,Unnamed: 1_level_1,Unnamed: 2_level_1
'71 (2014),3.500000,1
'Hellboy': The Seeds of Creation (2004),3.000000,1
'Round Midnight (1986),2.500000,1
'Til There Was You (1997),4.000000,3
"'burbs, The (1989)",3.125000,20
...,...,...
loudQUIETloud: A Film About the Pixies (2006),4.500000,1
xXx (2002),2.958333,24
xXx: State of the Union (2005),2.071429,7
¡Three Amigos! (1986),3.012500,40


In [105]:
df = dataset.loc[:,["userId","rating","title"]]
users_movie_matrix = pd.pivot_table(df,columns='title',index='userId',values='rating') 
# correlation matrix 
users_movie_matrix

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Til There Was You (1997),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...And Justice for All (1979),10 (1979),...,[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),a/k/a Tommy Chong (2005),eXistenZ (1999),loudQUIETloud: A Film About the Pixies (2006),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
664,,,,,,,,,,,...,,,,,,,,,,
665,,,,,,,,,,,...,,,,,,,,,,
666,,,,,,,,,,,...,,,,,5.0,,,,,
667,,,,,,,,,,,...,,,,,,,,,,


In [106]:
movie_watched = users_movie_matrix["Jurassic Park (1993)"]
y = users_movie_matrix["Silence of the Lambs, The (1991)"]

In [107]:
li = []
for i in range(len(users_movie_matrix.columns)):
    li.append(movie_watched.corr(users_movie_matrix.iloc[:,i]))
li = pd.Series(li)