<a href="https://colab.research.google.com/github/sahil-ansari-15/Recommendation-System/blob/main/Collaborative_filtering_Cosine_Similarity_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Business Problem**

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.      
The dataset can be downloaded from here  -- (https://grouplens.org/datasets/movielens/100k/)
This data set consists of: 
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
    * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th,1997 through April 22nd, 1998.

**Task and Approach:**

We need to work on the MovieLens dataset and build a model to recommend movies to the end users

**Step 1 :** Importing Libraries and Understanding Data

In [1]:
%matplotlib inline  
# To make data visualisations display in Jupyter Notebooks 
import numpy as np   # linear algebra
import pandas as pd  # Data processing, Input & Output load
import matplotlib.pyplot as plt # Visuvalization & plotting
import seaborn as sns # Also for Data visuvalization 

from sklearn.metrics.pairwise import cosine_similarity  # Compute cosine similarity between samples in X and Y.
from scipy import sparse  #  sparse matrix package for numeric data.
from scipy.sparse.linalg import svds # svd algorithm

import warnings   # To avoid warning messages in the code run
warnings.filterwarnings("ignore")


**Step 2 :** Loading Data  & Corss chekcing 

In [2]:
Rating = pd.read_csv('Ratings.csv') 
Movie_D = pd.read_csv('Movie details.csv',encoding='latin-1') ##Movie details 
User_Info = pd.read_csv('user level info.csv',encoding='latin-1') ## if you have a unicode string, you can use encode to convert

In [3]:
Rating.shape

(100000, 4)

In [4]:
Rating.head()

Unnamed: 0,user id,item id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


* Item id means it is Movie id 
* Item_ID chnaged as Movie id for the better redability pupose 


In [5]:
Rating.columns = ['user_id', 'movie_id', 'rating', 'timestamp'] 

Renaming the columns to avoid the space in the column name text 

In [6]:
Movie_D.shape

(1682, 24)

In [7]:
Movie_D.head()

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


In [8]:
Movie_D.columns = ['movie_id', 'movie_title', 'release_date', 'video_release_date ',
       'IMDb_URL', 'unknown', 'Action ', 'Adventure', 'Animation',
       'Childrens', 'Comedy ', 'Crime ', ' Documentary ', 'Drama',
       ' Fantasy', 'Film-Noir ', 'Horror ', 'Musical', 'Mystery',
       ' Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

Renaming the columns to avoid the space in the column name text 

**To get our desired information in a single dataframe, we can merge the two dataframes objects on the movie_Id column since it is common between the two dataframes.**

**We can do this using merge() function from the Pandas library**

In [9]:
Movie_Rating = pd.merge(Rating ,Movie_D,on = 'movie_id')
Movie_Rating.describe()

Unnamed: 0,user_id,movie_id,rating,timestamp,video_release_date,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
count,100000.0,100000.0,100000.0,100000.0,0.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,462.48475,425.53013,3.52986,883528900.0,,0.0001,0.25589,0.13753,0.03605,0.07182,0.29832,0.08055,0.00758,0.39895,0.01352,0.01733,0.05317,0.04954,0.05245,0.19461,0.1273,0.21872,0.09398,0.01854
std,266.61442,330.798356,1.125674,5343856.0,,0.01,0.436362,0.344408,0.186416,0.258191,0.457523,0.272144,0.086733,0.489685,0.115487,0.130498,0.224373,0.216994,0.222934,0.395902,0.33331,0.41338,0.291802,0.134894
min,1.0,1.0,1.0,874724700.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,254.0,175.0,3.0,879448700.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,447.0,322.0,4.0,882826900.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,682.0,631.0,4.0,888260000.0,,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,943.0,1682.0,5.0,893286600.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**We can see the Average rating for all the movie is 3.5**              
**We can also see 25 percentile also indicating avaerage is 3 highest is 5**

=======================================================================================
## Cosine Similarties

* Untill now we have seen the correlation wise now we are going to use  cosine similariy to find the similar movies
* Filter out required columns from the dataset 

In [10]:
Movie_cosine = Movie_Rating[['user_id','movie_id','rating']]
Movie_cosine.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,63,242,3
2,226,242,5
3,154,242,3
4,306,242,5


* Sparse matrix we are going to create using above data      
* A sparse matrix in Coordinate format this is also called as triplet format

In [11]:
data = Movie_cosine.rating
col = Movie_cosine.movie_id
row = Movie_cosine.user_id

R = sparse.coo_matrix((data, (row, col))).tocsr()
print ('{0}x{1} user by movie matrix'.format(*R.shape))

944x1683 user by movie matrix


#### (1,1)= 5  means in a matix 1 row n 1 column n 5 value of rating 

In [12]:
print(R)

  (1, 1)	5
  (1, 2)	3
  (1, 3)	4
  (1, 4)	3
  (1, 5)	3
  (1, 6)	5
  (1, 7)	4
  (1, 8)	1
  (1, 9)	5
  (1, 10)	3
  (1, 11)	2
  (1, 12)	5
  (1, 13)	5
  (1, 14)	5
  (1, 15)	5
  (1, 16)	5
  (1, 17)	3
  (1, 18)	4
  (1, 19)	5
  (1, 20)	4
  (1, 21)	1
  (1, 22)	4
  (1, 23)	4
  (1, 24)	3
  (1, 25)	4
  :	:
  (943, 739)	4
  (943, 756)	2
  (943, 763)	4
  (943, 765)	3
  (943, 785)	2
  (943, 794)	3
  (943, 796)	3
  (943, 808)	4
  (943, 816)	4
  (943, 824)	4
  (943, 825)	3
  (943, 831)	2
  (943, 840)	4
  (943, 928)	5
  (943, 941)	1
  (943, 943)	5
  (943, 1011)	2
  (943, 1028)	2
  (943, 1044)	3
  (943, 1047)	2
  (943, 1067)	2
  (943, 1074)	4
  (943, 1188)	3
  (943, 1228)	3
  (943, 1330)	3


* Keeping data ,col, row we call it as Triplet Format of Matrix

* The individual elements of the matrix can be listed in any order, and if there are multiple items for the same nonzero position, the values provided for those positions are added.

* Using the **cosine similarity** to measure the similarity between a pair of vectors

* With the cosine similarity, we are going to evaluate the similarity between two vectors based on the angle between them. The smaller the angle, the more similar the two vectors are

* If you recall from trigonometry, the range of the cosine function goes from -1 to 1. Some important properties of cosine to recall:

>+ Cosine(0°) = 1
+ Cosine(90°) = 0
+ Cosine(180°) = -1


* If we restrict our vectors to non-negative values (as in the case of movie ratings, usually going from a 1-5 scale), then the angle of separation between the two vectors is bound between 0° and 90°

In [13]:
find_similarities = cosine_similarity(R.T) # We are transposing the matrix 
print (find_similarities.shape)

(1683, 1683)


In [14]:
print(R.T)

  (1, 1)	5
  (2, 1)	3
  (3, 1)	4
  (4, 1)	3
  (5, 1)	3
  (6, 1)	5
  (7, 1)	4
  (8, 1)	1
  (9, 1)	5
  (10, 1)	3
  (11, 1)	2
  (12, 1)	5
  (13, 1)	5
  (14, 1)	5
  (15, 1)	5
  (16, 1)	5
  (17, 1)	3
  (18, 1)	4
  (19, 1)	5
  (20, 1)	4
  (21, 1)	1
  (22, 1)	4
  (23, 1)	4
  (24, 1)	3
  (25, 1)	4
  :	:
  (739, 943)	4
  (756, 943)	2
  (763, 943)	4
  (765, 943)	3
  (785, 943)	2
  (794, 943)	3
  (796, 943)	3
  (808, 943)	4
  (816, 943)	4
  (824, 943)	4
  (825, 943)	3
  (831, 943)	2
  (840, 943)	4
  (928, 943)	5
  (941, 943)	1
  (943, 943)	5
  (1011, 943)	2
  (1028, 943)	2
  (1044, 943)	3
  (1047, 943)	2
  (1067, 943)	2
  (1074, 943)	4
  (1188, 943)	3
  (1228, 943)	3
  (1330, 943)	3


In [15]:
print(find_similarities)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.40238218 ... 0.         0.04718307 0.04718307]
 [0.         0.40238218 1.         ... 0.         0.07829936 0.07829936]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.04718307 0.07829936 ... 0.         1.         0.        ]
 [0.         0.04718307 0.07829936 ... 0.         0.         1.        ]]


a=pd.DataFrame(find_similarities)

a.to_csv("matrix.csv")

In [16]:
def Get_Top5_Similarmovies(model, movie_id, n=5):
    return model[movie_id].argsort()[::-1][:n].tolist()  # Here movie id is index 
 
    # [::-1] sort in reverse order like 1234=4321 
    

* index is started with 0 and movie id is started with 1  

In [17]:
Movie_D.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


* Here Index 4 means movie id 5 

In [18]:
Movie_D.iloc[4] 

movie_id                                                               5
movie_title                                               Copycat (1995)
release_date                                                    1-Jan-95
video_release_date                                                   NaN
IMDb_URL               http://us.imdb.com/M/title-exact?Copycat%20(1995)
unknown                                                                0
Action                                                                 0
Adventure                                                              0
Animation                                                              0
Childrens                                                              0
Comedy                                                                 0
Crime                                                                  1
 Documentary                                                           0
Drama                                              

In [19]:
Movie_D.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


In [20]:
Movie_D.iloc[Get_Top5_Similarmovies(find_similarities, 4)]

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Childrens,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
4,5,Copycat (1995),1-Jan-95,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0
56,57,Priest (1994),1-Jan-94,,http://us.imdb.com/M/title-exact?Priest%20(1994),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
204,205,Patton (1970),1-Jan-70,,http://us.imdb.com/M/title-exact?Patton%20(1970),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
174,175,Brazil (1985),1-Jan-85,,http://us.imdb.com/M/title-exact?Brazil%20(1985),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
202,203,Unforgiven (1992),1-Jan-92,,http://us.imdb.com/M/title-exact?Unforgiven%20...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [21]:
def get_movieid_by_movie_name(movie_title):
    movie_id=[]                #### empty list
    listOfMovies=Movie_D['movie_title']   ### all movie title name
    listOfMoviesID=Movie_D['movie_id']    ### all movie id 
    for i in range(len(listOfMovies)):       
        if listOfMovies[i] == movie_title :    # if movie title match 
            movie_id=listOfMoviesID[i]         # give me movie id 
    return  movie_id


In [22]:
def similar_movie(id):
    df=Movie_D.iloc[Get_Top5_Similarmovies(find_similarities, id)]
    return df[["movie_id","movie_title"]]

In [23]:
similar_movie(get_movieid_by_movie_name('Get Shorty (1995)'))

Unnamed: 0,movie_id,movie_title
4,5,Copycat (1995)
56,57,Priest (1994)
204,205,Patton (1970)
174,175,Brazil (1985)
202,203,Unforgiven (1992)
