# <font color="orange">Movie Recommendation System </font>

<b><i>NAKKA SHEKHAR</b>
* [linkedin](https://www.linkedin.com/in/nakka-shekhar-2019a987/)
* [github](https://github.com/shekhar443/MACHINE-LEARNING-PROJECT)

<b>Steps Followed:</b>
* Importing the basic libraries
* Importing & Parsing the dataset as ratings and movies details
* Basic Inspection on datasets
* Create the ratings matrix of shape (m×u)
* Subtract Mean off - Normalization
* Computing SVD
* Calculate cosine similarity, sort by most similar and return the top N
* Select k principal components to represent the movies, a movie_id to find recommendations and print the top_n results


### Importing the basic libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

### Importing & Parsing the dataset as ratings and movies details

In [2]:
ratingData = pd.read_table('ratings.dat', 
names=['user_id', 'movie_id', 'rating', 'time'],engine='python', delimiter='::',encoding="ISO-8859-1")
movieData = pd.read_table('movies.dat',names=['movie_id', 'title', 'genre'],engine='python',
                          delimiter='::',encoding="ISO-8859-1")

### Basic Inspection on datasets

In [3]:
# Top 5 rows of movie data
movieData.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
# Top 5 rows of rating data
ratingData.head()

Unnamed: 0,user_id,movie_id,rating,time
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [5]:
r,c=ratingData.shape
print("rating data having {} rows {} columns".format(r,c))

rating data having 1000209 rows 4 columns


In [6]:
r,c=movieData.shape
print("movie data having {} rows {} columns".format(r,c))

movie data having 3883 rows 3 columns


In [7]:
movieData.size

11649

In [8]:
ratingData.size

4000836

In [9]:
print('columns in the movie data: ',list(movieData.columns))

columns in the movie data:  ['movie_id', 'title', 'genre']


In [10]:
print('columns in the rating data: ',list(ratingData.columns))

columns in the rating data:  ['user_id', 'movie_id', 'rating', 'time']


In [11]:
len(movieData.movie_id.unique())

3883

In [12]:
len(ratingData.movie_id.unique())

3706

In [13]:
ratingData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   user_id   1000209 non-null  int64
 1   movie_id  1000209 non-null  int64
 2   rating    1000209 non-null  int64
 3   time      1000209 non-null  int64
dtypes: int64(4)
memory usage: 30.5 MB


In [14]:
movieData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  3883 non-null   int64 
 1   title     3883 non-null   object
 2   genre     3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


In [15]:
# Checking null values
def checknull(obj):
    return obj.isnull().sum()

In [16]:
movieData.apply(checknull)

movie_id    0
title       0
genre       0
dtype: int64

In [17]:
ratingData.apply(checknull)

user_id     0
movie_id    0
rating      0
time        0
dtype: int64

In [18]:
# Checking duplicate values
def checkduplicate(obj):
    return obj.duplicated().sum()

In [19]:
movieData.apply(checkduplicate)

movie_id       0
title          0
genre       3582
dtype: int64

In [20]:
ratingData.apply(checkduplicate)

user_id      994169
movie_id     996503
rating      1000204
time         541754
dtype: int64

### Create the ratings matrix of shape (m×u)

In [21]:
ratingData.movie_id.values

array([1193,  661,  914, ...,  562, 1096, 1097], dtype=int64)

In [22]:
np.max(ratingData.movie_id.values)

3952

In [23]:
ratingData.user_id.values

array([   1,    1,    1, ..., 6040, 6040, 6040], dtype=int64)

In [24]:
np.max(ratingData.user_id.values)

6040

In [25]:
ratingMatrix = np.ndarray(
    shape=(np.max(ratingData.movie_id.values), np.max(ratingData.user_id.values)),
    dtype=np.uint8)

In [26]:
ratingData.movie_id.values-1

array([1192,  660,  913, ...,  561, 1095, 1096], dtype=int64)

In [27]:
ratingData.user_id.values-1

array([   0,    0,    0, ..., 6039, 6039, 6039], dtype=int64)

In [28]:
ratingData.rating.values

array([5, 3, 3, ..., 5, 4, 4], dtype=int64)

In [29]:
ratingMatrix[ratingData.movie_id.values-1, ratingData.user_id.values-1] = ratingData.rating.values

In [30]:
print(ratingMatrix)

[[5 0 0 ... 0 0 3]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### Subtract Mean off - Normalization

In [31]:
np.mean(ratingMatrix)

0.15007545010322546

In [32]:
np.mean(ratingMatrix, 1)

array([1.42599338, 0.37152318, 0.23874172, ..., 0.03278146, 0.02582781,
       0.24288079])

In [33]:
np.mean(ratingMatrix, 1).shape

(3952,)

In [34]:
np.asarray(np.mean(ratingMatrix, 1))

array([1.42599338, 0.37152318, 0.23874172, ..., 0.03278146, 0.02582781,
       0.24288079])

In [35]:
np.asarray(np.mean(ratingMatrix, 1)).shape

(3952,)

In [36]:
normalizedMatrix = ratingMatrix - np.asarray([(np.mean(ratingMatrix, 1))]).T

In [37]:
print(normalizedMatrix)

[[ 3.57400662 -1.42599338 -1.42599338 ... -1.42599338 -1.42599338
   1.57400662]
 [-0.37152318 -0.37152318 -0.37152318 ... -0.37152318 -0.37152318
  -0.37152318]
 [-0.23874172 -0.23874172 -0.23874172 ... -0.23874172 -0.23874172
  -0.23874172]
 ...
 [-0.03278146 -0.03278146 -0.03278146 ... -0.03278146 -0.03278146
  -0.03278146]
 [-0.02582781 -0.02582781 -0.02582781 ... -0.02582781 -0.02582781
  -0.02582781]
 [-0.24288079 -0.24288079 -0.24288079 ... -0.24288079 -0.24288079
  -0.24288079]]


### Computing SVD

In [38]:
normalizedMatrix.T

array([[ 3.57400662, -0.37152318, -0.23874172, ..., -0.03278146,
        -0.02582781, -0.24288079],
       [-1.42599338, -0.37152318, -0.23874172, ..., -0.03278146,
        -0.02582781, -0.24288079],
       [-1.42599338, -0.37152318, -0.23874172, ..., -0.03278146,
        -0.02582781, -0.24288079],
       ...,
       [-1.42599338, -0.37152318, -0.23874172, ..., -0.03278146,
        -0.02582781, -0.24288079],
       [-1.42599338, -0.37152318, -0.23874172, ..., -0.03278146,
        -0.02582781, -0.24288079],
       [ 1.57400662, -0.37152318, -0.23874172, ..., -0.03278146,
        -0.02582781, -0.24288079]])

In [39]:
ratingMatrix.shape[0] - 1

3951

In [40]:
np.sqrt(ratingMatrix.shape[0] - 1)

62.85698051927089

In [41]:
A = normalizedMatrix.T / np.sqrt(ratingMatrix.shape[0] - 1)
A

array([[ 0.05685934, -0.00591061, -0.00379817, ..., -0.00052152,
        -0.0004109 , -0.00386402],
       [-0.02268632, -0.00591061, -0.00379817, ..., -0.00052152,
        -0.0004109 , -0.00386402],
       [-0.02268632, -0.00591061, -0.00379817, ..., -0.00052152,
        -0.0004109 , -0.00386402],
       ...,
       [-0.02268632, -0.00591061, -0.00379817, ..., -0.00052152,
        -0.0004109 , -0.00386402],
       [-0.02268632, -0.00591061, -0.00379817, ..., -0.00052152,
        -0.0004109 , -0.00386402],
       [ 0.02504108, -0.00591061, -0.00379817, ..., -0.00052152,
        -0.0004109 , -0.00386402]])

In [42]:
U, S, V = np.linalg.svd(A)

### Calculate cosine similarity, sort by most similar and return the top N

In [43]:
def similar(ratingData, movie_id, top_n):
    index = movie_id - 1 # Movie id starts from 1
    movie_row = ratingData[index, :]
    magnitude = np.sqrt(np.einsum('ij, ij -> i', ratingData, ratingData)) #Einstein summation |  traditional matrix multiplication and is equivalent to np.matmul(a,b)
    similarity = np.dot(movie_row, ratingData.T) / (magnitude[index] * magnitude)
    sort_indexes = np.argsort(-similarity) #Perform an indirect sort along the given axis (Last axis)
    return sort_indexes[:top_n]

### Select k principal components to represent the movies, a movie_id to find recommendations and print the top_n results

In [44]:
k = int(input("enter the total number of movies: "))
movie_id = int(input("enter the movie id: "))
top_n = int(input("ton n movies: "))

sliced = V.T[:, :k] # representative data
indexes = similar(sliced, movie_id, top_n)

print(" ")
print('Recommendations for Movie {0}: \n'.format(
movieData[movieData.movie_id == movie_id].title.values[0]))
for id in indexes + 1:
    print(movieData[movieData.movie_id == id].title.values[0])

enter the total number of movies: 10000
enter the movie id: 23
ton n movies: 5
 
Recommendations for Movie Assassins (1995): 

Assassins (1995)
Boat, The (Das Boot) (1981)
Return of the Pink Panther, The (1974)
Braveheart (1995)
Guns of Navarone, The (1961)


#### <font color="red">Conclusions:</font>
<font color="green"></font>
* <font color="green">Here The Recommendation System is Developed for List of N Movies</font>
* <font color="green">Movie Recommendation System is Developed Based on Collabarating Based Recommendation</font>
* <font color="green">We Have to Give K Number of Features, Movie Id,Top N as Input and it Recommends Top N Movies as Output</font>
* <font color="green">These Top N Movies Recommended Using Collabarating Based Filtering Technique with Cosine Similarity and SVD</font>