# Machine Learning Case Study - Letter Recognition

<img src="https://raw.githubusercontent.com/urvipasad/Machine-Learning-Projects/master/data-original.png" align='center' height="100" width="600"><br/>

## Table of Content

1. [Problem Statement](#section1)<br>
2. [Importing Packages](#section2)<br>
3. [Loading Data](#section3)
    - 3.1. [Importing Dataset](#section301)<br>
    - 3.2. [Description of the Dataset](#section302)<br><br>
4. [Data visualization and pre-processing](#section4)
    - 4.1 [Data Preprocessing](#section401)<br>
       - 4.1.1. [When Were The Movies Released?](#section40101)<br>
	   - 4.1.2. [How Are The Ratings Distributed](#section40102)<br>
	   - 4.1.3. [When Have The Movies Been Rated?](#section40103)<br>
	   - 4.1.4. [How Are The Number Of Ratings Distributed For The Movies And The Users](#section40104)<br>
	   - 4.1.5. [Filter Sparse Movies And Users?](#section40105)<br><br>
5. [Collaborative Filtering Recommendation Model](#section5)
    - 5.1 [Splitting the data](#section501)<br>
    - 5.2 [Model Evaluation](#section502)<br><br>
6. [Model based Collaborative Filtering](#section6)<br><br>
7. [Implementing Singular Vector Decomposition](#section7)<br><br>
8. [Conclusion](#section8)<br>

<a id=section1></a>
Netflix is all about connecting people to the movies they love. To help customers find those movies, they developed world-class movie recommendation system: CinematchSM. Its job is to predict whether someone will enjoy a movie based on how much they liked or disliked other movies. Netflix use those predictions to make personal movie recommendations based on each customer’s unique tastes. And while Cinematch is doing pretty well, it can always be made better.





This project aims to build a movie recommendation mechanism within Netflix. The dataset I used here come directly from Netflix. It consists of 4 text data files, each file contains over 20M rows, i.e. over 4K movies and 400K customers. All together over 17K movies and 500K+ customers!. 

**Objectives:**

> - Predict the rating that a user would give to a movie that he ahs not yet rated.
> - Minimize the difference between predicted and actual rating (RMSE and MAPE)

**Note:** For this project I am just using only one data file and from this data file only using the data for the year 2005 to reduce the dataset size. as with 400K records it will be need a more computing power and I cannot run this on laptop

<a id=section2></a>
## 2. Importing Packages

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from collections import deque
from surprise import Reader, Dataset, SVD, evaluate

# To create interactive plots
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# To compute similarities between vectors
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# To use recommender systems
import surprise as sp
#from surprise.model_selection import cross_validate

ImportError: cannot import name 'evaluate' from 'surprise' (C:\PythonAnaconda\lib\site-packages\surprise\__init__.py)

In [2]:
#pip install scikit-surprise

<a id=section3></a>
## 3. Loading Data

Data is collected from below site.

https://www.kaggle.com/netflix-inc/netflix-prize-data


<a id=section301></a>
### 3.1 Importing Dataset

In [3]:
df_raw = pd.read_csv('combined_data_1.txt', header=None, names=['User', 'Rating', 'Date'], usecols=[0, 1, 2], parse_dates=["Date"])
movie_titles = pd.read_csv('movie_titles.csv', 
                           encoding = 'ISO-8859-1', 
                           header = None, 
                           names = ['Id', 'Year', 'Name'])

print('Shape Movie-Titles:\t{}'.format(movie_titles.shape))
movie_titles.sample(5)

Shape Movie-Titles:	(17770, 3)


Unnamed: 0,Id,Year,Name
13504,13505,1997.0,The Odyssey
9907,9908,2005.0,Sometimes in April
14677,14678,1990.0,Inspector Alleyn Mysteries: Set 1
15756,15757,1996.0,Temptress Moon
3280,3281,2002.0,Larryboy: Leggo My Ego


<a id=section302></a>
### 3.2 Description of the Dataset
**Attribute Information:**

The movie rating files contain over 100 million ratings from 480 thousand
randomly-chosen, anonymous Netflix customers over 17 thousand movie titles.  The
data were collected between October, 1998 and December, 2005 and reflect the
distribution of all ratings received during this period.  The ratings are on a
scale from 1 to 5 (integral) stars. To protect customer privacy, each customer
id has been replaced with a randomly-assigned id.  The date of each rating and
the title and year of release for each movie id are also provided.

In [4]:
df_raw.shape

(24058263, 3)

In [5]:
df_raw.index = np.arange(0,len(df_raw))
print('Full dataset shape: {}'.format(df_raw.shape))
print('-Dataset examples-')
print(df_raw.iloc[::5000000, :])

Full dataset shape: (24058263, 3)
-Dataset examples-
             User  Rating       Date
0              1:     NaN        NaT
5000000   2560324     4.0 2005-12-06
10000000  2271935     2.0 2005-04-11
15000000  1921803     2.0 2005-01-31
20000000  1933327     3.0 2004-11-10


In [6]:
p = df_raw.groupby('Rating')['Rating'].agg(['count'])
p

Unnamed: 0_level_0,count
Rating,Unnamed: 1_level_1
1.0,1118186
2.0,2439073
3.0,6904181
4.0,8085741
5.0,5506583


>- Most of the ratings are either 4 or 5. It seems that users are very generous in providing rating to a movie

In [7]:
# get movie count
movie_count = df_raw.isnull().sum()[1]
movie_count

4499

> - 4499 movies in the dataset

In [8]:
df_raw.describe(include = 'all')

Unnamed: 0,User,Rating,Date
count,24058263.0,24053760.0,24053764
unique,475257.0,,2182
top,305344.0,,2005-01-19 00:00:00
freq,4467.0,,180428
first,,,1999-11-11 00:00:00
last,,,2005-12-31 00:00:00
mean,,3.599634,
std,,1.086118,
min,,1.0,
25%,,3.0,


In [9]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24058263 entries, 0 to 24058262
Data columns (total 3 columns):
User      object
Rating    float64
Date      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 1.3+ GB


In [10]:
# get customer count
cust_count = df_raw['User'].nunique() 
cust_count

475257

> - 475257 unique customers in the data set provided by Netflix

In [11]:
# get rating count
rating_count = df_raw['User'].count()

In [12]:
rating_count

24058263

About **21.000 entries** are in the movie metadata dataset. 

<a id=section4></a>
## 4. Data visualization and pre-processing
 
<a id=section401></a>
### 4.1 Data Preprocessing
> - 
The user-data structure has to be preprocessed to extract all ratings and form a matrix, since the file-structure is a messy mixture of json and csv.We need to extract the movie Id from the file and then add a column for moview id in the file so that we can get a final data set which contains Movie Id, user Id, rating & Date

In [13]:
# Find empty rows to slice dataframe for each movie
tmp_movies = df_raw[df_raw['Rating'].isna()]['User'].reset_index()
movie_indices = [[index, int(movie[:-1])] for index, movie in tmp_movies.values]

# Shift the movie_indices by one to get start and endpoints of all movies
shifted_movie_indices = deque(movie_indices)
shifted_movie_indices.rotate(-1)


# Gather all dataframes
user_data = []

# Iterate over all movies
for [df_id_1, movie_id], [df_id_2, next_movie_id] in zip(movie_indices, shifted_movie_indices):
    
    # Check if it is the last movie in the file
    if df_id_1<df_id_2:
        tmp_df = df_raw.loc[df_id_1+1:df_id_2-1].copy()
    else:
        tmp_df = df_raw.loc[df_id_1+1:].copy()
        
    # Create movie_id column
    tmp_df['Movie'] = movie_id
    
    # Append dataframe to list
    user_data.append(tmp_df)

# Combine all dataframes
df = pd.concat(user_data)
del user_data, df_raw, tmp_movies, tmp_df, shifted_movie_indices, movie_indices, df_id_1, movie_id, df_id_2, next_movie_id
print('Shape User-Ratings:\t{}'.format(df.shape))
df.sample(5)

Shape User-Ratings:	(24053764, 4)


Unnamed: 0,User,Rating,Date,Movie
1332065,1134561,5.0,2005-10-12,290
19512465,78819,2.0,2003-10-13,3715
13368631,2612545,4.0,2005-07-08,2554
17440448,2439493,1.0,2004-10-09,3348
17204180,766895,5.0,2001-08-03,3315


<a id=section40101></a>
### 4.1.1 When Were The Movies Released?

In [14]:
data = movie_titles['Year'].value_counts().sort_index()

# Create trace
trace = go.Scatter(x = data.index,
                   y = data.values,
                   marker = dict(color = '#db0000'))
# Create layout
layout = dict(title = '{} Movies Grouped By Year Of Release'.format(movie_titles.shape[0]),
              xaxis = dict(title = 'Release Year'),
              yaxis = dict(title = 'Movies'))

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

Many movies on Netflix have been released in this millennial. Whether Netflix prefers young movies or there are no old movies left can not be deduced from this plot.<br>
The decline for the rightmost point is probably caused by an **incomplete last year.**

<a id=section40102></a>
### 4.1.2 How Are The Ratings Distributed?

In [15]:
# Get data
data = df['Rating'].value_counts().sort_index(ascending=False)

# Create trace
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = 'whitesmoke'),
               y = data.values,
               marker = dict(color = 'purple'))
# Create layout
layout = dict(title = 'Distribution Of {} Netflix-Ratings'.format(df.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

Netflix movies rarely have a rating lower than three. **Most ratings have between three and four stars.**<br>
The distribution is probably biased, since only people liking the movies proceed to be customers and others presumably will leave the platform.

<a id=section40103></a>
### 4.1.3 When Have The Movies Been Rated?

In [16]:
# Get data
data = df['Date'].value_counts()
data.index = pd.to_datetime(data.index)
data.sort_index(inplace=True)

# Create trace
trace = go.Scatter(x = data.index,
                   y = data.values,
                   marker = dict(color = '#db0000'))
# Create layout
layout = dict(title = '{} Movie-Ratings Grouped By Day'.format(df.shape[0]),
              xaxis = dict(title = 'Date'),
              yaxis = dict(title = 'Ratings'))

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

With beginning of november 2005  a strange decline in ratings can be observed. Furthermore two unnormal peaks are in january and april 2005.

<a id=section40104></a>
### 4.1.4 How Are The Number Of Ratings Distributed For The Movies And The Users

In [68]:
##### Ratings Per Movie #####
# Get data
data = df.groupby('Movie')['Rating'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 20000,
                                  size = 100),
                     marker = dict(color = '#db0000'))
# Create layout
layout = go.Layout(title = 'Distribution Of Ratings Per Movie ',
                   xaxis = dict(title = 'Ratings Per Movie'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)



##### Ratings Per User #####
# Get data
data = df.groupby('User')['Rating'].count().clip(upper=199)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 200,
                                  size = 2),
                     marker = dict(color = '#db0000'))
# Create layout
layout = go.Layout(title = 'Distribution Of Ratings Per User (Clipped at 199)',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

Only very few movies/users have many ratings. approx 200 movies have highest ratings out of 17770 movies. It seems very few user are rating the movie. As we can see the highest rated movies are only 1000 user out of 42K users.

<a id=section40105></a>
### 4.1.5 Filter Sparse Movies And Users

> To reduce the dimensionality of the dataset I am filtering rarely rated movies and rarely rating users out. Since the data set is very large I am filtering it further to focus on only year 2005 ratings

In [69]:
## Filter the dataset to focus on only year 2005 data
df['Year'] = df['Date'].dt.year
df = df[df.Year.isin([2005])]

In [19]:
df.shape

(12562227, 5)

In [20]:
# Filter sparse movies
min_movie_ratings = 20000
filter_movies = (df['Movie'].value_counts()>min_movie_ratings)
filter_movies = filter_movies[filter_movies].index.tolist()

# Filter sparse users
min_user_ratings = 300
filter_users = (df['User'].value_counts()>min_user_ratings)
filter_users = filter_users[filter_users].index.tolist()

# Actual filtering
df_filterd = df[(df['Movie'].isin(filter_movies)) & (df['User'].isin(filter_users))]
del filter_movies, filter_users, min_movie_ratings, min_user_ratings
print('Shape User-Ratings unfiltered:\t{}'.format(df.shape))
print('Shape User-Ratings filtered:\t{}'.format(df_filterd.shape))

Shape User-Ratings unfiltered:	(12562227, 5)
Shape User-Ratings filtered:	(141171, 5)


In [21]:
df_filterd.shape

(141171, 5)

In [22]:
df_filterd.Movie.value_counts()

4306    1264
607     1249
1905    1248
798     1239
2862    1227
2782    1202
2372    1201
2452    1186
3938    1182
2095    1182
3860    1179
3333    1173
1406    1164
571     1163
3962    1160
658     1159
4043    1157
4262    1152
2462    1146
3106    1143
1066    1132
3605    1114
2470    1109
1798    1106
1754    1106
3427    1105
1962    1105
2430    1096
3925    1091
1202    1089
        ... 
2800     793
406      791
2209     787
2200     770
2342     748
4315     745
2465     739
1719     738
357      731
3730     724
2874     723
334      703
2178     693
4345     692
1466     657
241      656
1073     642
273      620
2743     619
3197     610
528      584
3371     580
361      560
3239     560
482      527
3015     515
2457     496
3966     457
2809     401
3875     268
Name: Movie, Length: 151, dtype: int64

In [23]:
df_Final = df_filterd.sample(frac=0.2) # take only 2% dataset for final processing to reduce the computation time

>- As the the size of the filtered dataset is still huge 141171. I am only using fraction(2%) of the dataset for building the recommedation engine.

In [24]:
df_Final.shape

(28234, 5)

In [25]:
df_Final.User.value_counts()

2349412    41
1099833    40
2039201    37
692235     36
1878120    36
1119705    35
196113     34
2000256    34
893014     34
100495     34
1658790    33
2446246    33
367727     33
2402136    33
267841     33
2519310    33
2017088    33
917571     32
1591103    32
1353938    32
427967     32
476512     32
1022452    32
166041     32
2455340    31
2634182    31
200805     31
1697790    31
1964597    31
1819462    31
           ..
1932594     5
1866323     5
769702      5
1189269     5
2056022     5
1276353     5
835265      5
1114324     5
1314869     5
237263      4
2035299     4
2198129     4
1461435     4
2232041     4
2207031     4
491531      4
1928290     4
168523      4
1984315     3
507094      3
1001129     3
768876      3
122197      2
1911845     2
1792741     2
2439493     2
1257454     1
305344      1
2627190     1
638020      1
Name: User, Length: 1505, dtype: int64

Since the size of the data is around 40Lacs records and considering the computing power it will required to compute the correlation matrix for recommendation system it will not be feasible to run this on a CPU and we need high power GPU or TPU with a good size RAM. Hence I am taking only 1%  of the data from the given set for designing the recommendation enginee and only taking the year 2005 and 2004 in to consideration to avoid baised predictions.

#### Display 20 movies with highest ratings

In [26]:
df_Final.sort_values('Rating', ascending=False).head(20)

Unnamed: 0,User,Rating,Date,Movie,Year
14832704,1087451,5.0,2005-07-31,2862,2005
23555946,1813272,5.0,2005-04-20,4402,2005
1527765,751829,5.0,2005-03-22,312,2005
22981569,1734505,5.0,2005-08-11,4330,2005
792149,1860332,5.0,2005-01-30,191,2005
6283201,1740380,5.0,2005-06-28,1220,2005
20208492,1128164,5.0,2005-01-02,3860,2005
18478492,1298147,5.0,2005-04-06,3538,2005
12841224,467182,5.0,2005-01-20,2452,2005
18722838,727175,5.0,2005-07-19,3605,2005


<a id=section5></a>
# 5. Collaborative Filtering Recommendation Model

### Implementation
 We will use the df_filterd dataframe first as it contains User ID, Movie IDs and Ratings. These three elements are all we need for determining the similarity of the users based on their ratings for a particular movie.

First, lets do some quick data processing:

##### **Check if there any null values**

In [27]:
df_Final.isna().sum()

User      0
Rating    0
Date      0
Movie     0
Year      0
dtype: int64

In [28]:
df_Final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28234 entries, 4899402 to 16316267
Data columns (total 5 columns):
User      28234 non-null object
Rating    28234 non-null float64
Date      28234 non-null datetime64[ns]
Movie     28234 non-null int64
Year      28234 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 1.3+ MB


In [29]:
# Change the datatype for User column
df_Final['User'] = df_Final['User'].astype(int)

**Drop the date column from our Dataframe**

In [30]:
df_Final.drop(['Date','Year'], axis = 'columns', inplace = True)

In [31]:
df_Final.head(5)

Unnamed: 0,User,Rating,Movie
4899402,1390057,4.0,985
14840009,105802,4.0,2862
4200116,1710658,4.0,798
16261971,2608542,1.0,3151
3090700,134640,3.0,571


<a id=section501></a>
# 5.1. Splitting the data

In [32]:
from sklearn import model_selection as cv
train_data, test_data = cv.train_test_split(df_Final, test_size=0.3)

In [33]:
# Create two user-item matrices, one for training and another for testing
train_data_matrix = train_data[['User', 'Movie', 'Rating']].values
test_data_matrix = test_data[['User', 'Movie', 'Rating']].values

# Check their shape
print(train_data_matrix.shape)
print(test_data_matrix.shape)

(19763, 3)
(8471, 3)


Now, we use the **pairwise_distances** function from sklearn to calculate the [Pearson Correlation Coefficient](https://stackoverflow.com/questions/1838806/euclidean-distance-vs-pearson-correlation-vs-cosine-similarity). This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.

In [34]:
from sklearn.metrics.pairwise import pairwise_distances



### User Similarity Matrix

In [35]:
# User Similarity Matrix
user_correlation =  1-pairwise_distances(train_data, metric='correlation')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation[:4, :4])

### Item Similarity Matrix

In [37]:
# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(train_data_matrix.T, metric='correlation')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation[:4, :4])

[[1.00000000e+00 3.49308019e-03 7.98423436e-03]
 [3.49308019e-03 1.00000000e+00 8.90631599e-04]
 [7.98423436e-03 8.90631599e-04 1.00000000e+00]]


In [38]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        print("Mean user rating")
        # Use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

<a id=section502></a>
# 5.2 Model Evaluation

In [39]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Function to calculate RMSE
def rmse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return sqrt(mean_squared_error(pred, actual))

In [40]:
train_data_matrix[:5]

array([[1.483838e+06, 1.561000e+03, 4.000000e+00],
       [9.233250e+05, 3.385000e+03, 4.000000e+00],
       [1.190964e+06, 3.917000e+03, 3.000000e+00],
       [1.391070e+05, 2.922000e+03, 4.000000e+00],
       [9.369280e+05, 1.509000e+03, 5.000000e+00]])

In [41]:
# Predict ratings on the training data with both similarity score
user_prediction = predict(train_data_matrix, user_correlation, type='user')
item_prediction = predict(train_data_matrix, item_correlation, type='item')

# RMSE on the test data
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

Mean user rating
User-based CF RMSE: 504731.73170812946
Item-based CF RMSE: 613974.763759239


In [42]:
# RMSE on the train data
print('User-based CF RMSE: ' + str(rmse(user_prediction, train_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, train_data_matrix)))

User-based CF RMSE: 356666.2227463187
Item-based CF RMSE: 12737.153585647657


> - We can observ that using Item Similary or User similarity we are having a very high RMSE. It is obvious because we do not have a user rating for all movies dataset and due to sparsity of the data the results of Collaboriative filtering will be not be accurate. Hence let's try Model based Collaborative filtering

<a id=section6></a>
# 6. Model based Collaborative Filtering

In [43]:
n_users = df_Final.User.unique().shape[0]
n_movies = df_Final.Movie.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies))

Number of users = 1505 | Number of movies = 151


In [44]:
Ratings = df_Final.pivot(index = 'User', columns ='Movie', values = 'Rating').fillna(0)
Ratings.head()

Movie,30,175,191,197,241,273,290,299,312,313,...,4266,4302,4306,4315,4330,4345,4356,4402,4432,4472
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2213,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
2787,0.0,0.0,4.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3321,0.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3604,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
5530,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,4.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0


In [45]:
from sklearn.neighbors import NearestNeighbors

model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'auto')
model_knn.fit(Ratings)

NearestNeighbors(metric='cosine')

In [46]:
Ratings.index[0]

2213

In [47]:
query_index = np.random.choice(Ratings.shape[0])
distances, indices = model_knn.kneighbors(Ratings.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 5)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendation for {0}:\n'.format(Ratings.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, Ratings.index[indices.flatten()[i]],distances.flatten()[i]))

Recommendation for 216059:

1: 2119629, with distance of 0.588623742474115:
2: 1746378, with distance of 0.6026058909318384:
3: 530000, with distance of 0.617272400768263:
4: 362038, with distance of 0.6203151983649049:


> - Using KNN based collaborative filtering the predictions are pretty good as compared to Content based Collaborative filtering

In [48]:
sparsity = round(1.0 - len(df_Final) / float(n_users * n_movies), 3)
print('The sparsity level of Netfllix dataset is ' +  str(sparsity * 100) + '%')

The sparsity level of Netfllix dataset is 87.6%


<a id=section7></a>
## 7. Implementing Singular Vector Decomposition

#### Using Ratings Data

In [49]:
n_users = df_Final.User.unique().shape[0]
n_movies = df_Final.Movie.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies))

Number of users = 1505 | Number of movies = 151


In [50]:
df_Final.shape

(28234, 3)

In [51]:
Ratings = df_Final.pivot(index = 'User', columns ='Movie', values = 'Rating').fillna(0)
Ratings.head()

Movie,30,175,191,197,241,273,290,299,312,313,...,4266,4302,4306,4315,4330,4345,4356,4402,4432,4472
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2213,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
2787,0.0,0.0,4.0,1.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3321,0.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3604,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
5530,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,4.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0


In [52]:
Ratings.shape

(1505, 151)

We need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

In [53]:
R = Ratings.values
user_ratings_mean = np.mean(R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)

In [54]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(Ratings_demeaned, k = 50)

In [55]:
sigma = np.diag(sigma)

In [56]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

In [57]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head(10)

Movie,30,175,191,197,241,273,290,299,312,313,...,4266,4302,4306,4315,4330,4345,4356,4402,4432,4472
0,-0.008571,-0.555891,-0.838093,0.261516,1.959277,1.18963,0.365101,1.428813,0.490918,0.021691,...,1.788038,-0.170917,0.81345,0.11304,2.623923,0.18743,-0.397878,0.009542,-0.087347,-0.158775
1,0.769026,0.743563,3.463424,0.762272,-0.315287,0.335536,1.513265,0.248179,-0.129556,-0.131691,...,0.689324,0.102427,0.257954,-0.039828,-1.30904,0.330175,-0.165248,-0.204929,-0.17164,0.326057
2,1.315679,-0.363394,3.358475,-0.108153,-0.016885,0.594657,1.333019,0.989308,0.71852,0.849589,...,1.434426,1.16602,0.733858,0.502938,-0.75349,0.110921,0.451368,-0.027492,-1.060591,0.714683
3,0.404544,0.384412,-0.192209,0.726362,0.553045,-0.186497,0.255019,1.413969,1.066354,0.302598,...,-0.325301,0.397043,0.296945,0.484429,-0.192992,0.340746,0.386433,-0.654012,2.306228,0.49877
4,-0.124284,1.09326,3.372448,-1.247315,0.131133,0.562584,0.331978,-0.480132,2.349051,-0.325537,...,0.608733,2.310402,-0.041378,-0.204884,0.139795,0.353928,0.065645,1.994591,-0.014328,0.550427
5,1.698804,0.789926,0.369222,0.426622,0.304227,0.832006,2.100265,0.351429,-0.713347,-0.141921,...,0.442772,0.546278,-0.012677,0.103993,-0.016332,0.419524,0.215867,0.470323,-0.026349,0.505831
6,0.327181,-0.177556,-0.076259,-0.079557,0.106601,0.606226,0.362304,0.044029,0.420677,-0.727455,...,0.677307,0.717317,0.566644,-0.488427,0.256557,-0.093825,-0.501007,0.44926,-0.380306,-0.166702
7,0.933705,0.101995,-0.85238,1.169796,-0.123803,-0.022942,0.381609,0.469635,0.324103,0.811874,...,0.0071,0.282351,0.13717,0.374868,-0.725129,0.578492,0.435573,3.080806,1.418713,0.927247
8,0.325939,0.017224,4.023516,0.947962,-0.100974,0.673367,0.298717,0.14851,0.598416,0.149874,...,0.077764,2.361359,-0.017836,0.347007,-0.404625,-0.017948,-0.932376,-0.235029,0.362483,0.022763
9,0.734806,0.496694,0.209747,0.450544,-0.114254,0.339874,-0.04617,-0.106186,-0.584901,0.112646,...,-0.222436,1.754589,0.091079,1.28152,0.157573,0.00559,0.12095,1.050201,-1.155294,-0.551663


In [58]:
Ratings.reset_index(inplace = True)

In [59]:
preds.shape

(1505, 151)

In [60]:
movie_titles.head()

Unnamed: 0,Id,Year,Name
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW


In [61]:
df_Final.head(2)

Unnamed: 0,User,Rating,Movie
4899402,1390057,4.0,985
14840009,105802,4.0,2862


In [62]:
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations, Ratings):
    
    user_row_number= Ratings[Ratings.User==2213].index[0]

    # Get and sort the user's predictions
    
    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings[original_ratings.User == (userID)]
    user_full = (user_data.merge(movies, how = 'left', left_on = 'Movie', right_on = 'Id').
                     sort_values(['Rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['Id'].isin(user_full['Movie'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'Id',
               right_on = 'Movie').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

In [63]:
already_rated, predictions = recommend_movies(preds, 2213, movie_titles, df_Final, 20, Ratings)

User 2213 has already rated 22 movies.
Recommending highest 20 predicted ratings movies not already rated.


In [64]:
Ratings[Ratings.User==2213].index[0]

0

In [65]:
predictions

Unnamed: 0,Id,Year,Name,Movie
331,334,2005.0,The Pacifier,334.0
1169,1174,1993.0,The Sandlot,1174.0
3618,3638,2003.0,Bad Boys II,3638.0
4241,4262,1999.0,Sleepy Hollow,4262.0
847,851,1990.0,Back to the Future Part III,851.0
297,299,2001.0,Bridget Jones's Diary,299.0
453,457,2004.0,Kill Bill: Vol. 2,457.0
4206,4227,1997.0,The Full Monty,4227.0
4235,4256,1984.0,Footloose: Special Collector's Edition,4256.0
358,361,2004.0,The Phantom of the Opera: Special Edition,361.0


In [66]:
#!pip install scikit-surprise

In [None]:
pip install scikit-surprise