## Assignment – Movie Survey

-by Qi Sun


I use Qualtrics for this survey, which is 5-point Likert scale - 5 being most favorite, and 1 least. Also, there's a 'Not Sure' option in the survey.


<img src="https://raw.githubusercontent.com/susanqisun/DAV6300/main/Screen%20Shot%202021-02-02%20at%201.47.20%20PM.png" width="500">




**Survey link:**

https://yeshiva.co1.qualtrics.com/jfe/form/SV_dd4QEIhvROaTu8S


There are six movies in the survey:

1. The Little Things,  2h 7min | Crime, Drama, Thriller | 29 January 2021 (USA)


2. The White Tiger,  2h 5min | Crime, Drama | 22 January 2021 (USA)


3. The Dig, 1h 52min | Biography, Drama, History | 29 January 2021 (USA)


4. Soul, 1h 40min | Animation, Adventure, Comedy | 25 December 2020 (USA)


5. Wonder Woman 1984,  2h 31min | Action, Adventure, Fantasy | 25 December 2020 (USA)


6. Promising Young Woman, 1h 53min | Crime, Drama, Thriller | 25 December 2020 (USA)


In [46]:
import pandas as pd
import numpy as np

# survey results
df = pd.read_csv('https://raw.githubusercontent.com/susanqisun/DAV6300/main/movie%20recommender.csv')
df


Unnamed: 0,UserID,The Little Things,The White Tiger,The Dig,Soul,Wonder Woman 1984,Promising Young Woman
0,1,3,4,5,Not Sure,1,3.0
1,2,Not Sure,Not Sure,4,5,5,4.0
2,3,5,4,5,5,3,4.0
3,4,Not Sure,Not Sure,Not Sure,3,4,5.0
4,5,5,5,4,3,4,2.0
5,6,3,2,,3,4,5.0
6,7,,1,,,4,



### 1. Data exploration


In [47]:
#Identify the Data Types
df_info = pd.DataFrame(df.dtypes,columns=['Dtype'])

#Identify the unique values
df_info['Nunique'] = df.nunique()

#check missing values for each column
df_info['MissingValues']=df.isnull().sum()

# Identify the count for each variable
df_info['Count']=df.count()

df_info

Unnamed: 0,Dtype,Nunique,MissingValues,Count
UserID,int64,7,0,7
The Little Things,object,3,1,6
The White Tiger,object,5,0,7
The Dig,object,3,2,5
Soul,object,3,1,6
Wonder Woman 1984,int64,4,0,7
Promising Young Woman,float64,4,1,6


**Findings:**
> A total of 7 people completed this survey. Four movies have missing ratings. There are also some 'Not Sure' responses. 

> People rated these movies from 1 to 5. 1 means the least favorite. The missing values and 'Not Sure' values mean that not all people have watched all listed movies. There are some different ways to handle the missing values. For example, use a value of zero to replace the missing values and 'Not Sure' values, which means people didn't watch the movie. Or we can use linear regresstion to predict the point. 

**How to handle missing values?**
> Here, I'll replace the 'Not Sure' values with 'NaN' (missing value) and use `Singular Value Decomposition (SVD)` from the Surprise library to get estimated prediction of missing ratings, which can minimise RMSE (Root Mean Square Error) and give great recommendations based on user's ratings. For SVD, it doesn't care what the movie is. It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have reviewed the movie.


### 2. Replace 'Not Sure' values with 'NaN'


In [51]:
df02 = df.replace('Not Sure',np.NaN)
df02

Unnamed: 0,UserID,The Little Things,The White Tiger,The Dig,Soul,Wonder Woman 1984,Promising Young Woman
0,1,3.0,4.0,5.0,,1,3.0
1,2,,,4.0,5.0,5,4.0
2,3,5.0,4.0,5.0,5.0,3,4.0
3,4,,,,3.0,4,5.0
4,5,5.0,5.0,4.0,3.0,4,2.0
5,6,3.0,2.0,,3.0,4,5.0
6,7,,1.0,,,4,


### 3. Movies with top 3 ratings for each user

I'll calculate the top 3 ratings for each user. 

In [55]:
# https://stackoverflow.com/questions/28609667/pandas-find-column-name-and-value-with-max-and-second-max-value-for-each-row

df03 = df02.copy()

def top(x):
    x.set_index('UserID', inplace=True)
    df03 = pd.DataFrame({'1st Max':[],'Max1Value':[],'2nd Max':[],'Max2Value':[],'3rd Max':[],'Max3Value':[]})
    df03.index.name='User'
    df03.loc[x.index.values[0],['1st Max', '2nd Max','3rd Max']] = x.sum().nlargest(3).index.tolist()
    df03.loc[x.index.values[0],['Max1Value', 'Max2Value','Max3Value']] = x.sum().nlargest(3).values
    return df03

df_top = df03.groupby('UserID').apply(top).reset_index(level=1, drop=True).reset_index()
df_top



Unnamed: 0,UserID,1st Max,Max1Value,2nd Max,Max2Value,3rd Max,Max3Value
0,1,The Dig,5.0,The White Tiger,4.0,The Little Things,3.0
1,2,Soul,5.0,Wonder Woman 1984,5.0,The Dig,4.0
2,3,The Little Things,5.0,The Dig,5.0,Soul,5.0
3,4,Promising Young Woman,5.0,Wonder Woman 1984,4.0,Soul,3.0
4,5,The Little Things,5.0,The White Tiger,5.0,The Dig,4.0
5,6,Promising Young Woman,5.0,Wonder Woman 1984,4.0,The Little Things,3.0
6,7,Wonder Woman 1984,4.0,The White Tiger,1.0,The Little Things,0.0


### 4. Average rating of each movie

Code reference: https://github.com/wwwbbb8510/baseline-rs/blob/master/item-based-collaborative-filtering.ipynb


In [100]:
df04 = df03.copy()
df04.index = np.arange(1, len(df03) + 1)
df04

Unnamed: 0,UserID,The Little Things,The White Tiger,The Dig,Soul,Wonder Woman 1984,Promising Young Woman
1,1,3.0,4.0,5.0,,1,3.0
2,2,,,4.0,5.0,5,4.0
3,3,5.0,4.0,5.0,5.0,3,4.0
4,4,,,,3.0,4,5.0
5,5,5.0,5.0,4.0,3.0,4,2.0
6,6,3.0,2.0,,3.0,4,5.0
7,7,,1.0,,,4,


In [101]:
df05 = df04.drop(['UserID'], axis=1)

df_survey = df05.stack().reset_index()
df_survey.columns=['UserId', 'movie', 'rating']

df_survey

Unnamed: 0,UserId,movie,rating
0,1,The Little Things,3
1,1,The White Tiger,4
2,1,The Dig,5
3,1,Wonder Woman 1984,1
4,1,Promising Young Woman,3
5,2,The Dig,4
6,2,Soul,5
7,2,Wonder Woman 1984,5
8,2,Promising Young Woman,4
9,3,The Little Things,5


In [102]:
df_survey['rating'] = df_survey['rating'].astype('int64') 

df_survey.dtypes

UserId     int64
movie     object
rating     int64
dtype: object

In [95]:
rating_mean= df_survey.groupby(['movie'], as_index = False, sort = False).mean().rename(columns = {'rating': 'rating_mean'})[['movie','rating_mean']]
rating_mean.sort_values('rating_mean', ascending=False)


Unnamed: 0,movie,rating_mean
2,The Dig,4.5
0,The Little Things,4.0
4,Promising Young Woman,3.833333
5,Soul,3.8
3,Wonder Woman 1984,3.571429
1,The White Tiger,3.2


**Results:**

>The movie of 'The Dig' has the highest average rating and the movie of 'The White Tiger' has the least average rating.


### 5. Add movie ID


In [96]:
# read movie ID
movie = pd.read_csv('https://raw.githubusercontent.com/susanqisun/DAV6300/main/movieID.csv')
movie


Unnamed: 0,movieID,movie
0,101,The Little Things
1,102,The White Tiger
2,103,The Dig
3,104,Soul
4,105,Wonder Woman 1984
5,106,Promising Young Woman


In [103]:
# merge together
df_movie = pd.merge(left=df_survey, right=movie, how='outer')

df_movie.sort_values(by='UserId')

Unnamed: 0,UserId,movie,rating,movieID
0,1,The Little Things,3,101
4,1,The White Tiger,4,102
20,1,Promising Young Woman,3,106
13,1,Wonder Woman 1984,1,105
9,1,The Dig,5,103
26,2,Soul,5,104
21,2,Promising Young Woman,4,106
14,2,Wonder Woman 1984,5,105
10,2,The Dig,4,103
27,3,Soul,5,104


In [104]:
df_movie02 = df_movie[['UserId','movieID','rating']]
df_movie02.sort_values(by='UserId')

Unnamed: 0,UserId,movieID,rating
0,1,101,3
4,1,102,4
20,1,106,3
13,1,105,1
9,1,103,5
26,2,104,5
21,2,106,4
14,2,105,5
10,2,103,4
27,3,104,5


### 6. Calculate the missing rating

Next, I'll calculate the missing ratings by using SVD. SVD algorithm is equivalent to Probabilistic Matrix Factorization. I’ll use the surprise package, a popular package for building recommendation systems in Python. 

https://surprise.readthedocs.io/en/stable/matrix_factorization.html


In [108]:
from surprise import Reader, Dataset, SVD

# surprise reader API to read the dataset
reader = Reader()

data = Dataset.load_from_df(df_movie02, reader)
svd = SVD()
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x12b00c090>

As we can see below, there are some missing values in the raw survey dataset.

In [134]:
# raw survey data
df04

Unnamed: 0,UserID,The Little Things,The White Tiger,The Dig,Soul,Wonder Woman 1984,Promising Young Woman
1,1,3.0,4.0,5.0,,1,3.0
2,2,,,4.0,5.0,5,4.0
3,3,5.0,4.0,5.0,5.0,3,4.0
4,4,,,,3.0,4,5.0
5,5,5.0,5.0,4.0,3.0,4,2.0
6,6,3.0,2.0,,3.0,4,5.0
7,7,,1.0,,,4,


Below is the movie names and ids.

In [141]:
movie

Unnamed: 0,movieID,movie
0,101,The Little Things
1,102,The White Tiger
2,103,The Dig
3,104,Soul
4,105,Wonder Woman 1984
5,106,Promising Young Woman


### User 1:

For user 1, there's one missing rating.

First I need to find the movie ids that user 1 didn’t rate.

In [137]:
#https://blog.cambridgespark.com/tutorial-practical-introduction-to-recommender-systems-dbe22848392b

# get a list of all movie ids
iids = df_movie02['movieID'].unique()

# get a list of movie ids that user id 1 has rated
iids01 = df_movie02.loc[df_movie02['UserId']==1,'movieID']

# remove the movie ids that user id 1 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids,iids01)


Next I want to predict the score of each of the movie ids that user 1 didn’t rate. For this I have to create another dataset with the movie ids I want to predict in the sparse format as before: uid, iid, rating. I'll just arbitrarily set all the ratings of this test set to 4, as they are not needed. Let's do this, then output the prediction.


In [139]:
testset = [[1,iid,4.] for iid in iids_to_pred]
predictions = svd.test(testset)
predictions

[Prediction(uid=1, iid=104, r_ui=4.0, est=3.683747035361957, details={'was_impossible': False})]

For user 1, the missing rating is the movie with ID 104, which I got a rating with an estimated prediction of 3.68.

### User 2:

For user 2, there're two missing ratings.


In [140]:
# get a list of all movie ids
iids = df_movie02['movieID'].unique()

# get a list of movie ids that user id 1 has rated
iids01 = df_movie02.loc[df_movie02['UserId']==2,'movieID']

# remove the movie ids that user id 1 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids,iids01)

testset = [[2,iid,4.] for iid in iids_to_pred]
predictions = svd.test(testset)
predictions

[Prediction(uid=2, iid=101, r_ui=4.0, est=4.132933514290043, details={'was_impossible': False}),
 Prediction(uid=2, iid=102, r_ui=4.0, est=3.5969745738878047, details={'was_impossible': False})]

For user 2, one of the missing ratings is the movie with ID 101, which I got a rating with an estimated prediction of 4.13. The other missing ratings is the movie with ID 102, which I got a rating with an estimated prediction of 3.6.


### User 3:

For user 3, there's no missing rating.


In [142]:
# get a list of all movie ids
iids = df_movie02['movieID'].unique()

# get a list of movie ids that user id 1 has rated
iids01 = df_movie02.loc[df_movie02['UserId']==3,'movieID']

# remove the movie ids that user id 1 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids,iids01)

testset = [[3,iid,4.] for iid in iids_to_pred]
predictions = svd.test(testset)
predictions

[]

### User 4:

For user 4, there're 3 missing ratings.

In [143]:
# get a list of all movie ids
iids = df_movie02['movieID'].unique()

# get a list of movie ids that user id 1 has rated
iids01 = df_movie02.loc[df_movie02['UserId']==4,'movieID']

# remove the movie ids that user id 1 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids,iids01)

testset = [[4,iid,4.] for iid in iids_to_pred]
predictions = svd.test(testset)
predictions

[Prediction(uid=4, iid=101, r_ui=4.0, est=4.045680065115335, details={'was_impossible': False}),
 Prediction(uid=4, iid=102, r_ui=4.0, est=3.7591074911441344, details={'was_impossible': False}),
 Prediction(uid=4, iid=103, r_ui=4.0, est=4.134653268015945, details={'was_impossible': False})]

### User 7:

For user 7, there're 4 missing ratings.

In [144]:
# get a list of all movie ids
iids = df_movie02['movieID'].unique()

# get a list of movie ids that user id 1 has rated
iids01 = df_movie02.loc[df_movie02['UserId']==7,'movieID']

# remove the movie ids that user id 1 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids,iids01)

testset = [[7,iid,4.] for iid in iids_to_pred]
predictions = svd.test(testset)
predictions

[Prediction(uid=7, iid=101, r_ui=4.0, est=3.6068572877689764, details={'was_impossible': False}),
 Prediction(uid=7, iid=103, r_ui=4.0, est=3.7672593745579968, details={'was_impossible': False}),
 Prediction(uid=7, iid=104, r_ui=4.0, est=3.460695411363843, details={'was_impossible': False}),
 Prediction(uid=7, iid=106, r_ui=4.0, est=3.5491172576330245, details={'was_impossible': False})]

## Question: Is there any benefit in standardizing ratings? How might you approach this?

For example, users A and B are considered similar in the cosine similarity metric despite having different ratings. This is actually a common occurrence in the real world, and the users like the user A are what you can call tough raters. An example would be a movie critic who always gives out ratings lower than the average, but the rankings of the items in their list would be similar to the Average raters like B.

To factor in such individual user preferences, we can normalize ratings to remove their biases. We can do this by subtracting the average rating given by that user to all items from each item rated by that user.

Here’s what this normalization would look like:

* For user A, the rating vector [1, 2] has the average 1.5. Subtracting 1.5 from every rating would give you the vector [-0.5, 0.5].


* For user B, the rating vector [2, 4] has the average 3. Subtracting 3 from every rating would give you the vector [-1, 1].

We can see that the ratings are now adjusted to give an average of 0 for all users, which brings them all to the same level and removes their biases.

This method was explained in the article below, which also discussed some methods to handle the missing values (ratings).

https://towardsdatascience.com/the-magic-behind-recommendation-systems-c3fc44927b3c

