# Movie Ratings

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [2]:
# import dataframe from movie ratings csv file
ratings = pd.read_csv('movie_ratings.csv')

# display dataframe
ratings

Unnamed: 0,Name,Alien: Romulus,Terrifier 3,Inside Out 2,Deadpool & Wolverine,Late Night with the Devil,Dune: Part Two
0,Robert,4.0,4.0,4.0,3,4.0,5
1,Emily,3.0,1.0,5.0,4,,3
2,Leslie,5.0,,2.0,4,4.0,4
3,Eason,2.0,4.0,5.0,5,,5
4,Rebecca,4.0,4.0,5.0,3,4.0,2
5,Jonathan,3.0,2.0,3.0,1,,4
6,Maria,,,3.0,4,2.0,5
7,Joseph,3.0,5.0,,5,3.0,2
8,Anthony,4.0,4.0,3.0,2,,5
9,Tesha,4.0,4.0,5.0,4,5.0,4


Since the first column "Name" records names of the individuals who took the survey, and the rest of the columns are all movies titles, I will set name as the index of this dataframe.

In [3]:
ratings.set_index('Name', inplace=True)

## Show the average ratings for each movie.

I use the numpy nanmean method to find the average of the ratings for each movie while ignoring NaN values, then round the results to two decimals.

In [4]:
# calculate average ratings for each movies
# round with two decimals
movie_avg = ratings.apply(lambda col: np.nanmean(col).round(2))

print('Here are the average ratings for each movie:')
print(movie_avg)

Here are the average ratings for each movie:
Alien: Romulus               3.56
Terrifier 3                  3.50
Inside Out 2                 3.89
Deadpool & Wolverine         3.50
Late Night with the Devil    3.67
Dune: Part Two               3.90
dtype: float64


Show the average ratings for each user.

I use the numpy nanmean method to ignore NaN values, then round the results to two decimals.

In [5]:
# calculate average ratings for each user
# round with two decimals
user_avg = ratings.apply(lambda row: np.nanmean(row).round(2), axis=1)

print('Here are the average movie ratings from each user:')
print(user_avg)

Here are the average movie ratings from each user:
Name
Robert      4.00
Emily       3.20
Leslie      3.80
Eason       4.20
Rebecca     3.67
Jonathan    2.60
Maria       3.50
Joseph      3.60
Anthony     3.60
Tesha       4.33
dtype: float64


## Normalization
The following is a movie ratings dataframe, with normalized ratings <b>for each user</b>.

In [6]:
# Normalization function
def normalization(x,x_min, x_max):
    x_new = (x - x_min)/(x_max - x_min)
    return x_new

# use nanmin and nanmax to find min and max of the list ignoring NaN values
# then for each element in the list, call the normalization function
def new_elem(arr):
    x_min = np.nanmin(arr)
    x_max = np.nanmax(arr)
    return arr.map(lambda x: normalization(x,x_min,x_max))

# pass in each row of user and calculate the normalized ratings
normalized_ratings = ratings.apply(lambda arr: new_elem(arr), axis = 1) 
normalized_ratings

Unnamed: 0_level_0,Alien: Romulus,Terrifier 3,Inside Out 2,Deadpool & Wolverine,Late Night with the Devil,Dune: Part Two
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Robert,0.5,0.5,0.5,0.0,0.5,1.0
Emily,0.5,0.0,1.0,0.75,,0.5
Leslie,1.0,,0.0,0.666667,0.666667,0.666667
Eason,0.0,0.666667,1.0,1.0,,1.0
Rebecca,0.666667,0.666667,1.0,0.333333,0.666667,0.0
Jonathan,0.666667,0.333333,0.666667,0.0,,1.0
Maria,,,0.333333,0.666667,0.0,1.0
Joseph,0.333333,1.0,,1.0,0.333333,0.0
Anthony,0.666667,0.666667,0.333333,0.0,,1.0
Tesha,0.0,0.0,1.0,0.0,1.0,0.0


Find the normalized average rating for each user.

In [7]:
normalized_user_avg = normalized_ratings.apply(lambda row: np.nanmean(row).round(2), axis=1)

print('Here are the normalized average movie ratings from each user:')
print(normalized_user_avg)

Here are the normalized average movie ratings from each user:
Name
Robert      0.50
Emily       0.55
Leslie      0.60
Eason       0.73
Rebecca     0.56
Jonathan    0.53
Maria       0.50
Joseph      0.53
Anthony     0.53
Tesha       0.33
dtype: float64


Find the normalized average ratings for each movie.

In [8]:
normalized_movie_avg = normalized_ratings.apply(lambda col: np.nanmean(col).round(2))

print('Here are the average ratings for each movie:')
print(normalized_movie_avg)

Here are the average ratings for each movie:
Alien: Romulus               0.48
Terrifier 3                  0.48
Inside Out 2                 0.65
Deadpool & Wolverine         0.44
Late Night with the Devil    0.53
Dune: Part Two               0.62
dtype: float64


Some users are stricted on rating movies and are more likely to rate with low score. Whereas some users enjoy most movies and rate them with higher scores.

Tdifferent users can have different standards, using a normalized rating can put different standard levels into the same scale. Additionally, normalization is useful if the dataset's distribution is unknown.

One disadvantage of using normalized ratings is that the calculated rating can be affected by outliers. 

From the normalization equation:

$$x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$$

THe minimum data point and the maximum data point are used to calculate normalized data. If minimum or maximum data points are outliers, it can greatly changes the resulted normalized ratings.

## Standardization

The following is a movie ratings dataframe with standardized ratings for each user.

The standardization equation is:
$$x_{new} = \frac{x - \mu}{\sigma}$$

In which $\mu$ represents the mean (average) of the data points, and $\sigma$ represents the standard deviation of the data points.

In [9]:
# calculate the mean of the data, ignore NaN
def rating_avg(arr):
    return arr.mean(skipna=True)

# calculate the standard deviation of the data, ignore NaN
def rating_std(arr):
    return arr.std(skipna=True)

# standardization of a data point
def standardization(x, avg, s_dev):
    x_new = (x - avg)/s_dev
    return x_new


# map each value with the standardization formula.
def map_standardization(arr):
    avg = rating_avg(arr)
    std = rating_std(arr)
    return arr.map(lambda x: standardization(x,avg,std))

# apply the standardization to each user (in row)
standardized_ratings = ratings.apply(lambda arr: map_standardization(arr), axis = 1) 
standardized_ratings

Unnamed: 0_level_0,Alien: Romulus,Terrifier 3,Inside Out 2,Deadpool & Wolverine,Late Night with the Devil,Dune: Part Two
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Robert,0.0,0.0,0.0,-1.581139,0.0,1.581139
Emily,-0.13484,-1.48324,1.21356,0.53936,,-0.13484
Leslie,1.095445,,-1.643168,0.182574,0.182574,0.182574
Eason,-1.687323,-0.153393,0.613572,0.613572,,0.613572
Rebecca,0.322749,0.322749,1.290994,-0.645497,0.322749,-1.613743
Jonathan,0.350823,-0.526235,0.350823,-1.403293,,1.227881
Maria,,,-0.387298,0.387298,-1.161895,1.161895
Joseph,-0.447214,1.043498,,1.043498,-0.447214,-1.19257
Anthony,0.350823,0.350823,-0.526235,-1.403293,,1.227881
Tesha,-0.645497,-0.645497,1.290994,-0.645497,1.290994,-0.645497


Find the standardized average movie ratings for each user.

In [10]:
standardized_user_avg = standardized_ratings.apply(lambda row: np.nanmean(row).round(2), axis=1)

print('Here are the standardized average movie ratings from each user:')
print(standardized_user_avg)

Here are the standardized average movie ratings from each user:
Name
Robert      0.0
Emily      -0.0
Leslie      0.0
Eason      -0.0
Rebecca     0.0
Jonathan   -0.0
Maria       0.0
Joseph     -0.0
Anthony    -0.0
Tesha       0.0
dtype: float64


All the average ratings are 0. The result makes sense because the ratings are centered around the mean of 0 for each user. The average of the movie ratings for each user are all equal to 0. 

Find the standardized average rating for each movie:

In [11]:
standardized_movie_avg = standardized_ratings.apply(lambda col: np.nanmean(col).round(2))

print('Here are the standardized average ratings for each movie:')
print(standardized_movie_avg)

Here are the standardized average ratings for each movie:
Alien: Romulus              -0.09
Terrifier 3                 -0.14
Inside Out 2                 0.24
Deadpool & Wolverine        -0.29
Late Night with the Devil    0.03
Dune: Part Two               0.24
dtype: float64


The advantage of standardization is that it is not sensitive to outliers. Standardization is useful when the distribution is normal but not useful when the distribution is unknown.