# Movie Ratings

When we want to analyze a data set the first step is always to import our libraries and read in any csv files. 

In [145]:
import numpy as np
import pandas as pd

df = pd.read_csv('MovieRatings.csv')
df

Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail
0,Lucy,4.0,3.0,4.0,3,5
1,Willow,,,5.0,4,2
2,Carmela,,,5.0,4,4
3,Hillary,3.0,,,3,4
4,Sam,5.0,5.0,3.0,4,1


Now to calculate the average ratings for each movie and for each person. We will add these values to the dataframe as a new column and a new row. 

In [146]:
df['Average'] = df.mean(axis=1)
df

  df['Average'] = df.mean(axis=1)


Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,4.0,3.0,4.0,3,5,3.8
1,Willow,,,5.0,4,2,3.666667
2,Carmela,,,5.0,4,4,4.333333
3,Hillary,3.0,,,3,4,3.333333
4,Sam,5.0,5.0,3.0,4,1,3.6


In [147]:
df.loc[len(df.index)] = ['', df['The Batman'].mean(), df['The Banshees of Inisherin'].mean(), df['Sleepless in Seattle'].mean(), df['Joe vs The Volcano'].mean(), df["You've Got Mail"].mean(), df['Average'].mean()]

In [148]:
df

Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,4.0,3.0,4.0,3.0,5.0,3.8
1,Willow,,,5.0,4.0,2.0,3.666667
2,Carmela,,,5.0,4.0,4.0,4.333333
3,Hillary,3.0,,,3.0,4.0,3.333333
4,Sam,5.0,5.0,3.0,4.0,1.0,3.6
5,,4.0,4.0,4.25,3.6,3.2,3.746667


Now as data analysts we might want to normalize this data set. Normalization is when the data points are all represented by values within the range of zero to one. This means the largest data point will have a value of one and the smallest a value of zero. Normalization is useful for making sure that the data is all represented on the same scale. 

Here is how I normalized this data set in pandas. 

In [149]:
df_n = df.copy()
df_n

Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,4.0,3.0,4.0,3.0,5.0,3.8
1,Willow,,,5.0,4.0,2.0,3.666667
2,Carmela,,,5.0,4.0,4.0,4.333333
3,Hillary,3.0,,,3.0,4.0,3.333333
4,Sam,5.0,5.0,3.0,4.0,1.0,3.6
5,,4.0,4.0,4.25,3.6,3.2,3.746667


After copying the dataframe we need to decide on a strategy for dealing with missing values. I decide the best method would be to fill in the missing values with the average rating for each moving. This keeps the movie's average scores the same. 

In [150]:
mean = df_n.mean()
df_n = df_n.fillna(mean)
df_n

  mean = df_n.mean()


Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,4.0,3.0,4.0,3.0,5.0,3.8
1,Willow,4.0,4.0,5.0,4.0,2.0,3.666667
2,Carmela,4.0,4.0,5.0,4.0,4.0,4.333333
3,Hillary,3.0,4.0,4.25,3.0,4.0,3.333333
4,Sam,5.0,5.0,3.0,4.0,1.0,3.6
5,,4.0,4.0,4.25,3.6,3.2,3.746667


In [151]:
for column in df_n[['The Batman', 'The Banshees of Inisherin', 'Sleepless in Seattle', 'Joe vs The Volcano', "You've Got Mail", 'Average']]:
    df_n[column] = (df_n[column] - df_n[column].min()) / (df_n[column].max() - df_n[column].min())
    
df_n

Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,0.5,0.0,0.5,0.0,1.0,0.466667
1,Willow,0.5,0.5,1.0,1.0,0.25,0.333333
2,Carmela,0.5,0.5,1.0,1.0,0.75,1.0
3,Hillary,0.0,0.5,0.625,0.0,0.75,0.0
4,Sam,1.0,1.0,0.0,1.0,0.0,0.266667
5,,0.5,0.5,0.625,0.6,0.55,0.413333


So now we have our normalized data set. The advantage of going through this process is having all of your data on the same scale. We know that 1 is the largest point in the data set and zero is the smallest. We also know that any data points that have a value of 0.5 are right between the two. The same can be said of data points 0.25 and 0.75 they represent quartile values. This can be extremely helpful when dealing with data for which you don't know the scale.

As useful as this sounds it is not in our best interest to perform this process on all data sets. Sometimes the values can be too erratic and normalization can further add to the confusion. Outliers can also play a part in throwing off the scale of a normalized data set. In this example we knew our scale was from 1 to 5 this made for a good data set for demonstrating the advantages of normalization. 

-- Extra Credit --

In [152]:
df_s = df.copy()
df_s

Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,4.0,3.0,4.0,3.0,5.0,3.8
1,Willow,,,5.0,4.0,2.0,3.666667
2,Carmela,,,5.0,4.0,4.0,4.333333
3,Hillary,3.0,,,3.0,4.0,3.333333
4,Sam,5.0,5.0,3.0,4.0,1.0,3.6
5,,4.0,4.0,4.25,3.6,3.2,3.746667


In [153]:
mean = df_s.mean()
df_s = df_s.fillna(mean)
df_s

  mean = df_s.mean()


Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,4.0,3.0,4.0,3.0,5.0,3.8
1,Willow,4.0,4.0,5.0,4.0,2.0,3.666667
2,Carmela,4.0,4.0,5.0,4.0,4.0,4.333333
3,Hillary,3.0,4.0,4.25,3.0,4.0,3.333333
4,Sam,5.0,5.0,3.0,4.0,1.0,3.6
5,,4.0,4.0,4.25,3.6,3.2,3.746667


In [154]:
for column in df_s[['The Batman', 'The Banshees of Inisherin', 'Sleepless in Seattle', 'Joe vs The Volcano', "You've Got Mail", 'Average']]:
    df_s[column] = (df_s[column] - df_s[column].mean()) / df_s[column].std()
    
df_s

Unnamed: 0,Name,The Batman,The Banshees of Inisherin,Sleepless in Seattle,Joe vs The Volcano,You've Got Mail,Average
0,Lucy,0.0,-1.581139,-0.3371,-1.224745,1.224745,0.1614269
1,Willow,0.0,0.0,1.0113,0.816497,-0.8164966,-0.2421403
2,Carmela,0.0,0.0,1.0113,0.816497,0.5443311,1.775695
3,Hillary,-1.581139,0.0,0.0,-1.224745,0.5443311,-1.251058
4,Sam,1.581139,1.581139,-1.6855,0.816497,-1.49691,-0.4439239
5,,0.0,0.0,0.0,0.0,3.021644e-16,1.344149e-15
