## WEEK 7: Aggregating Data with Movie Ratings

1. Load the ratings by user information that you collected into a pandas dataframe.
2. Show the average ratings for each user and each movie.
3. Create a new pandas dataframe, with normalized ratings for each user. Again, show the average ratings for each user and each movie.
4. Provide a text-based conclusion: explain what might be advantages and disadvantages of using normalized ratings instead of the actual ratings.
5. Create another new pandas dataframe, with standardized ratings for each user. Once again, show the average ratings for each user and each movie. (EXTRA CREDIT)

First, I will load the .csv file with the movie ratings and take a look at it.

In [1]:
import pandas as pd
import numpy as np

url = 'https://github.com/sarahbill33/dataacq/blob/main/movieratings%20-%20Sheet1.csv?raw=true'
ratings = pd.read_csv(url, index_col=0)

print(ratings)

           Nightmare Before Christmas  Coraline  Hocus Pocus  Hocus Pocus 2  \
Ali                               2.0       3.0          5.0            NaN   
Angie                             3.0       5.0          5.0            5.0   
Neal                              4.0       NaN          NaN            2.0   
Fabian                            5.0       4.0          NaN            3.0   
Mackenzie                         NaN       NaN          4.0            3.0   

           Monster House  Spirited Away  
Ali                  NaN            4.0  
Angie                3.0            NaN  
Neal                 3.0            5.0  
Fabian               NaN            5.0  
Mackenzie            3.0            NaN  


Next, I think I want to name the index column for clarity.

In [3]:
ratings.index.rename('Respondent', inplace=True)

print(ratings)

            Nightmare Before Christmas  Coraline  Hocus Pocus  Hocus Pocus 2  \
Respondent                                                                     
Ali                                2.0       3.0          5.0            NaN   
Angie                              3.0       5.0          5.0            5.0   
Neal                               4.0       NaN          NaN            2.0   
Fabian                             5.0       4.0          NaN            3.0   
Mackenzie                          NaN       NaN          4.0            3.0   

            Monster House  Spirited Away  
Respondent                                
Ali                   NaN            4.0  
Angie                 3.0            NaN  
Neal                  3.0            5.0  
Fabian                NaN            5.0  
Mackenzie             3.0            NaN  


Now I will transform this to a long dataset before I do any groupings.

In [4]:
movie_list = list(ratings.columns)

ratings_long = pd.melt(ratings, value_vars = movie_list, value_name = 'Ratings', ignore_index = False)

print(ratings_long)

                              variable  Ratings
Respondent                                     
Ali         Nightmare Before Christmas      2.0
Angie       Nightmare Before Christmas      3.0
Neal        Nightmare Before Christmas      4.0
Fabian      Nightmare Before Christmas      5.0
Mackenzie   Nightmare Before Christmas      NaN
Ali                           Coraline      3.0
Angie                         Coraline      5.0
Neal                          Coraline      NaN
Fabian                        Coraline      4.0
Mackenzie                     Coraline      NaN
Ali                        Hocus Pocus      5.0
Angie                      Hocus Pocus      5.0
Neal                       Hocus Pocus      NaN
Fabian                     Hocus Pocus      NaN
Mackenzie                  Hocus Pocus      4.0
Ali                      Hocus Pocus 2      NaN
Angie                    Hocus Pocus 2      5.0
Neal                     Hocus Pocus 2      2.0
Fabian                   Hocus Pocus 2  

Now I am ready to do some groupings. First I will group by respondent to see average ratings by person.

In [16]:
respondents = ratings_long.groupby('Respondent')['Ratings'].mean()

print(respondents)

Respondent
Ali          3.500000
Angie        4.200000
Fabian       4.250000
Mackenzie    3.333333
Neal         3.500000
Name: Ratings, dtype: float64


And next I will look at average ratings by movie.

In [17]:
movies = ratings_long.groupby('variable')['Ratings'].mean()

print(movies)

variable
Coraline                      4.000000
Hocus Pocus                   4.666667
Hocus Pocus 2                 3.250000
Monster House                 3.000000
Nightmare Before Christmas    3.500000
Spirited Away                 4.666667
Name: Ratings, dtype: float64


### In conclusion of the first part, I can see that the individual average ratings ranged from 3.32 (Mackenzie) to 4.25 (Fabian). The movie ratings ranged from 3.00 (Monster House) to a tie at 4.67 (Hocus Pocus and Spirited Away).

Next, I will create the new dataframe with normalized data using the .agg() method.

In [25]:
url2 = 'https://github.com/sarahbill33/dataacq/blob/main/movieratings%20-%20Sheet1.csv?raw=true'
ratings_norm = pd.read_csv(url2, index_col=0)

print(ratings_norm)

           Nightmare Before Christmas  Coraline  Hocus Pocus  Hocus Pocus 2  \
Ali                               2.0       3.0          5.0            NaN   
Angie                             3.0       5.0          5.0            5.0   
Neal                              4.0       NaN          NaN            2.0   
Fabian                            5.0       4.0          NaN            3.0   
Mackenzie                         NaN       NaN          4.0            3.0   

           Monster House  Spirited Away  
Ali                  NaN            4.0  
Angie                3.0            NaN  
Neal                 3.0            5.0  
Fabian               NaN            5.0  
Mackenzie            3.0            NaN  


In [26]:
ratings_norm.index.rename('Respondent', inplace=True)

In [27]:
movie_list2 = list(ratings_norm.columns)

ratings_norml = pd.melt(ratings_norm, value_vars = movie_list2, value_name = 'Ratings', ignore_index = False)

print(ratings_norml)

                              variable  Ratings
Respondent                                     
Ali         Nightmare Before Christmas      2.0
Angie       Nightmare Before Christmas      3.0
Neal        Nightmare Before Christmas      4.0
Fabian      Nightmare Before Christmas      5.0
Mackenzie   Nightmare Before Christmas      NaN
Ali                           Coraline      3.0
Angie                         Coraline      5.0
Neal                          Coraline      NaN
Fabian                        Coraline      4.0
Mackenzie                     Coraline      NaN
Ali                        Hocus Pocus      5.0
Angie                      Hocus Pocus      5.0
Neal                       Hocus Pocus      NaN
Fabian                     Hocus Pocus      NaN
Mackenzie                  Hocus Pocus      4.0
Ali                      Hocus Pocus 2      NaN
Angie                    Hocus Pocus 2      5.0
Neal                     Hocus Pocus 2      2.0
Fabian                   Hocus Pocus 2  

In [28]:
respondents_norm = ratings_norml.groupby('Respondent')['Ratings'].agg('mean')

print(respondents_norm)

Respondent
Ali          3.500000
Angie        4.200000
Fabian       4.250000
Mackenzie    3.333333
Neal         3.500000
Name: Ratings, dtype: float64


In [30]:
movies_norm = ratings_norml.groupby('variable')['Ratings'].agg('mean')

print(movies_norm)

variable
Coraline                      4.000000
Hocus Pocus                   4.666667
Hocus Pocus 2                 3.250000
Monster House                 3.000000
Nightmare Before Christmas    3.500000
Spirited Away                 4.666667
Name: Ratings, dtype: float64


### In conclusion of the second part, I can see that the respondent and movie averages did not change :-/. I am wondering if I did something wrong.

The only thing I can think of that I could have done wrong is maybe I used the wrong method for "normalization". The textbook considered the .agg() function to be part of normalization, and so that is what I used.