# Movie Ratings Analysis

This project involves creating a recommendation system using a dataset of movie ratings. The dataset includes ratings for six popular movies by five different users.




## Task 1: Load the Ratings into a Pandas DataFrame and Save to CSV

First, we'll create a DataFrame to store the user ratings for each movie. After creating this DataFrame, we'll save the data to a CSV file for persistent storage and future use.

The data includes ratings for six movies by five different users. The ratings are on a scale of 1 to 5, with `None` indicating that a user has not rated that particular movie.


In [1]:
# Let's create the initial DataFrame again to ensure we have the data
import pandas as pd

# Given ratings data
ratings_data = {
    'American Sniper': [5, 4, None, 3, 5],
    'Edge of Tomorrow': [4, None, 4, 2, 4],
    'Groundhog Day': [3, 4, None, 2, None],
    'Jurassic World': [None, 3, 4, 2, 3],
    'Lost in Translation': [None, 4, 4, 3, None],
    'Lucy': [4, 4, None, 3, 3]
}

# Create DataFrame
df_ratings = pd.DataFrame(ratings_data, index=['John', 'Logan', 'Modesto', 'Malcolm', 'Maurice']).transpose()

# Now let's save this DataFrame to a CSV file
csv_file_path = 'movies_ratings.csv'  # Define the file path
df_ratings.to_csv(csv_file_path)  # Save to CSV

csv_file_path  # Output the file path for download or reference


'movies_ratings.csv'

## Task 2: Calculate Average Ratings
We then compute the average ratings for each movie and each user.


In [2]:
# Calculate the average ratings for each user (column-wise average, skipping NaN values)
average_ratings_per_user = df_ratings.mean(axis=0)

# Calculate the average ratings for each movie (row-wise average, skipping NaN values)
average_ratings_per_movie = df_ratings.mean(axis=1)

average_ratings_per_user, average_ratings_per_movie


(John       4.00
 Logan      3.80
 Modesto    4.00
 Malcolm    2.50
 Maurice    3.75
 dtype: float64,
 American Sniper        4.250000
 Edge of Tomorrow       3.500000
 Groundhog Day          3.000000
 Jurassic World         3.000000
 Lost in Translation    3.666667
 Lucy                   3.500000
 dtype: float64)

## Task 3: Normalize Ratings
To compare ratings on a consistent scale, we normalize the ratings using the Min-Max scaling technique.


In [3]:
# Normalization of ratings (Min-Max scaling)
# Normalized_rating = (Rating - Min_rating) / (Max_rating - Min_rating)

# Define the normalization function
def normalize(ratings):
    min_rating = ratings.min(skipna=True)
    max_rating = ratings.max(skipna=True)
    normalized_ratings = (ratings - min_rating) / (max_rating - min_rating)
    return normalized_ratings

# Apply the normalization function to each user's ratings (column-wise)
normalized_df = df_ratings.apply(normalize, axis=0)

# Show the normalized dataframe and calculate the average of the normalized ratings per user
normalized_df, normalized_df.mean(axis=0)


(                     John  Logan  Modesto  Malcolm  Maurice
 American Sniper       1.0    1.0      NaN      1.0      1.0
 Edge of Tomorrow      0.5    NaN      NaN      0.0      0.5
 Groundhog Day         0.0    1.0      NaN      0.0      NaN
 Jurassic World        NaN    0.0      NaN      0.0      0.0
 Lost in Translation   NaN    1.0      NaN      1.0      NaN
 Lucy                  0.5    1.0      NaN      1.0      0.0,
 John       0.500
 Logan      0.800
 Modesto      NaN
 Malcolm    0.500
 Maurice    0.375
 dtype: float64)

## Task 4: Standardize Ratings
Next, we standardize the ratings to have a mean of 0 and a standard deviation of 1. This is often a prerequisite for many machine learning algorithms.


In [4]:
# Standardization of ratings (Z-score normalization)
# Standardized_rating = (Rating - Mean_rating) / Std_rating

# Define the standardization function
def standardize(ratings):
    mean_rating = ratings.mean(skipna=True)
    std_rating = ratings.std(skipna=True)
    standardized_ratings = (ratings - mean_rating) / std_rating
    return standardized_ratings

# Apply the standardization function to each user's ratings (column-wise)
standardized_df = df_ratings.apply(standardize, axis=0)

# Show the standardized dataframe and calculate the average of the standardized ratings per user
standardized_df, standardized_df.mean(axis=0)


(                         John     Logan  Modesto   Malcolm   Maurice
 American Sniper      1.224745  0.447214      NaN  0.912871  1.305582
 Edge of Tomorrow     0.000000       NaN      NaN -0.912871  0.261116
 Groundhog Day       -1.224745  0.447214      NaN -0.912871       NaN
 Jurassic World            NaN -1.788854      NaN -0.912871 -0.783349
 Lost in Translation       NaN  0.447214      NaN  0.912871       NaN
 Lucy                 0.000000  0.447214      NaN  0.912871 -0.783349,
 John       0.000000e+00
 Logan      3.996803e-16
 Modesto             NaN
 Malcolm    0.000000e+00
 Maurice   -5.551115e-17
 dtype: float64)

## Discussion on Normalization vs Standardization
Normalization and standardization of data are crucial pre-processing steps in data analysis. Here we discuss their advantages and potential drawbacks.


When using normalized and standardized ratings in a recommendation system, there are several advantages and disadvantages to consider:

Advantages:

1.  Comparability: Normalization puts all ratings on the same scale, typically 0 to 1, which allows for comparison across different users or items regardless of the original scale.

2.  Fairness: It accounts for differences in user rating behavior. Some users may tend to give higher ratings in general (easy raters), while others might give lower ratings (tough raters). Normalization and standardization adjust for these biases.

3.  Algorithm Readiness: Many machine learning algorithms expect features to be on a similar scale for them to work correctly. Normalization and standardization are preprocessing steps to make data compatible with these algorithms.

4.  Outlier Management: Standardization reduces the impact of outliers since the number of standard deviations away from the mean is more informative than the raw score.

Disadvantages:

1.  Loss of Interpretability: The original meaning of the ratings is lost, which can make it harder to interpret the data without the context of the original scale.

2.  Data Skewness: If the data has a skewed distribution, standardization might not be the best approach as it assumes the data is normally distributed around the mean.

3.  Sensitivity to New Data: Normalized and standardized ratings need to be recalculated when new data comes in, which can be computationally intensive for large datasets.

4.  Ignores Null Values: If a user has not rated an item, normalization and standardization do not account for this, which could imply that a non-rating is the same as an average rating.

5.  Dependency on the Dataset: The normalized and standardized values are dependent on the dataset. If the dataset changes, such as adding more users or items, the scales may shift, and previously scaled values may no longer be valid.

In summary, normalization and standardization are powerful techniques for preprocessing data for recommendation systems, but they require careful implementation and regular updates to maintain their effectiveness. They make the ratings from different users more comparable and can improve the performance of machine learning models, but they also add a layer of complexity and can reduce the clarity of the raw data.

## Extra Credit: Calculate the Average of the Standardized Ratings
As an extra step, we calculate the average of the standardized ratings for each movie to see how they compare against each other.


In [5]:
# Calculate the average of the standardized ratings for each movie (row-wise average, skipping NaN values)
average_standardized_per_movie = standardized_df.mean(axis=1)

average_standardized_per_movie


American Sniper        0.972603
Edge of Tomorrow      -0.217251
Groundhog Day         -0.563467
Jurassic World        -1.161692
Lost in Translation    0.680042
Lucy                   0.144184
dtype: float64

## Conclusion
In this project, we performed data manipulation to prepare our movie ratings data for a recommendation system. We covered several data processing techniques, including normalization and standardization, which are essential for unbiased analysis and model training.
