# Objective:
The goal of this exercise is to predict missing ratings in the Rating.csv dataset by iteratively updating missing values using regression models like decision trees or neural networks.

# Dataset Details:
The dataset [Download here](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database) contains the following columns:

user_id: A randomly generated user ID (non-identifiable).
anime_id: The ID of the anime rated by the user.
rating: The rating the user assigned to the anime (range 1-10) or -1 if the user watched the anime but didn’t provide a rating.

# **# Steps for the Exercise:**

# 1. Data Preparation:
Load and Explore the Data:

Load the dataset into a Pandas DataFrame.
Explore the data to understand the structure, missing values (rating = -1), and general distribution.

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [34]:
import pandas as pd
data = pd.read_csv("/content/rating.csv")
print(data.head())
missing_data = data[data['rating'].isin([-1, None])]
print("Missing data:\n", missing_data)

   user_id  anime_id  rating
0        1      20.0    -1.0
1        1      24.0    -1.0
2        1      79.0    -1.0
3        1     226.0    -1.0
4        1     241.0    -1.0
Missing data:
          user_id  anime_id  rating
0              1      20.0    -1.0
1              1      24.0    -1.0
2              1      79.0    -1.0
3              1     226.0    -1.0
4              1     241.0    -1.0
...          ...       ...     ...
5600660    52585    4793.0    -1.0
5600661    52585    4874.0    -1.0
5600663    52585    4910.0    -1.0
5600664    52585    5526.0    -1.0
5600696    52586       NaN     NaN

[1072759 rows x 3 columns]


# 2. Regression-Based Imputation (Iterative Approach):
Setting Up the Regression Model:

For simplicity, you can use a decision tree regression model to predict the missing values.
Create a feature matrix (X) and target vector (y), where X consists of user_id, anime_id, and any other useful features (like user averages), and y is the rating.

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Filter out missing data (where rating == -1, treated as missing)
train_data = data[data['rating'] != -1].copy()

# Impute NaN values in user_avg_rating and anime_avg_rating with the overall average rating
overall_avg_rating = train_data['rating'].mean()

# Feature Engineering: Add user-based averages or anime-based averages
train_data['user_avg_rating'] = train_data.groupby('user_id')['rating'].transform('mean') # Create 'user_avg_rating' column
train_data['anime_avg_rating'] = train_data.groupby('anime_id')['rating'].transform('mean') # Create 'anime_avg_rating' column

# *** Drop rows with NaN values in 'rating' column in train_data ***
train_data = train_data.dropna(subset=['rating'])

# Features: user_id, anime_id, user_avg_rating, anime_avg_rating
X = train_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']]
y = train_data['rating']

# Train-test split (to ensure good model performance evaluation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# 3. Train a Regression Model (Decision Tree in this case):

In [36]:
# Train the Decision Tree Regressor model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Evaluate model performance
score = model.score(X_test, y_test)
print(f"Model R-squared score: {score}")

Model R-squared score: -0.15653852553335712


# 4. Impute Missing Values Using the Trained Model:

For each missing rating, predict it using the trained regression model.


In [37]:
# Now, predict ratings for the missing data (where rating == -1 or NaN)
#missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing
# Now, predict ratings for the missing data (where rating == -1 or NaN)
missing_data = data[data['rating'].isin([-1, None])].copy()

# Calculate user_avg_rating and anime_avg_rating for missing_data
missing_data['user_avg_rating'] = missing_data.groupby('user_id')['rating'].transform('mean')
missing_data['anime_avg_rating'] = missing_data.groupby('anime_id')['rating'].transform('mean')

X_missing = missing_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']]
# Predict the missing ratings
predicted_ratings = model.predict(X_missing)


# Update the DataFrame with the predicted ratings for missing values
data.loc[missing_data.index, 'rating'] = predicted_ratings

# Output the updated DataFrame
print("\nUpdated DataFrame with predicted ratings:\n", data)


Updated DataFrame with predicted ratings:
          user_id  anime_id  rating
0              1      20.0     1.0
1              1      24.0     1.0
2              1      79.0     1.0
3              1     226.0     1.0
4              1     241.0     1.0
...          ...       ...     ...
5600692    52586     356.0     8.0
5600693    52586     431.0     9.0
5600694    52586     440.0     9.0
5600695    52586     523.0    10.0
5600696    52586       NaN     8.0

[5600697 rows x 3 columns]


# 5. Repeat the Process:

After updating the ratings, you can re-train the model with the newly imputed data and predict again, improving the quality of the imputed values.

In [38]:
# Iterative imputation process
max_iterations = 5  # Set the number of iterations for improving predictions

for i in range(max_iterations):
    print(f"\nIteration: {i+1}")
    # Now, predict ratings for the missing data (where rating == -1 or NaN)
    missing_data = data[data['rating'].isin([-1, None])].copy()  # Treat both -1 and NaN as missing

    if not missing_data.empty:
        # Calculate user_avg_rating and anime_avg_rating for missing_data before prediction
        #These need to be recalculated for the missing data based on the current data
        missing_data['user_avg_rating'] = missing_data.groupby('user_id')['rating'].transform('mean')
        missing_data['anime_avg_rating'] = missing_data.groupby('anime_id')['rating'].transform('mean')
        X_missing = missing_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']]

        # Predict the missing ratings
        predicted_ratings = model.predict(X_missing)

        # Update the DataFrame with the predicted ratings for missing values
        data.loc[missing_data.index, 'rating'] = predicted_ratings
    else:
        print("No missing data to impute in this iteration.")

    # Re-train the model with the updated data (including the imputed values)
    train_data = data[data['rating'] != -1].copy()

    #Recalculating the average rating columns after updating the DataFrame:
    train_data['user_avg_rating'] = train_data.groupby('user_id')['rating'].transform('mean') # Create 'user_avg_rating' column
    train_data['anime_avg_rating'] = train_data.groupby('anime_id')['rating'].transform('mean') # Create 'anime_avg_rating' column

    X = train_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']]
    y = train_data['rating']

    # Re-train the model on the updated data
    if not X.empty and not y.empty:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        print(f"Model R-squared score after imputation: {score}")
    else:
        print("No training data available for retraining.")

# After completing the iterations, print the final updated values
print("\nFinal Updated DataFrame with Imputed Ratings:")
print(data[['user_id', 'anime_id', 'rating']].head())  # Print the top rows with updated ratings for review


Iteration: 1
No missing data to impute in this iteration.
Model R-squared score after imputation: 0.32265528233764806

Iteration: 2
No missing data to impute in this iteration.
Model R-squared score after imputation: 0.32265528233764806

Iteration: 3
No missing data to impute in this iteration.
Model R-squared score after imputation: 0.32265528233764806

Iteration: 4
No missing data to impute in this iteration.
Model R-squared score after imputation: 0.32265528233764806

Iteration: 5
No missing data to impute in this iteration.
Model R-squared score after imputation: 0.32265528233764806

Final Updated DataFrame with Imputed Ratings:
   user_id  anime_id  rating
0        1      20.0     1.0
1        1      24.0     1.0
2        1      79.0     1.0
3        1     226.0     1.0
4        1     241.0     1.0


In [39]:
# After completing the iterations, print the final updated values
print("\nFinal Updated DataFrame with Imputed Ratings:")
print(data[['user_id', 'anime_id', 'rating']].head())  # Print the top rows with updated ratings for review


Final Updated DataFrame with Imputed Ratings:
   user_id  anime_id  rating
0        1      20.0     1.0
1        1      24.0     1.0
2        1      79.0     1.0
3        1     226.0     1.0
4        1     241.0     1.0


# 6. Evaluation:
Measure Accuracy:

After filling in missing ratings, evaluate how well the model performs on the imputed ratings.
Use a root mean squared error (RMSE) or mean absolute error (MAE) to measure prediction accuracy.

In [40]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
print(f"RMSE of model: {rmse}")

RMSE of model: 2.496908921621391
