# Objective:
The goal of this exercise is to predict missing ratings in the Rating.csv dataset by iteratively updating missing values using regression models like decision trees or neural networks.

# Dataset Details:
The dataset [Download here](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database) contains the following columns:

user_id: A randomly generated user ID (non-identifiable).
anime_id: The ID of the anime rated by the user.
rating: The rating the user assigned to the anime (range 1-10) or -1 if the user watched the anime but didn’t provide a rating.

# **# Steps for the Exercise:**

# 1. Data Preparation:
Load and Explore the Data:

Load the dataset into a Pandas DataFrame.
Explore the data to understand the structure, missing values (rating = -1), and general distribution.

In [None]:
import pandas as pd
data = pd.read_csv('rating.csv')
print(data.head())
missing_data = data[data['rating'].isin([-1, None])]
print("Missing data:\n", missing_data)

# 2. Regression-Based Imputation (Iterative Approach):
Setting Up the Regression Model:

For simplicity, you can use a decision tree regression model to predict the missing values.
Create a feature matrix (X) and target vector (y), where X consists of user_id, anime_id, and any other useful features (like user averages), and y is the rating.

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Feature Engineering: Add user-based averages or anime-based averages
data['user_avg_rating'] = data.groupby('user_id')['rating'].transform('mean')
data['anime_avg_rating'] = data.groupby('anime_id')['rating'].transform('mean')

# Filter out missing data (where rating == -1, treated as missing)
train_data =

# Features: user_id, anime_id, user_avg_rating, anime_avg_rating
X =
y=

# Train-test split (to ensure good model performance evaluation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train a Regression Model (Decision Tree in this case):

In [None]:
# Train the Decision Tree Regressor model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Evaluate model performance
score = model.score(X_test, y_test)
print(f"Model R-squared score: {score}")

# 4. Impute Missing Values Using the Trained Model:

For each missing rating, predict it using the trained regression model.


In [None]:
# Now, predict ratings for the missing data (where rating == -1 or NaN)
missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing

# Predict the missing ratings


# Update the DataFrame with the predicted ratings for missing values
data.loc[missing_data.index, 'rating'] = predicted_ratings

# Output the updated DataFrame
print("\nUpdated DataFrame with predicted ratings:\n", data)

# 5. Repeat the Process:

After updating the ratings, you can re-train the model with the newly imputed data and predict again, improving the quality of the imputed values.

In [29]:
# Iterative imputation process
max_iterations = 5  # Set the number of iterations for improving predictions

for i in range(max_iterations):
    # Now, predict ratings for the missing data (where rating == -1 or NaN)
    missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing

    # Predict the missing ratings


    # Update the DataFrame with the predicted ratings for missing values
    data.loc[missing_data.index, 'rating'] = predicted_ratings

    # Re-train the model with the updated data (including the imputed values)


    # Re-train the model on the updated data
    model.fit(X, y)


In [None]:
# After completing the iterations, print the final updated values
print("\nFinal Updated DataFrame with Imputed Ratings:")
print(data[['user_id', 'anime_id', 'rating']].head())  # Print the top rows with updated ratings for review

# 6. Evaluation:
Measure Accuracy:

After filling in missing ratings, evaluate how well the model performs on the imputed ratings.
Use a root mean squared error (RMSE) or mean absolute error (MAE) to measure prediction accuracy.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
print(f"RMSE of model: {rmse}")