### Required Discussion 19:1: Building a Recommender System with SURPRISE

This discussion focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to [grouplens](https://grouplens.org/datasets/movielens/) and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.


In [1]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

import pandas as pd

ModuleNotFoundError: No module named 'surprise'

In [2]:
data = Dataset.load_builtin('ml-100k')

NameError: name 'Dataset' is not defined

In [None]:
# Define algorithms to test
algorithms = {
    'SVD': SVD(),
    'NMF': NMF(),
    'KNNBasic': KNNBasic(),
    'SlopeOne': SlopeOne(),
    'CoClustering': CoClustering()
}

# Results dictionary to store performance metrics
results = {}

# Perform cross-validation for each algorithm
for name, algorithm in algorithms.items():
    print(f"Cross-validating {name}...")
    cv_results = cross_validate(algorithm, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
    
    # Store the results
    results[name] = {
        'RMSE_mean': cv_results['test_rmse'].mean(),
        'RMSE_std': cv_results['test_rmse'].std(),
        'MAE_mean': cv_results['test_mae'].mean(),
        'MAE_std': cv_results['test_mae'].std()
    }

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Convert results to DataFrame for easier analysis
results_df = pd.DataFrame({
    algo: [metrics['RMSE_mean'], metrics['MAE_mean']] 
    for algo, metrics in results.items()
}, index=['RMSE', 'MAE'])

# Display the results table
print("\nResults Summary:")
print(results_df)

# Create bar plot of results
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(results))
width = 0.35

rmse_bars = ax.bar(x - width/2, [results[algo]['RMSE_mean'] for algo in algorithms], 
                   width, label='RMSE')
mae_bars = ax.bar(x + width/2, [results[algo]['MAE_mean'] for algo in algorithms], 
                  width, label='MAE')

ax.set_xlabel('Algorithms')
ax.set_ylabel('Error')
ax.set_title('Comparison of Recommender Algorithms on MovieLens Dataset')
ax.set_xticks(x)
ax.set_xticklabels(algorithms.keys())
ax.legend()

# Add error values on top of bars
def add_labels(bars):
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.3f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom')

add_labels(rmse_bars)
add_labels(mae_bars)

plt.tight_layout()
plt.show()

# Find the best algorithm based on RMSE
best_algo = min(results.items(), key=lambda x: x[1]['RMSE_mean'])[0]
print(f"\nThe best algorithm based on RMSE is: {best_algo}")
