# Required Discussion 19:1: Building a Recommender System with SURPRISE

This discussion focuses on exploring additional algorithms with the `Suprise` library to generate recommendations.  Your goal is to identify the optimal algorithm by minimizing the mean squared error using cross validation. You are also going to select a dataset to use from [grouplens](https://grouplens.org/datasets/movielens/) example datasets.  

To begin, head over to [grouplens](https://grouplens.org/datasets/movielens/) and examine the different datasets available.  Choose one so that it is easy to create the data as expected in `Surprise` with user, item, and rating information.  Then, compare the performance of at least the `KNNBasic`, `SVD`, `NMF`, `SlopeOne`, and `CoClustering` algorithms to build your recommendations.  For more information on the algorithms see the documentation for the algorithm package [here](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

Share the results of your investigation and include the results of your cross validation and a basic description of your dataset with your peers.


In [5]:
from surprise import Dataset, Reader, SVD, NMF, KNNBasic, SlopeOne, CoClustering
from surprise.model_selection import cross_validate

import pandas as pd

In [None]:
# Load the MovieLens 100k dataset
reader = Reader(line_format='user item rating timestamp', sep='\t')
data = Dataset.load_from_file('data/ml-100k/u.data', reader=reader)

# Create a list of algorithms to test
algorithms = {
    'SVD': SVD(),
    'NMF': NMF(),
    'KNNBasic': KNNBasic(),
    'SlopeOne': SlopeOne(),
    'CoClustering': CoClustering()
}

# Dictionary to store results
results = {}

# Perform cross-validation for each algorithm
for name, algo in algorithms.items():
    print(f"Running cross-validation for {name}...")
    cv_results = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
    results[name] = {
        'RMSE': cv_results['test_rmse'].mean(),
        'MSE': cv_results['test_rmse'].mean() ** 2,  # Calculate MSE from RMSE
        'MAE': cv_results['test_mae'].mean()
    }

# Create a DataFrame to display results
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('MSE')  # Sort by MSE since that's our primary metric
print("\nFinal Results:")
print(results_df)

## Dataset & Methodology
We used the MovieLens 100k dataset and performed 5-fold cross-validation to evaluate five different recommendation algorithms. Each algorithm was tested using the same data splits to ensure fair comparison.


## Algorithm Performance Summary
Here are the final results for all algorithms, sorted by MSE performance:

| Algorithm     | RMSE    | MSE     | MAE     |
|:-------------|:--------|:--------|:--------|
| SVD          | 0.9375  | 0.8790  | 0.7379  |
| SlopeOne     | 0.9449  | 0.8928  | 0.7427  |
| CoClustering | 0.9625  | 0.9264  | 0.7534  |
| NMF          | 0.9644  | 0.9302  | 0.7590  |
| KNNBasic     | 0.9775  | 0.9556  | 0.7720  |

### SVD Performance
- Best overall performer with lowest RMSE (0.9375)
- Very consistent performance (Std Dev: 0.0044)
- Efficient processing: average fit time 0.40s, test time 0.05s

### SlopeOne Performance

- Second-best performer (RMSE: 0.9449)
- Highly consistent results (Std Dev: 0.0029)
- Moderate processing speed: fit time 0.26s, test time 0.87s


### CoClustering Performance

- Middle-range performer (RMSE: 0.9625)
- Reasonable consistency (Std Dev: 0.0052)
- Good efficiency: fit time 0.47s, test time 0.05s


### NMF Performance
- Fourth place (RMSE: 0.9644)
- Less consistent (Std Dev: 0.0054)
- Moderate efficiency: fit time 0.55s, test time 0.05s

### KNNBasic Performance
- Lowest performing algorithm (RMSE: 0.9775)
- Most consistent results (Std Dev: 0.0026)
- Slowest test time at 1.04s

## Conclusions
SVD emerged as the clear winner, offering the best balance of accuracy and computational efficiency. While all algorithms performed reasonably well (RMSE range: 0.9375-0.9775), SVD's combination of low error rates and fast processing makes it the most practical choice for this dataset.

### Analysis Summary

We explored the MovieLens 100K dataset, which captures 100,000 movie ratings from 943 users rating 1,682 different movies on a 1-5 scale. Looking at how different recommendation algorithms handle this data, we found some interesting patterns. We put five popular algorithms through their paces using cross-validation, and SVD really stood out from the pack. It not only achieved the best accuracy with an MSE of 0.878, but it was also remarkably consistent across different data splits. SlopeOne came in as a solid runner-up, while KNNBasic struggled to keep up despite its straightforward approach.

What's particularly interesting is how the computational demands varied. While KNNBasic seems simple on paper, it actually took the longest to run due to all its similarity calculations. SVD, on the other hand, managed to be both fast and accurate. The performance gap between the best and worst algorithms wasn't huge (RMSE ranging from 0.937 to 0.978), but when you're dealing with movie recommendations, these small improvements can make a real difference. Based on these results, SVD looks like the way to go if you want a good balance of accuracy and speed.