# Trying things out
## Step 1: Download ml-small and unzip

In [1]:
import os
import requests
import zipfile

# URL of the dataset
url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
# Path to save the downloaded zip file
zip_path = "ml-latest-small.zip"
# Directory to extract the contents
extract_dir = "ml-latest-small"

# Check if the directory already exists
if not os.path.exists(extract_dir):
    # Download the zip file
    print("Downloading the dataset...")
    response = requests.get(url)
    with open(zip_path, "wb") as file:
        file.write(response.content)
    
    # Unzip the file
    print("Unzipping the dataset...")
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extract_dir)
    
    # Clean up the zip file
    os.remove(zip_path)
    print("Download and extraction complete.")
else:
    print("Dataset already exists. No download needed.")

Dataset already exists. No download needed.


## Step 2: load rating data into pandas and clean up

In [2]:
import pandas as pd
df = pd.read_csv("./ml-latest-small/ml-latest-small/ratings.csv")
df.drop(['timestamp'], axis=1, inplace=True)
df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


## Step 3: Load data into surprise

In [3]:
from surprise import Dataset
from surprise import Reader

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[["userId", "movieId", "rating"]], reader)

## Step 4: Simple KNN recommender

In [7]:
from surprise import KNNWithMeans

# To use user-based cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": True,  # Compute  similarities between users
}
algo = KNNWithMeans(sim_options=sim_options)

In [8]:
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7fb661ac5150>

In [21]:
score = 5
movie = 1
while score >= 1.5:
    prediction = algo.predict(100, movie)
    score = prediction.est
    movie += 1
score, movie

(1.2047694753577107, 179)

In [27]:
score = 0
movie = 1
while score < 4.99 or score == 5:
    prediction = algo.predict(100, movie)
    score = prediction.est
    movie += 1
score, movie

(4.993689831844592, 2296)

## Algorithm Benchmarking

In [36]:
from surprise import SVD, SVDpp, SlopeOne, NMF, NormalPredictor, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, CoClustering
from surprise.model_selection import cross_validate
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

benchmark = []
# Iterate over all algorithms

algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), CoClustering()]

print ("Attempting: ", str(algorithms), '\n\n\n')

for algorithm in algorithms:
    print("Starting: " ,str(algorithm))
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE','MAE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp._append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    print("Done: " ,str(algorithm), "\n\n")

print ('\n\tDONE\n')

Attempting:  [<surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fb612e87850>, <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7fb664448590>, <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7fb664437b10>, <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7fb664449250>, <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7fb661c04250>, <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7fb660875610>, <surprise.prediction_algorithms.knns.KNNBasic object at 0x7fb6940e5510>, <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7fb65e52dc50>, <surprise.prediction_algorithms.knns.KNNWithZScore object at 0x7fb65e52e010>, <surprise.prediction_algorithms.co_clustering.CoClustering object at 0x7fb65e52f110>] 



Starting:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fb612e87850>
Done:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7fb

In [37]:
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
surprise_results

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVDpp,0.868893,0.666767,50.728872,11.452713
SVD,0.879154,0.675766,0.910505,0.191343
KNNBaseline,0.880937,0.673486,0.212441,1.617064
KNNWithZScore,0.903403,0.685033,0.129243,1.516339
KNNWithMeans,0.905492,0.692235,0.084611,1.317035
SlopeOne,0.907989,0.695756,3.923813,6.456029
NMF,0.934452,0.716532,1.828848,0.20209
CoClustering,0.950586,0.736584,1.822705,0.226071
KNNBasic,0.956058,0.734074,0.064636,1.282334
NormalPredictor,1.42228,1.138334,0.077247,0.1346


### Conclusion:
Based on the above observations, it seems like the SVD algorithm provides similar performance (both rmse and mae) compared to SVD++ while using significantly less time to fit and test. Therefore we should probably use SVD for our recommendation system

### Credit:
This experiment roughly follows [this google colab](https://colab.research.google.com/github/singhsidhukuldeep/Recommendation-System/blob/master/Building_Recommender_System_with_Surprise.ipynb#scrollTo=_zqHKyGm38B4) but uses my own dataset and I made some minor changes with the parameter used as well as modifying tmp.append to tmp._append to supress panda error.

## NMF & SVD for Rec Sys: how does it work?
They actually work pretty similarlly. Initially we have a user-rating matrix, with each row being a user and each column being a movie for example, each entry would be a rating.

Now it is obvious that most of the cells in the matrix would be left blank since each person would only rate a small percentage of all the movies that are out there. So what both of these method would do is provide a way to estimate what the ratings in these cells would be if they were filled in.

Both methods works by decomposing the user-rating matrix into smaller dimention matrices, that when multiplied together, would result in a matrix that is the same dimention as the original user-rating matrix. And more importantly, the value in cells of the multiplied smaller maitrces corresoponding to the filled out cells in the original matrix would be similar to the value of their corresponding cell in the original matrix. That's how we know these methods "might" make an accurate estimation of the missing ratings (non-filled cells in the original matrix).

The intuition is that by factoring the original matrix, we extract the latent-factors in it, which encodes the information about the user's rating, such as how much a user value action movies vs others, or how much action is in a movie vs other genre. Then when multiplying the smaller matrices back, it generates a rating based on those information. 

Matrix Factorization is generally considered faster but less accurate compared to SVD, but somehow in this case SVD is faster (I am suspecting some weird optimization issue), so we should probably stick with SVD.

If this doesn't work, use item based and KNN: https://www.analyticsvidhya.com/blog/2020/11/create-your-own-movie-movie-recommendation-system/