## 2. Movie Rommendation using Collaborative Filtering

### Installing libraries and packages

In [1]:
import pandas as pd
import numpy as np

### Ingest

In [3]:

ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880000.0
1,1,306,3.5,1147869000.0
2,1,307,5.0,1147869000.0
3,1,665,5.0,1147879000.0
4,1,899,3.5,1147869000.0


### EDA

Total number of movies for which the users have given rating

In [None]:
len(ratings['movieId'].unique())

54343

Total number of Users participated in giving rating

In [None]:
len(ratings['userId'].unique())

81292

In [None]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12572782 entries, 0 to 12572781
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  float64
dtypes: float64(2), int64(2)
memory usage: 383.7 MB


### Model

In [5]:
pip install scikit-surprise


Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163001 sha256=589caa7f540ee41e9b5d66b00be12964e3e2d3a285f4646b53d6876726e10109
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [6]:
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

In [7]:

reader = Reader(rating_scale=(1.0, 5.0))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)


In [8]:
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8933  0.9037  0.8902  0.9077  0.8880  0.8966  0.0077  
MAE (testset)     0.6882  0.6914  0.6838  0.6984  0.6856  0.6895  0.0051  
Fit time          0.98    0.70    0.70    0.71    0.86    0.79    0.11    
Test time         0.08    0.19    0.05    0.06    0.05    0.09    0.05    


{'test_rmse': array([0.89334804, 0.90373296, 0.89018585, 0.90766251, 0.88795967]),
 'test_mae': array([0.68822518, 0.69139477, 0.68383033, 0.69841885, 0.6856272 ]),
 'fit_time': (0.984285831451416,
  0.6956889629364014,
  0.6999020576477051,
  0.7073378562927246,
  0.85504150390625),
 'test_time': (0.08008670806884766,
  0.1888904571533203,
  0.05290412902832031,
  0.05820751190185547,
  0.05071210861206055)}

### Prediction

**Predicting user rating for a movie by a user
whose rating already present in the data**

In [None]:
algo.predict(2,1,3)

In [None]:
algo.predict(1,296,5)

# Top 10 Recommendations based on users ratings


In [9]:
def recommend(u_id):
  # Get a list of all movieIds
  all_movieIds = ratings['movieId'].unique()

  # Get a list of movieIds the user has already rated
  movies_rated_by_user = ratings[ratings['userId'] == u_id]['movieId']

  # Filter out the movies the user has already rated
  movies_to_recommend = [movieId for movieId in all_movieIds if movieId not in movies_rated_by_user]

  # Predict ratings for the movies the user hasn't rated yet
  predicted_ratings = [algo.predict(u_id, movieId).est for movieId in movies_to_recommend]

  # Create a DataFrame with movieIds and predicted ratings
  recommendations_df = pd.DataFrame({'movieId': movies_to_recommend, 'predicted_rating': predicted_ratings})

  # Sort the DataFrame by predicted ratings in descending order and get the top 10 recommendations
  top_recommendations = recommendations_df.sort_values(by='predicted_rating', ascending=False).head(10)

  # Print or return the recommendations
  print(top_recommendations[['movieId', 'predicted_rating']])


In [18]:
recommend(43)  # top 10 movie predicted according to rating of user with id 43

     movieId  predicted_rating
244       50          4.502814
75       318          4.502231
251      541          4.298504
981     1704          4.294241
0        296          4.255286
85       527          4.227149
460    27773          4.223318
1        306          4.217447
262     1148          4.204871
98      1196          4.191785


#3. Evaluation

## 1. Content Based Recommendation
Due to the limited storage size of Google Colab's RAM, my Colab session collapsed multiple times, preventing me from obtaining the cosine similarity matrix. Despite attempting to utilize the GPU in Colab, the issue persisted. Additionally, I explored using Spark to handle large datasets, but I encountered difficulties in implementation.

##2. Colloborative Filtering
For the Collaborative Filtering method, I utilized the SURPRISE module to develop a model for predicting the rating by a particular user for a movie they have not watched. We calculated RMSE and MAE by performing 5-fold cross-validation, which served as metrics for evaluating the recommendation system.