
 Item and User based Collaborative Filtering


## Introduction:

**Collaborative filtering (CF)**: 

Collaborative filtering is a technique used by recommender systems. The most common filter being used is user-based and item based.

**User based collaborative filtering (UBCF)** : 

Find similar users to me and recommend what they liked.

**Item based collaborative filtering (IBCF)** : 

Find similar items to those that I have previously liked.

## Data Description:
This is a dataset which contains 100k rating informations which having 943 users and 1682 movies.


### Upload Data and Get An Insight

In [None]:
from google.colab import files

# Upload MovieLens 100k Dataset
files.upload()

In [None]:
import pandas

df = pandas.read_csv('ml-100k.data',header=None, sep='\t',names=["user_id", "movie_id", "rating","timestamp"])

# Get the Number of Users and Items
n_users, n_items = df['user_id'].unique().shape[0], df['movie_id'].unique().shape[0]
print(n_users, n_items)
df.head(5)

943 1682


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


The first column(with bold) is the row number, the second column is the user id, the third column is the movie id, the fourth column is the rating of a specific user to a specific movie, in this tutorial we didn't use timestamp for calculation, the only thing we are using is the user_id, movie_id and the rating.

## Random Recommendation

Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

In [None]:
from google.colab import files

# Upload utils.py
files.upload()

### Use the Surprise Library

[SURPRISE Libraray](http://surpriselib.com/)

Surprise is a Python scikit building and analyzing recommender systems

In [None]:
# Install the scikit-surprise library
!pip install scikit-surprise



In [None]:
from utils import *
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
import warnings; warnings.simplefilter('ignore')

# Load the data
data, n_users, n_items = spr_loadData('ml-100k.data')

# Select Algorithm KNNWithMeans and Run It
algo = NormalPredictor()
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.5263  1.5172  1.5246  1.5341  1.5158  1.5236  0.0066  
Fit time          0.14    0.17    0.17    0.17    0.16    0.16    0.01    
Test time         0.25    0.13    0.23    0.13    0.13    0.18    0.06    


{'fit_time': (0.1415097713470459,
  0.16597723960876465,
  0.1653599739074707,
  0.16808438301086426,
  0.16346001625061035),
 'test_rmse': array([1.52630024, 1.51715398, 1.52460949, 1.5340634 , 1.51579938]),
 'test_time': (0.25188350677490234,
  0.1308126449584961,
  0.23431658744812012,
  0.13154888153076172,
  0.1317145824432373)}

## User-based Neighborhood method:

Imagine that we want to recommend a movie to our friend Daven. We could assume that similar people will have similar taste. Suppose that me and Daven have seen the same movies, and we rated them all almost identically. But Daven hasn’t seen *'Infinity war'* but I did. 

If I love that movie, it sounds logical to think that he will too. With that, we have created an artificial rating based on our similarity.

In here we are using User-based Nearest Neighbor algorithm. 
This algorithm needs two tasks:

1. Find the nearest neighbors to the user A, using a similarity function *sim* to measure the distance between each pair of users.
2.Predict the rating that user A will give to all items the neighbors have consumed but A has not. We Look for the item j with the best predicted rating.

In other words, we are creating a User-Item Matrix.


## Use the Surprise Library

In [None]:
from utils import *
from surprise.model_selection import cross_validate
from surprise import KNNWithMeans
import warnings; warnings.simplefilter('ignore')

# Load the data
data, n_users, n_items = spr_loadData('ml-100k.data')

# Select Algorithm KNNWithMeans and Run It
sim_options = {'name': 'cosine',
               'user_based': True  # compute similarities between users
              }
algo = KNNWithMeans(k=n_users,sim_options=sim_options)
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9626  0.9581  0.9547  0.9607  0.9659  0.9604  0.0038  
Fit time          1.19    1.21    1.24    1.23    1.23    1.22    0.02    
Test time         5.67    5.54    5.55    5.49    5.55    5.56    0.06    


{'fit_time': (1.1923482418060303,
  1.2132914066314697,
  1.2415211200714111,
  1.2330732345581055,
  1.2282624244689941),
 'test_rmse': array([0.96257679, 0.95814613, 0.95467731, 0.96066776, 0.96594893]),
 'test_time': (5.66754412651062,
  5.540053606033325,
  5.546735763549805,
  5.488779544830322,
  5.54682731628418)}

## User-based Collaborative Filtering

### Import and Installation

In [None]:
#import libraries and helper functions
from utils import *
from sklearn.metrics.pairwise import cosine_similarity
import heapq
import warnings; warnings.simplefilter('ignore')

### Cosine similarity

This is a example of *sim* function, there are several different method to get the degree of similarity. In here we used a function *cosine_similarity* from sklearn.metrics.pairwise. Read more about the function usage in [here.](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

In [None]:
def cosSimilarityUser(data):
    # Calculate the Cosine Similarity Matrix
    user_similarity = cosine_similarity(data)
    
    # Preview the Similarity Matrix
    print("Similarity Matrix Sample")
    print(user_similarity[:5, :5])
    print("Similarity Matrix Dimension")
    print(np.shape(user_similarity))
    print("=" * 120)
    
    return user_similarity

### Prediction formula
Recall from the lecture, the user based collaborative predict function is 

![alt text](https://i.ibb.co/9YsmHNp/Screenshot-2019-05-28-15-51-36.png)


Below is a realization of the function.


In [None]:
def predictUser(ratings, similarity, num_items):
    # The Average Rating Values for Each User
    mean_user_rating = np.repeat(np.array([ratings.mean(axis=1)]), num_items, axis=0).T

    # The Difference Between Each Rating Value and The Average Value
    ratings_diff = ratings - mean_user_rating

    # Calculate the Predicted Score
    pred = mean_user_rating + \
           np.dot(similarity, ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    
    return pred

### Recommend Items For A Given User

In [None]:
def recItemsForOneUser(pred_array, train_array, user, num_rec):
    # Change Training Arrary Into Sparse Matrix
    train_matrix = sp.csr_matrix(train_array)

    # Get the Item IDs in the Training Data For the Specified User
    train_items_for_user = train_matrix.getrow(user).nonzero()[1]

    # Create A Dictionary with Key-Value Pairs as ItemID-PredictedValue Pair
    pred_dict_for_user = dict(zip(np.arange(train_matrix.shape[1]), pred_array[user]))

    # Remove the Key-Value Pairs used in Training
    for iid in train_items_for_user:
        pred_dict_for_user.pop(iid)

    # Select the Top-N Items in The Sorted List
    rec_list_for_user = heapq.nlargest(num_rec, pred_dict_for_user.items(), key=lambda tup: tup[1])

    # Get the Item ID List From the Top-N Tuples
    rec_item_list = [tup[0] for tup in rec_list_for_user]
    return rec_item_list

### Metrics Calculation Precision And Recall

**Precision** measures how accurate is your predictions. i.e. the percentage of your predictions are correct.

**Recall** measures how good you find all the positives. For example, we can find 80% of the possible positive cases in our top K predictions.

In [None]:
def Precision_and_Recall(pred_item_list, test_item_list):
    # Calculate the Number of Occurrences of Testing Item IDs in the Prediction Item ID List
    sum_relevant_item = 0
    for item in test_item_list:
        if item in pred_item_list:
            sum_relevant_item += 1

    # Calculate the Precision and Recall Value
    precision = sum_relevant_item / len(pred_item_list)
    recall = sum_relevant_item / len(test_item_list)

    return precision, recall

### Metrics Calculation Average Precision And Recall

In [None]:
def calMetrics(train_array, test_array, pred_array, at_K):
    # Get All the User IDs in Test Dataset
    test_matrix = sp.coo_matrix(test_array)
    test_users = test_matrix.row
    test_matrix = test_matrix.tocsr()

    # List to Store the Precision/Recall Value for Each User
    precision_u_at_K = []
    recall_u_at_K = []

    # Loop for Each User
    for u in test_users:
        # Get the Recommendation List for the User in Consideration
        rec_list_u = recItemsForOneUser(pred_array, train_array, u, at_K)

        # Generate an Item ID List For Testing
        item_list_u = test_matrix.getrow(u).nonzero()[1]

        # Calculate the Precision and Recall Value for this User
        precision_u, recall_u = Precision_and_Recall(rec_list_u, item_list_u)

        # Save the Precision/Recall Values
        precision_u_at_K.append(precision_u)
        recall_u_at_K.append(recall_u)

    # Calculate the Average Precision/Recall Values Over All Users
    print("Precision@"+str(at_K)+": "+str(np.mean(precision_u_at_K)))
    print("Recall@"+str(at_K)+": "+str(np.mean(recall_u_at_K)))
    print("=" * 120)

### User-based kNN Recommendation Main Program

In [None]:
if __name__ == '__main__':
    # Load Data
    train, test, num_users, num_items, uid_min, iid_min = loadData(test_size=0.2)
    train_array, test_array = train.toarray(), test.toarray()
    
    # Similarity And Prediction Matrices (User)
    similarity_user_array = cosSimilarityUser(train_array)
    pred_user_array = predictUser(train_array, similarity_user_array, num_items)

    # Recommendation
    rec_list = recItemsForOneUser(pred_user_array, train_array, 257, 10)
    print("The Recommendation List for User Is: " + str(rec_list+iid_min))
    print("=" * 120)
    
    # Metrics Calculation
    calMetrics(train_array, test_array, pred_user_array, 5)

Data Preview:
   uid  iid  ratings       time
0  196  242        3  881250949
1  186  302        3  891717742
2   22  377        1  878887116
3  244   51        2  880606923
4  166  346        1  886397596
Number of Users: 943
Number of Items: 1682
Sample Data: [[5 3 4 ... 0 0 0]]
Similarity Matrix Sample
[[1.         0.16359537 0.03365039 0.07323055 0.30757548]
 [0.16359537 1.         0.08520491 0.16757297 0.07164769]
 [0.03365039 0.08520491 1.         0.19338712 0.02781491]
 [0.07323055 0.16757297 0.19338712 1.         0.03761586]
 [0.30757548 0.07164769 0.02781491 0.03761586 1.        ]]
Similarity Matrix Dimension
(943, 943)
The Recommendation List for User Is: [313 288 286 302  50 269 748 333 181 245]


## Quiz Question 1: 

There are several popular similarity functions we can use in compute the degree of similarity between users, in here we used Cosine similarity, can you implement a Euclidean method?

(This is the formula of euclidean distance.)
![alt text](https://cdn-images-1.medium.com/max/1600/1*n6kmkzjKVTOWeXDxsx2daQ.png)

(This is the[ sklearn library](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html) might be helpful for you.)


In [None]:
def EuclideanRec(train_data):
    #write your code in here 
    


In [None]:
#You don't have to change this
if __name__ == '__main__':
    # Load Data
    train, test, num_users, num_items, uid_min, iid_min = loadData(test_size=0.2)
    train_array, test_array = train.toarray(), test.toarray()
    
    # Similarity And Prediction Matrices (User)
    similarity_user_array = EuclideanRec(train_array)
    pred_user_array = predictUser(train_array, similarity_user_array, num_items)

    # Recommendation
    rec_list = recItemsForOneUser(pred_user_array, train_array, 257, 10)
    print("The Recommendation List for User Is: " + str(rec_list+iid_min))
    print("=" * 120)
    
    # Metrics Calculation
    calMetrics(train_array, test_array, pred_user_array, 5)

## Item-based Collaborative Filtering:

Imagine now for Daven, instead of focusing on his friends, we could focus on what items from all the options are more similar to what we know he enjoys. This new focus is known as Item-Based Collaborative Filtering (IB-CF).

The difference between User-based and this method is that, in this case, we directly pre-calculate the similarity between the co-rated items, skipping K-neighborhood search.

This algorithm needs two tasks:

1. Calculate similarity among the items, such as cosine-based similarity.
2.Calculation of Prediction in weighted sum method.



## Use the Surprise Library

In [None]:
from utils import *
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate
import warnings; warnings.simplefilter('ignore')

# Load the data
data, n_users, n_items = spr_loadData('ml-100k.data')

# Select Algorithm KNNWithMeans and Run It
sim_options = {'name': 'cosine',
               'user_based': False  # compute similarities between items
              }
algo = KNNWithMeans(k=n_items,sim_options=sim_options)
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

### Cosine similarity

In [None]:
def cosSimilarityItem(data):
    item_similarity = cosine_similarity(data.T)
    return item_similarity

### Prediction formula

refer to lecture notes, item based collaborative filtering.

![alt text](https://cdn-images-1.medium.com/max/800/1*Euu92KfKBZJRwXVyqVkH9w.jpeg)


In [None]:
def predictItem(ratings, similarity, num_users):
    # The Average Rating Values for Each Item
    mean_item_rating = np.repeat(np.array([ratings.mean(axis=0)]), num_users, axis=0)

    # The Difference Between Each Rating Value and The Average Value
    ratings_diff = ratings - mean_item_rating

    # Calculate the Predicted Score
    pred = mean_item_rating + \
           np.dot(ratings_diff, similarity) / np.abs(similarity).sum(axis=1)

    return pred


### Item kNN Main Program



In [None]:
if __name__ == '__main__':
    # Load Data
    train, test, num_users, num_items, uid_min, iid_min = loadData(test_size=0.2)
    train_array, test_array = train.toarray(), test.toarray()

    # Similarity And Prediction Matrices (Item)
    similarity_item_array = cosSimilarityItem(train_array)
    pred_item_array = predictItem(train_array, similarity_item_array, num_users)

    # Recommendation
    rec_list = recItemsForOneUser(pred_item_array, train_array, 257, 10)
    print("The Recommendation List for User Is: " + str(rec_list+iid_min))
    print("=" * 120)
    
    # Metrics Calculation
    calMetrics(train_array, test_array, pred_item_array, 5)

## Quiz Question 2: 
In the tutorial we compute the precison and recall metrics. These metrics are for ranking tasks.
For the prediction task, we usually use the Root-Mean-Square (RMS) metrics.

Can you change the above code to calculate the RMS value between the prediction ratings and ground-truth (testing) ratings?

![RMS](https://www.includehelp.com/ml-ai/Images/rmse-1.jpg)


You can refer to the sklearn package for RMSE calculation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [None]:
from sklearn.metrics import mean_squared_error

def RMS(pred_rating_array, test_rating_array):
    # Assume you have the inputs: the predicted rating array and the testing rating array

### Modify the following function to calculate the RMS metrics of the recommendation

In [None]:
def calMetrics(pred_array, test_array):
    # Filter the Original Prediction Array with Only Testing Items Left
    
    # Calculate RMS value
    RMS_Val = RMS(pred_rating_array, test_array)
    
    # Print it Out
    print("RMS: "+str(RMS_Val))

### Main Function

In [None]:
if __name__ == '__main__':
    # Load Data
    train, test, num_users, num_items, uid_min, iid_min = loadData(test_size=0.2)
    train_array, test_array = train.toarray(), test.toarray()
    
    # Similarity And Prediction Matrices (User)
    similarity_user_array = cosSimilarityUser(train_array)
    pred_user_array = predictUser(train_array, similarity_user_array, num_items)

    # RMS
    calMetrics(pred_user_array, test_array)