# Item-based Collaborative Filtering (CF)
The goal is to find similar items

<img src="images/item_weight.png" alt="weighr" width="500" height="200">

Comparsion:
- User-user CF: choose items for a user, bc those items have been liked by similar users
- Item-item CF: choose items for a user, bc this user has liked similar items in the past
- Another perspective: to choose a user to recomned to item j, i can look at other items j' who liked the same users as item j. If item j and j' are similar, then they like the sme users -> mathematically identical

<img src="images/difference.png" alt="weighr" width="500" height="200">

## Code 
Check out **itembased.py** for clean code.

In [None]:
from __future__ import print_function, division
from builtins import range, input
# Note: you may need to update your version of future
# sudo pip install -U future

import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from datetime import datetime
from sortedcontainers import SortedList

In [None]:
# load in the data
import os
with open('user2movie.json', 'rb') as f:
    user2movie = pickle.load(f)

with open('movie2user.json', 'rb') as f:
    movie2user = pickle.load(f)

with open('usermovie2rating.json', 'rb') as f:
    usermovie2rating = pickle.load(f)

with open('usermovie2rating_test.json', 'rb') as f:
    usermovie2rating_test = pickle.load(f)


In [None]:
N = np.max(list(user2movie.keys()))+1
# The test set may contain movises the train set does not have data on
m1 = np.max(list(movie2user.keys()))
m2 = np.max([m for (u, m), r in usermovie2rating_test.items()])
M = max(m1, m2) + 1
print("N:", N, "M:", M)


In [None]:
# to find the user similarities, you have to do O(M^2 * N) calculations!
# in the "real-world" you'd want to parallelize this
# note: we really only have to do half the calculations, since w_ij is symmetric
K = 20 # number of neighbors we'd like to consider
limit = 5 #  a threshold that we borther to calc correlation
neighbors = [] # store neighbors in this list
averages = [] # each item's average rating for later use (-wij, itemj)
deviations = [] # each item's deviation for later use

# find the K closest items to item i
for i in range(M):
    users_i = movie2user[i]
    users_i_set = set(users_i)
    
    # calc avg and dev
    ratings_i = {user: usermovie2rating[(user, i)] for user in users_i}
    avg_i = np.mean(list(ratings_i.values()))
    dev_i = {user: (usermovie2rating[(user, i)]-avg_i) for user in users_i}
    dev_i_vals = np.array(list(dev_i.values()))
    sigma_i = np.sqrt(dev_i_vals.dot(dev_i_vals))
    
    # save these for later use
    averages.append(avg_i)
    deviations.append(dev_i)
    
    sl = SortedList()
    for j in range(M):
        if j != i:
            users_j = movie2user[j]
            users_j_set = set(users_j)
            common_users = (users_i_set & users_j_set)
            if len(common_users)> limit:
                # calc avg and dev
                ratings_j = {user: usermovie2rating[(user, j)] for user in users_j}
                avg_j = np.mean(list(ratings_j.values()))
                dev_j = {user: (usermovie2rating[(user, j)]-avg_j) for user in users_j}
                dev_j_vals = np.array(list(dev_j.values()))
                sigma_j = np.sqrt(dev_j_vals.dot(dev_j_vals))
                
                # calc correlation coefficient
                numerator = sum(dev_i[u]*dev_j[u] for u in common_users)
                denominator = sigma_i * sigma_j
                w_ij = numerator / denominator
                
                sl.add((-w_ij, j))
                if len(sl) > K:
                    del sl[-1]
    
    neighbors.append(sl)
    if i%1 == 0:
        print(i)

<img src="images/item_score.png" alt="score" width="500" height="200">


In [None]:
# i is item, u is user
def predict(i, u):
    # calc the weighted sum of dev
    numerator = 0
    denominator = 0
    for neg_w, j in neighbors[i]:
        try:
            numerator += -neg_w * deviations[j][u]
            denominator += abs(neg_w)
        except KeyError:
            pass
        
    if denominator == 0:
        prediction = averages[i]
    else:
        prediciton = averages[i] + numerator/denominator
    
    prediction = min(5, prediction)
    prediction = max(0.5, prediction) # min rating is 0.5
    return prediction

In [None]:
train_predictions = []
train_targets = []
for (i, m), target in usermovie2rating.items():
    # calculate the prediction for this movie
    prediction = predict(i, m)
    # save the prediction and target
    train_predictions.append(prediction)
    train_targets.append(target)

test_predictions = []
test_targets = []
# same thing for test set
for (i, m), target in usermovie2rating_test.items():
    # calculate the prediction for this movie
    prediction = predict(i, m)

    # save the prediction and target
    test_predictions.append(prediction)
    test_targets.append(target)


# calculate accuracy
def mse(p, t):
    p = np.array(p)
    t = np.array(t)
    return np.mean((p - t)**2)

print('train mse:', mse(train_predictions, train_targets))
print('test mse:', mse(test_predictions, test_targets))

Item based less overfit. You can think, item has more data than user.

A user-user weight has to look at a small list of item. 

A item-item weight can look at a long list of users.

However, item-based it "too accurate", only suggest similar products -> lack of diversity -> Youtube