# Key Properties of Rating Matrices

Ratings matrix: $R^{m \times n}$: m users and n items
Rating of user u for item j: $r_{uj}$

Ratings can be defined in a variety of ways:
- Continuous ratings: example: -10 to 10
- Interval-based ratings: example: numerical integer values from 1 to 5
- Ordinal ratings: example: "Strongly Disagree", "Disagree", "Neutral", "Agree", "Strongly Agree"
- Binary ratings: positive or negative responses
- Unary ratings: if the customer has not bought the item, then it is not necessary indicate a dislike for item.

In [1]:
import numpy as np

In [2]:
nan = np.nan


In [3]:
ratings_matrix = np.array([[7, 6, 7, 4, 5, 4], 
                           [6, 7, nan, 4, 3, 4],
                           [nan, 3, 3, 1, 1, nan],
                           [1, 2, 2, 3, 3, 4],
                           [1, nan, 1, 2, 3, 3]])

In [4]:
print(ratings_matrix)

[[ 7.  6.  7.  4.  5.  4.]
 [ 6.  7. nan  4.  3.  4.]
 [nan  3.  3.  1.  1. nan]
 [ 1.  2.  2.  3.  3.  4.]
 [ 1. nan  1.  2.  3.  3.]]


# User-Based Neighborhood Models

$R = [r_{uj}]$ <br>
$I_u$: set of item indices for which ratings have been specified by user (row) $u$ <br>

In [5]:
# indices for vector
def specified_rating_indices(u):
#     indices = np.argwhere(np.isfinite(u))
#     indices = np.reshape(indices, -1)
    return list(map(tuple, np.where(np.isfinite(u))))

In [6]:
# mean rating for each user i using his specified rating
def mean(u):
    # may use specified_rating_indices but use more time
    specified_ratings = u[specified_rating_indices(u)]#u[np.isfinite(u)]
    m = sum(specified_ratings)/np.shape(specified_ratings)[0]
    return m

In [7]:
def all_user_mean_ratings(ratings_matrix):
    return np.array([mean(ratings_matrix[u, :]) for u in range(ratings_matrix.shape[0])])
    

In [8]:
def get_mean_centered_ratings_matrix(ratings_matrix):
    users_mean_rating = all_user_mean_ratings(ratings_matrix)
    mean_centered_ratings_matrix = ratings_matrix - np.reshape(users_mean_rating, [-1, 1])
    return mean_centered_ratings_matrix

In [9]:
mean_centered_ratings_matrix = get_mean_centered_ratings_matrix(ratings_matrix)

  after removing the cwd from sys.path.


In [10]:
mean_centered_ratings_matrix

array([[ 1.5,  0.5,  1.5, -1.5, -0.5, -1.5],
       [ 1.2,  2.2,  nan, -0.8, -1.8, -0.8],
       [ nan,  1. ,  1. , -1. , -1. ,  nan],
       [-1.5, -0.5, -0.5,  0.5,  0.5,  1.5],
       [-1. ,  nan, -1. ,  0. ,  1. ,  1. ]])

sometimes, the mean is computed only over the items that are rated both by users u and v, but we will compute for each user

$$\mu_u = \dfrac{\Sigma_{k \in I_u} r_{uk}}{|I_u|}$$ <br>
$$Sim(u, v) = Pearson(u, v) = \dfrac{\Sigma_{k \in I_u \cap I_v}(r_{uk} - \mu_u)(r_{vk} - \mu_v)}{\sqrt{\Sigma_{k \in I_u \cap I_v}(r_{uk} - \mu_u)^2}\sqrt{\Sigma_{k \in I_u \cap I_v}(r_{vk} - \mu_v)^2}}$$

In [11]:
def pearson(u, v):
    mean_u = mean(u)
    mean_v = mean(v)
    
    specified_rating_indices_u = set(specified_rating_indices(u)[0])
    specified_rating_indices_v = set(specified_rating_indices(v)[0])
    
    mutually_specified_ratings_indices = specified_rating_indices_u.intersection(specified_rating_indices_v)
    mutually_specified_ratings_indices = list(mutually_specified_ratings_indices)
    
    u_mutually = u[mutually_specified_ratings_indices]
    v_mutually = v[mutually_specified_ratings_indices]
      
    centralized_mutually_u = u_mutually - mean_u
    centralized_mutually_v = v_mutually - mean_v

    result = np.sum(np.multiply(centralized_mutually_u, centralized_mutually_v)) 
    result = result / (np.sqrt(np.sum(np.square(centralized_mutually_u))) * np.sqrt(np.sum(np.square(centralized_mutually_v))))

    return result

In [12]:
print(pearson(ratings_matrix[1, :], ratings_matrix[2, :]))

0.9384742644069303


  after removing the cwd from sys.path.


In [13]:
for i in range(ratings_matrix.shape[0]):
    print(pearson(ratings_matrix[i, :], ratings_matrix[2, :]))

0.8944271909999159
0.9384742644069303
1.0
-1.0
-0.8164965809277259


  after removing the cwd from sys.path.


In [14]:
def mean_centered(u):
    return u - mean(u)

In [15]:
print(all_user_mean_ratings(ratings_matrix))

[5.5 4.8 2.  2.5 2. ]


  after removing the cwd from sys.path.


In [16]:
def get_user_similarity_value_for(u_index, ratings_matrix):
    user_ratings = ratings_matrix[u_index, :]
    similarity_value = np.array([pearson(ratings_matrix[i, :], user_ratings) for i in range(ratings_matrix.shape[0])])
    return similarity_value

In [17]:
def get_user_similarity_matrix(ratings_matrix):
    similarity_matrix = []
    for u_index in range(ratings_matrix.shape[0]):
        similarity_value = get_user_similarity_value_for(u_index, ratings_matrix)
        similarity_matrix.append(similarity_value)
    return np.array(similarity_matrix)
    

In [18]:
user_similarity_matrix = get_user_similarity_matrix(ratings_matrix)

  after removing the cwd from sys.path.


In [19]:
print(user_similarity_matrix)

[[ 1.          0.70066562  0.89442719 -0.8992288  -0.82199494]
 [ 0.70066562  1.          0.93847426 -0.71713717 -0.89866916]
 [ 0.89442719  0.93847426  1.         -1.         -0.81649658]
 [-0.8992288  -0.71713717 -1.          1.          0.87287156]
 [-0.82199494 -0.89866916 -0.81649658  0.87287156  1.        ]]


Let $P_u(j)$ be the set of k closest users to target user u, who have specified rating for item j

$$s_{uj} = r_{uj} - \mu_u$$ <br>
$$\hat{r_{uj}} = \mu_u + \dfrac{\Sigma_{v \in P_u(j)} Sim(u, v).s_{vj}}{\Sigma_{v \in P_u(j)} |Sim(u, v)|} = \mu_u + \dfrac{\Sigma_{v \in P_u(j)} Sim(u, v).(r_{vj} - \mu_v)}{\Sigma_{v \in P_u(j)} |Sim(u, v)|}$$

In [20]:
def predict(u_index, i_index, k):
    users_mean_rating = all_user_mean_ratings(ratings_matrix)
    
    similarity_value = user_similarity_matrix[u_index]
    sorted_users_similar = np.argsort(similarity_value)
    sorted_users_similar = np.flip(sorted_users_similar, axis=0)
        
    # only for this item
    users_rated_item = specified_rating_indices(ratings_matrix[:, i_index])[0]
#     print(np.array(users_rated_item))
#     print(sorted_users_similar)
    
    ranked_similar_user_rated_item = [u for u in sorted_users_similar if u in users_rated_item]
#     print(ranked_similar_user_rated_item)
    
    if k < len(ranked_similar_user_rated_item):
        top_k_similar_user = ranked_similar_user_rated_item[0:k]   
    else:
        top_k_similar_user = np.array(ranked_similar_user_rated_item)
        
#     print(top_k_similar_user)
    
    # replace with mean_centered for user
    
    ratings_in_item = mean_centered_ratings_matrix[:, i_index]
    top_k_ratings = ratings_in_item[top_k_similar_user]
    
    top_k_similarity_value = similarity_value[top_k_similar_user]
#     print(top_k_ratings)
#     print(top_k_similarity_value)
    
    r_hat = users_mean_rating[u_index] + np.sum(top_k_ratings * top_k_similarity_value)/np.sum(np.abs(top_k_similarity_value))
    return r_hat

In [21]:
print(predict(2, 0, 2))

3.3463952993809016


  after removing the cwd from sys.path.


In [22]:
def predict_top_k_items_of_user(u_index, k_items, k_users):
    items = []
    for i_index in range(ratings_matrix.shape[1]):
        if np.isnan(ratings_matrix[u_index][i_index]):
            rating = predict(u_index, i_index, k_users)
            items.append((i_index, rating))
    items = sorted(items, key=lambda tup: tup[1])
    return list(reversed(items))

In [23]:
print(predict_top_k_items_of_user(2, 2, 2))

[(0, 3.3463952993809016), (5, 0.8584109681112306)]


  after removing the cwd from sys.path.


In [24]:
user_similarity_matrix.shape

(5, 5)

## Similarity Function Variants

Cosine function on the raw ratings rather than the mean-centered ratings:
$$RawCosine(u, v) = \dfrac{\Sigma_{k \in I_u \cap I_v}r_{uk}.r_{vk}}{\sqrt{\Sigma_{k \in I_u \cap I_v}r_{uk}^2}.\sqrt{\Sigma_{k \in I_u \cap I_v}r_{vk}^2}}$$

In [25]:
def raw_cosine(u, v):
    specified_rating_indices_u = set(specified_rating_indices(u)[0])
    specified_rating_indices_v = set(specified_rating_indices(v)[0])
    
    mutually_specified_ratings_indices = specified_rating_indices_u.intersection(specified_rating_indices_v)
    mutually_specified_ratings_indices = list(mutually_specified_ratings_indices)
    
    u_mutually = u[mutually_specified_ratings_indices]
    v_mutually = v[mutually_specified_ratings_indices]
    
    result = np.sum(np.multiply(u_mutually, v_mutually)) / (np.sqrt(np.sum(np.square(u_mutually))) * np.sqrt(np.sum(np.square(v_mutually))))

    return result

In some implementations of the raw cosine, the normalization factors in the denominator are based on all the specified items and not the mutually rated items:
$$RawCosine(u, v) = \dfrac{\Sigma_{k \in I_u \cap I_v}r_{uk}.r_{vk}}{\sqrt{\Sigma_{k \in I_u}r_{uk}^2}.\sqrt{\Sigma_{k \in I_v}r_{vk}^2}}$$

In [26]:
def raw_cosine_2(u, v):
    specified_rating_indices_u = set(specified_rating_indices(u)[0])
    specified_rating_indices_v = set(specified_rating_indices(v)[0])
    
    mutually_specified_ratings_indices = specified_rating_indices_u.intersection(specified_rating_indices_v)
    mutually_specified_ratings_indices = list(mutually_specified_ratings_indices)
    
    specified_ratings_u = u[list(specified_rating_indices_u)]
    specified_ratings_v = v[list(specified_rating_indices_v)]
    
    u_mutually = u[mutually_specified_ratings_indices]
    v_mutually = v[mutually_specified_ratings_indices]
    
    result = np.sum(np.multiply(u_mutually, v_mutually)) / (np.sqrt(np.sum(np.square(specified_ratings_u))) * np.sqrt(np.sum(np.square(specified_ratings_v))))

    return result

In [27]:
user_0 = ratings_matrix[0, :]
user_2 = ratings_matrix[2, :]

print(raw_cosine(user_0, user_2))
print(raw_cosine_2(user_0, user_2))

0.9561828874675148
0.7766217620286882


In general, the Pearson correlation coefficient is preferable to the raw cosine because of the bias adjustment effect on mean-centering

When two user have only a small number of ratings in common, the similarity function should be reduced with a discount factor to de-emphasize the importance of that user pair - $significance$ $weighting$. The discount factor kicks in when the nummber of commom ratings between the two usersis less than a particular threshold $\beta$:
$$DiscountedSim(u, v) = Sim(u, v). \dfrac{min({|I_u \cap I_v|, \beta)}}{\beta}$$

In [28]:
def discounted_sim(u, v, beta):
    specified_rating_indices_u = set(specified_rating_indices(u)[0])
    specified_rating_indices_v = set(specified_rating_indices(v)[0])
    
    mutually_specified_ratings_indices = specified_rating_indices_u.intersection(specified_rating_indices_v)
    mutually_specified_ratings_indices = list(mutually_specified_ratings_indices)
    
    result = pearson(u, v) * min(len(mutually_specified_ratings_indices), beta) / beta
    
    return result

In [29]:
user_0 = ratings_matrix[0, :]
user_2 = ratings_matrix[2, :]

print(discounted_sim(user_0, user_2, 5))

0.7155417527999327


  after removing the cwd from sys.path.


## Variants of the Prediction Function 

Standard deviation:
$$\sigma_u = \sqrt{\dfrac{\Sigma_{j \in I_u}(r_{uj} - \mu_u)^2}{|I_u| - 1}}$$

In [30]:
def standard_deviation(u):
    specified_rating_indices_u = set(specified_rating_indices(u)[0])
    specified_ratings_u = u[list(specified_rating_indices_u)]
    m = mean(u)
    
#     print(specified_ratings_u)
#     print(m)
    result = np.sqrt(np.sum(np.square(specified_ratings_u - m)) / (len(list(specified_rating_indices_u)) - 1))
    
    return result

In [31]:
print(standard_deviation(user_2))

1.1547005383792515


  after removing the cwd from sys.path.


Standardized ratings:
$$z_{uj} = \dfrac{r_{uj} - \mu_u}{\sigma_u} = \dfrac{s_{uj}}{\sigma_u}$$

In [32]:
def get_standardized_ratings(u):
    specified_rating_indices_u = set(specified_rating_indices(u)[0])
    specified_ratings_u = u[list(specified_rating_indices_u)]
    m = mean(u)
    
    sigma = standard_deviation(u)
    
    result = (specified_ratings_u - m) / sigma

    r = []
    count = 0
    for i in range(len(u)):
        if np.isnan(u[i]):
            r.append(nan)
        else:
            r.append(result[count])
            count = count + 1
    return r

In [33]:
print(get_standardized_ratings(user_2))

[nan, 0.8660254037844387, 0.8660254037844387, -0.8660254037844387, -0.8660254037844387, nan]


  after removing the cwd from sys.path.


In [34]:
def get_standardized_ratings_matrix(ratings_matrix):
    result = []
    for u_index in range(ratings_matrix.shape[0]):
        u = get_standardized_ratings(ratings_matrix[u_index, :])
        result.append(u)
    return np.array(result)

In [35]:
standardized_ratings_matrix = get_standardized_ratings_matrix(ratings_matrix)
print(standardized_ratings_matrix)

[[ 1.08821438  0.36273813  1.08821438 -1.08821438 -0.36273813 -1.08821438]
 [ 0.73029674  1.33887736         nan -0.4868645  -1.09544512 -0.4868645 ]
 [        nan  0.8660254   0.8660254  -0.8660254  -0.8660254          nan]
 [-1.43019388 -0.47673129 -0.47673129  0.47673129  0.47673129  1.43019388]
 [-1.                 nan -1.          0.          1.          1.        ]]


  after removing the cwd from sys.path.


Let $P_u(j)$ denote the set of top-k similar users of target user $u$, for which the ratings of item $j$ have been observed:
$$\hat{r_{uj} = \mu_u + \sigma_u\dfrac{\Sigma_{v \in P_u(j)}Sim(u, v).z_{vj}}{\Sigma_{v \in P_u(j)}|Sim(u, v)|}}$$

In [36]:
def predict_2(u_index, i_index, k):
    users_mean_rating = all_user_mean_ratings(ratings_matrix)
    
    similarity_value = user_similarity_matrix[u_index]
    sorted_users_similar = np.argsort(similarity_value)
    sorted_users_similar = np.flip(sorted_users_similar, axis=0)
        
    # only for this item
    users_rated_item = specified_rating_indices(ratings_matrix[:, i_index])[0]
#     print(np.array(users_rated_item))
#     print(sorted_users_similar)
    
    ranked_similar_user_rated_item = [u for u in sorted_users_similar if u in users_rated_item]
#     print(ranked_similar_user_rated_item)
    
    if k < len(ranked_similar_user_rated_item):
        top_k_similar_user = ranked_similar_user_rated_item[0:k]   
    else:
        top_k_similar_user = np.array(ranked_similar_user_rated_item)
        
#     print(top_k_similar_user)
    
    # replace with mean_centered for user
    
    ratings_in_item = standardized_ratings_matrix[:, i_index]
    top_k_ratings = ratings_in_item[top_k_similar_user]

    top_k_similarity_value = similarity_value[top_k_similar_user]
#     print(top_k_ratings)
#     print(top_k_similarity_value)

    sigma = standard_deviation(users_mean_rating)
    
    r_hat = users_mean_rating[u_index] + sigma * np.sum(top_k_ratings * top_k_similarity_value)/np.sum(np.abs(top_k_similarity_value))
    return r_hat

In [37]:
print(predict_2(2, 0, 2))

3.5069605722790054


  after removing the cwd from sys.path.


One problem with Z-score is that the predicted ratings might frequently be outside the range og permissible ratings.

While the value of Sim(u, v) was chosen the be the Pearson correlation coefficient, a commonly used practice is to amplify it by exponentiating it to the power of $\alpha$:
$$Sim(u, v) = Pearson(u, v)^\alpha$$

In [38]:
def pearson_2(u, v, alpha):
    return np.power(pearson(u, v), alpha)

In [39]:
print(pearson_2(user_0, user_2, 1.2))

0.8746896591546224


  after removing the cwd from sys.path.


## Impact of the Long Tail 

Just as the notion of Inverse Document Frequency (idf) exists in the information retrieve literature, one can use the notion of Inverse User Frequency in this case. If $m_j$ is the number of ratings of item $j$, and $m$ is the total number of users, then the weighted $w_j$ of the item $j$ is set to the following:
$$w_j = \log{\dfrac{m}{m_j}}$$ <br>
Then the Pearson correlation coefficient can be modified as follows:
$$Pearson(u, v) = \dfrac{\Sigma_{k \in I_u \cap I_v}(r_{uk} - \mu_u)(r_{vk} - \mu_v)}{\sqrt{\Sigma_{k \in I_u \cap I_v}w_k.(r_{uk} - \mu_u)^2}\sqrt{\Sigma_{k \in I_u \cap I_v}w_k.(r_{vk} - \mu_v)^2}}$$

In [40]:
def get_rated_item_indices(ratings_matrix):
    result = []
    
    for i_index in range(ratings_matrix.shape[1]):
        item = ratings_matrix[:, i_index]
        result.append(specified_rating_indices(item)[0])
    
    return result

In [41]:
rated_item_indices = get_rated_item_indices(ratings_matrix=ratings_matrix)
print(rated_item_indices)

[(0, 1, 3, 4), (0, 1, 2, 3), (0, 2, 3, 4), (0, 1, 2, 3, 4), (0, 1, 2, 3, 4), (0, 1, 3, 4)]


In [42]:
def pearson_3(u, v):
    m_j = np.array([len(list(i)) for i in rated_item_indices])
    w = np.log(ratings_matrix.shape[0] / m_j)
    
    mean_u = mean(u)
    mean_v = mean(v)
    
    specified_rating_indices_u = set(specified_rating_indices(u)[0])
    specified_rating_indices_v = set(specified_rating_indices(v)[0])
    
    mutually_specified_ratings_indices = specified_rating_indices_u.intersection(specified_rating_indices_v)
    mutually_specified_ratings_indices = list(mutually_specified_ratings_indices)
    
    u_mutually = u[mutually_specified_ratings_indices]
    v_mutually = v[mutually_specified_ratings_indices]
      
    centralized_mutually_u = u_mutually - mean_u
    centralized_mutually_v = v_mutually - mean_v

    w_k = w[mutually_specified_ratings_indices]
    
    result = np.sum(np.multiply(np.multiply(centralized_mutually_u, centralized_mutually_v), w_k))
    result = result / (np.sqrt(np.sum(np.multiply(w_k, np.square(centralized_mutually_u)))) * np.sqrt(np.sum(np.multiply(w_k, np.square(centralized_mutually_v)))))

    return result

In [43]:
print(pearson_3(user_0, user_2))

0.8944271909999159


  after removing the cwd from sys.path.


# Item-Based Neighborhood Models 

In [44]:
mean_centered_ratings_matrix #s_uj

array([[ 1.5,  0.5,  1.5, -1.5, -0.5, -1.5],
       [ 1.2,  2.2,  nan, -0.8, -1.8, -0.8],
       [ nan,  1. ,  1. , -1. , -1. ,  nan],
       [-1.5, -0.5, -0.5,  0.5,  0.5,  1.5],
       [-1. ,  nan, -1. ,  0. ,  1. ,  1. ]])

Let $U_i$ be the indices of the set if users who have rated item $i$<br>
The $adjusted$ cosine similarity between the items (columns) $i$ and $j$ is defined as follows:
$$AdjustedCosine(i, j) =  \dfrac{\Sigma_{u \in U_i \cap U_j}s_{ui}.s_{uj}}{\sqrt{\Sigma_{u \in U_i \cap U_j}s_{ui}^2}\sqrt{\Sigma_{u \in U_i \cap U_j}s_{uj}^2}}$$

In [45]:
#get i, j from mean_centered_ratings_matrix
def adjusted_cosine(i, j):
    specified_rating_indices_i = set(specified_rating_indices(i)[0])
    specified_rating_indices_j = set(specified_rating_indices(j)[0])
    
    mutually_specified_ratings_indices = specified_rating_indices_i.intersection(specified_rating_indices_j)
    mutually_specified_ratings_indices = list(mutually_specified_ratings_indices)
    
    i_mutually = i[mutually_specified_ratings_indices]
    j_mutually = j[mutually_specified_ratings_indices]
    
    result = np.sum(np.multiply(i_mutually, j_mutually)) 
    result = result / (np.sqrt(np.sum(np.square(i_mutually))) * np.sqrt(np.sum(np.square(j_mutually))))

    return result

In [46]:
item_0 = mean_centered_ratings_matrix[:, 0]
item_2 = mean_centered_ratings_matrix[:, 2]

print(adjusted_cosine(item_0, item_2))

0.9116846116771036


In [47]:
def get_item_similarity_value_for_item_index(i_index, ratings_matrix):
    mean_centered_ratings_matrix = get_mean_centered_ratings_matrix(ratings_matrix)
    
    user_ratings = mean_centered_ratings_matrix[:, i_index]
    similarity_value = np.array([adjusted_cosine(mean_centered_ratings_matrix[:, i], user_ratings) for i in range(ratings_matrix.shape[1])])
    return similarity_value

In [48]:
def get_item_similarity_matrix(ratings_matrix):  
    similarity_matrix = []
    for i_index in range(mean_centered_ratings_matrix.shape[1]):
        similarity_value = get_item_similarity_value_for_item_index(i_index, ratings_matrix)
        similarity_matrix.append(similarity_value)
    return np.array(similarity_matrix)   

In [49]:
item_similarity_matrix = get_item_similarity_matrix(ratings_matrix)
print(item_similarity_matrix)

[[ 1.          0.73508319  0.91168461 -0.84830227 -0.8124881  -0.9896203 ]
 [ 0.73508319  1.          0.87287156 -0.73391041 -0.99599886 -0.62225073]
 [ 0.91168461  0.87287156  1.         -0.8819171  -0.89442719 -0.91168461]
 [-0.84830227 -0.73391041 -0.8819171   1.          0.70567109  0.82899588]
 [-0.8124881  -0.99599886 -0.89442719  0.70567109  1.          0.73033626]
 [-0.9896203  -0.62225073 -0.91168461  0.82899588  0.73033626  1.        ]]


  after removing the cwd from sys.path.


Let top-k most similar items to item $t$, for which the user $u$ has specified ratings, be denoted by $Q_t(u)$ <br>
The predicted rating $\hat{r_{ut}}$ of user $u$ for target item $t$ is as follows:<br>
$$\hat{r_{ut}} = \dfrac{\Sigma_{j \in Q_t(u)}AdjustedCosine(j, t).r_{uj}}{\Sigma_{j \in Q_t(u)}|AdjustedCosine(j, t)|}$$


In [50]:
def item_based_predict(u_index, i_index, k):
    mean_centered_ratings_matrix = get_mean_centered_ratings_matrix(ratings_matrix)
    
    similarity_value = item_similarity_matrix[i_index]
    sorted_items_similar = np.argsort(similarity_value)
    sorted_items_similar = np.flip(sorted_items_similar, axis=0)
        
    # only for this item
    items_rated_by_user = specified_rating_indices(ratings_matrix[u_index, :])[0]
    print(np.array(items_rated_by_user))
    print(sorted_items_similar)
    
    ranked_similar_items = [i for i in sorted_items_similar if i in items_rated_by_user]
    print(ranked_similar_items)
    
    if k < len(ranked_similar_items):
        top_k_similar_item = ranked_similar_items[0:k]   
    else:
        top_k_similar_item = np.array(ranked_similar_items)
        
    print(top_k_similar_item)
        
    ratings_of_user = ratings_matrix[u_index, :]
    top_k_ratings = ratings_of_user[top_k_similar_item]
    
    top_k_similarity_value = similarity_value[top_k_similar_item]
    print(top_k_ratings)
    print(top_k_similarity_value)
    
    r_hat = np.sum(top_k_ratings * top_k_similarity_value)/np.sum(np.abs(top_k_similarity_value))
    return r_hat

In [51]:
print(item_based_predict(2, 0, 2))

[1 2 3 4]
[0 2 1 4 3 5]
[2, 1, 4, 3]
[2, 1]
[3. 3.]
[0.91168461 0.73508319]
2.9999999999999996


  after removing the cwd from sys.path.


In [52]:
def predict_top_k_user_of_item(i_index, k_items, k_users):
    users = []
    for u_index in range(ratings_matrix.shape[0]):
        if np.isnan(ratings_matrix[u_index][i_index]):
            rating = predict(u_index, i_index, k_items)
            users.append((u_index, rating))
    users = sorted(users, key=lambda tup: tup[1])
    return list(reversed(users))

In [53]:
print(predict_top_k_user_of_item(2, 2, 2))

[(1, 6.013729659957365)]


  after removing the cwd from sys.path.


## Add Metrics

In [54]:
def get_euclid_metric():
    pass

# Clustering and Neighborhood-Based Methods

In [55]:
# bool_nan = np.isnan(mean_centered_ratings_matrix)
added_0_mean_centered_ratings_matrix = []
shape = mean_centered_ratings_matrix.shape
for row in mean_centered_ratings_matrix:
    for element in row:
        if np.isnan(element):
            element = 0.0
        added_0_mean_centered_ratings_matrix.append(element)

added_0_mean_centered_ratings_matrix = np.array(added_0_mean_centered_ratings_matrix)
added_0_mean_centered_ratings_matrix = np.reshape(added_0_mean_centered_ratings_matrix, [shape[0], shape[1]])
print(added_0_mean_centered_ratings_matrix)
print(mean_centered_ratings_matrix)

[[ 1.5  0.5  1.5 -1.5 -0.5 -1.5]
 [ 1.2  2.2  0.  -0.8 -1.8 -0.8]
 [ 0.   1.   1.  -1.  -1.   0. ]
 [-1.5 -0.5 -0.5  0.5  0.5  1.5]
 [-1.   0.  -1.   0.   1.   1. ]]
[[ 1.5  0.5  1.5 -1.5 -0.5 -1.5]
 [ 1.2  2.2  nan -0.8 -1.8 -0.8]
 [ nan  1.   1.  -1.  -1.   nan]
 [-1.5 -0.5 -0.5  0.5  0.5  1.5]
 [-1.   nan -1.   0.   1.   1. ]]


In [56]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(added_0_mean_centered_ratings_matrix)
print('Centers found by scikit-learn:')
print(kmeans.cluster_centers_)
pred_label = kmeans.predict(added_0_mean_centered_ratings_matrix)
print(pred_label)
# kmeans_display(added_0_mean_centered_ratings_matrix, pred_label)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


Centers found by scikit-learn:
[[-1.25       -0.25       -0.75        0.25        0.75        1.25      ]
 [ 0.9         1.23333333  0.83333333 -1.1        -1.1        -0.76666667]]
[1 1 1 0 0]


In [57]:
def predict_rating_with_kmean_user(u_index, i_index):
    u_label = pred_label[u_index]
    neighbor = [u for u in range(len(pred_label)) if ((pred_label[u] == u_label) and (u != u_index))]
#     print(neighbor)
    
    users_mean_rating = all_user_mean_ratings(ratings_matrix)
    
    similarity_value = user_similarity_matrix[u_index]
        
    # only for this item
    users_rated_item = specified_rating_indices(ratings_matrix[:, i_index])[0]

    ratings_in_item = mean_centered_ratings_matrix[:, i_index]
    top_k_ratings = ratings_in_item[neighbor]
    
    top_k_similarity_value = similarity_value[neighbor]
    
    r_hat = users_mean_rating[u_index] + np.sum(top_k_ratings * top_k_similarity_value)/np.sum(np.abs(top_k_similarity_value))
    return r_hat
    
    

In [58]:
predict_rating_with_kmean_user(2, 0)

  after removing the cwd from sys.path.


3.3463952993809016

KMean for item-based collaborative filtering is similar

# Dimensionality Reduction and Neighborhood Methods

In this code, we use SVD-like or PCA-like method to compute user-based collaborative filtering (in the item-based collaborative fitering, we will use $R_f^T$ in stead of $R_f$)<br> 
From the initial ratings matrix $R$, we will fill the missing value with the mean of corresponding row (mean of user ratings), the resulting matrix is denoted by $R_f$ <br>
We compute the $n \times n$ similarity matrix between pait of items, which is given by $S = R_f^TR_f$ (in PCA $S = R_f$). We perform the diagonalization of similarity matrix as follow: <br>
$$S = P \Delta P^T$$

In [59]:
ratings_matrix = np.array([[1, 1, 1],
                           [7, 7, 7],
                           [3, 1, 1], 
                           [5, 7, 7],
                           [3, 1, nan], 
                           [5, 7, nan], 
                           [3, 1, nan], 
                           [5, 7, nan], 
                           [3, 1, nan], 
                           [5, 7, nan], 
                           [3, 1, nan], 
                           [5, 7, nan]])

We can mean center along each row then mean center along each colum, this type of approach generally provides the most robust results

In [60]:
def get_added_missing_value_with_zero(ratings_matrix):
    mean_centered_ratings_matrix = get_mean_centered_ratings_matrix(ratings_matrix)
    
    added_0_mean_centered_ratings_matrix = []
    shape = mean_centered_ratings_matrix.shape
    for row in mean_centered_ratings_matrix:
        for element in row:
            if np.isnan(element):
                element = 0.0
            added_0_mean_centered_ratings_matrix.append(element)

    added_0_mean_centered_ratings_matrix = np.array(added_0_mean_centered_ratings_matrix)
    added_0_mean_centered_ratings_matrix = np.reshape(added_0_mean_centered_ratings_matrix, [shape[0], shape[1]])

    return added_0_mean_centered_ratings_matrix

In [61]:
fixed_ratings_matrix = get_added_missing_value_with_zero(ratings_matrix=ratings_matrix)

  after removing the cwd from sys.path.


In [62]:
fixed_ratings_matrix

array([[ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 1.33333333, -0.66666667, -0.66666667],
       [-1.33333333,  0.66666667,  0.66666667],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ]])

In [63]:
from numpy import linalg as LA

In [64]:
# input_matrix = np.dot(fixed_ratings_matrix, np.transpose(fixed_ratings_matrix))

In [65]:
# input_matrix.shape

$P = U, \Delta = SS^T$

In [66]:
U, s, V = LA.svd(np.transpose(fixed_ratings_matrix))

In [67]:
s

array([4.50349532e+00, 1.02560385e+00, 3.64010962e-16])

In [68]:
U.shape

(3, 3)

S return diagonal element

Compute new ratings matrix by: <br>
$$R_fP_d$$ <br>
$P_d$ be the $n \times d$ matrix containing only the columns of P corresponding the the largest d eigenvector (get from d-max value in $s$)

we can choose the number of percent information that we want to keep, by compute:<br>
$$\frac{\sum_{i = 1}^k \sigma_i^2}{\sum_{j = 1}^r \sigma_j^2} \geq 0.9$$

In [69]:
# compute new_ratings_matrix
# in this code, we use hard threshold 0.01
d = s[s > 0.01].shape[0]

In [70]:
U[:, 0:2]


array([[-0.75130448, -0.31970025],
       [ 0.65252078, -0.49079864],
       [ 0.0987837 ,  0.81049889]])

In [71]:
new_ratings_matrix = np.dot(fixed_ratings_matrix, U[:, 0:d])

In [72]:
new_ratings_matrix.shape

(12, 2)

In [73]:
new_ratings_matrix

array([[ 0.        ,  0.        ],
       [ 0.        ,  0.        ],
       [-1.50260895, -0.63940051],
       [ 1.50260895,  0.63940051],
       [-1.40382525,  0.17109838],
       [ 1.40382525, -0.17109838],
       [-1.40382525,  0.17109838],
       [ 1.40382525, -0.17109838],
       [-1.40382525,  0.17109838],
       [ 1.40382525, -0.17109838],
       [-1.40382525,  0.17109838],
       [ 1.40382525, -0.17109838]])

From there, we can compute the similarity matrix for user-based collaborative filtering, very similar to previous section.

## Handling problem with bias

### Maximum Likelihood Estimation (missing)

### Direct Matrix Factorization of Incomplete Data

$$R = Q \Sigma P$$ <br>
$$R \approx Q_d\Sigma_dP_d^T $$

In [74]:
ratings_matrix

array([[ 1.,  1.,  1.],
       [ 7.,  7.,  7.],
       [ 3.,  1.,  1.],
       [ 5.,  7.,  7.],
       [ 3.,  1., nan],
       [ 5.,  7., nan],
       [ 3.,  1., nan],
       [ 5.,  7., nan],
       [ 3.,  1., nan],
       [ 5.,  7., nan],
       [ 3.,  1., nan],
       [ 5.,  7., nan]])

In [75]:
fixed_ratings_matrix

array([[ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 1.33333333, -0.66666667, -0.66666667],
       [-1.33333333,  0.66666667,  0.66666667],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ],
       [ 1.        , -1.        ,  0.        ],
       [-1.        ,  1.        ,  0.        ]])

In [76]:
U, s, V = LA.svd(fixed_ratings_matrix)

In [77]:
Q_d = U[:, 0:2]
s_d = np.diag(s[0:2])
P_d = V[:, 0:2]

In [78]:
R = np.dot(np.dot(Q_d, s_d), np.transpose(P_d))

In [79]:
R

array([[-5.55170811e-17,  1.46804396e-16, -2.15586601e-16],
       [-3.93933230e-17, -2.39207996e-17,  2.68404887e-17],
       [ 1.54613895e+00, -1.66567566e-01,  4.98373629e-01],
       [-1.54613895e+00,  1.66567566e-01, -4.98373629e-01],
       [ 9.43054946e-01, -5.32778141e-01,  9.09282586e-01],
       [-9.43054946e-01,  5.32778141e-01, -9.09282586e-01],
       [ 9.43054946e-01, -5.32778141e-01,  9.09282586e-01],
       [-9.43054946e-01,  5.32778141e-01, -9.09282586e-01],
       [ 9.43054946e-01, -5.32778141e-01,  9.09282586e-01],
       [-9.43054946e-01,  5.32778141e-01, -9.09282586e-01],
       [ 9.43054946e-01, -5.32778141e-01,  9.09282586e-01],
       [-9.43054946e-01,  5.32778141e-01, -9.09282586e-01]])

# A Regression Modeling View of Neighborhood Methods