#  **4. Recommender Systems**

## What are Embeddings in Machine Learning? 

* Embeddings are representations of values or objects like text, images and audio designed to be consumed by ML models and **semantic search algorithms**. 
  * **Semantic Search Algorithms** $\rightarrow$ Algorithms designed to improve the accuracy and relevance of search results by understanding the intent and contextual meaning behind a user's query, rather than relying solely on matching keywords. Which contrasts to lexical search, which is literal. 
* Embeddings **translate objects like these into a mathematical form according to the factors or traits**, where each one may or may not have and the categories they belong to. They allow models to understand the relationships between words and other objects. 

How do we represent a word to a neural net? 
Neural nets view things as a vector of floats. So images are pixels in a long row with values from 0 - 1. So we can assume that two vectors that are close to each other in magnitudes, that must mean that they are similar. 

Imagine instead of analysing through vector similarity, that we actually had to analyse a dictionary with all words. As there is no similarity, we lose context, so imagine that we have two related words: 'music' and 'concert'. If you think about it, 'concert' would be at the beggining of the dictionary (As an example: [0,0,1,0,0,0,...] ) whereas 'music' would be towards the latter half: ([0,0,0,...,0,1,0,0]). If we only analyse these as dictionaries, we can't identify any underlying context as we're taking their literal structure. 

If we go for lexical similarity, if we have the sentence: 'I like to play with my pet ____' the model could return 'car' instead of 'cat', because they are close, but in reality, it has nothing to do with what we actually want. With semantic search we would get something like 'dog' instead of 'cat' and even though they are not lexically similar, we can now observe we have better context. 

***So the question that we're hunting for is: What does it mean that two words are similar?***

What we assume here is that two words are similar, is if they're used in the same context, so all instances of the word 'dog' would be grouped with words like 'cat', 'pet', etc... 

This allows our networks to better understand what we're talking about. If we have a model whose job it is to guess the next word in a sentence and it does so well, it means that it's able to compress the data in a very efficient manner. We are encoding information about the actual word, we reduce the dimensionality. 

If we run this on a large enough dataset, and enough compute, we are able to obtain for each word, a vector of similar words, and it works surprisingly well, seeing the amount of information it gets. GAN is more oriented towards images.  

* King - Man + Woman = Queen -> That would be a manner of understanding how embeddings understand context. However, aha! When these are developed is when you see the true nature of the bias, of the human developing it. 

And the coolest thing, is that it gives you context, in a completely unsupervised manner. That's awesome.

*Technically, embeddings are vectors created by ml models to capture meaningful data, in base of cosine similarity*

### Awesome, nice theory, how do we create it?

If we want to plug words into a neural network, we need to find a way to transform words into numbers. 
An easy way to transform words into numbers, is just to assign a word, a random number to each one. 
However, as we've learned previously, we need to design the system such that we can group words and understand context. 

Let's consider the following example: 'Embeddings are great!' 
If we were to randomly assign a number to each word, we could get something like [12, -3.05, 42]. However if we get 'Embeddings are awesome!', we would get [12,-3.05, 4.2] which doesn't account for the fact that **great** and **awesome** are pretty similar. 

Bear in mind, we also must consider sarcasm, as in: 'Your kid broke the window, great'. So it would be ideal to have one number that keeps track of the positive ways in which great is used, and a different number to keep track of how negative ways are used. 

This sounds like a lot of work to consider all of these components, but thanks to a Neural Network, we can easily establish this system.

Steps: 
1. Create input for each unique word in the training data. 
2. Connect each input to at least one activation function. (The number of activation functions, the number of contexts).
3. The way that we must train these NN's and optimise the cost function is on the basis of next-word prediction. 
   * If you take our previous example, 'Embeddings are great!', we would take the input 'Embeddings' and train for the word 'are'!
   * For 'are' as in input, we'd want the output for the next word to be 'great!'. 
4. As we now have error functions, we can optimise these functions using Back-Propagation which we saw at the very beginning of the subject. In order to make the weights for similar words, alike. 

![Neural network for Embeddings](image-2.png)

Having similar words with similar embeddings means training a NN to process language is easier, as learning how one word is used shows us how other similar words is used. 

However, all this talk of context, and we're only talking about training the Neural Network to predict the next word. In the case of embeddings are great, if we pass from *embeddings* to now predicting the next word after *are*, we might lose the context of the embeddings topic, and only search for the next adjective.

Which is why they've invented Word2Vec. It was created in order to include more context, and it does so with the following two techniques: 

1. Continuous Bag of Words $\rightarrow$ Increases the context by using the surrounding words, to predict the word in the middle. 
   So it would take the words 'Embeddings' and 'great!', in order to predict the word 'is'
2. Skip Gram $\rightarrow$ Takes a single word and tries to predict its surrounding words. 

So far, we've taken very simple examples, by only referencing a single sentence. The actual word2vec database, uses all of wikipedia, and so you might see how things escalate. A much larger input range, a much larger weight range (if we're only going to have a single dense layer, that is) and finally the sum that gets us from the activation functions to the outputs. 

So this takes a **long time** to train. However, they use some neat techniques in order to save up some processing time with *Negative Sampling*. This technique prevents having to re-train the entire network by selecting a group of inputs not to be considered when training the dataset, in order to speed up efficiency times. 

Once the entire dataset is trained, you now have what's known as a vector database for your words. 
 
*Sources which helped in the writing of this post:*
* ComputerPhile - Word Embeddings - https://www.youtube.com/watch?v=gQddtTdmG_8
* StatQuest - Word Embedding and Word2Vec https://www.youtube.com/watch?v=viZrOnJclY0 
---

Recommender systems are systems which recommend items (such as movies, books, ads) to users based on various information, such as their past viewing/purchasing behavior (e.g., which movies they rated high or low, which ads they clicked on), as well as optional “side information” such as demographics about the user, or information about the content of the item (e.g., its title, genre or price). Such systems are widely used by various internet companies, such as Facebook, Amazon, Netflix, Google, etc.  

The internet has changed how we consume media, products, and services. With so many options and choices, it becomes overwhelming to select the right one. That’s where Recommender Systems come in. Recommender Systems are intelligent algorithms that analyze user behaviour, preferences, and data to suggest personalized recommendations. These systems are widely used in e-commerce, streaming services, social networks, and other domains.

In this chapter, we give a brief introduction to the topic.

![Schema](image.png)

### 3 types: 
* Content-based filtering: 
  * Uses item descriptions and a user's past preferences to make recommendations. This creates a profile for each user based on the attributes of the items they have liked or interacted with. 
  Good for transparency and user independence. 

* **Collaborative filtering:**
  * Recommendation approach that relies on the collective preferences and behaviours of users. Operates under the assumption that is users agreed on the past, they will agree in the future. 
    * Model-based filtering 
    * Memory-based filtering 
      * User-Based CF: Recommendations are made based on the preferences of similar users. 
      * Item-Based CF: Items recommended based on their similarity to items the user has liked in the past. 

* Hybrid filtering technique: 
  * Combines elements of Collaborative Filtering and Content-Based Filtering to offset the limitations of each one. 
  Implementation:
    * Separate implementation where we run CF and CBF separately and combine their predictions. 
    * Feature Combination: Integrate content-based characteristics into a collaborative approach. 
    * Unified model 
  Overall it is the best one as it has improved accuracy, and flexibility, however it is complex. 



# **COLLABORATIVE FILTERING**

![advantages](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*7uW5hLXztSu_FOmZOWpB6g.png)

## **4.1 Memory-Based Collaborative Filtering**

**Memory-based** collaborative filtering algorithms, also referred to as **Neighborhood-based** algorithms, were among the earliest algorithms developed for collaborative filtering. These algorithms are based on the fact that similar users display similar patterns of rating behavior and similar items receive similar ratings. There are two primary types of neighborhood-based algorithms:

1. **User-based** collaborative filtering: In this case, the ratings provided by similar users to a target user A are used to make recommendations for A. The predicted ratings of A are computed as the weighted average values of these “peer group” ratings for each item.
   
2. **Item-based** collaborative filtering: In order to make recommendations for target item B, the first step is to determine a set S of items, which are most similar to item B. Then, in order to predict the rating of any particular user A for item B, the ratings in set S, which are specified by A, are determined. The weighted average of these ratings is used to compute the predicted rating of user A for item B.

An important distinction between user-based collaborative filtering and item-based collaborative filtering algorithms is that **the ratings in the former case are predicted using the ratings of neighboring users**, whereas **the ratings in the latter case are predicted using the user’s own ratings on neighboring (i.e., closely related) items.** 

In user-based, **neighborhoods are defined by similarities among users** (rows of ratings matrix), whereas in the latter case, **neighborhoods are defined by similarities among items** (columns of ratings matrix). Thus, the two methods share a complementary relationship. **Nevertheless, there are considerable differences in the types of recommendations that are achieved using these two methods.**

For the purpose of subsequent discussion, we assume that the user-item ratings matrix is an incomplete $m × n$ matrix $R = [r_{uj}]$ containing $m$ users and $n$ items. It is assumed that only a small subset of the ratings matrix is specified or observed. 

Like all other collaborative filtering algorithms, neighborhood-based collaborative filtering algorithms can be formulated in one of two ways:

1. *Predicting the rating value of a user-item combination*: 
   
   This is the simplest and most primitive formulation of a recommender system. In this case, **the missing rating** ruj of the user u for item j **is predicted**.
   
2. *Determining the top-k items or top-k users*: 
   
   In most practical settings, the merchant is not necessarily looking for specific ratings values of user-item combinations. **Rather, it is more interesting to learn the top-k most relevant items for a particular user**, or the top-k **most relevant users for a particular item**. *The problem of determining the top-k items is more common than that of finding the top-k users*. This is because the former formulation is used to **present lists of recommended items to users in web-centric scenarios**. In traditional recommendation algorithms, the “top-k problem” almost *always refers to the process of finding the top-k items*, rather than the top-k users. However, the latter formulation is also useful to the merchant because it can be used to determine the best users to target with marketing efforts.

### **Key propierties of Rating Matrices**

As discussed earlier, we assume that the ratings matrix is denoted by $R$, and it is an m × n matrix containing m users and n items. Therefore, the rating of user u for item j is denoted by ruj. Only a small subset of the entries in the ratings matrix are typically specified. The specified entries of the matrix are referred to as the training data, whereas the unspecified entries of the matrix are referred to as the test data. In that case, all the unspecified entries belong to a special column, which is known as the class variable or dependent variable. Therefore, the recommendation problem can be viewed as a generalization of the problem of classification and regression.

Ratings can be defined in a variety of ways, depending on the application at hand:
1. Continuous ratings. (Height in Centimeters)
2. Interval-based ratings. (40º - 50º C)
3. Ordinal ratings. (Agree, neutral, disagree)
4. Binary ratings. (yes/no, true/false)
5. Unary ratings. (The like button)

It is noteworthy that the indirect derivation of unary ratings from customer actions is also referred to as implicit feedback, because the customer does not explicitly provide feedback.

### **Long-tail property**

The future of business is selling less of more. 
Selling a smaller number of products in larger frequency. The best-sellers sell in huge quantities. 
In the age of the internet, there are so many products which can sell, but it's only a reduced number of them which are the big boys. 
Like the four horsemen Facebook, Amazon, Apple, Google. 

Inspired by the Pareto curve, which implies that 80% of results come from 20% of causes. 

Applications in the economy and e-commerce, it refers to the distribution of products and services in which a small percentage of products/services generate the largest part of the sales. In tech, it refers to the distribution of data in which a smaller number of data is accessed with the most frequency, while the rest are rarely consulted. 

![longtail](https://lh6.googleusercontent.com/NlrWR-SLfo5K7w2pFzFqcmWVVe0YNQvtbJ_iIkhGhWwDU6nnkpR0Vzh8xjenCCW2i3hg_Y1MPxNVVoc6zIgC1O14pN-CTYlr9amX1TcgmXi59ryLD5_FnKKov72Oa55HfMY-4eNAQlMrQcBaMQIMIGIEEy8Z1DvnUEFqSOy4GJYv12d8tUmH126RUX-wKw)

The distribution of ratings among items **often satisfies a property in real-world settings**, which is referred to as the **long-tail property**. According to this property, only a small fraction of the items are rated frequently. Such items are referred to as popular items. The vast majority of items are rated rarely. This results in a highly skewed distribution of the underlying ratings. An example of a skewed rating distribution is illustrated in Figure 2.1. The X-axis shows the index of the item in order of decreasing frequency, and the Y -axis shows the frequency with which the item was rated. It is evident that most of the items are rated only a small number of times. Such a rating distribution has important implications for the recommendation process:

1. In many cases, the high-frequency items tend to be relatively competitive items with little profit for the merchant. On the other hand, the lower frequency items have larger profit margins. In such cases, it may be advantageous to the merchant to recommend lower frequency items. In fact, analysis suggests that many companies, such as Amazon, make most of their profit by selling items in the long tail.
   
2. Because of the rarity of observed ratings in the long tail it is generally more difficult to provide robust rating predictions in the long tail. In fact, many recommendation algorithms have a tendency to suggest popular items rather than infrequent items. This phenomenon also has a negative impact on diversity, and users may often become bored by receiving the same set of recommendations of popular items.
   
3. The long tailed distribution implies that the items, which are frequently rated by users, are fewer in number. **This fact has important implications for neighborhood- based collaborative filtering algorithms because the neighborhoods are often defined on the basis of these frequently rated items. In many cases, the ratings of these high-frequency items are not representative of the low-frequency items because of the in- herent differences in the rating patterns of the two classes of items.** As a result, the prediction process may yield misleading results. As we will discuss in section 7.6 of Chapter 7, this phenomenon can also cause misleading evaluations of recommendation algorithms.


The basic idea in neighborhood-based methods is to use either user-user similarity or item- item similarity to make recommendations from a ratings matrix. The concept of a neigh- borhood implies that we need to determine either similar users or similar items in order to make predictions. In the following, we will discuss how neighborhood-based methods can be used to predict the ratings of specific user-item combinations. There are two basic principles used in neighborhood-based models:
- (a) **User-based models (On the left)**: Similar users have similar ratings on the same item. Therefore, if Alice and Bob have rated movies in a similar way in the past, then one can use Alice’s observed ratings on the movie Terminator to predict Bob’s unobserved ratings on this movie.
- (b) **Item-based models (On the right)**: Similar items are rated in a similar way by the same user. Therefore, Bob’s ratings on similar science fiction movies like Alien and Predator can be used to predict his rating on Terminator.


![useitem](https://www.researchgate.net/publication/355218515/figure/fig2/AS:1079169563787266@1634305482033/The-collaborative-filtering-algorithms-a-user-based-b-item-based.png)

### **4.1.1 User-based models**

In this approach, user-based neighborhoods **are defined in order to identify similar users to the target user for whom the rating predictions are being computed.** In order to **determine the neighborhood of the target user i, her similarity to all the other users is computed.** Therefore, a similarity function needs to be defined between the ratings specified by users. **Such a similarity computation is tricky because different users may have different scales of ratings.**

One user might be biased toward liking most items, whereas another user might be biased toward not liking most of the items. Furthermore, different users may have rated different items. Therefore, **mechanisms need to be identified to address these issues.**

For the $m \times n$ ratings matrix $R = [r_{uj}]$ with $m$ users and $n$ items, let $I_{u}$ denote the set of item indices for which ratings have been specified by user (row) $u$. For example, if the ratings of the first, third, and fifth items (columns) of user (row) $u$ are specified (observed)and the remaining are missing, then we have $I_{u} = \{1,3,5\}$.

Therefore, the set of items rated by both users $u$ and $v$ is given by $I_{u} \cap I_{v} $. For example, if user $v$ has rated the first fouritems,then $I_{v} =\{1,2,3,4\}$,and $I_{u}\cap I_{v} =\{1,3,5\}\cap\{1,2,3,4\}=\{1,3\}$.

**It is possible (and quite common) for $I_{u}\cap I_{v}$ to be an empty set because ratings matrices are generally sparse**. The set $I_{u}\cap I_{v}$ defines the mutually observed ratings, which are used to compute the similarity between the $u$ th and $v$ th users for neighborhood computation.

One measure that captures the similarity $Sim(u,v)$ between the rating vectors of two users $u$ and $v$ is the **Pearson correlation coefficient**. Because $I_{u}\cap I_{v}$ represents the set of item indices for which both user u and user v have specified ratings, the coefficient is computed only on this set of items. 

The first step is to compute the **mean rating $\mu_{u}$** for each user $u$ using her specified ratings:

$$ \mu_{u} = \frac{\sum_{k \in I_{u}} r_{uk}}{|I_{u}|} \quad \forall u \in \{ 1, \cdots, m\}$$

Then, the **Pearson correlation coefficient** between the rows (users) u and v is defined as
follows:

$$Sim(u,v) = Pearson(u,v) = \frac{\sum_{k \in I_{u}\cap I_{v}}(r_{uk} - \mu_{u}) \cdot (r_{vk} - \mu_{v})}{\sqrt{\sum_{k \in I_{u}\cap I_{v}}(r_{uk} - \mu_{u})^{2}} \cdot \sqrt{\sum_{k \in I_{u}\cap I_{v}}(r_{vk} - \mu_{v})^{2}}} $$

> The formula is essentially the dot product of the centered mean for each user rating over the square root of the square root of the dot product of the centered mean. 

> **Pearson's Coefficient quantifies the strength and direction of a linear relationship between two variables. It is a normalized measurement of the covariance of the two variables, and its value ranges from -1 to 1.**

Strictly speaking, the traditional definition of Pearson(u,v) tells us that the values of $\mu_{u}$ and $\mu_{v}$ should be computed only over the items that are rated both by users $u$ and $v$. However, it is quite common (and computationally simpler) to compute each $\mu_{u}$ just once for each user $u$.

This will also take less compute as it's a single direct operation

Cosine Similarity: 

$$ Cosine (u, v) = \frac{r_{u} \cdot r_{u}}{||r_{u}|| \cdot ||r_{v}||} $$

**Python Implementation**

In [1]:
import pandas as pd
import numpy as np
from numpy import isnan
from numpy import linalg as LA

In [2]:
# * We're filling in the matrix
df = pd.DataFrame(
    {
        "user_1": [7, 6, 7, 4, 5, 4],
        "user_2": [6, 7, np.nan, 4, 3, 4],
        "user_3": [np.nan, 3, 3, 1, 1, np.nan],
        "user_4": [1, 2, 2, 3, 3, 4],
        "user_5": [1, np.nan, 1, 2, 3, 3],
    },
    index=["item_1", "item_2", "item_3", "item_4", "item_5", "item_6"],
)

df = df.T

In [3]:
# * Calculate the mean for each user
df["Mean Rating"] = df.mean(axis=1)
df.head()

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6,Mean Rating
user_1,7.0,6.0,7.0,4.0,5.0,4.0,5.5
user_2,6.0,7.0,,4.0,3.0,4.0,4.8
user_3,,3.0,3.0,1.0,1.0,,2.0
user_4,1.0,2.0,2.0,3.0,3.0,4.0,2.5
user_5,1.0,,1.0,2.0,3.0,3.0,2.0


In [4]:
# * Generate the Pearson Coefficient
def sim_pearson(df, j):
    pearson_values = []
    for i in np.arange(len(df.values)):
        mask = ~isnan(df.values[i]) * ~isnan(df.values[j])
        mu_u = df["Mean Rating"].iloc[i]
        mu_v = df["Mean Rating"].iloc[j]
        ru = df.values[i][mask]
        rv = df.values[j][mask]

        numerador = np.sum((ru - mu_u) * (rv - mu_v))
        denominador = np.sqrt(np.sum((ru - mu_u) ** 2) * np.sum((rv - mu_v) ** 2))

        pearson = numerador / denominador
        pearson_values.append(pearson)
    return pearson_values

In [5]:
df["Pearson (i, 3)"] = sim_pearson(df, 2)
# * Generate the pearson similarity for user 3
# * As we can observe the most similar are user 2 and user 1, 4 and 5 are not there yet
df.head()

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6,Mean Rating,"Pearson (i, 3)"
user_1,7.0,6.0,7.0,4.0,5.0,4.0,5.5,0.894427
user_2,6.0,7.0,,4.0,3.0,4.0,4.8,0.938474
user_3,,3.0,3.0,1.0,1.0,,2.0,1.0
user_4,1.0,2.0,2.0,3.0,3.0,4.0,2.5,-1.0
user_5,1.0,,1.0,2.0,3.0,3.0,2.0,-0.816497


In [6]:
def sim_cosine(df, i, j):
    mask = ~isnan(df.values[i]) * ~isnan(df.values[j])
    ru = df.values[i][mask]
    rv = df.values[j][mask]
    cosine = (np.dot(ru, rv)) / (LA.norm(ru) * LA.norm(rv))
    return cosine

In [7]:
df["Cosine (i, 3)"] = [sim_cosine(df, i, 2) for i in np.arange(len(df.values))]
# * However, with the cosine values, the numbers tell a slightly different story, as each one is
df.head()

Unnamed: 0,item_1,item_2,item_3,item_4,item_5,item_6,Mean Rating,"Pearson (i, 3)","Cosine (i, 3)"
user_1,7.0,6.0,7.0,4.0,5.0,4.0,5.5,0.894427,0.955867
user_2,6.0,7.0,,4.0,3.0,4.0,4.8,0.938474,0.973637
user_3,,3.0,3.0,1.0,1.0,,2.0,1.0,1.0
user_4,1.0,2.0,2.0,3.0,3.0,4.0,2.5,-1.0,0.763057
user_5,1.0,,1.0,2.0,3.0,3.0,2.0,-0.816497,0.64712




#### Pearson's 

Pros:
- **Context Sensitivity:** Adjusts for users' rating scales, identifying patterns even if users rate differently.
- **Normalization:** Accounts for rating biases, useful when users have varying rating scales.

Cons:
- **Sparse Data Sensitivity:** Less reliable in datasets with few common ratings between users.
- **Computationally Intensive:** Requires more calculations, especially in large datasets.

#### Cosine 

Pros:
- **Efficiency:** More computationally efficient, particularly with large, sparse datasets.
- **Versatility:** Suitable for various data types and effective in high-dimensional spaces.

Cons:
- **Lack of Normalization:** Does not adjust for differences in users' rating behaviors.
- **Insensitive to Rating Scale:** Overlooks variations in user rating patterns.

The choice between Pearson's and cosine similarity depends on dataset characteristics and requirements. Pearson's is beneficial for adjusting to users' rating biases, while cosine is preferred for computational efficiency and high-dimensional data applicability.


**Choosing your neighboors**

The Pearson coefficient is computed between the target user and all the other users. One way of defining the peer group of the target user would be to use the set of k users with the highest Pearson coefficient with the target. However, since the number of observed ratings in the top-k peer group of a target user may vary significantly with the item at hand, the closest k users are found for the target user separately for each predicted item, such that each of these k users have specified ratings for that item. The weighted average of these ratings can be returned as the predicted rating for that item.

The main problem with this approach is that different users may provide ratings on different scales. One user might rate all items highly, whereas another user might rate all items negatively. The raw ratings, therefore, need to be mean-centered in row-wise fashion, before determining the (weighted) average rating of the peer group. The **mean-centered rating $s_{uj}$** of a user $u$ for item $j$ is defined by subtracting her mean rating from the raw rating $r_{uj}$, 

$$ s_{uj} = r_{uj} - \mu_{u} \quad \forall u \in \{1, \cdots,  m\}$$

As before, the weighted average of the mean-centered rating of an item in the top-k peer
group of target user $u$ is used to provide a **mean-centered prediction $\hat{r}_{uj}$**. The mean rating of
the target user is then added back to this prediction to provide a raw rating prediction $\hat{r}_{uj}$ of target user $u$ for item $j$. The hat notation on top of $r_{uj}$ indicates a predicted rating, as opposed to one that was already observed in the original ratings matrix. 

Let $P_{u}(j)$ be the set of $k$ closest users to target user $u$, who have specified ratings for item $j$. Users with very low or negative correlations with target user $u$ are sometimes filtered from $P_{u}(j)$ as a heuristic enhancement. Then, the overall neighborhood-based prediction function is as follows:

$$ \hat{r}_{uj} = \mu_{u} + \frac{\sum_{v \in P_{u}(j)} Sim(u,v) \cdot s_{vj}}{|\sum_{v \in P_{u}(j)} Sim(u,v) \cdot s_{vj}|} $$

This broader approach allows for a number of different variations in terms of how the similarity or prediction function is computed or in terms of which items are filtered out during the prediction process.

**Example ML-100**

In [8]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error

In [9]:
train_df = pd.read_csv(
    "./ml-100k/u1.base",
    sep="\t",
    header=None,
    names=["user_id", "item_id", "rating", "timestamp"],
)
test_df = pd.read_csv(
    "./ml-100k/u1.test",
    sep="\t",
    header=None,
    names=["user_id", "item_id", "rating", "timestamp"],
)

In [10]:
train_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,1,1,5,874965758
1,1,2,3,876893171
2,1,3,4,878542960
3,1,4,3,876893119
4,1,5,3,889751712


In [11]:
# construct the ratings matrix
ratings_matrix = pd.pivot_table(
    train_df, values="rating", index="user_id", columns="item_id"
)

In [12]:
print(ratings_matrix)

item_id  1     2     3     4     5     6     7     8     9     10    ...  \
user_id                                                              ...   
1         5.0   3.0   4.0   3.0   3.0   NaN   4.0   1.0   5.0   NaN  ...   
2         4.0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   2.0  ...   
3         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
4         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
5         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
...       ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   
939       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   5.0   NaN  ...   
940       NaN   NaN   NaN   2.0   NaN   NaN   4.0   5.0   3.0   NaN  ...   
941       5.0   NaN   NaN   NaN   NaN   NaN   4.0   NaN   NaN   NaN  ...   
942       NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
943       NaN   5.0   NaN   NaN   NaN   NaN   NaN   NaN   3.0   NaN  ...   

item_id  16

In [13]:
normalized_ratings_matrix = ratings_matrix.subtract(ratings_matrix.mean(axis=1), axis=0)

In this example, we use Pearson correlation to measure how similar items or users are. But if we were using cosine similarity, **we would need to deal with any missing ratings**. The usual way to do this **is to fill in the missing spots with the average rating that a user gives to all items or the average rating that all users give to an item**. Some people might just **put a 0 in the missing spots, which is okay as long as we've made sure all the data is on a common scale first**.

Additionally, we need to ensure the meaning of 0, as this might create a bias towards the rating given. 

In [14]:
similarity_matrix = normalized_ratings_matrix.T.corr()

We define a function to calculate the score (rating) according to the formula we saw in the previous section. It’s worth noting that not all the items in the test set are found in the training set. In this case, we 2.5 since it’s neutral (could use the average rating of the dataset).

In [15]:
def calculate_score(user_id, item_id):
    """
    Calculate the predicted rating for a given user and item using a weighted average approach.

    Args:
    user_id: The ID of the user.
    item_id: The ID of the item.

    Returns:
    The predicted rating for the item by the user.
    """

    # Check if the item exists in the ratings matrix; if not, return a default score of 2.5
    if item_id not in ratings_matrix.columns:
        return 2.5

    # * Get similarity scores for all users with respect to the current user, excluding the user themselves
    similarity_scores = similarity_matrix[user_id].drop(labels=user_id)

    # * Get the normalized ratings for the item, excluding the current user's rating
    normalized_ratings = normalized_ratings_matrix[item_id].drop(index=user_id)

    # * Exclude users who haven't rated the item from both similarity scores and normalized ratings
    # TODO - But then why don't you do this before? Also, is is null the same as an NA?
    users_with_missing_ratings = normalized_ratings[normalized_ratings.isnull()].index
    similarity_scores.drop(index=users_with_missing_ratings, inplace=True)
    normalized_ratings.dropna(inplace=True)

    # * If no other users have rated items in common with the current user, return a default score
    if similarity_scores.empty:
        return 2.5

    # * Initialize accumulators for the total score and total weight
    total_score = 0
    total_weight = 0

    # * Calculate the weighted score for each user who has rated the item
    for other_user_id in normalized_ratings.index:
        if not pd.isna(similarity_scores[other_user_id]):
            total_score += (
                normalized_ratings[other_user_id] * similarity_scores[other_user_id]
            )
            total_weight += abs(similarity_scores[other_user_id])

    # * If no weights are available, return the user's average rating across all items
    if total_weight == 0:
        return ratings_matrix.T.mean()[user_id]

    # * Calculate the predicted rating based on the weighted average of ratings
    predicted_rating = ratings_matrix.T.mean()[user_id] + total_score / total_weight

    return predicted_rating

We iterate over all the user/item pairs in the test set and calculate the prediction using the function defined previously.

In [16]:
# Extract actual ratings from the test dataset
test_ratings = np.array(test_df["rating"])

# Create an iterator for user-item pairs for prediction
user_item_pairs = zip(test_df["user_id"], test_df["item_id"])

# Use list comprehension for efficient prediction of ratings
# Calculate predicted ratings for each user-item pair using the improved calculate_score function
pred_ratings = np.array(
    [calculate_score(user_id, item_id) for user_id, item_id in user_item_pairs]
)

# Calculate the root mean squared error (RMSE) between actual and predicted ratings
rmse = np.sqrt(mean_squared_error(test_ratings, pred_ratings))

# Print the RMSE to evaluate the performance of the recommendation system
print(f"Root Mean Squared Error: {rmse}")

Root Mean Squared Error: 0.9723785192883043


In [17]:
# Calculate the mean rating from the training dataset to use as the baseline
baseline_rating = train_df["rating"].mean()

# Instead of iterating to create an array of the same baseline rating,
# directly create an array filled with the baseline rating with the same length as the test dataset
baseline_ratings = np.full(test_df.shape[0], baseline_rating)

# Calculate and print the Root Mean Squared Error (RMSE) between the actual test ratings and the baseline ratings
rmse_baseline = np.sqrt(mean_squared_error(test_ratings, baseline_ratings))
print(f"Baseline RMSE: {rmse_baseline}")

Baseline RMSE: 1.1536759477860323


User-based collaborative filtering is an effective way to come up for recommendations. That being said, it suffers from issues of sparsity. In other words, you tend to encounter a large number of items and a relatively small number of ratings which results in a lot of wasted memory space. Not to mention, when we start dealing with millions of users, computing all pairwise correlations becomes very expensive. To get around this issue, we select a subset of users, called a neighbourhood, and use that when computing the rating.