## Part 3: Building a Recommender System with Implicit Feedback

In this part, we will build an implicit feedback recommender system using the [implicit](https://github.com/benfred/implicit) package.

What is implicit feedback, exactly? Let's revisit collaborative filtering. In Part 1, we learned that [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) is based on the assumption that `similar users like similar things`. The user-item matrix, or "utility matrix", is the foundation of collaborative filtering. In the utility matrix, rows represent users and columns represent items.

<img src="images/utility-matrix.png" width="30%"/>

The cells of the matrix are populated by a given user's degree of preference towards an item, which can come in the form of:

1. **explicit feedback:** direct feedback towards an item (e.g., movie ratings which we explored in Part 1
2. **implicit feedback:** indirect behaviour towards an item (e.g., purchase history, browsing history, search behaviour)

Implicit feedback makes assumptions about a user's preference based on their actions towards items. Let's take Netflix for example. If you binge-watch a show and blaze through all seasons in a week, there's a high chance that you like that show. However, if you start watching a series and stop halfway through the first episode, there's suspicion to believe that you probably don't like that show.


### Step 1: Import Dependencies

We'll be using the following packages to build our implicit feedback recommender system:

- [numpy](https://numpy.org/)
- [pandas](https://pandas.pydata.org/)
- [implicit](https://github.com/benfred/implicit)
- scipy (specifically, the [csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) class)

In [3]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

import implicit

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)



In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Step 2: Load the Data

Building on our familiarity with the MovieLens dataset from Parts 1 and 2 of this experiment, we will continue to use this dataset. For this step, the required MovieLens dataset files are conveniently available in the 'data' directory.
- `ratings.csv`
- `movies.csv`


In [8]:
ratings = pd.read_csv(r"/content/drive/MyDrive/Colab Notebooks/6100/W12-RecSys/data/ratings.csv")
movies = pd.read_csv(r"/content/drive/MyDrive/Colab Notebooks/6100/W12-RecSys/data/movies.csv")

In [9]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


For this implicit feedback tutorial, we'll treat movie ratings as the number of times that a user watched a movie. For example, if Jane (a user in our database) gave `Batman` a rating of 1 and `Legally Blonde` a rating of 5, we'll assume that Jane watched Batman one time and Legally Blonde five times. (*important)

### Step 3: Transforming the Data

Similar to Part 1, we need to transform the `ratings` dataframe into a user-item matrix where rows represent users and columns represent movies. The cells of this matrix will be populated with implicit feedback: in this case, the number of times a user watched a movie.

The `create_X()` function outputs a sparse matrix **X** with four mapper dictionaries:
- **user_mapper:** maps user id to user index
- **movie_mapper:** maps movie id to movie index
- **user_inv_mapper:** maps user index to user id
- **movie_inv_mapper:** maps movie index to movie id

We need these dictionaries because they map which row and column of the utility matrix corresponds to which user ID and movie ID, respectively.

The **X** (user-item) matrix is a [scipy.sparse.csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) which stores the data sparsely.




<img src="images/user-movie-matrix.png" width="500px" align="left">

study note: This piece of code is the same as part 1

In [10]:
def create_X(df):
    """
    Generates a sparse matrix from ratings dataframe.

    Args:
        df: pandas dataframe

    Returns:
        X: sparse matrix
        user_mapper: dict that maps user id's to user indices
        user_inv_mapper: dict that maps user indices to user id's
        movie_mapper: dict that maps movie id's to movie indices
        movie_inv_mapper: dict that maps movie indices to movie id's
    """
    N = df['userId'].nunique()
    M = df['movieId'].nunique()

    user_mapper = dict(zip(np.unique(df["userId"]), list(range(N))))
    movie_mapper = dict(zip(np.unique(df["movieId"]), list(range(M))))

    user_inv_mapper = dict(zip(list(range(N)), np.unique(df["userId"])))
    movie_inv_mapper = dict(zip(list(range(M)), np.unique(df["movieId"])))

    user_index = [user_mapper[i] for i in df['userId']]
    movie_index = [movie_mapper[i] for i in df['movieId']]

    X = csr_matrix((df["rating"], (movie_index, user_index)), shape=(M, N))

    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper

In [11]:
X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_X(ratings)

### Creating Movie Title Mappers

We need to interpret a movie title from its index in the user-item matrix and vice versa. Let's create 2 helper functions that make this interpretation easy:

- `get_movie_index()` - converts a movie title to movie index
    - Note that this function uses [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy)'s string matching to get the approximate movie title match based on the string that gets passed in. This means that you don't need to know the exact spelling and formatting of the title to get the corresponding movie index.
- `get_movie_title()` - converts a movie index to movie title

In [13]:
pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [14]:
from fuzzywuzzy import process

def movie_finder(title):
    all_titles = movies['title'].tolist()
    closest_match = process.extractOne(title,all_titles)
    return closest_match[0]

movie_title_mapper = dict(zip(movies['title'], movies['movieId']))
movie_title_inv_mapper = dict(zip(movies['movieId'], movies['title']))

def get_movie_index(title):
    fuzzy_title = movie_finder(title)
    movie_id = movie_title_mapper[fuzzy_title]
    movie_idx = movie_mapper[movie_id]
    return movie_idx

def get_movie_title(movie_idx):
    movie_id = movie_inv_mapper[movie_idx]
    title = movie_title_inv_mapper[movie_id]
    return title



It's time to test it out! Let's get the movie index of `Legally Blonde`.

In [15]:
get_movie_index('Legally Blonde')

3282

Let's pass this index value into `get_movie_title()`. We're expecting Legally Blonde to get returned.

In [16]:
get_movie_title(3282)

'Legally Blonde (2001)'

Great! These helper functions will be useful when we want to interpret our recommender results.

### Step 4: Building Our Implicit Feedback Recommender Model


We've transformed and prepared our data so that we can start creating our recommender model.

The [implicit](https://github.com/benfred/implicit) package is built around a linear algebra technique called [matrix factorization](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)), which can help us discover latent features underlying the interactions between users and movies. These latent features give a more compact representation of user tastes and item descriptions. Matrix factorization is particularly useful for very sparse data and can enhance the quality of recommendations. The algorithm works by factorizing the original user-item matrix into two factor matrices:

- user-factor matrix (n_users, k)
- item-factor matrix (k, n_items)

We are reducing the dimensions of our original matrix into "taste" dimensions. We cannot interpret what each latent feature $k$ represents. However, we could imagine that one latent feature may represent users who like romantic comedies from the 1990s, while another latent feature may represent movies which are independent foreign language films.

$$X_{mn} \approx P_{mk} \times Q_{nk}^T = \hat{X}$$

<img src="images/matrix-factorization.png" width="60%"/>

In traditional matrix factorization, such as SVD, we would attempt to solve the factorization at once which can be very computationally expensive. As a more practical alternative, we can use a technique called `Alternating Least Squares (ALS)` instead. With ALS, we solve for one factor matrix at a time:

- Step 1: hold user-factor matrix fixed and solve for the item-factor matrix
- Step 2: hold item-factor matrix fixed and solve for the user-item matrix

We alternate between Step 1 and 2 above, until the dot product of the item-factor matrix and user-item matrix is approximately equal to the original X (user-item) matrix. This approach is less computationally expensive and can be run in parallel.

The [implicit](https://github.com/benfred/implicit) package implements matrix factorization using Alternating Least Squares (see docs [here](https://benfred.github.io/implicit/api/models/cpu/als.html)). Let's initiate the model using the `AlternatingLeastSquares` class.

In [17]:
model = implicit.als.AlternatingLeastSquares(factors=50)#important

  check_blas_config()


This model comes with a couple of hyperparameters that can be tuned to generate optimal results:

- factors ($k$): number of latent factors,
- regularization ($\lambda$): prevents the model from overfitting during training

In this experiment, we'll set $k = 50$ and $\lambda = 0.01$ (the default). In a real-world scenario, I highly recommend tuning these hyperparameters before generating recommendations to generate optimal results.

The next step is to fit our model with our user-item matrix.

In [18]:
model.fit(X)

  0%|          | 0/15 [00:00<?, ?it/s]

Now, let's test out the model's recommendations. We can use the model's `similar_items()` method which returns the most relevant movies of a given movie. We can use our helpful `get_movie_index()` function to get the movie index of the movie that we're interested in.

In [19]:
movie_of_interest = 'forrest gump'

movie_index = get_movie_index(movie_of_interest)
related = model.similar_items(movie_index)
related#movie_index, similarity_score

(array([314, 167,  26, 187, 308, 284, 126,  28, 516, 420], dtype=int32),
 array([1.0000001 , 0.584224  , 0.5401984 , 0.5255811 , 0.5086001 ,
        0.48962426, 0.4814141 , 0.4607454 , 0.46019232, 0.42843986],
       dtype=float32))

The output of `similar_items()` is not user-friendly. We'll need to use our `get_movie_title()` function to interpret what our results are.

In [20]:
# Print a message indicating the context, e.g., "Because you watched {movie_name}..."
print(f"Because you watched {movie_finder(movie_of_interest)}...")

# Iterate through the indices of related movies
for idx in related[0]:
    # Get the title of the recommended movie using the index
    recommended_title = get_movie_title(idx)

    # Ensure the recommended movie title is different from the original movie
    if recommended_title != get_movie_title(movie_index):
        # Print the recommended movie title
        print(recommended_title)

Because you watched Forrest Gump (1994)...
Strange Days (1995)
Now and Then (1995)
Cure, The (1995)
Client, The (1994)
To Live (Huozhe) (1994)
Batman Forever (1995)
City of Lost Children, The (Cité des enfants perdus, La) (1995)
Love and a .45 (1994)
Killing Zoe (1994)


When we treat user ratings as implicit feedback, the results look pretty good! You can test out other movies by changing the `movie_of_interest` variable.

### Step 5: Generating User-Item Recommendations

A cool feature of [implicit](https://github.com/benfred/implicit) is that you can pull personalized recommendations for a given user. Let's test it out on a user in our dataset.

In [48]:
user_id = 5#使用implicit库来为给定的用户生成个性化推荐

In [49]:
user_ratings = ratings[ratings['userId']==user_id].merge(movies[['movieId', 'title']])
user_ratings = user_ratings.sort_values('rating', ascending=False)
print(f"Number of movies rated by user {user_id}: {user_ratings['movieId'].nunique()}")

Number of movies rated by user 5: 44


User 5 watched 44 movies. Their highest rated movies are below:

In [50]:
user_ratings = ratings[ratings['userId']==user_id].merge(movies[['movieId', 'title']])
user_ratings = user_ratings.sort_values('rating', ascending=False)
top_5 = user_ratings.head()
top_5

Unnamed: 0,userId,movieId,rating,timestamp,title
37,5,590,5.0,847434747,Dances with Wolves (1990)
30,5,475,5.0,847435311,In the Name of the Father (1993)
32,5,527,5.0,847434960,Schindler's List (1993)
6,5,58,5.0,847435238,"Postman, The (Postino, Il) (1994)"
41,5,596,5.0,847435292,Pinocchio (1940)


Their lowest rated movies:

In [51]:
bottom_5 = user_ratings[user_ratings['rating']<3].tail()
bottom_5

Unnamed: 0,userId,movieId,rating,timestamp,title
26,5,380,2.0,847434748,True Lies (1994)
23,5,357,2.0,847435238,Four Weddings and a Funeral (1994)
19,5,316,2.0,847434832,Stargate (1994)
15,5,266,1.0,847435311,Legends of the Fall (1994)


Based on their preferences above, we can get a sense that user 95 likes action and crime movies from the early 1990's over light-hearted American comedies from the early 2000's. Let's see what recommendations our model will generate for user 95.

We'll use the `recommend()` method, which takes in the user index of interest and transposed user-item matrix.

In [52]:
# Transpose the user-item matrix for efficient operations
X_t = X.T.tocsr()#important

# Map the user ID to its corresponding index in the transposed matrix
user_idx = user_mapper[user_id]

# Extract the user-item matrix for the specified user
user_item_matrix = X_t[user_idx, :]

# Get movie recommendations for the user using the collaborative filtering model
recommendations = model.recommend(user_idx, user_item_matrix)

We can't interpret the results as is since movies are represented by their index. We'll have to loop over the list of recommendations and get the movie title for each movie index.

In [53]:
# Iterate through the recommended movie indices
for r in recommendations[0]:
    # Get the title of the recommended movie using the index
    recommended_title = get_movie_title(r)

    # Print the recommended movie title
    print(recommended_title)

Heat (1995)
Wallace & Gromit: A Close Shave (1995)
Timecop (1994)
Puppet Masters, The (1994)
Chasers (1994)
Big Bully (1996)
Sliver (1993)
Birdcage, The (1996)
Beautiful Girls (1996)
Star Trek: Generations (1994)


User 95's recommendations consist of action, crime, and thrillers. None of their recommendations are comedies.