# Alternating Least Squares (ALS)

Alternating Least Squares (ALS) is a computational method used in the field of data science, specifically for the development of recommendation systems. It's a matrix factorization technique that decomposes a user-item interaction matrix into two lower dimensionality matrices, optimizing for the least squares problem. This method handles the issues of scalability and sparsity quite efficiently, which is why it is widely used in large-scale collaborative filtering problems.

## Definition

The main idea of ALS is to treat one set of latent factors as constant while optimizing for the other, and vice versa. This simplifies the optimization problem into a quadratic form that can be solved directly with a unique solution.

Assume we have a user-item matrix $R$ of size $m \times n$, where $m$ is the number of users and $n$ is the number of items. We aim to find two matrices $U$ of size $m \times k$ (user factors) and $V$ of size $n \times k$ (item factors), such that $R \approx U^TV$.

The objective function of this least squares problem can be written as:

$$
\min_{U,V} ||R - U^TV||^2 = \min_{U,V} \sum_{i,j} (R_{i,j} - \mathbf{u}_i \cdot \mathbf{v}_j)^2
$$

where $\mathbf{u}_i$ is the $i$th row of $U$ (the $i$th user's latent factors), and $\mathbf{v}_j$ is the $j$th row of $V$ (the $j$th item's latent factors). 

## Alternating Least Squares Process

The ALS algorithm alternates between fixing $U$ and solving for $V$, and fixing $V$ and solving for $U$. 

### Step 1: Initialization

Initialize the user factor matrix $U$ and item factor matrix $V$ with some values, often small random numbers.

### Step 2: Fix $U$, Solve for $V$

With $U$ fixed, each $\mathbf{v}_j$ can be computed independently. The optimization problem for each $\mathbf{v}_j$ can be written as:

$$
\mathbf{v}_j = \left( \sum_{i} \mathbf{u}_i^T\mathbf{u}_i \right)^{-1} \sum_{i} R_{i,j}\mathbf{u}_i
$$

### Step 3: Fix $V$, Solve for $U$

With $V$ fixed, each $\mathbf{u}_i$ can be computed independently. The optimization problem for each $\mathbf{u}_i$ can be written as:

$$
\mathbf{u}_i = \left( \sum_{j} \mathbf{v}_j^T\mathbf{v}_j \right)^{-1} \sum_{j} R_{i,j}\mathbf{v}_j
$$

Repeat Steps 2 and 3 until convergence or until a predefined number of iterations is reached.

## Benefits of ALS

ALS has its strengths, including its ability to parallelize and distribute computation, its handling of missing values, and its computational efficiency.

## Applications of ALS

The primary application of ALS is in collaborative filtering for recommendation systems. For example, the algorithm is used in platforms like Netflix and Spotify to recommend movies and music to users based on their previous behaviors and the behaviors of other users.

## Limitations of ALS

Like any other algorithm, ALS also has its limitations:

- ALS assumes that missing data means negative feedback, which might not always be the case. For example, a user may not have interacted with an item simply because they were not aware of its existence, not because they didn't like it.

- ALS may not handle new users or items (also known as the cold start problem) very well. Since it relies on historical user-item interactions, it can be difficult to generate recommendations for new users or items that have little interaction history.

- ALS can lead to popularity bias in recommendations. Popular items can often end up being recommended more often, while less popular or niche items may be overlooked.

- Tuning the model parameters (like the dimensionality of the factor vectors and regularization term) requires careful consideration and can be computationally intensive.

You can find more at https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext("local")
spark = SparkSession.builder.getOrCreate()

In [2]:
from pyspark.ml.recommendation import ALS
# Load the ratings data from CSV
ratings_data = spark.read.csv("data\\movie-small\\ratings.csv", header=True, inferSchema=True)

# Load the movies data from CSV
movies_data = spark.read.csv("data\\movie-small\\movies.csv", header=True, inferSchema=True)

# Split the data into training and test sets
(training_data, test_data) = ratings_data.randomSplit([0.8, 0.2], seed=1234)

# Create an ALS model

In [3]:
# Create an ALS model
als = ALS(maxIter=10, regParam=0.01, userCol="userId", itemCol="movieId", 
          ratingCol="rating", coldStartStrategy="drop")

# Fit the model to the training data
model = als.fit(training_data)

# Generate recommendations for all users
user_recs = model.recommendForAllUsers(10)  # Generate top 10 recommendations for each user

# convert the recommendations to multiple rows per user with one recommendation in each row
user_recs = user_recs.selectExpr("userId", "explode(recommendations) as recommendations")
# convert the recommendations column from {movieId, rating} to tow columns movieId  and rating
user_recs = user_recs.selectExpr("userId", "recommendations.movieId as movieId", 
                                 "recommendations.rating as rating")



# Evaluate the model by computing the RMSE on the test data

In [4]:
from pyspark.ml.evaluation import RegressionEvaluator
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test_data)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.1085100193424526


# Show the recommendations for a specific user

In [5]:
# Show the recommendations for a specific user
user_id = 2

user_rec = user_recs.filter(user_recs.userId == user_id)

print("Movies rated by user with id " + str(user_id))
# Show the movies rated by the user
user_ratings = ratings_data.filter(ratings_data.userId == user_id).join(movies_data, "movieId")\
    .select("title", "genres", "rating")
user_ratings.show(100,truncate=False)

# join the recommendations with the movies data to get the movie titles
user_rec = user_rec.join(movies_data, "movieId")\
    .select("title", "genres", "rating")
print("Top 10 recommendations for user with id " + str(user_id))
# Show the recommendations for a specific user
user_rec.show(truncate=False)

Movies rated by user with id 2
+----------------------------------------------------+-----------------------------------------------+------+
|title                                               |genres                                         |rating|
+----------------------------------------------------+-----------------------------------------------+------+
|Shawshank Redemption, The (1994)                    |Crime|Drama                                    |3.0   |
|Tommy Boy (1995)                                    |Comedy                                         |4.0   |
|Good Will Hunting (1997)                            |Drama|Romance                                  |4.5   |
|Gladiator (2000)                                    |Action|Adventure|Drama                         |4.0   |
|Kill Bill: Vol. 1 (2003)                            |Action|Crime|Thriller                          |4.0   |
|Collateral (2004)                                   |Action|Crime|Drama|Thriller        

# Generate top 10 user recommendations for each movie

In [6]:
# Generate top 10 user recommendations for each movie
movieRecs = model.recommendForAllItems(10)

# convert the recommendations to multiple rows per movie with one recommendation in each row
movieRecs = movieRecs.selectExpr("movieId", "explode(recommendations) as recommendations")
# convert the recommendations column from {userId, rating} to tow columns userId  and rating
movieRecs = movieRecs.selectExpr("movieId", "recommendations.userId as userId",
                                    "recommendations.rating as rating")

# Show the recommendations for a specific movie
movie_id = 2

movie_rec = movieRecs.filter(movieRecs.movieId == movie_id)

print("Top 10 users that will be interested in the movie with id " + str(movie_id))
movie_rec.show(truncate=False)


Top 10 users that will be interested in the movie with id 2
+-------+------+---------+
|movieId|userId|rating   |
+-------+------+---------+
|2      |258   |6.898732 |
|2      |543   |5.9302106|
|2      |407   |5.830795 |
|2      |48    |5.375635 |
|2      |35    |5.1472993|
|2      |162   |4.9345474|
|2      |553   |4.8886347|
|2      |53    |4.881683 |
|2      |478   |4.8594093|
|2      |584   |4.8578053|
+-------+------+---------+

