### Understanding ALS (Alternating Least Squares)

ALS (Alternating Least Squares) is a collaborative filtering algorithm widely used in recommendation systems. It is particularly effective for matrix factorization tasks, where the goal is to predict missing values in a user-item interaction matrix. The algorithm is based on the principle of decomposing the interaction matrix into two lower-dimensional matrices: one representing user preferences and the other representing item features.

#### How ALS Works:
1. **Matrix Factorization**:
    - Given a user-item interaction matrix `R` (e.g., ratings, watch ratios, etc.), ALS approximates it as the product of two matrices:
      - `U`: A matrix where each row corresponds to a user and represents their latent preferences.
      - `V`: A matrix where each row corresponds to an item and represents its latent features.
    - The goal is to minimize the reconstruction error

2. **Alternating Optimization**:
    - ALS alternates between fixing one matrix (e.g., `U`) and solving for the other (e.g., `V`), and vice versa. This iterative process continues until convergence or a maximum number of iterations is reached.

3. **Handling Sparsity**:
    - Real-world interaction matrices are often sparse (i.e., most entries are missing). ALS efficiently handles this sparsity by only considering observed interactions during optimization.

#### Key Features of ALS:
1. **Scalability**:
    - ALS is computationally efficient and can scale to large datasets, making it suitable for big data applications.

2. **Cold-Start Handling**:
    - ALS can handle missing data using strategies like `coldStartStrategy="drop"`, which excludes users or items with insufficient data from predictions.

3. **Regularization**:
    - Regularization is applied to prevent overfitting, ensuring that the model generalizes well to unseen data.

4. **Parallelization**:
    - ALS is implemented in distributed frameworks like Apache Spark, enabling parallel computation across large clusters.

#### Limitations of ALS:
1. **Cold-Start Problem**:
    - ALS struggles with new users or items that lack interaction data.
2. **Linear Assumptions**:
    - The algorithm assumes linear relationships between latent factors, which may not capture complex patterns in the data.

Despite its limitations, ALS remains a powerful and widely used algorithm for collaborative filtering tasks, offering a balance between simplicity, scalability, and effectiveness.

In [3]:
import os
import shutil
import pandas as pd
from pyspark.sql import SparkSession, Row
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession, Row
from pyspark.ml.recommendation import ALS

import sys
sys.path.append('../data')
import load_data


In [4]:
small_matrix, big_matrix, item_categories, item_features, social_network, user_features, captions   = load_data.load_data()
small_matrix.drop(columns=["play_duration", "video_duration", "time", "date", "timestamp"], inplace=True)

Loading data...
Data loaded.
Cleaning data...
Data cleaned.


In [5]:
# Setup Spark.
spark = SparkSession.builder.appName("ALS").getOrCreate()
spark_df = spark.createDataFrame(small_matrix)


25/05/17 03:02:04 WARN Utils: Your hostname, pcfixe resolves to a loopback address: 127.0.1.1; using 192.168.1.3 instead (on interface enp5s0)
25/05/17 03:02:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/17 03:02:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
als = ALS(maxIter=15, regParam=0.01, userCol="user_id", itemCol="video_id", ratingCol="watch_ratio", coldStartStrategy="drop")
als.setSeed(42)



ALS_85b33fd4371a

In [7]:
model = als.fit(spark_df)

model_path = "modeld_als_sav"

if os.path.exists(model_path) and os.path.isdir(model_path):
    shutil.rmtree(model_path)

model.save(model_path)


25/05/17 03:02:59 WARN TaskSetManager: Stage 0 contains a task of very large size (2961 KiB). The maximum recommended task size is 1000 KiB.
25/05/17 03:03:00 WARN TaskSetManager: Stage 1 contains a task of very large size (2961 KiB). The maximum recommended task size is 1000 KiB.
25/05/17 03:03:03 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
25/05/17 03:03:08 WARN MemoryManager: Total allocation exceeds 95,00% (1 020 054 720 bytes) of heap memory
Scaling row group sizes to 95,00% for 8 writers
25/05/17 03:03:08 WARN MemoryManager: Total allocation exceeds 95,00% (1 020 054 720 bytes) of heap memory
Scaling row group sizes to 84,44% for 9 writers
25/05/17 03:03:08 WARN MemoryManager: Total allocation exceeds 95,00% (1 020 054 720 bytes) of heap memory
Scaling row group sizes to 76,00% for 10 writers
25/05/17 03:03:09 WARN MemoryManager: Total allocation exceeds 95,00% (1 020 054 720 bytes) of heap memory
Scaling row group sizes to 84,44% f

In [None]:
from pyspark.ml.recommendation import ALSModel

# Load the model from disk
loaded_model = ALSModel.load(model_path)

# Generate a top-N ranked list of recommendations for each user
n = 10

user_recs = loaded_model.recommendForAllUsers(n)
user_recs.show()

top_recommendations = loaded_model.recommendForUserSubset(spark_df.select("user_id").distinct(), n)
top_recommendations.show(truncate=False)


+-------+--------------------+
|user_id|     recommendations|
+-------+--------------------+
|     14|[{5210, 3.3426616...|
|     19|[{9178, 2.511615}...|
|     21|[{9178, 2.8554924...|
|     23|[{5365, 3.2164779...|
|     24|[{7383, 2.4718285...|
|     36|[{9178, 2.476555}...|
|     37|[{9178, 2.9262342...|
|     41|[{9178, 2.697055}...|
|     51|[{9178, 3.1278403...|
|     55|[{9815, 3.145819}...|
|     64|[{9178, 3.500213}...|
|     73|[{9815, 3.4126804...|
|     75|[{9178, 2.1038976...|
|     97|[{9178, 2.691554}...|
|     98|[{9178, 2.5094604...|
|    102|[{9178, 2.2952542...|
|    120|[{9178, 2.1472466...|
|    127|[{9178, 2.082949}...|
|    129|[{9178, 3.5886247...|
|    131|[{6523, 6.544776}...|
+-------+--------------------+
only showing top 20 rows



25/05/17 03:18:13 WARN TaskSetManager: Stage 646 contains a task of very large size (2961 KiB). The maximum recommended task size is 1000 KiB.


+-------+----------------------------------------------------------------------------------------------+
|user_id|recommendations                                                                               |
+-------+----------------------------------------------------------------------------------------------+
|14     |[{5210, 3.3426616}, {9178, 3.1054316}, {7383, 2.9617243}, {4040, 2.867227}, {8298, 2.8307033}]|
|19     |[{9178, 2.511615}, {7383, 2.4508276}, {4040, 2.4472446}, {314, 2.368181}, {8298, 2.3449864}]  |
|21     |[{9178, 2.8554924}, {4040, 2.788542}, {7383, 2.7624385}, {314, 2.7037168}, {8298, 2.6912348}] |
|23     |[{5365, 3.2164779}, {9178, 3.005998}, {4040, 2.9460378}, {7383, 2.9212122}, {314, 2.8971171}] |
|24     |[{7383, 2.4718285}, {4040, 2.4301996}, {9178, 2.4197152}, {314, 2.3857317}, {8298, 2.33107}]  |
|36     |[{9178, 2.476555}, {4040, 2.3870633}, {7383, 2.3789985}, {314, 2.312498}, {8298, 2.2989814}]  |
|37     |[{9178, 2.9262342}, {7383, 2.886976}, {4040, 2

In [None]:
# Make predictions
predictions = loaded_model.transform(spark_df)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="watch_ratio", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
mae = evaluator.evaluate(predictions, {evaluator.metricName: "mae"})

# Evaluate the model
print("\n\nModel evaluation:")

print("Root-mean-square error = " + str(rmse))
print("Mean absolute error = " + str(mae))
print("R2 = " + str(r2))

25/05/17 03:09:34 WARN TaskSetManager: Stage 611 contains a task of very large size (2961 KiB). The maximum recommended task size is 1000 KiB.
25/05/17 03:09:36 WARN TaskSetManager: Stage 615 contains a task of very large size (2961 KiB). The maximum recommended task size is 1000 KiB.
25/05/17 03:09:37 WARN TaskSetManager: Stage 619 contains a task of very large size (2961 KiB). The maximum recommended task size is 1000 KiB.
[Stage 619:>                                                      (0 + 24) / 24]



Model evaluation:
Root-mean-square error = 1.1829326735324064
Mean absolute error = 0.35043772919713145
R2 = 0.23951807584652962


                                                                                

### Conclusion

The ALS (Alternating Least Squares) model was successfully implemented to generate video recommendations based on user interaction data. The model was trained on a subset of the data (`small_matrix`) and evaluated using standard regression metrics such as RMSE, MAE, and R². The results of the evaluation are as follows:

- **Root Mean Square Error (RMSE):** 1.1829  
    This metric indicates the average magnitude of error between the predicted and actual watch ratios. A lower RMSE value is desirable, but the current value suggests there is room for improvement in prediction accuracy.

- **Mean Absolute Error (MAE):** 0.3504  
    This metric measures the average absolute difference between the predicted and actual watch ratios. The relatively low MAE indicates that the model performs reasonably well in capturing user preferences.

- **R² Score:** 0.2395  
    The R² score measures the proportion of variance in the dependent variable (watch ratio) that is predictable from the independent variables. A score of 0.2395 suggests that the model explains only a small portion of the variance, indicating potential for further optimization.

The model was also used to generate top-N recommendations for users, providing personalized video suggestions. These recommendations can be leveraged to enhance user engagement and satisfaction by tailoring content to individual preferences.

#### Key Observations:
1. The model's performance metrics indicate that while it provides a good starting point, there is significant scope for improvement in terms of accuracy and generalization.
2. The cold-start strategy (`drop`) was employed to handle missing data, ensuring that the model does not generate predictions for users or items with insufficient data.

#### Final Thoughts:
The ALS model provides a solid foundation for building a recommendation system tailored to user preferences. With further refinement and experimentation, it has the potential to deliver highly accurate and personalized recommendations, driving user engagement and satisfaction.