# Chapter 1

### Latent Features 

- Some movies have multiple genres (a movie can be of both comedy and romance genre)
- However, individual people might experience only a single genre from these movies
- How to tell which person thinks the movie falls morely on which genre?
- Latent feature (also known as Rank) tells that. 
- ALS algorithm decomposes original table into 2 matrices : User-Movie = User-LatentFeature X LatentFeature-Movie

<center><img src="images/01.01.png"  style="width: 400px, height: 300px;"/></center>

# Chapter 2

### How ALS figures out recommendations

<center><img src="images/02.01.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.02.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.03.png"  style="width: 400px, height: 300px;"/></center>


- Original Rating table is decomposed into User and Product tables with random numbers
- The random values in User table and Product table is adjusted through a number of iterations by reducing r-squared values
- The minimized factorized table closely resembles the original Rating table.
- While the original table has many missing values, the factorized table now contains the predicted values which can be considered as recommendation

```
# View left factor matrix
print(U)
# View right factor matrix
print(P)
# Multiply factor matrices
UP = np.matmul(U,P)

# Convert to pandas DataFrame
print(pd.DataFrame(UP, columns = P.columns, index = U.index))

```

<center><img src="images/02.04.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.05.png"  style="width: 400px, height: 300px;"/></center>


### Requirements to apply pyspark ALS

- Dataframe must be in long format (NOT wide format)
- 2 separate tables (User table and Product table) should have unique id in integer data-type
- the individual tables should be coalesced into one partition to make the id consistent
- when using `monotonically_increasing_id`, make sure to cache/persist the dataframes or the values might change
- the tables should be merged together to create the Rating table. 
- ALS is applied on the Rating table
- use `nonnegative = True` to ensure positive values
- use `rank` to specify number of latent features you want to use
- use `implicitPrefs= True` along with `alpha` only when you do not have explicit `rating` column
- use `maxIter` to adjust weights for n number of iterations for reduced r-squared
- use `coldStartStrategy = "drop"` to avoid testing on unknown user-product pair (if training data has no information on particular userid and productid )


<center><img src="images/02.06.png"  style="width: 400px, height: 300px;"/></center>


### ALS algorithm

```
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("ALS Example") \
    .getOrCreate()

# Sample data (User ID, Item ID, Rating, Additional Column1, Additional Column2)
data = [
    (1, 1, 5, "A", "X"),
# .............................
    (3, 2, 2, "F", "U")
]
df = spark.createDataFrame(data, ["userId", "itemId", "rating", "user_cat", "item_cat"])
# If user ID and/or Item ID do not exist and rather only categories exist, make unique identifier in separate dataframes for each distinct category and then join
users = df.select("user_cat").distinct().coalesce(1) # coalesce to put them in a single partition for consistent increase of id
users = users.withColumn("userId", monotonically_increasing_id()).persist() # caching to make sure the values do not change
items = df.select("item_cat").distinct().coalesce(1) # coalesce to put them in a single partition for consistent increase of id
items = items.withColumn("itemId", monotonically_increasing_id()).persist() # caching to make sure the values do not change
df = = df.join(users, "user_cat", "left").join(items, "item_cat", "left")

(training_data, test_data) = df.randomSplit([0.8, 0.2]) # Split data into training and test sets

# Train with ALS model
als = ALS(rank=10, maxIter=10, regParam=0.01, userCol="userId", itemCol="itemId", ratingCol="rating")
model = als.fit(training_data)
predictions = model.transform(test_data) # Make predictions on test data
# Evaluate predictions using RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
```