# Chapter 1

### Latent Features 

- Some movies have multiple genres (a movie can be of both comedy and romance genre)
- However, individual people might experience only a single genre from these movies
- How to tell which person thinks the movie falls morely on which genre?
- Latent feature (also known as Rank) tells that. 
- ALS algorithm decomposes original table into 2 matrices : User-Movie = User-LatentFeature X LatentFeature-Movie

<center><img src="images/01.01.png"  style="width: 400px, height: 300px;"/></center>

# Chapter 2

### How ALS figures out recommendations

<center><img src="images/02.01.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.02.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.03.png"  style="width: 400px, height: 300px;"/></center>


- Original Rating table is decomposed into User and Product tables with random numbers
- The random values in User table and Product table is adjusted through a number of iterations by reducing r-squared values
- The minimized factorized table closely resembles the original Rating table.
- While the original table has many missing values, the factorized table now contains the predicted values which can be considered as recommendation

```
# View left factor matrix
print(U)
# View right factor matrix
print(P)
# Multiply factor matrices
UP = np.matmul(U,P)

# Convert to pandas DataFrame
print(pd.DataFrame(UP, columns = P.columns, index = U.index))

```

<center><img src="images/02.04.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.05.png"  style="width: 400px, height: 300px;"/></center>


### Requirements to apply pyspark ALS

- Dataframe must be in long format (NOT wide format)
- 2 separate tables (User table and Product table) should have unique id in integer data-type
- the individual tables should be coalesced into one partition to make the id consistent
- when using `monotonically_increasing_id`, make sure to cache/persist the dataframes or the values might change
- the tables should be merged together to create the Rating table. 
- ALS is applied on the Rating table
- use `nonnegative = True` to ensure positive values
- use `rank` to specify number of latent features you want to use
- use `implicitPrefs= True` along with `alpha` only when you do not have explicit `rating` column
- use `maxIter` to adjust weights for n number of iterations for reduced r-squared
- use `coldStartStrategy = "drop"` to avoid testing on unknown user-product pair (if training data has no information on particular userid and productid )


<center><img src="images/02.06.png"  style="width: 400px, height: 300px;"/></center>


# Chapter 3

### ALS algorithm

```
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("ALS Example") \
    .getOrCreate()

# Sample data (User ID, Item ID, Rating, Additional Column1, Additional Column2)
data = [
    (1, 1, 5, "A", "X"),
# .............................
    (3, 2, 2, "F", "U")
]
df = spark.createDataFrame(data, ["userId", "itemId", "rating", "user_cat", "item_cat"])
# If user ID and/or Item ID do not exist and rather only categories exist, make unique identifier in separate dataframes for each distinct category and then join
users = df.select("user_cat").distinct().coalesce(1) # coalesce to put them in a single partition for consistent increase of id
users = users.withColumn("userId", monotonically_increasing_id()).persist() # caching to make sure the values do not change
items = df.select("item_cat").distinct().coalesce(1) # coalesce to put them in a single partition for consistent increase of id
items = items.withColumn("itemId", monotonically_increasing_id()).persist() # caching to make sure the values do not change
new_rating_df = df.join(users, "user_cat", "left").join(items, "item_cat", "left") # If rating dataframe already exists
## NOTE: If rating dataframe does not exist, then create one:
users = df1.select("userId").distinct()
items = df2.select("itemId").distinct()
custom_rating_df = users.crossJoin(items).join(df3, ["userId", "itemId"], "left").fillna(0)

# Sparsity = (Number of empty entries) / (Total number of entries)
numerator = ratings.count()
num_users = ratings.select("userId").distinct().count()
num_items = ratings.select("itemId").distinct().count()
denominator = num_users * num_items
sparsity = (1.0 - (numerator *1.0)/denominator)*100

(training_data, test_data) = df.randomSplit([0.8, 0.2]) # Split data into training and test sets

# k flod Cross-validation
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

als = ALS(userCol="userId", itemCol="itemId", ratingCol="rating",
        coldStartStrategy="drop" , nonnegative =True , implicitPrefs = False)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
param_grid = ParamGridBuilder()
                .addGrid(als.rank, [5, 40, 80, 120])
                .addGrid(als.maxIter, [5, 100, 250, 500])
                .addGrid(als.regParam, [.05, .1, 1.5])
                .build()
cv = CrossValidator(estimator = als,
                estimatorParamMaps = param_grid,
                evaluator = evaluator,
                numFolds = 5)
len(param_grid) # Total number of models in the grid
model = cv.fit(training_data) # Train with ALS model
best_model = model.bestModel # Best Model
predictions = best_model.transform(test_data) # Make predictions on test data
rmse = evaluator.evaluate(predictions) # Evaluate predictions using RMSE
best_model.getRank()
best_model.getMaxIter()
best_model.getRegParam()
recommendations_df = best_model.recommendForAllUsers(5) # Generate all recommendations for users up to userId 5
from pyspark.sql.functions import explode
# Explode recommendations column to separate itemId and prediction
df = recommendations_df.withColumn("itemId_rating", explode("recommendations")) # Put each recommendation in new row
df = df.withColumn("itemId", col("itemId_rating").getField("itemId"))\
                .withColumn("rating", col("itemId_rating").getField("rating")) # Generate separate columns from the pair
df.join(ratings, ["userId", "itemId"], "left").filter(ratings['rating'].isNull()).show() # Only show recommendation for items not rated yet

### NOTE: For implicit rating columns (eg: No of clicks / binary rating etc) use custom evaluation metric (ROEM) instead of RMSE
### Or create weighted rating column by feature engineering. Also, make sure to put "implicitPrefs = True" in ALS model param
from pyspark.ml.evaluation import Evaluator
class ROEMEvaluator(Evaluator):
    def __init__(self, userCol="userId", itemCol="itemId", ratingCol="num_clicks"):
        self.userCol = userCol
        self.itemCol = itemCol
        self.ratingCol = ratingCol

    def _evaluate(self, predictions):
        predictions.createOrReplaceTempView("predictions")
        denominator = predictions.groupBy().sum(self.ratingCol).collect()[0][0]
        rankings = spark.sql("SELECT " + self.userCol + ", " + self.ratingCol + ", PERCENT_RANK() OVER (PARTITION BY " + self.  userCol + " ORDER BY prediction DESC) AS rank FROM predictions")
        rankings.createOrReplaceTempView("rankings")
        numerator = spark.sql("SELECT SUM(" + self.ratingCol + " * rank) FROM rankings").collect()[0][0]
        performance = numerator / denominator
        return performance
evaluator = ROEMEvaluator( ratingCol="colName")

```

# Chapter 4

### ROEM

```
# ROEM is a metric used to evaluate the performance of recommendation systems for implicit ratings of ALS algorithm.
# ROEM stands for Rank Ordering Error Metric 
# Unfortunately, pyspark do not provide native support for ROEM.
# Here is a custom implementation of ROEM

def ROEM(predictions, userCol="userId", itemCol="songId", ratingCol="num_plays"):
    # Create table that can be queried
    predictions.createOrReplaceTempView("predictions")
    # Sum of total number of plays of all songs
    denominator = predictions.groupBy().sum(ratingCol).collect()[0][0]
    # Calculating rankings of songs predictions by user
    spark.sql(
        "SELECT " + userCol + " , " + ratingCol + " , PERCENT_RANK() OVER (PARTITION BY " + userCol + " ORDER BY prediction DESC) AS rank FROM predictions"
    ).createOrReplaceTempView("rankings")
    # Multiplies the rank of each song by the number of plays and adds the products together
    numerator = spark.sql('SELECT SUM(' + ratingCol + ' * rank) FROM rankings').collect()[0][0]
    # Compute ROEM
    roem = numerator / denominator
    return roem
    
# Split the data into training and test sets
(training, test) = msd.randomSplit([0.8, 0.2])
#Building 5 folds within the training set.
train1, train2, train3, train4, train5 = training.randomSplit([0.2, 0.2, 0.2, 0.2, 0.2], seed = 1)
fold1 = train2.union(train3).union(train4).union(train5)
fold2 = train3.union(train4).union(train5).union(train1)
fold3 = train4.union(train5).union(train1).union(train2)
fold4 = train5.union(train1).union(train2).union(train3)
fold5 = train1.union(train2).union(train3).union(train4)

foldlist = [(fold1, train1), (fold2, train2), (fold3, train3), (fold4, train4), (fold5, train5)]

# Empty list to fill with ROEMs from each model
ROEMS = []

# Loops through all models and all folds
for model in model_list:
    for ft_pair in foldlist:
        # Fits model to fold within training data
        fitted_model = model.fit(ft_pair[0])
        # Generates predictions using fitted_model on respective CV test data
        predictions = fitted_model.transform(ft_pair[1])
        # Generates and prints a ROEM metric CV test data
        r = ROEM(predictions)
        print ("ROEM: ", r)
    # Fits model to all of training data and generates preds for test data
    v_fitted_model = model.fit(training)
    v_predictions = v_fitted_model.transform(test)
    v_ROEM = ROEM(v_predictions)
    # Adds validation ROEM to ROEM list
    ROEMS.append(v_ROEM)
    print ("Validation ROEM: ", v_ROEM)
```