![a](https://touchwoodsolihull.co.uk/thumbs/938x370r/2015-06/hm-1.png)

This notebook explores several approaches to tackle the H&M recommendation problem with the usage of PySpark.
During the work we have tried the following strategies:
* Recommend previously purchased items
* Recommend items that are frequently bought together
* Recommend most popular items
* Discuss the Spark's module for collaborative filtering: ALS

# Read the data

In [None]:
!pip install pyspark -q

In [None]:
import pyspark
from pyspark.sql import SparkSession

import warnings
warnings.filterwarnings("ignore")

spark = SparkSession.builder.appName("H&M").getOrCreate()

In [None]:
transactions = (spark.read.format("csv")
                .option("header", "true")
                .load("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv"))
transactions.printSchema()

In [None]:
from pyspark.sql.functions import *

transactions = transactions.withColumn('t_dat', to_date("t_dat"))

# Analysis

<h3 style="font-family:Cursive;color:#ff1aff;">Keep only last week purchases for each customer</h3>

In [None]:
# find the latests purchase day for every customer
tmp = transactions.groupby("customer_id").agg(
    expr("max(t_dat) AS latest_date")
)

joinExpression = transactions['customer_id'] == tmp["customer_id"]

transactions = transactions.join(tmp, joinExpression).drop(tmp["customer_id"])

In [None]:
transactions = transactions.withColumn('date_diff', 
                            datediff(col("latest_date"), col("t_dat"))
                                      ).filter("date_diff <= 6")

<h3 style="font-family:Cursive;color:#ff1aff;">Recommend the items that are most popular to be purchased together</h3>

In [None]:
tmp = (transactions.groupby("customer_id", "article_id")
                    .count())
tmp.orderBy("count", ascending=False).show(5)

In [None]:
tmp.cache()
tmp.createOrReplaceTempView("tmp")
paired_items = spark.sql("""
    SELECT * FROM
    (SELECT *, row_number() over
    (PARTITION BY customer_id ORDER BY count DESC) as row_index
    FROM tmp) a
    WHERE row_index <= 3
    """)

tmp.unpersist()

In [None]:
paired_items = (paired_items.groupBy('customer_id')
                            .agg(collect_set('article_id')
                            .alias('article_id')))
paired_items.show()

<h3 style="font-family:Cursive;color:#ff1aff;">Recommend last week's most popular items</h3>

In [None]:
transactions.cache()
transactions.createOrReplaceTempView('transactions')
top12 = spark.sql("""
    SELECT article_id, COUNT(*) count FROM transactions
    WHERE t_dat > '2020-08-23'
    Group BY article_id
    ORDER by count DESC
    Limit 12
    """)

In [None]:
from pyspark.sql import Row
import pandas as pd

p_top12 = top12.select('article_id').toPandas()
t12 = ' '.join(p[0] for p in p_top12.to_numpy())
print(t12)

# ALS

Goal: Factorize the given ratings matrix $R$ into two factors-user matrix $U$ and item matrix $V$ such that $R \approx U^TV$.

Notation: 
* $u_i$-the $i$th column of the user matrix
* $v_i$-the $i$th column of the item matrix
* $r_{ij}$-the rating of the $j$th item from $i$th user
* \lambda-regularization factor
* $n_{u_i}$-the number of items the $i$th user rated
* $n_{v_j}$-the number of time the $j$th item was rated

Objective:
$\text{argmin}_{U,V}\sum_{i,j, r_{ij}\not=0}(r_{ij}-u_i^Tv_j)^2+\lambda(\sum_{i}n_{u_i}\|u_i\|^2+\sum_{j}n_{v_j}\|v_j\|^2)$

Algorithm: Fix the set of variables $U$ and treat them as constants, the objective is convex function of $V$. Solve for $V$. Repeat similarly for $U$.

ALS finds $k$-dimensional feature vector for each user and item such that the dot product of their feature vectors approximates the user's rating for the item.

ALS requires an input dataset with only three columns: a user ID column, an item ID column, and a rating column. Moreover, ALS handles both explicit ratings-a numerical rating and the implicit ratings-the strength of interactions between a user and the item (here the number of purchases for the given user of the given item).

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [None]:
tmp =  transactions.withColumn('t_dat', transactions['t_dat'].cast('string'))
tmp = tmp.withColumn('date', from_unixtime(unix_timestamp('t_dat', 'yyyy-MM-dd')))
tmp = tmp.withColumn('year', year(col('date')))
tmp = tmp.withColumn('month', month(col('date')))
tmp = tmp.withColumn('day', date_format(col('date'), "d"))

tmp = tmp[tmp['year'] == 2020]
tmp = tmp[tmp['month'] == 9]
tmp = tmp[tmp['day'] == 22]
transactions.unpersist()

# Prepare the dataset
tmp = tmp.groupby('customer_id', 'article_id').count()
tmp.show(5)

In [None]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") 
           for column in list(set(tmp.columns) - set(['count']))]

In [None]:
import gc
gc.collect()

In [None]:
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(tmp).transform(tmp)

(train, test) = transformed.randomSplit([0.8, 0.2])

In [None]:
als = ALS(maxIter=5, regParam=0.09, 
          rank=25, userCol="customer_id_index",
          itemCol="article_id_index", ratingCol="count",
          coldStartStrategy="drop", nonnegative=True)

model=als.fit(train)

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator=RegressionEvaluator(metricName="rmse",labelCol="count",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE is equal to : ", str(rmse))

In [None]:
 model.recommendForAllUsers(10).show(10)

Since recommendations are given as indexes with should map them to ids.

In [None]:
import pandas as pd

recs = model.recommendForAllUsers(10).toPandas()
recommendations = (recs['recommendations'].apply(pd.Series)
            .merge(recs, right_index = True, left_index = True) # split a Pandas column with lists of tuples into separate columns
            .drop(["recommendations"], axis = 1) # drop the list of tuples
            .melt(id_vars = ['customer_id_index'], value_name = "recommendations") # turn 'recommendation' columns into rows
            .drop("variable", axis = 1)
            .sort_values('customer_id_index')
            .dropna())

recommendations = pd.concat([recommendations['recommendations'].apply(pd.Series), 
                             recommendations['customer_id_index']], axis = 1) # separate article_ids from counts

In [None]:
recommendations.columns = ['ArticleID_index','count','UserID_index']
transformed_subset = transformed.select('article_id', 'article_id_index', 'customer_id', 'customer_id_index')
transformed_subset = transformed_subset.toPandas()

In [None]:
# map index to id
article_map = dict(zip(transformed_subset['article_id_index'], transformed_subset['article_id']))
customer_map = dict(zip(transformed_subset['customer_id_index'], transformed_subset['customer_id']))
recommendations['article_id'] = recommendations['ArticleID_index'].map(article_map)
recommendations['customer_id'] = recommendations['UserID_index'].map(customer_map)

In [None]:
recommendations.reset_index(drop=True, inplace=True)
recommendations = recommendations[['customer_id','article_id']]
recommendations