# <center>INFO-H515 - Distributed Data Management and Scalable Analytics</center>

## <center>Project 2021-2022</center>

#### <center>Bakkali Yahya (000445166)</center>
#### <center>Hauwaert Maxime (000461714)</center>
#### <center>Marotte Nathan (000459274)</center>

# INTRODUCTION


In this notebook we will be using the following packages:
  * [PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html)
  * [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)
  * [Numpy](https://numpy.org/doc/stable/)

## Specifications of the problem
We are tasked to construct a recommender system as part of the recsyschallenge of 2022, organised by Dressipi, a company focused on providing product and outfit recommendations to leading global retailers.

## Dataset
The full dataset consists of 1.1 million online retail sessions in the fashion domain, sampled from a 18-month period.

It is split in 4 csv files:
   * candidate_items.csv : contains the candidate items for the recommender system. This means our model will provide those item_id as output of the prediction.
   * item_features.csv : contains the features (for example material, type, colour, etc ... ) for each item, as well as the value of this feature (cotton, skinny, blue, etc ...)
   * train_purchases.csv : contains the purchases at the end of each session.
   * train_sessions.csv : contains the purchasing sessions, each session is a stream of items viewed, and when an item is purchased, the session ends.

Using those 4 files, we will first develop a pipeline to trim it, then engineer the features to regroup different features into one dataset so that it can be used for the recommender system.


### Creating the Spark session

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import numpy as np
import os

# start spark session
os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=32g pyspark-shell"
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("AppName") \
    .config('spark.ui.port', '4075')\
    .getOrCreate()


### Loading the dataset

In [None]:
candidate_items, item_features, train_purchases, train_sessions = [None] * 4
datasets = [candidate_items, item_features, train_purchases, train_sessions]

#load all datasets
def load_candidate_items():
    global candidate_items
    candidate_items = spark.read.csv("dressipi_recsys2022/candidate_items.csv", header=True)
    candidate_items = candidate_items.withColumn("item_id", candidate_items["item_id"].cast("int"))

def load_item_features():
    global item_features
    item_features = spark.read.csv("dressipi_recsys2022/item_features.csv", header=True)
    item_features = item_features.withColumn("item_id", item_features["item_id"].cast("int"))
    item_features = item_features.withColumn("feature_category_id", item_features["feature_category_id"].cast("int"))
    item_features = item_features.withColumn("feature_value_id", item_features["feature_value_id"].cast("int"))

def load_train_purchases():
    global train_purchases
    train_purchases = spark.read.csv("dressipi_recsys2022/train_purchases.csv", header=True)
    train_purchases = train_purchases.withColumn("session_id", train_purchases["session_id"].cast("int"))
    train_purchases = train_purchases.withColumn("item_id", train_purchases["item_id"].cast("int"))
    train_purchases = train_purchases.withColumn("date", train_purchases["date"].cast("timestamp"))

def load_train_sessions():
    global train_sessions
    train_sessions = spark.read.csv("dressipi_recsys2022/train_sessions.csv", header=True)
    train_sessions = train_sessions.withColumn("session_id", train_sessions["session_id"].cast("int"))
    train_sessions = train_sessions.withColumn("item_id", train_sessions["item_id"].cast("int"))
    train_sessions = train_sessions.withColumn("date", train_sessions["date"].cast("timestamp"))

def load_datasets():
    global datasets
    load_candidate_items()
    load_item_features()
    load_train_purchases()
    load_train_sessions()
    datasets = [candidate_items, item_features, train_purchases, train_sessions]

load_datasets()


### Quick look at the data

#### train_sessions.csv
This dataset represents the browsing session of a user in the store. It is made of 3 columns:
- session_id : the id of the session. It serves as a key to join the data with the other datasets
- item_id : the item viewed during the session.
- date : the date of at wich the item was viewed.

Please note that the browsing sessions end at the end of the day, or if an item was purchased. This means we will not find, for one session, items viewed on 2 different days.
Also, as stated on the challenge's website, there are no sessions that do not end with a purchased item in this dataset.

Here is a representation of a few of the rows in the train_sessions.csv dataset:

In [None]:
train_sessions.show(5)
print(f"There are {train_sessions.count()} rows in total.")

#### train_purchases.csv

This dataset represents the purchases made by a user in the store. It is made of 3 columns:
- session_id : the id of the session.
- item_id : the item purchased.
- date : the date of at wich the item was purchased.

This dataset should be used with the train_sessions dataset to have a more complete picture of the browsing experience of the user. Each session in this dataset ends with an item purchased, noted in the column item_id.

Here is a representation of a few of the rows in the train_purchases.csv dataset:

In [None]:
train_purchases.show(5)
print(f"There are {train_purchases.count()} rows in total.")

#### item_features.csv

This dataset helps us notice pattern in different objects. It links the item_id with their feature (such as the color of the item, the material, the type, etc ...).
There are 3 columns :
- item_id : the id of the item in the store.
- feature_id : the id of the feature attached to the item.
- value : the value of the feature for that item. (for example if the feature_category_id is the color, the feature_value_id could be a representation of "red", "blue", "green", etc ...)

Here is a representation of a few of the rows in the item_features.csv dataset:

In [None]:
item_features.show(5)
print(f"There are {item_features.count()} rows in total.")

#### candidate_items.csv

This dataset contains all the item_id that are candidate to the recommandation system. It consists of a column of 4991 item_id.

Here is a representation of a few of the rows in the candidate_items.csv dataset:

In [None]:
candidate_items.show(5)
print(f"There are {candidate_items.count()} rows in total.")

# Part 1 : Pipeline

## Data exploration

We will first check if there are missing values, NA, or Null values in the downloaded dataset.

In [None]:
print("Nb of missing values in candidate_items:", candidate_items.filter(candidate_items["item_id"].isNull()).count())
print("Nb of missing values in item_features:", item_features.filter(item_features["item_id"].isNull()).count())
print("Nb of missing values in train_purchases:", train_purchases.filter(train_purchases["item_id"].isNull()).count())
print("Nb of missing values in train_sessions:", train_sessions.filter(train_sessions["item_id"].isNull()).count())

Now, we wondered if there are sessions for which the first item viewed was purchased (therefore it will not be in train_sessions but in train_purchases), and we also made sure that there are no sessions for which there were no purchases.

In [None]:
t = [a["session_id"] for a in train_sessions.select("session_id").collect()]
p = [a["session_id"] for a in train_purchases.select("session_id").collect()]

print(f"Nb of session id in train_sessions but not in train_purchases: {len(set(t).difference(set(p)))}")
print(f"Nb of session id in train_purchases but not in train_sessions: {len(set(p).difference(set(t)))}")

Then we wanted to know how much data we were working with so we counted the number of different item_id in each dataset.

In [None]:
a = train_sessions.select("item_id").distinct().collect()
b = train_purchases.select("item_id").distinct().collect()
c = candidate_items.select("item_id").distinct().collect()
d = item_features.select("item_id").distinct().collect()
print(f"Nb of distinct item id in train_sessions: {len(a)}")
print(f"Nb of distinct item id in train_purchases: {len(b)}")
print(f"Nb of distinct item id in candidate_items: {len(c)}")
print(f"Nb of distinct item id in item_features: {len(d)}")

We cannot know if those items are the same or different, but by running the next cell we discover that there are no items in train_sessions that are not in another dataset. Therefore we know that we have at least 23496 items.

In [None]:
print(f"Nb of item id in train_sessions but not in item_features: {len(set(a).difference(set(d)))}")
print(f"Nb of item id in train_sessions but not in train_purchases: {len(set(b).difference(set(d)))}")
print(f"Nb of item id in train_sessions but not in candidate_items: {len(set(c).difference(set(d)))}")

## Feature engineering

Let's remove sessions which end up with the purchase of a non candidate item.

In [None]:
train_purchases = candidate_items.join(train_purchases, "item_id", "inner")

train_sessions = train_sessions.join(train_purchases.select("session_id"), "session_id", "inner")

Let's transform the dataset from a dataframe into an RDD.

In [None]:
# Maps every session to its items and dates
session_item_date_rdd = train_sessions.rdd.map(lambda x: (x["session_id"], (x["item_id"], x["date"]))).cache()

These lists will contain the RDDs of the features along with their names.

In [None]:
features = []
features_names = []

### Date related features

Let's create an RDD that will be used to generate the following features.

In [None]:
# Maps every session to its dates
session_date_rdd = session_item_date_rdd.mapValues(lambda x: (x[1])).cache()

#### 1) Month

The month of the year (1 to 12) in which the session took place

The reason for this feature is that the pieces of fashion are probably more likely to be bought depending on the month of the purchase. There is probably less swimsuits bought in December than in July, therefore we believe that if two items are bought in the same month, they could be similar.

In [None]:
def get_month_feature():
    return session_date_rdd.mapValues(lambda x: int(x.month))\
                            .reduceByKey(min)
    
month = get_month_feature()

features.append(month)
features_names.append("month")

print(month.take(5))

#### 2) Season (Meteorological)

The meteorological season of the year in which the session takes place.
  - Winter : December, January, and February
  - Spring : March, April, and May
  - Summer : June, July, and Augustus
  - Fall   : September, October, and November

We add this feature even though it seems redudant with the months feature, because the month feature might be too restrictive (stuff bought in january or february can also be similar because of the season, especially in clothing).


In [None]:
def get_season_feature():
    return month.mapValues(get_season)\
                .reduceByKey(min)

def get_season(month):
    if month == 12 or month <= 2: return 0
    elif 2 < month <= 5: return 1
    elif 5 < month <= 8: return 2
    elif 8 < month <= 11: return 3

season = get_season_feature()

features.append(season)
features_names.append("season")

print(season.take(5))

#### 3) Day of month

We believe that the day of the month (from 1 (#todo 0 ?) to 28, 29, 30 or 31 also has an influence on the purchases, because salaries are often paid close by the 1st of the month, therefore we might expect more purchase during that time.


In [None]:
def get_day_of_month_feature():
    return session_date_rdd.mapValues(lambda x: int(x.day))\
                            .reduceByKey(min)

day_of_month = get_day_of_month_feature()

features.append(day_of_month)
features_names.append("day_of_month")

print(day_of_month.take(5))

#### 4) Weekday

This feature indicates what day of the week (0 to 6, starting on monday) the session happened as we believe this may have an influence, because there is more time to browse on the weekend than on week days where people are probably working.

In [None]:
def get_weekday_feature():
    return session_date_rdd.mapValues(lambda x: int(x.strftime("%w")))\
                            .reduceByKey(min)

weekday = get_weekday_feature()

features.append(weekday)
features_names.append("weekday")

print(weekday.take(5))

#### 5) Weekend

A binary feature to tells us if the session happened on a weekend.
    
    Maybe the weekday feature is too specific and the ?


In [None]:
def get_weekend_feature():
    return weekday.mapValues(lambda x: int(x in (5, 6)))

weekend = get_weekend_feature()

features.append(weekend)
features_names.append("weekend")

print(weekend.take(5))

#### 6) Hour

This feature give us the 24-hour format hour of the session.

The justification is that there are probably specific items bought late in the night, or other items bought a few hours after the end of workday.


In [None]:
def get_hour_feature():
    return session_date_rdd.mapValues(lambda x: int(x.strftime("%H")))\
                            .reduceByKey(min)

hour = get_hour_feature()

features.append(hour)
features_names.append("hour")

print(hour.take(5))

#### 7) Day period
#todo reprendre ici

    It tells us at which period of the day the session began.
    
    Maybe the hour feature is too restrictive ?


In [None]:
def get_day_period_feature():
    return hour.mapValues(get_day)

def get_day(x):
    if 6 < x < 12: return 0
    elif 12 < x < 18: return 1
    elif 18 < x < 22: return 2
    else: return 3
    
day_period = get_day_period_feature()

features.append(day_period)
features_names.append("day_period")

print(day_period.take(5))

#### 8) Night

    It tells us if the session began at night.
    
    Maybe the day period feature is still too restrictive ?


In [None]:
def get_night_feature():
    return day_period.mapValues(lambda x: int(x == 3))

night = get_night_feature()

features.append(night)
features_names.append("night")

print(night.take(5))

#### 9) Duration of the session

    It tells us the duration of the session.
    
    ?


In [None]:
def get_duration_feature():
    return session_date_rdd.groupByKey()\
                            .mapValues(get_session_duration)

def get_session_duration(dates):
    dates = list(dates)
    dates.sort()
    return (dates[-1] - dates[0]).total_seconds() if len(dates) >= 2 else 0.0

duration = get_duration_feature()

features.append(duration)
features_names.append("duration")

print(duration.take(5))

10) Average time between consecutive item views

    It tells us the average of time between the consecutive views of the items of the session.
    
    ?


In [None]:
def get_average_time_feature():
    return session_date_rdd.groupByKey()\
                            .mapValues(get_average_time)

def get_average_time(dates):
    import datetime
    dates = sorted(list(dates))
    avgs = [dates[i+1] - dates[i] for i in range(len(dates)-1)]
    return round((sum(avgs, datetime.timedelta())/len(avgs)).total_seconds()) if len(avgs) > 0 else 0

average_time = get_average_time_feature()

features.append(average_time)
features_names.append("average_time")

print(average_time.take(5))

### Session related features

Let's create an RDD that will be used to generate the following features.

In [None]:
# Maps every session to its items
session_item_rdd = session_item_date_rdd.mapValues(lambda x: (x[0])).cache()

#### 11) Number of items

    It tells us the number of items seen in the session.
    
    We believe that alone it is useless but by combining it with the following features, it could become useful.

In [None]:
def get_length_feature():
    return session_item_rdd.groupByKey().mapValues(len)

length = get_length_feature()

features.append(length)
features_names.append("length")

print(length.take(5))

#### 12) Number of distinct items

    It tells us the number of distinct items viewed in the session.
    
    We believe that this feature could add a dimension to the feature "number of items".

In [None]:
def get_distinct_nb_feature():
    return session_item_rdd.groupByKey().mapValues(lambda x: len(set(x)))

distinct_nb = get_distinct_nb_feature()

features.append(distinct_nb)
features_names.append("distinct_nb")

print(distinct_nb.take(5))

### Items related features

#### 13) Most viewed item

    It tells us the most seen item in the session.
    
    The most seen item should be highly be related to the purchased item.

In [None]:
def get_most_viewed_item_feature():
    return session_item_rdd.groupByKey().mapValues(get_most_viewed_item)

def get_most_viewed_item(items):
    items = list(items)
    items.sort()
    most_viewed = (None, -1)
    
    last_viewed = items[0]
    cnt = 0
    
    for item in items:
        if last_viewed != item:
            if cnt > most_viewed[1]:
                most_viewed = (last_viewed, cnt)
                cnt = 1
                last_viewed = item
        else:
            cnt += 1
            
    if cnt > most_viewed[1]:
        most_viewed = (last_viewed, cnt)
    
    return most_viewed[0]
    
most_viewed_item = get_most_viewed_item_feature()

features.append(most_viewed_item)
features_names.append("most_viewed_item")

print(most_viewed_item.take(5))

#### 14) Longest viewed item

    It tells us the item that was seen the longest in the session.
    
    The longest viewed item should be highly related to the purchased item as it tells us that the user was most likely interested in it.

In [None]:
def get_longest_item_feature():
    return train_sessions.rdd.map(lambda x: (x["session_id"], (x["item_id"], x["date"]))).\
                    groupByKey().\
                    mapValues(get_longest_item)

def get_longest_item(items):
    items = list(items)
    t = [items[i+1][1] - items[i][1] for i in range(len(items)-1)]
    return items[np.argmax(t)][0] if len(t) > 0 else items[0][0]

longest_item = get_longest_item_feature()

features.append(longest_item)
features_names.append("longest_item")

print(longest_item.take(5))

#### 15) Last item

    It tells us the last item that was seen in the session.
    
    The last item seen in the session should be closer to the purchased item than the first one.

In [None]:
def get_last_item_feature():
    return session_item_date_rdd.groupByKey()\
                                .mapValues(lambda x: max(x, key=lambda i: i[1])[0])

def get_last_item(items):
    return max(list(items), key=lambda i: i[1])[0]

last_item = get_last_item_feature()

features.append(last_item)
features_names.append("last_item")

print(last_item.take(5))

### Items' categories related features

Let's create two RDDs that will be used to generate the following features.

In [None]:
# Maps every item to its features
item_features_rdd = item_features.rdd.map(lambda x: (x["item_id"], (x["feature_category_id"], x["feature_value_id"])))\
                                .groupByKey()\
                                .mapValues(lambda x: [(a,b) for a, b in x])

# Maps every session to the features of its items
session_item_features_rdd = item_features_rdd.join(train_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()

#### 16) Most present number of categories
    # TODO rename
    It tells us the most present number of categories of the items seen in the session.
    
    The number of categories of an item should be closely related to its type.


In [None]:
def get_categories_nb_feature():
    return session_item_features_rdd.mapValues(get_categories_nb)

def get_categories_nb(cat):
    lst = [len(c) for c in cat]
    return max(set(lst), key=lst.count)
    
categories_nb = get_categories_nb_feature()

features.append(categories_nb)
features_names.append("categories_nb")

print(categories_nb.take(5))

#### 17 - 18 ) Most present category with its count

    It tells us the most present category in the different items seen in the session and the number of times it appears.
    
    The most present category should be highly be related to the purchased item.

We generate for each session the most present category and the number of times it appears.

In [None]:
def get_most_present_category_with_count(categories_i):
    categories = [cat for cat_i in categories_i for cat in cat_i]    
    categories.sort()
    
    most_viewed = (None, -1)
    last_viewed = categories[0]
    cnt = 0
    
    for category in categories:
        if last_viewed != category:
            if cnt > most_viewed[1]:
                most_viewed = (last_viewed, cnt)
                cnt = 1
                last_viewed = category
        else:
            cnt += 1
            
    if cnt > most_viewed[1]:
        most_viewed = (last_viewed, cnt)
    
    return most_viewed[0], most_viewed[1]

most_present_category_with_count = session_item_features_rdd.mapValues(get_most_present_category_with_count)

We split the RDD above in two.

Models in pyspark require the features to have a numeric value. So to solve easily this issue we created a simple function that return the sum of ($category$ * 1000) and the $value$. We chose 1000 as the maximum value that $value$ can take is less than 1000, this ensures no collision.

For example: (category, value) : (66, 109) -> 66 000 + 109 = 66 109

In [None]:
def get_most_present_category_feature():
    return most_present_category_with_count.mapValues(lambda x: x[0][0]*1000 + x[0][1])

most_present_category = get_most_present_category_feature()

features.append(most_present_category)
features_names.append("most_present_category")

print(most_present_category.take(5))

In [None]:
def get_most_present_category_count_feature():
    return most_present_category_with_count.mapValues(lambda x: x[1])


most_present_category_count = get_most_present_category_count_feature()

features.append(most_present_category_count)
features_names.append("most_present_category_count")

print(most_present_category_count.take(5))

12) Number of repetitive items

    It tells us ?
    
    ?

In [None]:
"""def get_repetitive_nb_feature():
    return session_item_rdd.groupByKey().mapValues(lambda x: len(list(x)) - len(set(x)))

repetitive_nb = get_repetitive_nb_feature()
print(repetitive_nb.take(5))"""

13) Same category


    It tells us the number of categories which appear at least twice in the items of the session.
    
    ?

In [None]:
"""def get_same_category_feature():
    return session_item_features_rdd.mapValues(get_same_category)

def get_same_category(x):
    dico = dict()
    for item in x:
        for cat in item:
            if cat in dico:
                dico[cat] += 1
            else:
                dico[cat] = 0

    res = 0
    for val in dico.values():
        if val > 0:
            res += 1

    return res

same_category = get_same_category_feature()
print(same_category.take(5))"""

14) Different category

    It tells us ?
    
    ?

In [None]:
"""def get_diff_category():
    return session_item_features_rdd.mapValues(get_different_category)

def get_different_category(x):
    dico = dict()
    for item in x:
        for cat in item:
            if cat in dico:
                dico[cat] += 1
            else:
                dico[cat] = 0

    res = 0
    for val in dico.values():
        if val == 0:
            res += 1

    return res

diff_category = get_diff_category()
print(diff_category.take(5))"""

## Joining features

All of these engineered features will be put in the same RDD with the id of the session as the key and the different features as the value.

This will be done through consecutive join's on the key.

In [None]:
purchase_item_rdd = train_purchases.rdd.map(lambda x: (x["session_id"], x["item_id"])).cache()

def get_full_rdd():
    temp = purchase_item_rdd.join(features[0])
    for i in range(1, len(features)):
        feature = features[i]
        temp = temp.join(feature).mapValues(lambda x: list(x[0])+[x[1]])
    return temp

BIG_RDD = get_full_rdd().cache()

In [None]:
print(BIG_RDD.take(1))

In [None]:
aha ha # TEST

# Part 2 : Feature selection


## Ranking algorithm

According to G. Bontempi's handbook "Statistical foundations of machine learning", the ranking methods follow these steps:
    
    1) Calculate for each feature its relevance with the output variable using a univariate measure.
    
    2) Sort their relevances in descending order.
    
    3) Select the top k features.

As we have a lot of categorical features as well as a categorical output, we chose to use mutual information as a univariate measure.

$$ I(X;Y) = H(Y) - H(Y|X)$$

https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg/1200px-Entropy-mutual-information-relative-entropy-relation-diagram.svg.png

First let's compute $H(Y)$ in map reduce.

$$\displaystyle \mathrm {H} (Y)=-\sum _{y\in {\mathcal {Y}}}p(y)\log _{2}p(x)$$
Based on https://en.wikipedia.org/wiki/Entropy_(information_theory)

First for each session we produce one element with its label as the key and $1$ as the value.

Then reduce by key by adding the values, this gets us the count for each label.

After that we map to get $p(y)$ by dividing the count by the total.

TODO

In [None]:
total_count = purchase_item_rdd.count()

total_count_broadcast = spark.sparkContext.broadcast(total_count)

H_y = BIG_RDD.map(lambda x: (x[1][0], 1))\
            .reduceByKey(np.add)\
            .map(lambda x: x[1]/total_count_broadcast.value)\
            .map(lambda x: -x * np.log2(x))\
            .reduce(np.add)

Here is the value of $H(Y)$.

In [None]:
H_y

Now let's calculate $H(Y|X)$ of each feature in map reduce.

$$\displaystyle \mathrm {H} (Y|X)=-\sum _{y,x\in {\mathcal {Y}}\times {\mathcal {X}}}p_{Y,X}(y,x)\log {\frac {p_{Y,X}(y,x)}{p_{X}(x)}}$$
Based on https://en.wikipedia.org/wiki/Entropy_(information_theory)

RDD containing count of each (feature_index, feature_value, label)

In [None]:
# (feature_index, feature_value, label) : count
count_yx = BIG_RDD.flatMap(lambda x: [((i, x[1][i], x[1][0]), 1) for i in range(1, len(x[1]))])\
                    .reduceByKey(np.add)

RDD containing p of each (feature_index, feature_value, label)

In [None]:
# (feature_index, feature_value, label) : p
p_yx = count_yx.mapValues(lambda x: x / total_count_broadcast.value)

RDD containing count of each (feature_index, feature_value)

In [None]:
# (feature_index, feature_value) : count
count_x = count_yx.map(lambda x: ((x[0][0], x[0][1]), x[1]))\
                    .reduceByKey(np.add)

RDD containing p of each (feature_index, feature_value)

In [None]:
# (feature_index, feature_value) : p
p_x = count_x.mapValues(lambda x: x / total_count_broadcast.value)

RDD containing each $H(y|x)$

In [None]:
def get_h_yx_term(x):
    p_yx = x[0]
    p_y = x[1]
    h_yx_term = - p_yx * np.log2(p_yx/p_y)
    
    return h_yx_term

H_yx = p_yx.map(lambda x: ((x[0][0], x[0][1]), x[1]))\
            .join(p_x)\
            .map(lambda x: (x[0][0], get_h_yx_term(x[1])))\
            .reduceByKey(np.add)

In [None]:
H_yx.take(20) # TODO better print ?

After that we can compute $I(X;Y)$ for each feature.

In [None]:
H_y_broadcast = spark.sparkContext.broadcast(H_y)

features_mi = H_xy.mapValues(lambda x: H_y_broadcast.value - x).collect()

Now let's sort and print the result.

In [None]:
features_mi.sort(key=lambda x: x[1], reverse=True)

print("Features ranking:")

for i in features_mi:
    print(f"\t- {features_names[i[0]-1]} ({i[1]})")

## Forward feature selection algorithm

According to G. Bontempi's handbook "Statistical foundations of machine learning", the forward feature selection algorithms are part of wrapping search strategies and work as follows:
    
    First, we start with no selected features.
    
    Then, we select the feature which returns the lowest generalisation error.
    
    After that, we select the feature which, with the previously selected features, returns the lowest generalisation error.
    
    The last step is repeated until there is no improvement or the maximum number of features is reached.

Let's prepare the RDD to train the different models

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree

all_features_rdd = BIG_RDD.map(lambda x: (x[1][0], x[1][1:]))

In [None]:
nb_classes = candidate_items.rdd.map(lambda x: x["item_id"]).reduce(max)+1
selected_features_indices = []
remaining_features_indices = list(range(len(features)))
improvement = True
last_accuracy = -1

while (len(selected_features_indices) < len(features)) and (improvement):
    current_accuracy = []
    selected_features_indices_broadcast = spark.sparkContext.broadcast(selected_features_indices)
    print("\nTEST")
    for current_feature_index in remaining_features_indices:
        print("-", end="")
        current_feature_index_broadcast = spark.sparkContext.broadcast(current_feature_index)
        selected_features_rdd = all_features_rdd.map(lambda x: LabeledPoint(x[0], [x[1][i] for i in selected_features_indices_broadcast.value] + [x[1][current_feature_index_broadcast.value]]))
        
        training, test = selected_features_rdd.randomSplit([0.6, 0.4], seed = 515)
        
        model = DecisionTree.trainClassifier(training, nb_classes, {}, maxDepth=2, maxBins=8)
        
        #predictionAndLabel = test.map(lambda x: (x.label, model.predict(x.features)))
        predictionAndLabel = test.map(lambda x: (x.label, 1))
        accuracy = 1.0 * predictionAndLabel.filter(lambda x: x[0] == x[1]).count() / test.count()

        current_accuracy.append((accuracy, current_feature_index))
        
    best_feature = max(current_accuracy, key=lambda x: x[0])
    
    if best_feature[0] > last_accuracy:
        last_accuracy = best_feature[0]
        best_feature_index = best_feature[1]
        selected_features_indices.append(best_feature_index)
        remaining_features_indices.remove(best_feature_index)

    else:
        improvement = False

In [None]:
selected_feature_names = [features_name[i] for i in selected_feature_indices]
print(f"The best features are: ", end="")
print(", ".join(selected_feature_names))

In [None]:
selected_features_names = features_name # TODO change

# Part 3 : Model

The model should be trained on the data with the selected features and should returned the predictions required by the competition. For each test session, the model should return 100 candidates items with the highest chance of being purchased. This restricts the differents models that can be used. Therefore the decision tree model was selected.

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Let's first transform our rdd which contains all the data into a dataframe which the classifier can use.

In [None]:
def collector(x):
    dico = {}
    dico["session_id"] = x[0]
    dico["item_id"] = x[1][0]
    
    for i in range(len(features_name)):
        dico[features_name[i]] = x[1][i+1]
        
    return dico
        

In [None]:
from pyspark.sql.types import Row
df = BIG_RDD.map(lambda x: Row(**collector(x))).toDF()
df = df.fillna(0)

# si = StringIndexer(inputCol="item_id", outputCol="label").fit(item_features.select("item_id").distinct())
si = StringIndexer(inputCol="item_id", outputCol="label").fit(candidate_items)
indexed_items_df = si.transform(candidate_items)
df = si.transform(df)

df = VectorAssembler(inputCols = features_name, outputCol = "features").transform(df)

df = df.select(["features", "label"])

Now we create our model.

In [None]:
decisionTree = DecisionTreeClassifier(maxDepth=2)

Finally we can test the accuracy of our model with cross validation. We chose $K$ = 4 to split the dataset into 75 % training and 25 % testing. 

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

grid = ParamGridBuilder().addGrid(decisionTree.featuresCol, ["features"]).build()
cv = CrossValidator(estimator=decisionTree, estimatorParamMaps=grid, numFolds=2, evaluator=evaluator, parallelism=4)
cvModel = cv.fit(df)

print("Average accuracy:", cvModel.avgMetrics[0])

We can see that the accuracy of our model is quite low. This can be explained by the large number of different items in the dataset. Even in the RecSys Challenge 2022, it asked for each session to give 100 items which have the highest probability.

In [None]:
TEST

In [None]:
decisionTree = DecisionTreeClassifier()
model = decisionTree.fit(df)

pred = model.transform(df)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy") 
nbaccuracy = evaluator.evaluate(pred) 
print("Test accuracy = " + str(nbaccuracy))

In [None]:
TEST

Now we can get the predictions for the RecSys Challenge 2022's leaderboard.

First we load the specific dataset.

In [None]:
leaderboard_sessions = spark.read.csv("dressipi_recsys2022/test_leaderboard_sessions.csv", header=True)
leaderboard_sessions = leaderboard_sessions.withColumn("session_id", leaderboard_sessions["session_id"].cast("int"))
leaderboard_sessions = leaderboard_sessions.withColumn("item_id", leaderboard_sessions["item_id"].cast("int"))
leaderboard_sessions = leaderboard_sessions.withColumn("date", leaderboard_sessions["date"].cast("timestamp"))

Then we compute the values of all the different features.

In [None]:
session_item_date_rdd = leaderboard_sessions.rdd.map(lambda x: (x["session_id"], (x["item_id"], x["date"]))).cache()
session_item_rdd = session_item_date_rdd.mapValues(lambda x: (x[0])).cache()
session_date_rdd = session_item_date_rdd.mapValues(lambda x: (x[1])).cache()

session_item_features_rdd = item_features_rdd.join(leaderboard_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()

item_features_rdd = item_features.rdd.map(lambda x: (x["item_id"], (x["feature_category_id"], x["feature_value_id"])))\
                                .groupByKey()\
                                .mapValues(lambda x: [(a,b) for a, b in x])

session_item_features_rdd = item_features_rdd.join(leaderboard_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()


month = get_month_feature()
season = get_season_feature()
day_of_month = get_day_of_month_feature()
weekday = get_weekday_feature()
weekend = get_weekend_feature()
hour = get_hour_feature()
day_period = get_day_period_feature()
night = get_night_feature()
duration = get_duration_feature()
average_time = get_average_time_feature()


length = get_length_feature()
distinct_nb = get_distinct_nb_feature()


most_viewed_item = get_most_viewed_item_feature()
longest_item = get_longest_item_feature()
last_item = get_last_item_feature()


categories_nb = get_categories_nb_feature()

most_present_category_with_count = session_item_features_rdd.mapValues(get_most_present_category_with_count)

most_present_category = get_most_present_category_feature()
most_present_category_count = get_most_present_category_count_feature()



#repetitive_nb = get_repetitive_nb_feature()
#same_category = get_same_category_feature()
#diff_category = get_diff_category()

#features = [month, season, day_of_month, weekday, weekend, hour, day_period, night, duration, average_time, distinct_nb, repetitive_nb, same_category, diff_category, last_item, most_viewed_item, length, longest_item, categories_nb]
features = [month, season, day_of_month, weekday, weekend, hour, day_period, night, duration, average_time, length, distinct_nb, most_viewed_item, longest_item, last_item, categories_nb, most_present_category_with_count, most_present_category, most_present_category_count]

test_rdd = features[0].join(features[1])

for i in range(2, len(features)):
    feature = features[i]
    test_rdd = test_rdd.join(feature).mapValues(lambda x: list(x[0])+[x[1]])


In [None]:
test_rdd.take(1)

Next we transform our RDD to a dataframe which can be used by our model.

In [None]:
def test_collector(x):
    dico = {}
    dico["session_id"] = x[0]
    for i in range(len(features_name)):
        dico[features_name[i]] = x[1][i]
    return dico


test_df = test_rdd.map(lambda x: Row(**test_collector(x))).toDF()
test_df = test_df.fillna(0)
test_df = VectorAssembler(inputCols = features_name, outputCol = "features").transform(test_df)
test_df = test_df.select(["features", "session_id"])

In [None]:
test_df.show(5)

We create a new decision tree model that will be trained on the whole dataset.

In [None]:
decisionTree = DecisionTreeClassifier(maxDepth=2)
model = decisionTree.fit(df)

We get the list of items which should be predicted for the challenge.

In [None]:
possible_items = candidate_items.rdd.map(lambda x: x["item_id"]).collect()

Finally we gather all the results we computed and save them to a CSV file with the structures asked by the RecSys Challenge 2022.

In [None]:
pred = model.transform(test_df)

In [None]:
pred.show(1)

In [None]:
from ipywidgets import IntProgress
from IPython.display import display

In [None]:
item_to_index = dict()

df_iterator = indexed_items_df.rdd.toLocalIterator()

for row in df_iterator:
    item_to_index[row["item_id"]] = int(row["label"])

In [None]:
# possible_items_indices = [item_to_index[item] for item in possible_items]

In [None]:
progressbar = IntProgress(min=0, max=pred.count())
display(progressbar)

pred_iterator = pred.rdd.toLocalIterator()
session_id = 1
with open("predictions3.csv", "w") as file:
    file.write("session_id,item_id,rank\n")
    for row in pred_iterator:
        
        temp = list(row["probability"])

        temp = [(temp[item_to_index[i]], i) for i in possible_items]
        temp.sort(reverse=True)
        
        temp = temp[:100]
        for l in range(100):
            file.write(f"{row['session_id']},{temp[l][1]},{l+1}\n")
        
        progressbar.value += 1
        session_id += 1

# Part 3 : Model

The model should be trained on the data with the selected features and should returned the predictions required by the competition. For each test session, the model should return 100 candidates items with the highest chance of being purchased. This restricts the differents models that can be used. Therefore the decision tree model was selected.

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Let's first transform our rdd which contains all the data into a dataframe which the classifier can use.

In [None]:
def collector(x):
    dico = {}
    dico["session_id"] = x[0]
    dico["item_id"] = x[1][0]
    
    for i in range(len(features_name)):
        dico[features_name[i]] = x[1][i+1]
        
    return dico
        

In [None]:
from pyspark.sql.types import Row
df = BIG_RDD.map(lambda x: Row(**collector(x))).toDF()
df = df.fillna(0)

# si = StringIndexer(inputCol="item_id", outputCol="label").fit(item_features.select("item_id").distinct())
si = StringIndexer(inputCol="item_id", outputCol="label").fit(candidate_items)
indexed_items_df = si.transform(candidate_items)
df = si.transform(df)

df = VectorAssembler(inputCols = features_name, outputCol = "features").transform(df)

df = df.select(["features", "label"])

Now we create our model.

In [None]:
decisionTree = DecisionTreeClassifier(maxDepth=2)

Finally we can test the accuracy of our model with cross validation. We chose $K$ = 4 to split the dataset into 75 % training and 25 % testing. 

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

grid = ParamGridBuilder().addGrid(decisionTree.featuresCol, ["features"]).build()
cv = CrossValidator(estimator=decisionTree, estimatorParamMaps=grid, numFolds=2, evaluator=evaluator, parallelism=4)
cvModel = cv.fit(df)

print("Average accuracy:", cvModel.avgMetrics[0])

We can see that the accuracy of our model is quite low. This can be explained by the large number of different items in the dataset. Even in the RecSys Challenge 2022, it asked for each session to give 100 items which have the highest probability.

In [None]:
TEST

In [None]:
decisionTree = DecisionTreeClassifier()
model = decisionTree.fit(df)

pred = model.transform(df)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy") 
nbaccuracy = evaluator.evaluate(pred) 
print("Test accuracy = " + str(nbaccuracy))

In [None]:
TEST

Now we can get the predictions for the RecSys Challenge 2022's leaderboard.

First we load the specific dataset.

In [None]:
leaderboard_sessions = spark.read.csv("dressipi_recsys2022/test_leaderboard_sessions.csv", header=True)
leaderboard_sessions = leaderboard_sessions.withColumn("session_id", leaderboard_sessions["session_id"].cast("int"))
leaderboard_sessions = leaderboard_sessions.withColumn("item_id", leaderboard_sessions["item_id"].cast("int"))
leaderboard_sessions = leaderboard_sessions.withColumn("date", leaderboard_sessions["date"].cast("timestamp"))

Then we compute the values of all the different features.

In [None]:
session_item_date_rdd = leaderboard_sessions.rdd.map(lambda x: (x["session_id"], (x["item_id"], x["date"]))).cache()
session_item_rdd = session_item_date_rdd.mapValues(lambda x: (x[0])).cache()
session_date_rdd = session_item_date_rdd.mapValues(lambda x: (x[1])).cache()

session_item_features_rdd = item_features_rdd.join(leaderboard_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()

item_features_rdd = item_features.rdd.map(lambda x: (x["item_id"], (x["feature_category_id"], x["feature_value_id"])))\
                                .groupByKey()\
                                .mapValues(lambda x: [(a,b) for a, b in x])

session_item_features_rdd = item_features_rdd.join(leaderboard_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()


month = get_month_feature()
season = get_season_feature()
day_of_month = get_day_of_month_feature()
weekday = get_weekday_feature()
weekend = get_weekend_feature()
hour = get_hour_feature()
day_period = get_day_period_feature()
night = get_night_feature()
duration = get_duration_feature()
average_time = get_average_time_feature()


length = get_length_feature()
distinct_nb = get_distinct_nb_feature()


most_viewed_item = get_most_viewed_item_feature()
longest_item = get_longest_item_feature()
last_item = get_last_item_feature()


categories_nb = get_categories_nb_feature()

most_present_category_with_count = session_item_features_rdd.mapValues(get_most_present_category_with_count)

most_present_category = get_most_present_category_feature()
most_present_category_count = get_most_present_category_count_feature()



#repetitive_nb = get_repetitive_nb_feature()
#same_category = get_same_category_feature()
#diff_category = get_diff_category()

#features = [month, season, day_of_month, weekday, weekend, hour, day_period, night, duration, average_time, distinct_nb, repetitive_nb, same_category, diff_category, last_item, most_viewed_item, length, longest_item, categories_nb]
features = [month, season, day_of_month, weekday, weekend, hour, day_period, night, duration, average_time, length, distinct_nb, most_viewed_item, longest_item, last_item, categories_nb, most_present_category_with_count, most_present_category, most_present_category_count]

test_rdd = features[0].join(features[1])

for i in range(2, len(features)):
    feature = features[i]
    test_rdd = test_rdd.join(feature).mapValues(lambda x: list(x[0])+[x[1]])


In [None]:
test_rdd.take(1)

Next we transform our RDD to a dataframe which can be used by our model.

In [None]:
def test_collector(x):
    dico = {}
    dico["session_id"] = x[0]
    for i in range(len(features_name)):
        dico[features_name[i]] = x[1][i]
    return dico


test_df = test_rdd.map(lambda x: Row(**test_collector(x))).toDF()
test_df = test_df.fillna(0)
test_df = VectorAssembler(inputCols = features_name, outputCol = "features").transform(test_df)
test_df = test_df.select(["features", "session_id"])

In [None]:
test_df.show(5)

We create a new decision tree model that will be trained on the whole dataset.

In [None]:
decisionTree = DecisionTreeClassifier(maxDepth=2)
model = decisionTree.fit(df)

We get the list of items which should be predicted for the challenge.

In [None]:
possible_items = candidate_items.rdd.map(lambda x: x["item_id"]).collect()

Finally we gather all the results we computed and save them to a CSV file with the structures asked by the RecSys Challenge 2022.

In [None]:
pred = model.transform(test_df)

In [None]:
pred.show(1)

In [None]:
from ipywidgets import IntProgress
from IPython.display import display

In [None]:
item_to_index = dict()

df_iterator = indexed_items_df.rdd.toLocalIterator()

for row in df_iterator:
    item_to_index[row["item_id"]] = int(row["label"])

In [None]:
# possible_items_indices = [item_to_index[item] for item in possible_items]

In [None]:
progressbar = IntProgress(min=0, max=pred.count())
display(progressbar)

pred_iterator = pred.rdd.toLocalIterator()
session_id = 1
with open("predictions3.csv", "w") as file:
    file.write("session_id,item_id,rank\n")
    for row in pred_iterator:
        
        temp = list(row["probability"])

        temp = [(temp[item_to_index[i]], i) for i in possible_items]
        temp.sort(reverse=True)
        
        temp = temp[:100]
        for l in range(100):
            file.write(f"{row['session_id']},{temp[l][1]},{l+1}\n")
        
        progressbar.value += 1
        session_id += 1

In [None]:
spark.stop()