# <center>INFO-H515 - Distributed Data Management and Scalable Analytics</center>

## <center>Project 2021-2022</center>

#### <center>Bakkali Yahya (000445166)</center>
#### <center>Hauwaert Maxime (000461714)</center>
#### <center>Marotte Nathan (000459274)</center>

### <center>Video URL : https://drive.google.com/file/d/1XA5euCfwmq0Yp1CqiVUHU_7ZTO-Mpx-u/view?usp=sharing</center>

# INTRODUCTION


In this notebook we will be using the following packages:
  * [PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html)
  * [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)
  * [Numpy](https://numpy.org/doc/stable/)

## Specifications of the problem
We are tasked to construct a recommender system as part of the recsyschallenge of 2022, organised by Dressipi, a company focused on providing product and outfit recommendations to leading global retailers.

## Dataset
The full dataset consists of 1.1 million online retail sessions in the fashion domain, sampled from a 18-month period.

It is split in 4 csv files:
   * candidate_items.csv : contains the candidate items for the recommender system. This means our model will provide those item_id as output of the prediction.
   * item_features.csv : contains the features (for example material, type, colour, etc ... ) for each item, as well as the value of this feature (cotton, skinny, blue, etc ...)
   * train_purchases.csv : contains the purchases at the end of each session.
   * train_sessions.csv : contains the purchasing sessions, each session is a stream of items viewed, and when an item is purchased, the session ends.

Using those 4 files, we will first develop a pipeline to trim it, then engineer the features to regroup different features into one dataset so that it can be used for the recommender system.


### Creating the Spark session

In [1]:
import os
from pyspark.sql import SparkSession
import numpy as np

os.environ['PYSPARK_SUBMIT_ARGS'] ="--conf spark.driver.memory=16g pyspark-shell"

os.environ['HADOOP_CONF_DIR']="/etc/hadoop/conf"
os.environ['PYSPARK_PYTHON']="/usr/local/anaconda3/bin/python3"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/anaconda3/bin/python3"

spark = SparkSession \
   .builder \
   .master("yarn") \
   .config("spark.executor.instances","5") \
   .appName("group03") \
   .getOrCreate()

### Loading the dataset

In [2]:
candidate_items, item_features, train_purchases, train_sessions = [None] * 4
datasets = [candidate_items, item_features, train_purchases, train_sessions]

#load all datasets
def load_candidate_items():
    global candidate_items
    candidate_items = spark.read.csv("dressipi_recsys2022/candidate_items.csv", header=True)
    candidate_items = candidate_items.withColumn("item_id", candidate_items["item_id"].cast("int"))

def load_item_features():
    global item_features
    item_features = spark.read.csv("dressipi_recsys2022/item_features.csv", header=True)
    item_features = item_features.withColumn("item_id", item_features["item_id"].cast("int"))
    item_features = item_features.withColumn("feature_category_id", item_features["feature_category_id"].cast("int"))
    item_features = item_features.withColumn("feature_value_id", item_features["feature_value_id"].cast("int"))

def load_train_purchases():
    global train_purchases
    train_purchases = spark.read.csv("dressipi_recsys2022/train_purchases.csv", header=True)
    train_purchases = train_purchases.withColumn("session_id", train_purchases["session_id"].cast("int"))
    train_purchases = train_purchases.withColumn("item_id", train_purchases["item_id"].cast("int"))
    train_purchases = train_purchases.withColumn("date", train_purchases["date"].cast("timestamp"))

def load_train_sessions():
    global train_sessions
    train_sessions = spark.read.csv("dressipi_recsys2022/train_sessions.csv", header=True)
    train_sessions = train_sessions.withColumn("session_id", train_sessions["session_id"].cast("int"))
    train_sessions = train_sessions.withColumn("item_id", train_sessions["item_id"].cast("int"))
    train_sessions = train_sessions.withColumn("date", train_sessions["date"].cast("timestamp"))

def load_datasets():
    global datasets
    load_candidate_items()
    load_item_features()
    load_train_purchases()
    load_train_sessions()
    datasets = [candidate_items, item_features, train_purchases, train_sessions]

load_datasets()

### Quick look at the data

#### train_sessions.csv
This dataset represents the browsing session of a user in the store. It is made of 3 columns:
- session_id : the id of the session. It serves as a key to join the data with the other datasets
- item_id : the item viewed during the session.
- date : the date of at wich the item was viewed.

Please note that the browsing sessions end at the end of the day, or if an item was purchased. This means we will not find, for one session, items viewed on 2 different days.
Also, as stated on the challenge's website, there are no sessions that do not end with a purchased item in this dataset.

Here is a representation of a few of the rows in the train_sessions.csv dataset:

In [3]:
train_sessions.show(5)
print(f"There are {train_sessions.count()} rows in total.")

+----------+-------+--------------------+
|session_id|item_id|                date|
+----------+-------+--------------------+
|         3|   9655|2020-12-18 21:25:...|
|         3|   9655|2020-12-18 21:19:...|
|        13|  15654|2020-03-13 19:35:...|
|        18|  18316|2020-08-26 19:18:...|
|        18|   2507|2020-08-26 19:16:...|
+----------+-------+--------------------+
only showing top 5 rows

There are 4743820 rows in total.


#### train_purchases.csv

This dataset represents the purchases made by a user in the store. It is made of 3 columns:
- session_id : the id of the session.
- item_id : the item purchased.
- date : the date of at wich the item was purchased.

This dataset should be used with the train_sessions dataset to have a more complete picture of the browsing experience of the user. Each session in this dataset ends with an item purchased, noted in the column item_id.

Here is a representation of a few of the rows in the train_purchases.csv dataset:

In [4]:
train_purchases.show(5)
print(f"There are {train_purchases.count()} rows in total.")

+----------+-------+--------------------+
|session_id|item_id|                date|
+----------+-------+--------------------+
|         3|  15085|2020-12-18 21:26:...|
|        13|  18626|2020-03-13 19:36:...|
|        18|  24911|2020-08-26 19:20:...|
|        19|  12534|2020-11-02 17:16:...|
|        24|  13226|2020-02-26 18:27:...|
+----------+-------+--------------------+
only showing top 5 rows

There are 1000000 rows in total.


#### item_features.csv

This dataset helps us notice pattern in different objects. It links the item_id with their feature (such as the color of the item, the material, the type, etc ...).
There are 3 columns :
- item_id : the id of the item in the store.
- feature_id : the id of the feature attached to the item.
- value : the value of the feature for that item. (for example if the feature_category_id is the color, the feature_value_id could be a representation of "red", "blue", "green", etc ...)

Here is a representation of a few of the rows in the item_features.csv dataset:

In [5]:
item_features.show(5)
print(f"There are {item_features.count()} rows in total.")

+-------+-------------------+----------------+
|item_id|feature_category_id|feature_value_id|
+-------+-------------------+----------------+
|      2|                 56|             365|
|      2|                 62|             801|
|      2|                 68|             351|
|      2|                 33|             802|
|      2|                 72|              75|
+-------+-------------------+----------------+
only showing top 5 rows

There are 471751 rows in total.


#### candidate_items.csv

This dataset contains all the item_id that are candidate to the recommandation system. It consists of a column of 4991 item_id.

Here is a representation of a few of the rows in the candidate_items.csv dataset:

In [6]:
candidate_items.show(5)
print(f"There are {candidate_items.count()} rows in total.")

+-------+
|item_id|
+-------+
|      4|
|      8|
|      9|
|     19|
|     20|
+-------+
only showing top 5 rows

There are 4990 rows in total.


# Part 1 : Pipeline

## Data exploration

We will first check if there are missing values, NA, or Null values in the downloaded dataset.

In [7]:
print("Nb of missing values in candidate_items:", candidate_items.filter(candidate_items["item_id"].isNull()).count())
print("Nb of missing values in item_features:", item_features.filter(item_features["item_id"].isNull()).count())
print("Nb of missing values in train_purchases:", train_purchases.filter(train_purchases["item_id"].isNull()).count())
print("Nb of missing values in train_sessions:", train_sessions.filter(train_sessions["item_id"].isNull()).count())

Nb of missing values in candidate_items: 0
Nb of missing values in item_features: 0
Nb of missing values in train_purchases: 0
Nb of missing values in train_sessions: 0


Now, we wondered if there are sessions for which the first item viewed was purchased (therefore it will not be in train_sessions but in train_purchases), and we also made sure that there are no sessions for which there were no purchases.

In [8]:
t = [a["session_id"] for a in train_sessions.select("session_id").collect()]
p = [a["session_id"] for a in train_purchases.select("session_id").collect()]

print(f"Nb of session id in train_sessions but not in train_purchases: {len(set(t).difference(set(p)))}")
print(f"Nb of session id in train_purchases but not in train_sessions: {len(set(p).difference(set(t)))}")

Nb of session id in train_sessions but not in train_purchases: 0
Nb of session id in train_purchases but not in train_sessions: 0


Then we wanted to know how much data we were working with so we counted the number of different item_id in each dataset.

In [9]:
a = train_sessions.select("item_id").distinct().collect()
b = train_purchases.select("item_id").distinct().collect()
c = candidate_items.select("item_id").distinct().collect()
d = item_features.select("item_id").distinct().collect()
print(f"Nb of distinct item id in train_sessions: {len(a)}")
print(f"Nb of distinct item id in train_purchases: {len(b)}")
print(f"Nb of distinct item id in candidate_items: {len(c)}")
print(f"Nb of distinct item id in item_features: {len(d)}")

Nb of distinct item id in train_sessions: 23496
Nb of distinct item id in train_purchases: 18907
Nb of distinct item id in candidate_items: 4990
Nb of distinct item id in item_features: 23691


We cannot know if those items are the same or different, but by running the next cell we discover that item_features is complete. It contains all the items present in the other datasets. Therefore, we know that there exists 23691 items.

In [10]:
print(f"Nb of item id in train_sessions but not in item_features: {len(set(a).difference(set(d)))}")
print(f"Nb of item id in train_purchases but not in item_features: {len(set(b).difference(set(d)))}")
print(f"Nb of item id in candidate_items but not in item_features: {len(set(c).difference(set(d)))}")

Nb of item id in train_sessions but not in item_features: 0
Nb of item id in train_purchases but not in item_features: 0
Nb of item id in candidate_items but not in item_features: 0


## Feature engineering

Let's remove sessions which end up with the purchase of a non candidate item.

In [11]:
train_purchases = candidate_items.join(train_purchases, "item_id", "inner")
# ⚠️ Uncomment the following line to undersampling to reduce computing time
# train_purchases = train_purchases.sample(withReplacement=False, fraction=0.01, seed=3)
train_sessions = train_sessions.join(train_purchases.select("session_id"), "session_id", "inner")

Let's transform the dataset from a dataframe into an RDD.

In [12]:
# Maps every session to its items and dates
session_item_date_rdd = train_sessions.rdd.map(lambda x: (x["session_id"], (x["item_id"], x["date"]))).cache()

These lists will contain the RDDs of the features along with their names.

In [13]:
features = []
features_names = []

### Date related features

Let's create an RDD that will be used to generate the following features.

In [14]:
# Maps every session to its dates
session_date_rdd = session_item_date_rdd.mapValues(lambda x: (x[1])).cache()

#### 1) Month

The month of the year (1 to 12) in which the session took place

The reason for this feature is that the pieces of fashion are probably more likely to be bought depending on the month of the purchase. There is probably less swimsuits bought in December than in July, therefore we believe that if two items are bought in the same month, they could be similar.

In [15]:
def get_month_feature():
    return session_date_rdd.mapValues(lambda x: int(x.month))\
                            .reduceByKey(min)
    
month = get_month_feature()

features.append(month)
features_names.append("month")

print(month.take(5))

[(100800, 2), (270200, 9), (324400, 9), (504400, 5), (569200, 3)]


#### 2) Season (Meteorological)

The meteorological season of the year in which the session takes place.
  - Winter : December, January, and February
  - Spring : March, April, and May
  - Summer : June, July, and Augustus
  - Fall   : September, October, and November

We add this feature even though it seems redudant with the months feature, because the month feature might be too restrictive (stuff bought in january or february can also be similar because of the season, especially in clothing).


In [16]:
def get_season_feature():
    return month.mapValues(get_season)\
                .reduceByKey(min)

def get_season(month):
    if month == 12 or month <= 2: return 0
    elif 2 < month <= 5: return 1
    elif 5 < month <= 8: return 2
    elif 8 < month <= 11: return 3

season = get_season_feature()

features.append(season)
features_names.append("season")

print(season.take(5))

[(100800, 0), (270200, 3), (324400, 3), (504400, 1), (569200, 1)]


#### 3) Day of month

We believe that the day of the month from 1 to 28, 29, 30 or 31 also has an influence on the purchases, because salaries are often paid close by the 1st of the month, therefore we might expect more purchase during that time.


In [17]:
def get_day_of_month_feature():
    return session_date_rdd.mapValues(lambda x: int(x.day))\
                            .reduceByKey(min)

day_of_month = get_day_of_month_feature()

features.append(day_of_month)
features_names.append("day_of_month")

print(day_of_month.take(5))

[(100800, 4), (270200, 1), (324400, 1), (504400, 1), (569200, 31)]


#### 4) Weekday

This feature indicates what day of the week (0 to 6, starting on monday) the session happened as we believe this may have an influence, because there is more time to browse on the weekend than on week days where people are probably working.

In [18]:
def get_weekday_feature():
    return session_date_rdd.mapValues(lambda x: int(x.strftime("%w")))\
                            .reduceByKey(min)

weekday = get_weekday_feature()

features.append(weekday)
features_names.append("weekday")

print(weekday.take(5))

[(100800, 4), (270200, 2), (324400, 2), (504400, 6), (569200, 3)]


#### 5) Weekend

A binary feature to tells us if the session happened on a weekend. We are not sure if the weekday feature is too specific, this should allow the model to differentiate the situation better.

In [19]:
def get_weekend_feature():
    return weekday.mapValues(lambda x: int(x in (5, 6)))

weekend = get_weekend_feature()

features.append(weekend)
features_names.append("weekend")

print(weekend.take(5))

[(100800, 0), (270200, 0), (324400, 0), (504400, 1), (569200, 0)]


#### 6) Hour

This feature give us the 24-hour format hour of the session. The justification is that there are probably specific items bought late in the night, or other items bought a few hours after the end of workday.


In [20]:
def get_hour_feature():
    return session_date_rdd.mapValues(lambda x: int(x.strftime("%H")))\
                            .reduceByKey(min)

hour = get_hour_feature()

features.append(hour)
features_names.append("hour")

print(hour.take(5))

[(100800, 18), (270200, 18), (324400, 21), (504400, 15), (569200, 12)]


#### 7) Day period
To avoid the features of being too selective, we devided the day into 4 periods, the morning, the afternoon, the evening, and the night. We believe that items bought in the same period will share similarities, for example when seeing an advertisment when driving from/to work, or when you are browsing the internet late at night.


In [21]:
def get_day_period_feature():
    return hour.mapValues(get_day)

def get_day(x):
    if 6 < x < 12: return 0
    elif 12 < x < 18: return 1
    elif 18 < x < 22: return 2
    else: return 3
    
day_period = get_day_period_feature()

features.append(day_period)
features_names.append("day_period")

print(day_period.take(5))

[(100800, 3), (270200, 3), (324400, 2), (504400, 1), (569200, 3)]


#### 8) Night
A boolean variable to indicate if the session was during the night, or the day to further divide the dataset.


In [22]:
def get_night_feature():
    return day_period.mapValues(lambda x: int(x == 3))

night = get_night_feature()

features.append(night)
features_names.append("night")

print(night.take(5))

[(100800, 1), (270200, 1), (324400, 0), (504400, 0), (569200, 1)]


#### 9) Duration of the session

Short sessions might be straight to the point from seeing an object to buying another one, maybe it can tell us something about the item feature that was so great it was bought so fast/slow.


In [23]:
def get_duration_feature():
    return session_date_rdd.groupByKey()\
                            .mapValues(get_session_duration)

def get_session_duration(dates):
    dates = list(dates)
    dates.sort()
    return (dates[-1] - dates[0]).total_seconds() if len(dates) >= 2 else 0.0

duration = get_duration_feature()

features.append(duration)
features_names.append("duration")

print(duration.take(5))

[(100800, 0.0), (270200, 325.706), (324400, 61.3), (504400, 0.0), (569200, 0.0)]


#### 10) Average time between consecutive item views

We believe that buyers who saw a lot of items in a very short amount of times might belong to the same category of consumers (for example frequent customer, or IT-litterate people) in a way that they might share similar interests in fashion.

In [24]:
def get_average_time_feature():
    return session_date_rdd.groupByKey()\
                            .mapValues(get_average_time)

def get_average_time(dates):
    import datetime
    dates = sorted(list(dates))
    avgs = [dates[i+1] - dates[i] for i in range(len(dates)-1)]
    return round((sum(avgs, datetime.timedelta())/len(avgs)).total_seconds()) if len(avgs) > 0 else 0

average_time = get_average_time_feature()

features.append(average_time)
features_names.append("average_time")

print(average_time.take(5))

[(100800, 0), (270200, 27), (324400, 61), (504400, 0), (569200, 0)]


### Session related features

Let's create an RDD that will be used to generate the following features.

In [25]:
# Maps every session to its items
session_item_rdd = session_item_date_rdd.mapValues(lambda x: (x[0])).cache()

#### 11) Number of items

It tells us the number of items seen in the session. We believe that alone it is useless but by combining it with the following features, it could become useful.

In [26]:
def get_length_feature():
    return session_item_rdd.groupByKey().mapValues(len)

length = get_length_feature()

features.append(length)
features_names.append("length")

print(length.take(5))

[(100800, 1), (270200, 13), (324400, 2), (504400, 1), (569200, 1)]


#### 12) Number of distinct items

It tells us the number of distinct items viewed in the session. We believe that this feature could add a nuance to the feature "number of items".

In [27]:
def get_distinct_nb_feature():
    return session_item_rdd.groupByKey().mapValues(lambda x: len(set(x)))

distinct_nb = get_distinct_nb_feature()

features.append(distinct_nb)
features_names.append("distinct_nb")

print(distinct_nb.take(5))

[(100800, 1), (270200, 12), (324400, 2), (504400, 1), (569200, 1)]


### Items related features

#### 13) Most viewed item

It tells us the item that was seen the most amount of times. We believe that this item has a high probability of being the item bought at the end of the session.

In [28]:
def get_most_viewed_item_feature():
    return session_item_rdd.groupByKey().mapValues(get_most_viewed_item)

def get_most_viewed_item(items):
    items = list(items)
    items.sort()
    most_viewed = (None, -1)
    
    last_viewed = items[0]
    cnt = 0
    
    for item in items:
        if last_viewed != item:
            if cnt > most_viewed[1]:
                most_viewed = (last_viewed, cnt)
                cnt = 1
                last_viewed = item
        else:
            cnt += 1
            
    if cnt > most_viewed[1]:
        most_viewed = (last_viewed, cnt)
    
    return most_viewed[0]
    
most_viewed_item = get_most_viewed_item_feature()

features.append(most_viewed_item)
features_names.append("most_viewed_item")

print(most_viewed_item.take(5))

[(100800, 18407), (270200, 39), (324400, 7267), (504400, 7717), (569200, 8702)]


#### 14) Longest viewed item

If a customer spends a lot of time viewing an item, it might be because he is interested in its caracteristics, and therefore more likely to buy that item or a similar one.

In [29]:
def get_longest_item_feature():
    return train_sessions.rdd.map(lambda x: (x["session_id"], (x["item_id"], x["date"]))).\
                    groupByKey().\
                    mapValues(get_longest_item)

def get_longest_item(items):
    items = list(items)
    t = [items[i+1][1] - items[i][1] for i in range(len(items)-1)]
    return items[np.argmax(t)][0] if len(t) > 0 else items[0][0]

longest_item = get_longest_item_feature()

features.append(longest_item)
features_names.append("longest_item")

print(longest_item.take(5))

[(100800, 18407), (270200, 8700), (324400, 7267), (504400, 7717), (569200, 8702)]


#### 15) Last item

It tells us the last item that was seen in the session. The last item seen in the session should be closer to the purchased item than the first one.

In [30]:
def get_last_item_feature():
    return session_item_date_rdd.groupByKey()\
                                .mapValues(lambda x: max(x, key=lambda i: i[1])[0])

def get_last_item(items):
    return max(list(items), key=lambda i: i[1])[0]

last_item = get_last_item_feature()

features.append(last_item)
features_names.append("last_item")

print(last_item.take(5))

[(100800, 18407), (270200, 19819), (324400, 7267), (504400, 7717), (569200, 8702)]


#### 16) Is the most viewed item also the longest viewed item ?

We checked if the item that was most often seen is also the item that was viewed for the longest time. We believe that if those two variables point to the same item, it is very likely that the item bought is similar.

In [31]:
def get_most_longest_equality_feature():
    return most_viewed_item.join(longest_item)\
                            .mapValues(lambda x: int(x[0] == x[1]))

most_longest_equality = get_most_longest_equality_feature()

features.append(most_longest_equality)
features_names.append("most_longest_equality")

print(most_longest_equality.take(5))

[(100800, 1), (270200, 0), (324400, 1), (504400, 1), (569200, 1)]


#### 17) Is the most viewed item also the last item ?

We checked if the item that was most often seen is also the last item viewed. We believe that if those two variables point to the same item, it is very likely that the item bought is similar

In [32]:
def get_most_last_equality_feature():
    return most_viewed_item.join(last_item)\
                            .mapValues(lambda x: int(x[0] == x[1]))

most_last_equality = get_most_last_equality_feature()

features.append(most_last_equality)
features_names.append("most_last_equality")

print(most_last_equality.take(5))

[(100800, 1), (270200, 0), (324400, 1), (504400, 1), (569200, 1)]


#### 18) Is the longest viewed item also the last item ?

We checked if the item that was was viewed for the longest time is also the last item seen. We believe that if those two variables point to the same item, it is very likely that the item bought is similar

In [33]:
def get_longest_last_equality_feature():
    return longest_item.join(last_item)\
                    .mapValues(lambda x: int(x[0] == x[1]))

longest_last_equality = get_longest_last_equality_feature()

features.append(longest_last_equality)
features_names.append("longest_last_equality")

print(longest_last_equality.take(5))

[(100800, 1), (270200, 0), (324400, 1), (504400, 1), (569200, 1)]


### Items' categories related features

Let's create two RDDs that will be used to generate the following features.

In [34]:
# Maps every item to its features
item_features_rdd = item_features.rdd.map(lambda x: (x["item_id"], (x["feature_category_id"], x["feature_value_id"])))\
                                .groupByKey()\
                                .mapValues(lambda x: [(a,b) for a, b in x])

# Maps every session to the features of its items
session_item_features_rdd = item_features_rdd.join(train_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()

#### 19) Most common number of categories of all items seen

This feature tells us for each session the number of categories that appears the most. We believe that if two items have the same number of categories the are probably similar



In [35]:
def get_categories_nb_feature():
    return session_item_features_rdd.mapValues(get_categories_nb)

def get_categories_nb(cat):
    lst = [len(c) for c in cat]
    return max(set(lst), key=lst.count)
    
categories_nb = get_categories_nb_feature()

features.append(categories_nb)
features_names.append("categories_nb")

print(categories_nb.take(5))

[(1799214, 20), (295526, 24), (1798608, 18), (1255632, 18), (1212808, 25)]


#### 20 - 21) Most common category with its count
We regroup in one column the most common category id and its count so that similar categories and similar counts are closer in the search space.

We generate for each session the most present category and the number of times it appears.

In [36]:
def get_most_present_category_with_count(categories_i):
    categories = [cat for cat_i in categories_i for cat in cat_i]    
    categories.sort()
    
    most_viewed = (None, -1)
    last_viewed = categories[0]
    cnt = 0
    
    for category in categories:
        if last_viewed != category:
            if cnt > most_viewed[1]:
                most_viewed = (last_viewed, cnt)
                cnt = 1
                last_viewed = category
        else:
            cnt += 1
            
    if cnt > most_viewed[1]:
        most_viewed = (last_viewed, cnt)
    
    return most_viewed[0], most_viewed[1]

most_present_category_with_count = session_item_features_rdd.mapValues(get_most_present_category_with_count)

We split the RDD above in two.

Models in pyspark require the features to have a numeric value. So to solve easily this issue we created a simple function that return the sum of ($category$ * 1000) and the $value$. We chose 1000 as the maximum value that $value$ can take is less than 1000, this ensures no collision.

For example: (category, value) : (66, 109) -> 66 000 + 109 = 66 109

In [37]:
def get_most_present_category_feature():
    return most_present_category_with_count.mapValues(lambda x: x[0][0]*1000 + x[0][1])

most_present_category = get_most_present_category_feature()

features.append(most_present_category)
features_names.append("most_present_category")

print(most_present_category.take(5))

[(1799214, 4618), (295526, 3793), (1798608, 3793), (1255632, 4618), (1212808, 3793)]


In [38]:
def get_most_present_category_count_feature():
    return most_present_category_with_count.mapValues(lambda x: x[1])

most_present_category_count = get_most_present_category_count_feature()

features.append(most_present_category_count)
features_names.append("most_present_category_count")

print(most_present_category_count.take(5))

[(1799214, 3), (295526, 38), (1798608, 6), (1255632, 5), (1212808, 12)]


#### 22) Number of categories of the most viewed item

This variable computes the number of categories of the most viewed item. It might give us information about the bought product since it might have a similar number of categories, if the items are similar. For example if for a jacket we have 4 categories (size color cut and material), and the item bought also has 4 categories, it is possible that it is also a jacket, since gloves for example could have only 2 categories.

In [39]:
def get_nb_categories_most_viewed_item_feature():
    return most_viewed_item.map(lambda x: (x[1], x[0]))\
                            .join(item_features_rdd)\
                            .map(lambda x: (x[1][0], len(x[1][1])))

nb_categories_most_viewed_item = get_nb_categories_most_viewed_item_feature()

features.append(nb_categories_most_viewed_item)
features_names.append("nb_categories_most_viewed_item")

print(nb_categories_most_viewed_item.take(5))

[(3201600, 18), (2902600, 18), (24801, 18), (1705601, 18), (2194201, 18)]


#### 23) Number of categories of the longest viewed item

This variable computes the number of categories of the most longest viewed item. Like the previous feature, it might give us information about the bought product.

In [40]:
def get_nb_categories_longest_item_feature():
    return longest_item.map(lambda x: (x[1], x[0]))\
                        .join(item_features_rdd)\
                        .map(lambda x: (x[1][0], len(x[1][1])))

nb_categories_longest_item = get_nb_categories_longest_item_feature()

features.append(nb_categories_longest_item)
features_names.append("nb_categories_longest_item")

print(nb_categories_longest_item.take(5))

[(3201600, 18), (1062200, 18), (2123002, 18), (1904603, 18), (2862406, 18)]


#### 24) Number of categories of the last item.

This variable computes the number of categories of the last item. Like the previous feature, it might give us information about the bought product. 

In [41]:
def get_nb_categories_last_item_feature():
    return last_item.map(lambda x: (x[1], x[0]))\
                    .join(item_features_rdd)\
                    .map(lambda x: (x[1][0], len(x[1][1])))

nb_categories_last_item = get_nb_categories_last_item_feature()

features.append(nb_categories_last_item)
features_names.append("nb_categories_last_item")

print(nb_categories_last_item.take(5))

[(2357800, 25), (398614, 25), (598820, 25), (2366227, 25), (2564630, 25)]


## Joining features

All of these engineered features will be put in the same RDD with the id of the session as the key and the different features as the value.

First we add to the values of the features RDDs the index of the feature

Then we take the union of all the features RDDs.

Finally we group the elements of the union by their key, the session_id, and we create a list containing the label as well as all the features according to their index.

We end up with an RDD which has this structure.

session_id : [label, feature_1, feature_2, ...]

In [42]:
purchase_item_rdd = train_purchases.rdd.map(lambda x: (x["session_id"], x["item_id"])).cache()

def group_full_rdd(x):
    temp = [None] * (len(features_names)+1)
    for item in x:
        temp[item[0]] = item[1]
    return temp
        
def get_full_rdd():
    temp = purchase_item_rdd.mapValues(lambda x: (0, x))
    for i, feature in enumerate(features):
        temp = temp.union(feature.mapValues(lambda x: (i+1, x)))
    
    temp = temp.groupByKey().mapValues(group_full_rdd)
    return temp

BIG_RDD = get_full_rdd().cache()

In [44]:
print(BIG_RDD.take(1))

[(129978, [13269, 4, 1, 28, 3, 0, 23, 3, 1, 149.714, 30, 6, 4, 1373, 20271, 1373, 0, 1, 0, 25, 3793, 6, 25, 25, 25])]


# Part 2 : Feature selection


## Ranking algorithm

According to G. Bontempi's handbook "Statistical foundations of machine learning", the ranking methods follow these steps:
    
    1) Calculate for each feature its relevance with the output variable using a univariate measure.
    
    2) Sort their relevance in descending order.
    
    3) Select the top k features.

As we have a lot of categorical features as well as a categorical output, we chose to use mutual information as a univariate measure.

$$ I(X;Y) = H(Y) - H(Y|X)$$

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg/1200px-Entropy-mutual-information-relative-entropy-relation-diagram.svg.png" width="400">
Taken from https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Entropy-mutual-information-relative-entropy-relation-diagram.svg/1200px-Entropy-mutual-information-relative-entropy-relation-diagram.svg.png 

First let's compute $H(Y)$ in map reduce.

$$\displaystyle \mathrm {H} (Y)=-\sum _{y\in {\mathcal {Y}}}p(y)\log _{2}p(y)$$



Based on https://en.wikipedia.org/wiki/Entropy_(information_theory)

First for each session we produce one element with its label as the key and $1$ as the value.

Then reduce by key by adding the values, this gets us the count for each label.

After that we map to get $p(y)$ by dividing the count by the total number of elements.

Finally we map each $p(y)$ to $-p(y)\log _{2}p(x)$, which represents all the terms of the equation above, and we reduce them by adding them.

[^1]: lol.

In [49]:
total_count = purchase_item_rdd.count()

total_count_broadcast = spark.sparkContext.broadcast(total_count)

H_y = BIG_RDD.map(lambda x: (x[1][0], 1))\
            .reduceByKey(np.add)\
            .map(lambda x: x[1]/total_count_broadcast.value)\
            .map(lambda x: -x * np.log2(x))\
            .reduce(np.add)

Here is the value of $H(Y)$.

In [50]:
H_y

11.03202146098545

Now let's calculate $H(Y|X)$ of each feature in map reduce.

$$\displaystyle \mathrm {H} (Y|X)=-\sum _{y,x\in {\mathcal {Y}}\times {\mathcal {X}}}p_{Y,X}(y,x)\log {\frac {p_{Y,X}(y,x)}{p_{X}(x)}}$$
Based on https://en.wikipedia.org/wiki/Entropy_(information_theory)

This RDD will contain the number of elements which correspond to the same feature_index and have the same value and label.

In [51]:
# (feature_index, feature_value, label) : count
count_yx = BIG_RDD.flatMap(lambda x: [((i, x[1][i], x[1][0]), 1) for i in range(1, len(x[1]))])\
                    .reduceByKey(np.add)

This RDD will contain the $p$ of the elements which correspond to the same feature_index and have the same value and label.

In other words, this will contain the $p_{Y,X}(y,x)$ for each feature.

In [52]:
# (feature_index, feature_value, label) : p
p_yx = count_yx.mapValues(lambda x: x / total_count_broadcast.value)

This RDD will contain the number of elements which correspond to the same feature_index and have the same value.

In [53]:
# (feature_index, feature_value) : count
count_x = count_yx.map(lambda x: ((x[0][0], x[0][1]), x[1]))\
                    .reduceByKey(np.add)

This RDD will contain the $p$ of the elements which correspond to the same feature_index and have the same value.

In other words, this will contain the $p_{X}(x)$ for each feature.

In [54]:
# (feature_index, feature_value) : p
p_x = count_x.mapValues(lambda x: x / total_count_broadcast.value)

Now we will calculate $H(y|x)$ for each feature.

First we take the RDD which contains the $p_{Y,X}(y,x)$ for each feature and we join it with the RDD which contains the $p_{X}(x)$ for each feature.

Finally we map each element to $p_{Y,X}(y,x)\log {\frac {p_{Y,X}(y,x)}{p_{X}(x)}}$, which represents all the terms of the equation above, and we reduce them by adding them.

In [55]:
def get_h_yx_term(x):
    p_yx = x[0]
    p_y = x[1]
    h_yx_term = - p_yx * np.log2(p_yx/p_y)
    
    return h_yx_term

H_yx = p_yx.map(lambda x: ((x[0][0], x[0][1]), x[1]))\
            .join(p_x)\
            .map(lambda x: (x[0][0], get_h_yx_term(x[1])))\
            .reduceByKey(np.add)

Here is the value of $H(Y|X)$ for each feature.

In [56]:
H_yx.collect()

[(1, 10.075723915996821),
 (2, 10.393079295379358),
 (3, 10.662043448070813),
 (4, 10.958292053275848),
 (5, 11.018113765358375),
 (6, 10.837239056117252),
 (7, 10.999397527159932),
 (8, 11.02249534813991),
 (9, 3.57164547680823),
 (10, 9.007892033810464),
 (11, 10.785789841027354),
 (12, 10.811317578751712),
 (13, 5.260058196644147),
 (14, 5.170970290288833),
 (15, 4.694100578387632),
 (16, 11.006942193247763),
 (17, 11.005879277544992),
 (18, 11.00046796553993),
 (19, 10.336253059083456),
 (20, 10.451879782370915),
 (21, 10.830148946392256),
 (22, 10.364923349486572),
 (23, 10.423365463383101),
 (24, 10.199048539517932)]

After that we can compute $I(X;Y)$ for each feature.

In [57]:
H_y_broadcast = spark.sparkContext.broadcast(H_y)

features_mi = H_yx.mapValues(lambda x: H_y_broadcast.value - x).collect()

Now let's sort and print the result.

In [58]:
features_mi.sort(key=lambda x: x[1], reverse=True)

print("Features ranking:")

for i in features_mi:
    print(f"\t- {features_names[i[0]-1]} ({i[1]})")

Features ranking:
	- duration (7.460375984177219)
	- last_item (6.3379208825978175)
	- longest_item (5.861051170696617)
	- most_viewed_item (5.771963264341302)
	- average_time (2.0241294271749855)
	- month (0.9562975449886277)
	- nb_categories_last_item (0.8329729214675172)
	- categories_nb (0.6957684019019936)
	- nb_categories_most_viewed_item (0.667098111498877)
	- season (0.6389421656060907)
	- nb_categories_longest_item (0.608655997602348)
	- most_present_category (0.5801416786145346)
	- day_of_month (0.36997801291463617)
	- length (0.2462316199580954)
	- distinct_nb (0.22070388223373705)
	- most_present_category_count (0.20187251459319278)
	- hour (0.19478240486819764)
	- weekday (0.07372940770960135)
	- day_period (0.032623933825517426)
	- longest_last_equality (0.031553495445519886)
	- most_last_equality (0.0261421834404576)
	- most_longest_equality (0.025079267737686095)
	- weekend (0.013907695627073835)
	- night (0.009526112845538393)


## Forward feature selection algorithm

According to G. Bontempi's handbook "Statistical foundations of machine learning", the forward feature selection algorithms are part of wrapping search strategies and work as follows:
    
    First, we start with no selected features.
    
    Then, we select the feature which returns the lowest generalisation error.
    
    After that, we select the feature which, with the previously selected features, returns the lowest generalisation error.
    
    The last step is repeated until there is no improvement or the maximum number of features is reached.

Here is the pseudo code.
```python
def forward_feature_selection(ALL_FEATURES):
    remaining_features <- ALL_FEATURES
    selected_features <- []
    do 
        candidate_feature <- find_lowest_generalisation_error(remaining_features)

        if improvement:
            selected_features.add(candidate_feature)
            remaining_features.remove(candidate_feature)

    while len(remaining_features) > 0 or no_improvement

    return remaining_features
```

Let's prepare the RDD to train the different models.

In [None]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree

all_features_rdd = BIG_RDD.sample(withReplacement=False, fraction=0.01, seed=3).map(lambda x: (x[1][0], x[1][1:]))

Now we can launch the forward feature selection algorithm.

# ⚠️

In order to calculate the generalisation error, the use of cross validation would have been better. But due to the large computational power needed to run the cross validation, a single split of the data was performed.

In [None]:
nb_classes = candidate_items.rdd.map(lambda x: x["item_id"]).reduce(max)+1
selected_features_indices = []
remaining_features_indices = list(range(len(features)))
improvement = True
last_accuracy = -1

while (len(selected_features_indices) < len(features)) and (improvement):
    current_accuracy = []
    selected_features_indices_broadcast = spark.sparkContext.broadcast(selected_features_indices)
    print(f"Searching feature {len(selected_features_indices)+1}: ", end="")
    for current_feature_index in remaining_features_indices:
        print("|", end="")
        current_feature_index_broadcast = spark.sparkContext.broadcast(current_feature_index)
        selected_features_rdd = all_features_rdd.map(lambda x: LabeledPoint(x[0], [x[1][i] for i in selected_features_indices_broadcast.value] + [x[1][current_feature_index_broadcast.value]]))
        
        training, test = selected_features_rdd.randomSplit([0.6, 0.4], seed = 515)
        
        model = DecisionTree.trainClassifier(training, nb_classes, {}, maxDepth=1, maxBins=2)
        
        predictions = model.predict(test.map(lambda x: x.features))
        labelsAndPredictions = test.map(lambda x: x.label).zip(predictions)
        
        accuracy = 1.0 * labelsAndPredictions.filter(lambda x: x[0] == x[1]).count() / test.count()

        current_accuracy.append((accuracy, current_feature_index))
    print()
    best_feature = max(current_accuracy, key=lambda x: x[0])
    
    if best_feature[0] > last_accuracy:
        last_accuracy = best_feature[0]
        best_feature_index = best_feature[1]
        selected_features_indices.append(best_feature_index)
        remaining_features_indices.remove(best_feature_index)

    else:
        improvement = False

Now we can print the features selected by our algorithm.

In [None]:
selected_feature_names = [features_name[i] for i in selected_feature_indices]
print(f"The best features are: ", end="")
print(", ".join(selected_feature_names))

# ⚠️
Due to the imposibility for the cluster to run the algorithm above in a reasonable amount of time, we just select the 12 best features according to the ranking algorithm.

In [75]:
selected_feature_indices = [features_mi[i][0]-1 for i in range(12)]
selected_features_names = [features_names[i] for i in selected_feature_indices]

# Part 3 : Model

The model should be trained on the data with the selected features and should returned the predictions required by the competition. For each test session, the model should return 100 candidates items with the highest chance of being purchased. This restricts the differents models that can be used. Therefore we selected the Decision Tree model.

In [76]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Let's first transform our rdd which contains all the data into a dataframe which the classifier can use.

In [77]:
from pyspark.sql.types import Row

def collector(x):
    dico = {}
    dico["session_id"] = x[0]
    dico["item_id"] = x[1][0]
    
    for i in range(len(features_names)):
        dico[features_names[i]] = x[1][i+1]
        
    return dico

df = BIG_RDD.map(lambda x: Row(**collector(x))).toDF()
df = df.fillna(0)
si = StringIndexer(inputCol="item_id", outputCol="label").fit(candidate_items)
indexed_items_df = si.transform(candidate_items)
df = si.transform(df)
df = VectorAssembler(inputCols = selected_features_names, outputCol = "features").transform(df)
df = df.select(["features", "label"])

Now we create our model.

In [81]:
decisionTree = DecisionTreeClassifier(maxDepth=5)

Finally we can test the accuracy of our model with cross validation. We chose $K$ = 4 to split the dataset into 75 % training and 25 % testing. 

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

grid = ParamGridBuilder().addGrid(decisionTree.featuresCol, ["features"]).build()
cv = CrossValidator(estimator=decisionTree, estimatorParamMaps=grid, numFolds=4, evaluator=evaluator, parallelism=4)
cvModel = cv.fit(df)

print("Average accuracy:", cvModel.avgMetrics[0])

We can see that the accuracy of our model can seem quite low, but this can be explained by the large number of different items in the dataset. It is approxamively 100 % better than just selecting items at random. Indeed, an algorithm based only at random would have an accuracy of 2e-4. Even in the RecSys Challenge 2022, it is asked for each session to give 100 items which have the highest probability.

Now we can get the predictions for the RecSys Challenge 2022's leaderboard.

First we load the specific dataset.

In [None]:
leaderboard_sessions = spark.read.csv("dressipi_recsys2022/test_leaderboard_sessions.csv", header=True)
leaderboard_sessions = leaderboard_sessions.withColumn("session_id", leaderboard_sessions["session_id"].cast("int"))
leaderboard_sessions = leaderboard_sessions.withColumn("item_id", leaderboard_sessions["item_id"].cast("int"))
leaderboard_sessions = leaderboard_sessions.withColumn("date", leaderboard_sessions["date"].cast("timestamp"))

Then we compute the values of all the different features.

In [None]:
session_item_date_rdd = leaderboard_sessions.rdd.map(lambda x: (x["session_id"], (x["item_id"], x["date"]))).cache()
session_item_rdd = session_item_date_rdd.mapValues(lambda x: (x[0])).cache()
session_date_rdd = session_item_date_rdd.mapValues(lambda x: (x[1])).cache()

session_item_features_rdd = item_features_rdd.join(leaderboard_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()

item_features_rdd = item_features.rdd.map(lambda x: (x["item_id"], (x["feature_category_id"], x["feature_value_id"])))\
                                .groupByKey()\
                                .mapValues(lambda x: [(a,b) for a, b in x])

session_item_features_rdd = item_features_rdd.join(leaderboard_sessions.rdd.map(lambda x: (x["item_id"], x["session_id"])))\
                                            .map(lambda x: (x[1][1], x[1][0]))\
                                            .groupByKey()


month = get_month_feature()
season = get_season_feature()
day_of_month = get_day_of_month_feature()
weekday = get_weekday_feature()
weekend = get_weekend_feature()
hour = get_hour_feature()
day_period = get_day_period_feature()
night = get_night_feature()
duration = get_duration_feature()
average_time = get_average_time_feature()

length = get_length_feature()
distinct_nb = get_distinct_nb_feature()

most_viewed_item = get_most_viewed_item_feature()
longest_item = get_longest_item_feature()
last_item = get_last_item_feature()
most_longest_equality = get_most_longest_equality_feature()
most_last_equality = get_most_last_equality_feature()
longest_last_equality = get_longest_last_equality_feature()

categories_nb = get_categories_nb_feature()

most_present_category_with_count = session_item_features_rdd.mapValues(get_most_present_category_with_count)

most_present_category = get_most_present_category_feature()
most_present_category_count = get_most_present_category_count_feature()
nb_categories_most_viewed_item = get_nb_categories_most_viewed_item_feature()
nb_categories_longest_item = get_nb_categories_longest_item_feature()
nb_categories_last_item = get_nb_categories_last_item_feature()

features = [month, season, day_of_month, weekday, weekend, hour, day_period, night, duration, average_time, length, distinct_nb, most_viewed_item, longest_item, last_item, most_longest_equality, most_last_equality, longest_last_equality, categories_nb, most_present_category_with_count, most_present_category, most_present_category_count, nb_categories_most_viewed_item, nb_categories_longest_item, nb_categories_last_item]

test_rdd = features[0].join(features[1])

for i in range(2, len(features)):
    feature = features[i]
    test_rdd = test_rdd.join(feature).mapValues(lambda x: list(x[0])+[x[1]])

In [None]:
test_rdd.take(1)

Next we transform our RDD to a dataframe which can be used by our model.

In [None]:
def test_collector(x):
    dico = {}
    dico["session_id"] = x[0]
    for i in range(len(features_names)):
        dico[features_names[i]] = x[1][i]
    return dico


test_df = test_rdd.map(lambda x: Row(**test_collector(x))).toDF()
test_df = test_df.fillna(0)
test_df = VectorAssembler(inputCols = selected_features_names, outputCol = "features").transform(test_df)
test_df = test_df.select(["features", "session_id"])

In [None]:
test_df.show(5)

We create a new decision tree model that will be trained on the whole dataset.

In [None]:
decisionTree = DecisionTreeClassifier(maxDepth=2)
model = decisionTree.fit(df)

We get the list of items which should be predicted for the challenge.

In [None]:
possible_items = candidate_items.rdd.map(lambda x: x["item_id"]).collect()

Finally we gather all the results we computed and save them to a CSV file with the structures asked by the RecSys Challenge 2022.

In [None]:
pred = model.transform(test_df)

In [None]:
pred.show(1)

In [None]:
from ipywidgets import IntProgress
from IPython.display import display

In [None]:
item_to_index = dict()

df_iterator = indexed_items_df.rdd.toLocalIterator()

for row in df_iterator:
    item_to_index[row["item_id"]] = int(row["label"])

In [None]:
progressbar = IntProgress(min=0, max=pred.count())
display(progressbar)

pred_iterator = pred.rdd.toLocalIterator()
session_id = 1
with open("RecSys2022PredictionsLeaderboard.csv", "w") as file:
    file.write("session_id,item_id,rank\n")
    for row in pred_iterator:
        
        temp = list(row["probability"])

        temp = [(temp[item_to_index[i]], i) for i in possible_items]
        temp.sort(reverse=True)
        
        temp = temp[:100]
        for l in range(100):
            file.write(f"{row['session_id']},{temp[l][1]},{l+1}\n")
        
        progressbar.value += 1
        session_id += 1

In [None]:
spark.stop()

# Conclusion

## Video presentation : 

Please find the presentation of our project using the following link : https://drive.google.com/file/d/1XA5euCfwmq0Yp1CqiVUHU_7ZTO-Mpx-u/view?usp=sharing

## Issues with the cluster
As already reported by other students, the issues with the cluster really impacted our development. We weren't able to connect from our IDE to the cluster directly so we had to develop our code with very basic autocompletion on the jupyter notebook available on the cluster. This meant we couldn't work in parallel on different cells seemlessly, we had to work separately and merge our modifications which added quite an overhead to the work performance.

Also, and most importantly, the issue of availability of the cluster really impacted us hard, totalling for around 25% of our time spent developing. This means we, on average for each session of working on the project, had to dedicate 25% of our time to solve diverse issues with the cluster (slow logging in and access, restarting kernel, saving the project issue, creating backups, etc etc ...). Actually, we believe it would have been faster to not use the cluster at all and to run the computation locally on our computer. The computation would have taken longer but there would have been less friction when working on the project.

Finally, when one person is running the notebook and another student try to join the notebook, there is a message asking the other student to either erase all the changes the first student made, or reload the notebook, even without changing anything on the notebook which might have lead to data being lost.



## Feature engineered dataset
From the multiple .csv file that were available to us for the project, we developed a dataset containing 24 features described in the report. We will enumerate them here as well

**Date related**: Month, Season, Day of month, Weekday, Weekend, Hour, Day period, Night, Duration of the session, Average time between consecutive item views, Number of items, Number of distinct items, Most viewed item.

**Session related**: Number of items, Number of distinct items.

**Item related**: Most viewed item, Longest  viewed item, Last viewed item, Is the most viewed item also the longest viewed item ?, Is the most viewed item also the last item ?, Is the longest viewed item also the last item ?.

**Categories related**: Most common number of categories of all items seen, Most present category with its count, Number of categories of the most viewed item, Number of categories of the longest viewed item, Number of categories of the last item.


Here is an example of the first few rows of the dataset in a table

| session_id | item_id | month | season | day_of_month | weekday | weekend | hour | day_period | night | duration | average_time | length | distinct_nb | most_viewed_item | longest_item | last_item | most_longest_equality | most_last_equality | longest_last_equality | categories_nb | most_present_category | most_present_category_count | nb_categories_most_viewed_item | nb_categories_longest_item | nb_categories_last_item | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 129978 | 13269 | 4 | 1 | 28 | 3 | 0 | 23 | 3 | 1 | 149 | 30 | 6 | 4 | 1373 | 20271 | 1373 | 0 | 1 | 0 | 25 | 3793 | 6 | 25 | 25 | 25 | 
| 144420 | 4400 | 5 | 1 | 2 | 6 | 1 | 13 | 1 | 0 | 53 | 53 | 2 | 1 | 16065 | 16065 | 16065 | 1 | 1 | 1 | 27 | 3793 | 2 | 27 | 27 | 27 | 
| 163676 | 7729 | 1 | 0 | 15 | 5 | 1 | 12 | 3 | 1 | 499 | 125 | 5 | 4 | 4222 | 5966 | 4222 | 0 | 1 | 0 | 24 | 3793 | 3 | 24 | 18 | 24 | 
| 187746 | 2888 | 5 | 1 | 23 | 0 | 0 | 18 | 3 | 1 | 3672 | 367 | 11 | 8 | 2499 | 9246 | 2499 | 0 | 1 | 0 | 25 | 3793 | 10 | 25 | 25 | 25 | 
| 264770 | 18969 | 5 | 1 | 30 | 0 | 0 | 18 | 3 | 1 | 211 | 30 | 8 | 6 | 15777 | 26163 | 3237 | 0 | 0 | 0 | 24 | 3793 | 8 | 24 | 24 | 24 | 

## Feature selection

To achieve this dataset used 2 different feature selection algorithms : a feature ranking algorithm as well as a forward feature selection algorithm.

For the ranking algorithm, we used the definition available in Pr. Bontempi's book : "Statistical foundations of machine learning", and we chose to use mutual information as a univariate measure.

For the forward feature selection algorithm, we used the definition available in the same book, and we chose the decision tree classifier to compute the generalisation error as it is the same model that was chosen in the following part.

As we could not get the forward feature algorithm to finish in a resonable amount of time due to the cluster, we decided to use results of the ranking algorithm to select the best features. We decided to get rid of the 12 worst features leaving us with this dataset


| session_id | item_id | duration | last_item | longest_item | most_viewed_item | average_time | month | nb_categories_last_item | categories_nb | nb_categories_most_viewed_item | season | nb_categories_longest_item | most_present_category | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 129978 | 13269 | 149 | 1373 | 20271 | 1373 | 30 | 4 | 25 | 25 | 25 | 1 | 25 | 3793 | 
| 144420 | 4400 | 53 | 16065 | 16065 | 16065 | 53 | 5 | 27 | 27 | 27 | 1 | 27 | 3793 | 
| 163676 | 7729 | 499 | 4222 | 5966 | 4222 | 125 | 1 | 24 | 24 | 24 | 0 | 18 | 3793 | 
| 187746 | 2888 | 3672 | 2499 | 9246 | 2499 | 367 | 5 | 25 | 25 | 25 | 1 | 25 | 3793 | 
| 264770 | 18969 | 211 | 3237 | 26163 | 15777 | 30 | 5 | 24 | 24 | 24 | 1 | 24 | 3793 | 



## Model

We experimented with different models but the one we found the most suitable was the DecisionTreeClassifier from pyspark.ml.classification. 

When performing a 4-fold cross validation, the model obtained arround 2% of accuracy. This accuracy was not computed using latest state of our project due to the impossibility of running it on the cluster. This result is excellent compared to random picking which has a 2e-4 (0.0002). Therefore, we predict the data 100 times better than what we would have at random.


This model was able to generate the prediction for us to submit on the challenge website, where we got a whopping score of 0.022 (2.2%).