# LightFM application to Kaggle Ponpare challenge

## Note to self :
use --ip=localhost option when launching notebook to show the output

## Configure workspace

First setup a local repository in which to clone the Kaggle scripts and download the data.  
Example: 
- mkdir Kaggle && cd Kaggle 
- git clone  https://github.com/tdeboissiere/Kaggle.git
- Download the [data](https://www.kaggle.com/c/coupon-purchase-prediction/data)
- Copy the data to Kaggle/Ponpare/Data/Data_japanese and unzip

Once this is done, we run a series of preprocessing steps :
- Translate the data to English 
- Do a bit of data cleaning 
- Create validation sets

To this end, we first move to the Ponpare_utilities directory

In [1]:
cd Ponpare_utilities

/home/irfulx204/mnt/tmain/Desktop/Notebook/Kaggle/Ponpare/Ponpare_utilities


##Import relevant modules

In [18]:
import create_validation as val
import preprocessing_submission as prep_sub
import preprocessing_validation as prep_val
import translate as tr
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn import preprocessing
import sys
import cPickle as pickle
from datetime import date
import calendar
import scipy.io as spi
import scipy.sparse as sps
from sklearn.feature_extraction import DictVectorizer

In [9]:
tr.translate()

The files are now being translated, this takes a bit of time.  
Once this is done, we check the files have been created.

In [6]:
assert(
    os.path.isfile("../Data/Data_translated/coupon_detail_train_translated.csv"))
assert(
    os.path.isfile("../Data/Data_translated/coupon_visit_train_translated.csv"))
assert(os.path.isfile("../Data/Data_translated/coupon_list_train_translated.csv"))
assert(os.path.isfile("../Data/Data_translated/coupon_list_test_translated.csv"))
assert(os.path.isfile("../Data/Data_translated/user_list_translated.csv"))

No assertion error, all good !
Let us check the translation :

In [3]:
df = pd.read_csv("../Data/Data_translated/user_list_translated.csv")
print df.iloc[:5, :]

              REG_DATE SEX_ID  AGE WITHDRAW_DATE PREF_NAME  \
0  2012-03-28 14:14:18      f   25           NaN   unknown   
1  2011-05-18 00:41:48      f   34           NaN     tokyo   
2  2011-06-13 16:36:58      m   41           NaN     aichi   
3  2012-02-08 12:56:15      m   25           NaN   unknown   
4  2011-05-22 23:43:56      m   62           NaN  kanagawa   

                       USER_ID_hash  
0  d9dca3cb44bab12ba313eaa681f663eb  
1  560574a339f1b25e57b0221e486907ed  
2  e66ae91b978b3229f8fd858c80615b73  
3  43fc18f32eafb05713ec02935e2c2825  
4  dc6df8aa860f8db0d710ce9d4839840f  


We can see the translation worked.  
"unknown" filed is for users for which we do not know PREF_NAME.

We now create a validation set :

In [5]:
# We pick the last week of training data (dubbed week52) as our test week
# All coupons emitted during this week become our new test coupons
val.create_validation_set([2012, 06, 17], [2012, 06, 23], "week52")

Let us check the files have been created properly :

In [9]:
assert(os.path.isfile(
    "../Data/Validation/week52/coupon_detail_train_validation_week52.csv"))
assert(os.path.isfile(
    "../Data/Validation/week52/coupon_visit_train_validation_week52.csv"))
assert(os.path.isfile(
    "../Data/Validation/week52/coupon_list_train_validation_week52.csv"))
assert(os.path.isfile(
    "../Data/Validation/week52/coupon_list_test_validation_week52.csv"))
assert(
    os.path.isfile("../Data/Validation/week52/user_list_validation_week52.csv"))


All good !
We can now move on to the more interesting part : building a recommender with lightFM.

We start by creating the sparse matrices which we will use to store implicit feedback information and user/item attributes.

Let us write down 3 functions for that :

## Creation of user feature matrix

In [26]:
def build_user_feature_matrix(week_ID):
    """ Build user feature matrix
    (feat = AGE, SEX_ID, these feat are then binarized)

    arg : week_ID (str) validation week

    """

    print "Creating user_feature matrix for LightFM"

    def age_function(age, age_low=0, age_up=100):
        """Binarize age in age slices"""
        
        if age_low <= age < age_up:
            return 1
        else:
            return 0

    def format_reg_date(row):
        """Format reg date to "year-month" """
        
        row = row.split(" ")
        row = row[0].split("-")
        reg_date = row[0]  # + row[1]
        return reg_date

    ulist = pd.read_csv(
        "../Data/Validation/%s/user_list_validation_%s.csv" %
        (week_ID, week_ID))

    # Format REG_DATE
    ulist["REG_DATE"] = ulist["REG_DATE"].apply(format_reg_date)

    # Segment the age
    ulist["0to30"] = ulist["AGE"].apply(age_function, age_low=0, age_up=30)
    ulist["30to50"] = ulist["AGE"].apply(
        age_function,
        age_low=30,
        age_up=50)
    ulist["50to100"] = ulist["AGE"].apply(
        age_function,
        age_low=50,
        age_up=100)

    list_age_bin = [col for col in ulist.columns.values if "to" in col]
    ulist = ulist[["USER_ID_hash",
                   "PREF_NAME",
                   "SEX_ID",
                   "REG_DATE"] + list_age_bin]

    ulist = ulist.T.to_dict().values()
    vec = DictVectorizer(sparse=True)
    ulist = vec.fit_transform(ulist)
    # ulist is in csr format, make sure the type is int
    ulist = sps.csr_matrix(ulist, dtype=np.int32)

    # Save the matrix. They are already in csr format
    spi.mmwrite(
        "../Data/Validation/%s/user_feat_mtrx_%s.mtx" %
        (week_ID, week_ID), ulist)


## Creation of item feature matrix

In [25]:
def build_item_feature_matrix(week_ID):
    """ Build item feature matrix

    arg : week_ID (str) validation week
    """

    print "Creating item_feature matrix for LightFM"

    def binarize_function(val, val_low=0, val_up=100):
        """Function to binarize a given column in slices
        """
        if val_low <= val < val_up:
            return 1
        else:
            return 0

    # Utility to convert a date to the day of the week
    #(indexed by i in [0,1,..6])
    def get_day_of_week(row):
        """Convert to unix time. Neglect time of the day
        """
        row = row.split(" ")
        row = row[0].split("-")
        y, m, d = int(row[0]), int(row[1]), int(row[2])
        return date(y, m, d).weekday()

    # Load coupon data
    cpltr = pd.read_csv(
        "../Data/Validation/%s/coupon_list_train_validation_%s.csv" %
        (week_ID, week_ID))
    cplte = pd.read_csv(
        "../Data/Validation/%s/coupon_list_test_validation_%s.csv" %
        (week_ID, week_ID))

    cplte["DISPFROM_day"] = cplte["DISPFROM"].apply(get_day_of_week)
    cpltr["DISPFROM_day"] = cpltr["DISPFROM"].apply(get_day_of_week)
    cplte["DISPEND_day"] = cplte["DISPEND"].apply(get_day_of_week)
    cpltr["DISPEND_day"] = cpltr["DISPEND"].apply(get_day_of_week)

    cpltr["PRICE_0to50"] = cpltr["PRICE_RATE"].apply(
        binarize_function,
        val_low=0,
        val_up=30)
    cpltr["PRICE_50to70"] = cpltr["PRICE_RATE"].apply(
        binarize_function,
        val_low=50,
        val_up=70)
    cpltr["PRICE_70to100"] = cpltr["PRICE_RATE"].apply(
        binarize_function,
        val_low=70,
        val_up=100)

    cplte["PRICE_0to50"] = cplte["PRICE_RATE"].apply(
        binarize_function,
        val_low=0,
        val_up=30)
    cplte["PRICE_50to70"] = cplte["PRICE_RATE"].apply(
        binarize_function,
        val_low=50,
        val_up=51)
    cplte["PRICE_70to100"] = cplte["PRICE_RATE"].apply(
        binarize_function,
        val_low=51,
        val_up=100)

    list_quant_name = [0, 20, 40, 60, 80, 100]
    quant_step = list_quant_name[1] - list_quant_name[0]

    list_prices = cpltr["CATALOG_PRICE"].values
    list_quant = [np.percentile(list_prices, quant)
                  for quant in list_quant_name]

    for index, (quant_name, quant) in enumerate(zip(list_quant_name, list_quant)):
        if index > 0:
            cpltr["CAT_%sto%s" % (
                quant_name - quant_step,
                quant_name)] = cpltr["CATALOG_PRICE"].apply(binarize_function,
                                                            val_low=list_quant[
                                                                index -
                                                                1],
                                                            val_up=quant)
            cplte["CAT_%sto%s" % (
                quant_name - quant_step,
                quant_name)] = cplte["CATALOG_PRICE"].apply(binarize_function,
                                                            val_low=list_quant[
                                                                index -
                                                                1],
                                                            val_up=quant)

    list_prices = cpltr["DISCOUNT_PRICE"].values
    list_quant = [np.percentile(list_prices, quant)
                  for quant in list_quant_name]
    for index, (quant_name, quant) in enumerate(zip(list_quant_name, list_quant)):
        if index > 0:
            cpltr["DIS_%sto%s" % (
                quant_name - quant_step,
                quant_name)] = cpltr["DISCOUNT_PRICE"].apply(binarize_function,
                                                             val_low=list_quant[
                                                                 index -
                                                                 1],
                                                             val_up=quant)
            cplte["DIS_%sto%s" % (
                quant_name - quant_step,
                quant_name)] = cplte["DISCOUNT_PRICE"].apply(binarize_function,
                                                             val_low=list_quant[
                                                                 index -
                                                                 1],
                                                             val_up=quant)

    list_col_bin = [col for col in cplte.columns.values if "to" in col]

    # List of features
    list_feat = [
        "GENRE_NAME", "large_area_name", "small_area_name", "VALIDPERIOD", "USABLE_DATE_MON", "USABLE_DATE_TUE",
        "USABLE_DATE_WED", "USABLE_DATE_THU", "USABLE_DATE_FRI",
        "USABLE_DATE_SAT", "USABLE_DATE_SUN", "USABLE_DATE_HOLIDAY",
        "USABLE_DATE_BEFORE_HOLIDAY"] + list_col_bin

    # NA imputation
    cplte = cplte.fillna(-1)
    cpltr = cpltr.fillna(-1)

    list_col_to_str = [
        "PRICE_RATE",
        "CATALOG_PRICE",
        "DISCOUNT_PRICE",
        "DISPFROM_day",
        "DISPEND_day",
        "DISPPERIOD",
        "VALIDPERIOD"]
    cpltr[list_col_to_str] = cpltr[list_col_to_str].astype(str)
    cplte[list_col_to_str] = cplte[list_col_to_str].astype(str)

    # Reduce dataset to features of interest
    cpltr = cpltr[list_feat]
    cplte = cplte[list_feat]

    list_us = [col for col in list_feat if "USABLE" in col]
    for col in list_us:
        cpltr.loc[cpltr[col] > 0, col] = 1
        cpltr.loc[cpltr[col] < 0, col] = 0
        cplte.loc[cpltr[col] > 0, col] = 1
        cplte.loc[cpltr[col] < 0, col] = 0

    # Binarize categorical features
    cpltr = cpltr.T.to_dict().values()
    vec = DictVectorizer(sparse=True)
    cpltr = vec.fit_transform(cpltr)
    cplte = vec.transform(cplte.T.to_dict().values())

    cplte = sps.csr_matrix(cplte, dtype=np.int32)
    cpltr = sps.csr_matrix(cpltr, dtype=np.int32)

    # Save the matrix. They are already in csr format
    spi.mmwrite(
        "../Data/Validation/%s/train_item_feat_mtrx_%s.mtx" %
        (week_ID, week_ID), cpltr)
    spi.mmwrite(
        "../Data/Validation/%s/test_item_feat_mtrx_%s.mtx" %
        (week_ID, week_ID), cplte)


## Creation of implicit feed back user/item matrix

In [1]:
def build_user_item_mtrx(week_ID):
    """ Build user item matrix (for test and train datasets)
    (sparse matrix, Mui[u,i] = 1 if user u has purchase item i, 0 otherwise)

    arg : week_ID (str) validation week
    """

    print "Creating user_item matrix for LightFM"

    # For now, only consider the detail dataset
    cpdtr = pd.read_csv(
        "../Data/Validation/%s/coupon_detail_train_validation_%s.csv" %
        (week_ID, week_ID))
    cpltr = pd.read_csv(
        "../Data/Validation/%s/coupon_list_train_validation_%s.csv" %
        (week_ID, week_ID))
    cplte = pd.read_csv(
        "../Data/Validation/%s/coupon_list_test_validation_%s.csv" %
        (week_ID, week_ID))
    ulist = pd.read_csv(
        "../Data/Validation/%s/user_list_validation_%s.csv" %
        (week_ID, week_ID))

    # Build a dict with the coupon index in cpltr
    d_ci_tr = {}
    for i in range(len(cpltr)):
        coupon = cpltr["COUPON_ID_hash"].values[i]
        d_ci_tr[coupon] = i

    # Build a dict with the user index in ulist
    d_ui = {}
    for i in range(len(ulist)):
        user = ulist["USER_ID_hash"].values[i]
        d_ui[user] = i

    # Build the user x item matrices using scipy lil_matrix
    Mui_tr = sps.lil_matrix((len(ulist), len(cpltr)), dtype=np.int8)

    # Now fill Mui_tr with the info from cpdtr
    for i in range(len(cpdtr)):
        sys.stdout.write(
            "\rProcessing row " + str(i) + "/ " + str(cpdtr.shape[0]))
        sys.stdout.flush()
        user = cpdtr["USER_ID_hash"].values[i]
        coupon = cpdtr["COUPON_ID_hash"].values[i]
        ui, ci = d_ui[user], d_ci_tr[coupon]
        Mui_tr[ui, ci] = 1
    print

    # Save the matrix in the COO format
    spi.mmwrite(
        "../Data/Validation/%s/user_item_train_mtrx_%s.mtx" %
        (week_ID, week_ID), Mui_tr)


The inline documentation of these functions is self-explanatory. We basically specify which features we want to take into account for either users or items and we consider an implicit feedback-only case where only purchases are counted as interactions.

We can now create these matrices :

In [27]:
build_user_item_mtrx("week52")
build_user_feature_matrix("week52")
build_item_feature_matrix("week52")

As always, let's check the matrices were created :

In [28]:
week_ID = "week52"
assert(
    os.path.isfile("../Data/Validation/%s/user_item_train_mtrx_%s.mtx" %
                   (week_ID, week_ID)))
assert(
    os.path.isfile("../Data/Validation/%s/train_item_feat_mtrx_%s.mtx" %
                   (week_ID, week_ID)))
assert(
    os.path.isfile("../Data/Validation/%s/test_item_feat_mtrx_%s.mtx" %
                   (week_ID, week_ID)))
assert(
    os.path.isfile("../Data/Validation/%s/user_feat_mtrx_%s.mtx" %
                   (week_ID, week_ID)))


Alright !
Let us now load these matrices :

In [30]:
week_ID = "week52"
Mui_train = spi.mmread(
    "../Data/Validation/%s/user_item_train_mtrx_%s.mtx" %
    (week_ID, week_ID))
uf = spi.mmread(
    "../Data/Validation/%s/user_feat_mtrx_%s.mtx" %
    (week_ID, week_ID))
itrf = spi.mmread(
    "../Data/Validation/%s/train_item_feat_mtrx_%s.mtx" %
    (week_ID, week_ID))
itef = spi.mmread(
    "../Data/Validation/%s/test_item_feat_mtrx_%s.mtx" %
    (week_ID, week_ID))


Import lightfm and our evaluation metric (MAP@10):

In [31]:
from lightfm import LightFM
import mean_average_precision as mapr

Let us now define a function to fit a lightFM instance

In [35]:
def fit_model(week_ID, no_comp, lr, ep):
    """ Fit the lightFM model to all weeks in list_week_ID.
    Then print the results for MAPat10

    args : week_ID validation test week
    no_comp, lr, ep = (int, float, int) number of components, learning rate, number of epochs for lightFM model

returns: d_user_pred, list_user, list_coupon
list_coupon = list of test coupons
list_user = list of user ID
d_user_pred : key = user, value = predicted ranking of coupons in list_coupon

    """

    print "Fit lightfm model for %s" % week_ID

    # Load data
    Mui_train = spi.mmread(
        "../Data/Validation/%s/user_item_train_mtrx_%s.mtx" %
        (week_ID, week_ID))
    uf = spi.mmread(
        "../Data/Validation/%s/user_feat_mtrx_%s.mtx" %
        (week_ID, week_ID))
    itrf = spi.mmread(
        "../Data/Validation/%s/train_item_feat_mtrx_%s.mtx" %
        (week_ID, week_ID))
    itef = spi.mmread(
        "../Data/Validation/%s/test_item_feat_mtrx_%s.mtx" %
        (week_ID, week_ID))

    # Print shapes as a check
    print "user_features shape: %s,\nitem train features shape: %s,\nitem test features shape: %s" % (uf.shape, itrf.shape, itef.shape)

    # Load test coupon  and user lists
    cplte = pd.read_csv(
        "../Data/Validation/" +
        week_ID +
        "/coupon_list_test_validation_" +
        week_ID +
        ".csv")
    ulist = pd.read_csv(
        "../Data/Validation/" +
        week_ID +
        "/user_list_validation_" +
        week_ID +
        ".csv")
    list_coupon = cplte["COUPON_ID_hash"].values
    list_user = ulist["USER_ID_hash"].values

    # Build model
    model = LightFM(no_components=no_comp, learning_rate=lr, loss='warp')
    model.fit_partial(
        Mui_train,
        user_features=uf,
        item_features=itrf,
        epochs=ep,
        num_threads=4,
        verbose=True)

    test = sps.csr_matrix(
        (len(list_user),
         len(list_coupon)),
        dtype=np.int32)
    no_users, no_items = test.shape
    pid_array = np.arange(no_items, dtype=np.int32)

    # Create and initialise dict to store predictions
    d_user_pred = {}
    for user in list_user:
        d_user_pred[user] = []

    # Loop over users and compute predictions
    for user_id, row in enumerate(test):
        sys.stdout.write("\rProcessing user " + str(user_id)
                         + "/ " + str(len(list_user)))
        sys.stdout.flush()
        uid_array = np.empty(no_items, dtype=np.int32)
        uid_array.fill(user_id)
        predictions = model.predict(
            uid_array,
            pid_array,
            user_features=uf,
            item_features=itef,
            num_threads=4)
        user = str(list_user[user_id])
        d_user_pred[user] = predictions

    return d_user_pred, list_user, list_coupon


We also define a scoring function :

In [37]:
def score_lightFM(no_comp, lr, ep):
    """
    Score the lightFM model for mean average precision at k = 10

    args = no_comp, lr, ep (int, float, int)
    number of components, learning rate, number of epochs for lightFM model
    """

    list_score = []

    # Loop over validation weeks
    for week_ID in ["week52"]:
        # Get predictions, manually choose metric and classifier
        d_user_pred, list_user_full, list_coupon = fit_model(
            week_ID, no_comp, lr, ep)
        # Format predictions
        for index, user in enumerate(list_user_full):
            list_pred = d_user_pred[user]
            top_k = np.argsort(-list_pred)[:10]
            d_user_pred[user] = list_coupon[top_k]

        # Get actual purchase
        d_user_purchase = {}
        with open("../Data/Validation/" + week_ID + "/dict_purchase_validation_" + week_ID + ".pickle", "r") as fp:
            d_user_purchase = pickle.load(fp)

        # Take care of users who registered during validation test week
        for key in d_user_purchase.keys():
            try:
                d_user_pred[key]
            except KeyError:
                d_user_pred[key] = []

        list_user = d_user_purchase.keys()
        list_actual = [d_user_purchase[key] for key in list_user]
        list_pred = [d_user_pred[key] for key in list_user]

    list_score.append(mapr.mapk(list_actual, list_pred))
    print list_score

    list_score = np.array(list_score)
    print list_score
    print str(np.mean(list_score)) + " +/- " + str(np.std(list_score))


We can now evaluate our lightFM model !

In [39]:
no_comp, lr, ep = 10, 0.01, 10  # 10 components, 0.01 learning rate, 10 epochs
score_lightFM(no_comp, lr, ep)