# Intro

Neural Collaborative Filtering (NCF) is one of the recommendation system frameworks, based on neural networks, proposed by He, et. al. (2017). According to them a neural network can develop a model by learning item user interactions as a key factor of a collaboritive filtering from implicit feedback.

Our first notebook [(hakanerdem)](https://www.kaggle.com/hakanerdem/recommender-system-with-embedding-layers) was about recommender systems for cross selling opportunities on the domain of retail marketing. We try to implement the same issue with NCF at this time. Python code mostly developed thanks to beautiful notebooks like:

[fuzzywizard](https://www.kaggle.com/fuzzywizard/rec-sys-collaborative-filtering-dl-techniques#4-Matrix-Factorization-using-Deep-Learning-(Keras))

[rajmehra03](https://www.kaggle.com/rajmehra03/cf-based-recsys-by-low-rank-matrix-factorization)

Data preparation sections are the same as the firts notebook mentioned above. Please see for detailed information about NCF:

He, X., Liao, L., Zhang, H., Nie, L., Hu, X.,Chua, T.,  (2017), Neural Collaborative Filtering. WWW'17: Proceedings of the 26th International Conference on World Wide Web 173–182 DOI: http://dx.doi.org/10.1145/3038912.3052569

Dataset choosen: Online Retail II. For detailed information please visit:

https://www.kaggle.com/mashlyn/online-retail-ii-uci, [UCI Repository](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II)

# Libraries

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow import keras
from tensorflow.keras import layers

# Data & Preparation

In [None]:
data = pd.read_csv("../input/online-retail-ii-uci/online_retail_II.csv",
                   parse_dates=["InvoiceDate"],
                   dtype={"Customer ID":"object"})

In [None]:
df = data.copy()
df.head()

Droping rows with missing values and irrelevant labels

In [None]:
df = df.dropna()
df = df.drop(df[df["Quantity"]<0].index)
df = df.drop(df[df["StockCode"].str.contains("TEST")].index)
df = df.drop(df[df["StockCode"]=="POST"].index)

df = df.sort_values("InvoiceDate")

# Common Functions

First function can be used to obtain lists having unique elements. Second, for generating product purchase sequences and a target sequence having *n_target* length occuring after a sequence of product purchased. The last one for generating a negative sample.

In [None]:
def unique(list1):
    list_set = set(list1)
    unique_list = (list(list_set))
    return unique_list

def generate_sequence(serie, n_target):
    input_sequence = []
    output_sequence = []
    for x in serie:
        x = unique(x)
        if len(x)>n_target:
            input_sequence.append(x[:-n_target])
            output_sequence.append(x[-n_target:])
    return input_sequence, output_sequence

def agg(x, corp, sample_size=1):
    diff = np.setdiff1d(corp, list(x))
    ind = np.random.permutation(len(diff))
    return diff[ind[:int(sample_size*len(x))]]

Generating customers' purchase sequences of distinct products.

In [None]:
by_customer = df.groupby("Customer ID", as_index=False).agg(
    {"StockCode": [lambda x: list(x)]}
)
sequential_df = by_customer["StockCode"].rename(
    columns={"<lambda>":"purchase_sequence"}
)
sequential_df["CustomerID"] = by_customer["Customer ID"]
sequential_df["product_count"] = sequential_df["purchase_sequence"].apply(
    lambda x: len(unique(list(x)))
)

Coosing some hyperparameter values arbitrarily. We prefer to choose a frequency number to get rid of sparsity of data. That kind of sparsity means that some customers' purchasing behaviour includes only one or two different purchasing within plenty of different products. This issue also know as *cold start* problem. For detailed information please see:

Lü, L., Medo, M., Yeung, C. H., Zhang, Y., Zhang, Z., Zhou, T., (2012), Recommender systems, Physics Reports 519, 1-49, DOI: http://dx.doi.org/10.1016/j.physrep.2012.02.006

In [None]:
n_target = 1
n_frequency = 3
corp = sequential_df.explode("purchase_sequence")["purchase_sequence"].unique()
frequent_df = sequential_df[(sequential_df["product_count"]>n_frequency)]

input_seq, output_seq = generate_sequence(
    frequent_df["purchase_sequence"],
    n_target
    )

frequent_df["input_sequence"] = input_seq
frequent_df["output_sequence"] = output_seq
frequent_df = frequent_df[["CustomerID", "input_sequence", "output_sequence"]]
frequent_df = frequent_df.explode("input_sequence")
frequent_df["purchase"] = 1
frequent_df = frequent_df.set_index("CustomerID", drop=True)
frequent_df.head(10)

# Negative Sampling

Since all instances prepared so far represent positive-only feedback, we try to supply some negative information to the model. Instead of providing all non-purchased products, some negative instances are chosen randomly from products not purchased for a particular customer.

> sample_size=1

means there is 1 non-purchased product to be selected randomly.

In [None]:
new_df = frequent_df.reset_index().groupby("CustomerID").agg({"input_sequence": (lambda x: list(x))})
new_df["agg"] = new_df["input_sequence"].apply(lambda y: agg(y, corp, 1))
ndf = new_df.explode("agg")[["agg"]]
ndf["purchase"] = 0
ndf = ndf.rename(columns={"agg":"input_sequence"})

pdf = frequent_df[["input_sequence", "purchase"]]

sample_df = pdf.append(ndf)
sample_df = sample_df.reset_index()
sample_df = sample_df.sort_values("CustomerID", ignore_index=True)

display(sample_df.info())
display(sample_df.head(50))

# Encoding & Splitting

As a last step we try to encode user and product features. Method taken from [keras.io](https://keras.io/examples/structured_data/collaborative_filtering_movielens/) examples. We take the data as train & validation, but the better practice is holding out some samples in advance as test data

In [None]:
cust_ids = sample_df["CustomerID"].unique().tolist()
cust2cust_encoded = {x: i for i, x in enumerate(cust_ids)}
cust_encoded2cust = {i: x for i, x in enumerate(cust_ids)}
prod_ids = corp
prod2prod_encoded = {x: i for i, x in enumerate(prod_ids)}
prod_encoded2prod = {i: x for i, x in enumerate(prod_ids)}
sample_df["cust"] = sample_df["CustomerID"].map(cust2cust_encoded)
sample_df["prod"] = sample_df["input_sequence"].map(prod2prod_encoded)

num_custs = len(cust2cust_encoded)
num_prods = len(prod2prod_encoded)
sample_df["purchase"] = sample_df["purchase"].values.astype(np.float32)

print(
    "Number of Customers: {}, Number of Products: {}, Purchase: {}, Not Purchase: {}".format(
        num_custs, num_prods, 1, 0
    )
)

sample_df = sample_df.sample(frac=1, random_state=52)
X = sample_df[["cust", "prod"]].values
y = sample_df["purchase"].values

train_indices = int(0.8 * sample_df.shape[0])
X_train, X_val, y_train, y_val = (X[:train_indices],
                                  X[train_indices:],
                                  y[:train_indices],
                                  y[train_indices:])

# Base Model

In [None]:
num_custs = sample_df["cust"].nunique()
num_prods = sample_df["prod"].nunique()
hidden_units = (128,64)
prod_embedding_size = 8
user_embedding_size = 8

user_id_input = keras.Input(shape=(1,), name="user_id")
prod_id_input = keras.Input(shape=(1,), name="prod_id")
user_embedded = layers.Embedding(num_custs,
                                 user_embedding_size, 
                                 input_length=1,
                                 embeddings_regularizer=keras.regularizers.l2(1e-7),
                                 name="user_embedding")(user_id_input)
prod_embedded = layers.Embedding(num_prods,
                                 prod_embedding_size,
                                 input_length=1,
                                 embeddings_regularizer=keras.regularizers.l2(1e-6),
                                 name="prod_embedding")(prod_id_input)

concatenated = layers.Concatenate(name="concat")([user_embedded, prod_embedded])
out = layers.Flatten(name="flat")(concatenated)

for n_hidden in hidden_units:
    out = layers.Dense(n_hidden, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001))(out)
    out = layers.Dropout(0.4)(out)

out = layers.Dense(1, activation="sigmoid", name="prediction")(out)

neural_model = keras.Model(inputs = [user_id_input, prod_id_input],
                           outputs = out, name="neural_model")
neural_model.summary()

In [None]:
neural_model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
                     optimizer=keras.optimizers.Adam(learning_rate=0.001))

es = keras.callbacks.EarlyStopping(monitor='val_loss',
                                   mode='min',
                                   verbose=1,
                                   patience=5)

history = neural_model.fit(
    [sample_df["cust"].values, sample_df["prod"].values],
    sample_df.purchase.values,
    batch_size=256,
    epochs=50,
    callbacks=[es],
    verbose=1,
    validation_split=.1
    )

In [None]:
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "val"], loc="upper right")
plt.show()

# Model Evaluation

We try to measure the model performance by providing candidate products to the model and evaluating the outputs. Same protocol with [hakanerdem](https://www.kaggle.com/hakanerdem/recommender-system-with-embedding-layers) is conducted.

In [None]:
cust_id = sample_df["CustomerID"].sample(1).iloc[0]
cust_encoder = cust2cust_encoded.get(cust_id)
purchased = frequent_df[(frequent_df.index==cust_id) & (frequent_df["purchase"]==1)]

candidates = frequent_df[~frequent_df["input_sequence"].isin(purchased["input_sequence"].values)]["input_sequence"][:49]
candidates = set(candidates).intersection(set(prod2prod_encoded.keys()))
candidates = candidates.union(set(frequent_df[frequent_df.index==cust_id]["output_sequence"].values[0]))
candidates = [[prod2prod_encoded.get(x)] for x in list(candidates)]

vals = neural_model.predict([np.array([cust_encoder] * len(candidates)), np.array(candidates)]).flatten()
top_ratings_indices = vals.argsort()[-20:][::-1]
recommended_prod_ids = [prod_encoded2prod.get(candidates[x][0]) for x in top_ratings_indices]

print("Showing recommendations for user: {}".format(cust_id))
print("====" * 12)
print("Products purchased from customer")
print("----" * 8)
print(frequent_df[frequent_df.index==cust_id])

print("----" * 8)
print("Top 20 product recommendations")
print("----" * 8)
print(recommended_prod_ids)

In [None]:
counter = 0
size = 100

for s in range(size):

    cust_id = sample_df["CustomerID"].values[s]
    cust_encoder = cust2cust_encoded.get(cust_id)
    purchased = frequent_df[(frequent_df.index==cust_id) & (frequent_df["purchase"]==1)]

    candidates = frequent_df[~frequent_df["input_sequence"].isin(purchased["input_sequence"].values)]["input_sequence"][:49]
    candidates = set(candidates).intersection(set(prod2prod_encoded.keys()))
    candidates = candidates.union(set(frequent_df[frequent_df.index==cust_id]["output_sequence"].values[0]))
    candidates = [[prod2prod_encoded.get(x)] for x in list(candidates)]

    vals = neural_model.predict([np.array([cust_encoder] * len(candidates)), np.array(candidates)]).flatten()
    top_ratings_indices = vals.argsort()[-20:][::-1]
    recommended_prod_ids = [prod_encoded2prod.get(candidates[x][0]) for x in top_ratings_indices]
    target_prod_ids = frequent_df.loc[(frequent_df.index==cust_id), "output_sequence"].values[0]

    if len(np.setdiff1d(target_prod_ids, recommended_prod_ids)) < n_target:
        counter = counter + 1

print("recall@20 for first", size, " input: ", counter/size)

# Critics

Critics are more important than upvotes. Of course, base model have plenty of faulties, please criticise. On the other hand, it is a better practice to compare with different notebooks implemented same dataset, like my first study with embedding layers.

Some hyperparameters which are should be tuned.

* Number and unit numbers of *hidden_units*
* *prod_embedding_size* and *user_embedding_size*
* regularizers, learning rate, activation functions, batch size, dropout rates
* *n_frequency* number of frequent products
* *sample_size* number of negative samples corresponding to a positive sample

Thanks in advance..