# Data description

In [1]:
from data import Dataset

data = Dataset(csvfile='interactions.csv', 
               num_test_users=500, 
               sample=0.15, 
               cut_item=100)

train, test = data.get_train_test_sequences()

#User: 7499	#Item: 4698
796479 49833


After sampling and removing some cold items (count < 100) and cold users (ratings < 20), a small dataset was created.
* Users: 7499
* Items: 4698
* Interactions: 846,312

500 users were selected for test. After train test split, there are 796,479 interactions in train set and 49,833 interactions in test set.

# Model

**Causal CNN** was applied to capture the sequences features. In order to be causal, **zero-padding** should be applied before convolution. The following figure shows the basic architecture of the network, which only contains one convolution layer. Actually, more convolution layers can be applied. Theoretically, more layers may produce better result, but may cause over-fitting as well. I would try to find out whether deeper CNN is better or not by experiment.

The loss function is **BPR loss** with **negative sampling**. Hinge loss or pointwise loss can also be used, but BPR loss works better in this set.

<img src="img/seq_cnn.png">

# Experimental study

As it is a top-N recommendation, some top-N metrics were used:
* Precision@20
* Recall@20
* NDCG@20

First, I compare popularity baseline (POP), matrix factorization with BPR loss (BPR-MF) and sequence-based CNN (Seq-CNN). The following table show the results. We can see that BPR-MF perform only a little bit better than POP. One of the reason is that the BPR-MF model was not well tuning. Seq-CNN perform much better than the POP baseline. That's because the sequence model can capture the changes of users' taste. In other word, it's time-aware.

The sequence model can works better if we change the distribution of negative sampling (Seq-CNN-PN). The more popular the item, the more probable it is that the user knows about it. Thus, higher probability should be used to sample the pop items. The probability distribution that I use is $log(c(i)+1)$, where $c(i)$ is the appeal count of item $i$.

|  Model        | Precision@20 | Recall@20 | NDCG@20 |
| ------------- | ------------ | --------- | ------- |
| POP           | 3.62%        | 8.15%     | 0.2596  |
| BPR-MF        | 3.80%        | 8.55%     | 0.2744  |
| Seq-CNN       | 5.34%        | 10.50%    | 0.3798  |
| Seq-CNN-PN    | 5.49%        | 11.19%    | 0.3949  |

# TO-DO

* Deeper convolution layers?
* Wider embedding dimension?
* Different activation function, e.g. ReLU, Tanh?
* Residual CNN?
* Conv2d instead of Conv1d?
* RNN (LSTM, GRU) is not work, why?