After examination of the source code a bug has been discovered that caused the trouble with the evaluation of the distance to distribution. After the bug has been fixed it was decided to test different neural network configurations:
* do not apply sigmoid activation to any layer;
* apply sigmoid activation only to hidden layers;
* apply sigmoid activation to all layers;

Each configuration has 2 hidden layers with 20 neurons.

The tests has been performed on 3 different types of datasets which were produced using generated traces with 2 types of population. 

__Case 1: single population.__<br>
Arrivals: Poisson with 20.0 mean time between arrivals.<br>
Popularity: Zipf with 0.8 parameter.<br>
Number of items: 100 000.<br>

__Case 2: mixed 2 populations.__<br>
Arrivals: Poisson with 40.0 mean time between arrivals for both of populations.<br>
Popularity: Zipf with 0.8 parameter for both populations, but for second population the IDs are randomly shuffled each time window.<br>
Number of items: 50 000 in each population.<br>

The **Dataset 1** is generated using only case 1 population, the **Dataset 2** - using case 2 population without keeping class label, the **Dataset 3** - using case 2 population keeping class label. Each dataset consists of 6 columns - ID of the object, popularities in 4 previous time windows, popularity in 5-th time window (which NN should try to predict). **Dataset 3** additionally contains a column with class label (0 or 1).

During tests it was observed that the neural network performs better training on **Dataset 2** rather than on **Dataset 3**, which should't be the case, since **Dataset 3** contains more information. It was decided to transform **Dataset 3** in a way that 4 popularity in previous time window columns are transformed into 8 columns. The first 4 columns out of 8 new columns are non-zero and contain popularity values if the item's class label is 0. If the class label is 1 then the last 4 columns are non-zero.

Now let’s present what results has been achieved:

## 1. No sigmoid activation
### Dataset 1. Distance to distribution.
<img src="no_sigmoid/case1_dist_plot.png" style="width: 700px;"/> 
### Dataset 1. Ordering.
<img src="no_sigmoid/case1_order_plot.png" style="width: 700px;"/> 
### Dataset 2. Distance to distribution.
<img src="no_sigmoid/case2_no_label_dist_plot.png" style="width: 700px;"/> 
### Dataset 2. Ordering.
<img src="no_sigmoid/case2_no_label_order_plot.png" style="width: 700px;"/> 
### Dataset 3. Distance to distribution.
<img src="no_sigmoid/case2_with_label_dist_plot.png" style="width: 700px;"/> 
### Dataset 3. Ordering.
<img src="no_sigmoid/case2_with_label_order_plot.png" style="width: 700px;"/>
 
As seen from the plots, this configuration performed reasonably well in terms of item ordering in all cases even though the predicted popularities can be far from real popularities.

## 2. Sigmoid activation on hidden layers.
### Dataset 1. Distance to distribution.
<img src="middle_sigmoid/case1_dist_plot.png" style="width: 700px;"/> 
### Dataset 1. Ordering.
<img src="middle_sigmoid/case1_order_plot.png" style="width: 700px;"/> 
### Dataset 2. Distance to distribution.
<img src="middle_sigmoid/case2_no_label_dist_plot.png" style="width: 700px;"/> 
### Dataset 2. Ordering.
<img src="middle_sigmoid/case2_no_label_order_plot.png" style="width: 700px;"/> 
### Dataset 3. Distance to distribution.
<img src="middle_sigmoid/case2_with_label_dist_plot.png" style="width: 700px;"/> 
### Dataset 3. Ordering.
<img src="middle_sigmoid/case2_with_label_order_plot.png" style="width: 700px;"/>

This configuration also performed reasonably well with item ordering, but popularity prediction has less variability and still is not very accurate.

## 3. Sigmoid activation on all layers.
### Dataset 1. Distance to distribution.
<img src="all_sigmoid/case1_dist_plot.png" style="width: 700px;"/> 
### Dataset 1. Ordering.
<img src="all_sigmoid/case1_order_plot.png" style="width: 700px;"/> 
### Dataset 2. Distance to distribution.
<img src="all_sigmoid/case2_no_label_dist_plot.png" style="width: 700px;"/> 
### Dataset 2. Ordering.
<img src="all_sigmoid/case2_no_label_order_plot.png" style="width: 700px;"/> 
### Dataset 3. Distance to distribution.
<img src="all_sigmoid/case2_with_label_dist_plot.png" style="width: 700px;"/> 
### Dataset 3. Ordering.
<img src="all_sigmoid/case2_with_label_order_plot.png" style="width: 700px;"/>

But the last configuration ordered the items with all 3 datasets in reverse order. Also the predicted popularity is almost the same for each item and close to the average popularity - 1e-5.

After running the learning again for a few times the neural networks were able to order the items correctly, but the predicted popularity behaviour is the same.

| Distance to distribution | Ordering |
|-|-|
| Dataset 1. Distance to distribution. | Dataset 1. Ordering. |
|-|-|
|<img src="all_sigmoid_fix/case1_dist_plot.png" style="width: 700px;"/>|<img src="all_sigmoid_fix/case1_order_plot.png" style="width: 700px;"/>|
|-|-|
| Dataset 2. Distance to distribution. | Dataset 2. Ordering. |
|-|-|
|<img src="all_sigmoid_fix/case2_no_label_dist_plot.png" style="width: 700px;"/>|<img src="all_sigmoid_fix/case2_no_label_order_plot.png" style="width: 700px;"/>|
|-|-|
| Dataset 3. Distance to distribution. | Dataset 3. Ordering. |
|-|-|
|<img src="all_sigmoid_fix/case2_with_label_dist_plot.png" style="width: 700px;"/>|<img src="all_sigmoid_fix/case2_with_label_order_plot.png" style="width: 700px;"/>|

Atempts to change the number of layers, number of neurons in hidden layers, applying different learning rates produced the same behaviour.