#  Normalization test of split and metrics functions 

In this notebook we check the consistency of split and metrics function. We first check this using an artificial dataset and then using the movielens 100k and 1m data used in the recommender tests.  

In [1]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")
import os

import numpy as np 
import pandas as pd
import itertools
import time 

from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_random_split, python_stratified_split
from reco_utils.dataset.numpy_splitters import numpy_stratified_split
from reco_utils.dataset.sparse import AffinityMatrix

from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k

print("System version: {}".format(sys.version))
print("Pandas version: {}".format(pd.__version__))

System version: 3.6.6 |Anaconda custom (64-bit)| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Pandas version: 0.23.4


## 1 Artificial dataset 

For debugging purpose it is useful to generate random sparse matrices. The function `affinity_matrix()` generates a  random rating matrix with a specified degree of sparsness. Realistic user/affinity matrices show a high degree of sparsness;  for example, the sparsness of the movielens dataset is $\ge 90$%, depending on the particular chosen data size, e.g. movielens 100k $\simeq 93$%, movielens 1m $\simeq 95$% etc... Once the sparsness is fixed, the remaining ratings are sampled with equal probabilities. In a realistic datset, rating probabilities are not uniform.  

In [2]:
def affinity_matrix(users, items, ratings, spars):

    '''
    Generate a random user/item affinity matrix. By increasing the likehood of 0 elements we simulate 
    a typical recommeding situation where the input matrix is highly sparse. 
    
    Args: 
        users (int): number of users (rows).
        items (int): number of items (columns).
        ratings (int): rating scale, e.g. 5 meaning rates are from 1 to 5.
        spars: probablity of obtaining zero. This roughly correponds to the sparsness. 
               of the generated matrix. If spars = 0 then the affinity matrix is dense. 
    
    Returns: 
        X (np array, int): sparse user/affinity matrix 
    
    '''
    
    np.random.seed(123)

    s= [(1-spars)/5]*5 #uniform probability for the 5 ratings
    s.append(spars) 
    P= s[::-1] 
    
    # generates the user/item affinity matrix. Ratings are from 1 to 5, with 0s denoting unrated items
    X= np.random.choice(ratings+1, (users,items), p = P)
    
    return X

In [5]:
#generate the random sparse matrix. In this example we choose a ~80% sparsness and the matrix dimension are chosen 
#with the same ratio of item/user ~ 1.78 of the movielens 100k dataset  
X = affinity_matrix(users=30,items=53, ratings= 5, spars= 0.8)
X

array([[0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [2, 0, 2, ..., 0, 3, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [6]:
#Check the sparsness of the dataset#

zero = (X == 0).sum()  # number of unrated items
total = X.shape[0] * X.shape[1]  # number of elements in the matrix
sparsness = zero / total * 100  # Percentage of zeros in the matrix

sparsness 

79.37106918238995

In [7]:
#number of ratings per user
rated = np.sum(X != 0, axis=1)
rated

array([ 7,  6, 13,  9, 12, 12, 11, 11, 11, 14, 10,  9, 15, 11, 15, 11,  9,
        9,  3,  9,  8, 13, 12, 11, 13, 16, 19, 10, 12,  7])

In [8]:
#Total number of rated items 
total_rated = rated.sum()
total_rated

328

In order to simulate the recommendation task, we split this dataset into train and test set and use the latter to evaluate precision@k. Below, we compare two different split strategies, the first is the **global** split of `python_random_split`and the second is the **local** split `of numpy_stratified_split`. In order to apply the former splitter we first need to map X to a dataframe representation. 

In [10]:
def sparse_to_df(X):

        """
        Map the user/affinity matrix to a pd dataframe

        """
        m, n = X.shape #obtain the matrix dimensions: m = #users, n=#items 

        userids = []

        for i in range(1, m+1):
            userids.extend([i]*n)


        itemids = [i for i in range(1, n+1)]*m
        ratings = np.reshape(X, -1)

        #create dataframe
        results = pd.DataFrame.from_dict(
                        {
                            'user': userids,
                            'item': itemids,
                            'rating': ratings,
                         }
                    )

        #here we eliminate the missing ratings to obtain a standard form of the df as that of real dataframe.
        results = results[results.rating !=0]

        return results

In [11]:
#create a pandas df 
X_df = sparse_to_df(X)
X_df.head(10)

Unnamed: 0,user,item,rating
6,1,7,5
21,1,22,2
37,1,38,3
38,1,39,4
44,1,45,2
47,1,48,5
51,1,52,1
58,2,6,2
64,2,12,2
71,2,19,3


### 1.1 Splitting 

Splitting data in an unsupervised setting generally works differently than in the supervised one. 

Let us first consider a typical supervised learning problem, for example a binary classification problem. We are given a matrix $X^{\mu}_i$, where $\mu \in [1, m]$ is the example index and $i \in [1,n]$ is the feature index. We are also given a ground truth vector $y^{\mu}$. The matrix $\hat{X}$ is generally dense and we want to cut a certain percentage `t` of examples for the training set and `(1-t)` for the test set. In this case $Xtr^{\mu}_i$ contains the **same** number of features (columns) but different examples (rows) and we split $y^{\mu}$ accordingly. 

In the unspervised case, no ground truth vector is provided. In the recommendation case, the user/item affinity matrix contains the ratings as training examples; the only way to verify if the recommendation is correct is to cut part of the ratings for the test set: the ratings are in this case the examples. For the same user we can then verify if the prediction is correct or not by comparing to the test set. Due to the unequal number of ratings per user, we need to make sure that each user contributes the same number of train/test examples. 

### 1.1.1 Python Random splitter

In [12]:
#split data using the random split 
Xtr_random, Xtst_random = python_random_split(X_df, ratio = 0.75, seed= 123)

In [13]:
Xtst_random.head() 

Unnamed: 0,user,item,rating
85,2,33,2
763,15,22,5
286,6,22,5
1139,22,27,2
1362,26,38,4


In [14]:
print( 'global % of rated items in the train set', (len(Xtr_random.rating)/total_rated)*100 )
print( 'global % of rated items in the test set', (len(Xtst_random.rating)/total_rated)*100 )

global % of rated items in the train set 75.0
global % of rated items in the test set 25.0


Let us check now the per-user percentage of train/tes set example

In [15]:
#per user % of training example 
(Xtr_random.groupby(by= 'user').rating.count()/total_rated)*100 

user
1     1.524390
2     1.219512
3     3.048780
4     1.219512
5     2.743902
6     2.134146
7     3.353659
8     2.134146
9     2.743902
10    3.048780
11    2.743902
12    2.743902
13    3.963415
14    2.134146
15    3.963415
16    2.134146
17    1.829268
18    2.134146
20    1.829268
21    2.134146
22    3.353659
23    2.134146
24    2.743902
25    3.353659
26    3.963415
27    4.573171
28    2.439024
29    1.829268
30    1.829268
Name: rating, dtype: float64

The above splitter often introduces missing users. Above, user 19 is missing for example.

In [16]:
#per user % of test example 
(Xtst_random.groupby(by= 'user').rating.count()/total_rated)*100 

user
1     0.609756
2     0.609756
3     0.914634
4     1.524390
5     0.914634
6     1.524390
8     1.219512
9     0.609756
10    1.219512
11    0.304878
13    0.609756
14    1.219512
15    0.609756
16    1.219512
17    0.914634
18    0.609756
19    0.914634
20    0.914634
21    0.304878
22    0.609756
23    1.524390
24    0.609756
25    0.609756
26    0.914634
27    1.219512
28    0.609756
29    1.829268
30    0.304878
Name: rating, dtype: float64

In [17]:
#mean ratio of train/test and standard deviation 
np.mean(Xtr_random.groupby(by= 'user').rating.count()/(Xtst_random.groupby(by= 'user').rating.count()))

3.6209876543209876

In [18]:
#std of train/test and standard deviation 
np.std(Xtr_random.groupby(by= 'user').rating.count()/(Xtst_random.groupby(by= 'user').rating.count()))

2.089984900572549

the fluctuations around the mean ratio of train/test examples is too large. 

### 1.1.2 Python Stratified splitter 

In [19]:
#split data using the random split 
Xtr_strat, Xtst_strat = python_stratified_split(X_df, ratio = 0.75, seed= 123, col_user ='user', col_item= 'item')

In [20]:
Xtst_strat.head() 

Unnamed: 0,user,item,rating
47,1,48,5
51,1,52,1
71,2,19,3
91,2,39,5
125,3,20,5


In [21]:
print( 'global % of rated items in the train set', (len(Xtr_strat.rating)/total_rated)*100 )
print( 'global % of rated items in the test set', (len(Xtst_strat.rating)/total_rated)*100 )

global % of rated items in the train set 74.6951219512195
global % of rated items in the test set 25.304878048780488


In [22]:
#per user % of training example 
(Xtr_strat.groupby(by= 'user').rating.count()/total_rated)*100

user
1     1.524390
2     1.219512
3     3.048780
4     2.134146
5     2.743902
6     2.743902
7     2.439024
8     2.439024
9     2.439024
10    3.048780
11    2.439024
12    2.134146
13    3.353659
14    2.439024
15    3.353659
16    2.439024
17    2.134146
18    2.134146
19    0.609756
20    2.134146
21    1.829268
22    3.048780
23    2.743902
24    2.439024
25    3.048780
26    3.658537
27    4.268293
28    2.439024
29    2.743902
30    1.524390
Name: rating, dtype: float64

Note that the sum of the above percentages is 74.74% as it should be.

In [23]:
#per user % of test example 
(Xtst_strat.groupby(by= 'user').rating.count()/total_rated)*100 

user
1     0.609756
2     0.609756
3     0.914634
4     0.609756
5     0.914634
6     0.914634
7     0.914634
8     0.914634
9     0.914634
10    1.219512
11    0.609756
12    0.609756
13    1.219512
14    0.914634
15    1.219512
16    0.914634
17    0.609756
18    0.609756
19    0.304878
20    0.609756
21    0.609756
22    0.914634
23    0.914634
24    0.914634
25    0.914634
26    1.219512
27    1.524390
28    0.609756
29    0.914634
30    0.609756
Name: rating, dtype: float64

This time the number of users is the same in both train and test. We can check again the mean train/test set ratio and its std. We find that in this case the fluctuations are much smaller than in the random splitter, meaning a more balanced per user example.

In [24]:
np.mean(Xtr_strat.groupby(by= 'user').rating.count()/(Xtst_strat.groupby(by= 'user').rating.count()))

2.9766666666666666

In [25]:
np.std(Xtr_strat.groupby(by= 'user').rating.count()/(Xtst_strat.groupby(by= 'user').rating.count()))

0.48814842915745305

In this case fluctuations (error bars) have a reasonable size.

### 1.1.3 Numpy stratified split 

In this case the data are split by keeping a per-user, constant percentage. 

In [26]:
Xtr_np, Xtst_np = numpy_stratified_split(X, ratio = 0.75, seed= 123)

In [27]:
#number of rated elements in the train/test set 
Xtr_rated = np.sum(Xtr_np != 0, axis=1)  # number of rated items in the train set
Xtst_rated = np.sum(Xtst_np != 0, axis=1)  # number of rated items in the test set

In [28]:
print( 'global % of rated items in the train set', (Xtr_rated.sum() / total_rated)*100 )
print( 'global % of rated items in the test set', (Xtst_rated.sum() / total_rated)*100 )

global % of rated items in the train set 74.6951219512195
global % of rated items in the test set 25.304878048780488


In [29]:
#per user percentage of training examples 
Xtr_rated/rated

array([0.71428571, 0.66666667, 0.76923077, 0.77777778, 0.75      ,
       0.75      , 0.72727273, 0.72727273, 0.72727273, 0.71428571,
       0.8       , 0.77777778, 0.73333333, 0.72727273, 0.73333333,
       0.72727273, 0.77777778, 0.77777778, 0.66666667, 0.77777778,
       0.75      , 0.76923077, 0.75      , 0.72727273, 0.76923077,
       0.75      , 0.73684211, 0.8       , 0.75      , 0.71428571])

In [30]:
#per user percentage of test examples 
Xtst_rated/rated

array([0.28571429, 0.33333333, 0.23076923, 0.22222222, 0.25      ,
       0.25      , 0.27272727, 0.27272727, 0.27272727, 0.28571429,
       0.2       , 0.22222222, 0.26666667, 0.27272727, 0.26666667,
       0.27272727, 0.22222222, 0.22222222, 0.33333333, 0.22222222,
       0.25      , 0.23076923, 0.25      , 0.27272727, 0.23076923,
       0.25      , 0.26315789, 0.2       , 0.25      , 0.28571429])

Also in this case the per user, training percentage are not exactly 75% but much closer to it than the previous case. The reason for these fluctuations is due to rounding errors and in principle and can be improven.

In [31]:
np.mean(Xtr_rated/Xtst_rated)

2.976666666666666

In [32]:
np.std(Xtr_rated/Xtst_rated)

0.488148429157453

Despite the apparent difference, both the python and stratified splitters produce the same results. The main difference in the scalability, as we show below on the movilens datasets.

## 1.2 Evaluation 

We now consider the evaluation metrics and their normalization. As an example, we consider precision@k and rmse. As a first step, we are interested in determining the maximum achievable precision for precision@k; conventionally, this is set to one to denote a perfect score, in agreement with the normalization of the corresponding distribution. The definition is

$$ p_k = \frac{1}{m} \sum_{\mu=1}^m \frac{1}{k} \sum_{i =1}^{min(k, |D_i|)} I(X_p = X_{tst})^{\mu}_i$$,

where $I()$ is known as an indicator "function", even thought mathematically is a distribution. An example of this class of distributions is the Dirac delta function $\delta(X_p - X_{tst})$. The above definition has the (known) problem of not being normalized to 1 if the number of test set elements is less than k, as it can be seen explictly from the above equation. Below we show tha precision@k is often $< 1&  when working with sparse matrices.   

We define a maximum achievable precision@k function in order to find out the reference scale when evaluating models. The `max_precision_at_k()` function evaluates the absolute scale in a simple and fast way by directly employing the definition of precision@k. Below we test this function with the result coming from a direct application of `precision_at_k()` and show that it is the same. Then we will use `max_precision_at_k()` for the rest of the notebook. 

In [33]:
def max_precision_at_k(a,k):
    
    pk = np.mean(a/k)
    
    return pk    

### 1.2.1 Numpy stratified splitter

In [34]:
#return the top_k elements 

top_k = 5

def return_top_k(X_pred, k):

    top_items = np.argpartition(-X_pred , range(k), axis=1)[:, :k]  # get the top k items
        
    score_c = X_pred.copy()  # get a copy of the score matrix
    score_c[np.arange(score_c.shape[0])[:, None], top_items] = 0  # set to zero the top_k elements

    top_scores = X_pred - score_c  # set to zeros all elements other then the top_k
    
    return top_scores 

In [35]:
#top k matrix using the numpy_splitter result
top_np = return_top_k(Xtst_np, top_k)

In [36]:
#evaluate the maximum achievable prediction using the `max_precision_at_k()` function 
num_pred_np = np.sum(top_np !=0, axis=1)
print('max_precision at k=5,', max_precision_at_k(num_pred_np, top_k))

max_precision at k=5, 0.5533333333333333


Now we repeat the same calculation using the `precision_at_k()` function. In order to do that, we first need to convert the numpy prediction and test matrices to a pandas df:

In [37]:
#convert the prediction matrix to a pandas dataframe 
pred_np = sparse_to_df(top_np)
pred_np.head()

Unnamed: 0,user,item,rating
21,1,22,2
38,1,39,4
58,2,6,2
85,2,33,2
120,3,15,3


In [38]:
test_np = sparse_to_df(Xtst_np)
test_np.head()

Unnamed: 0,user,item,rating
21,1,22,2
38,1,39,4
58,2,6,2
85,2,33,2
120,3,15,3


In [39]:
eval_precision = precision_at_k(test_np, pred_np, col_user="user", col_item="item", 
                               col_rating="rating", col_prediction="rating", 
                               relevancy_method="top_k", k= top_k)
eval_precision 

0.5533333333333333

That is the same result obtained using `max_precision_at_k()`.

### 1.2.2 Python random splitter 

The `max_k()` function returns a dataframe in wich the number of rated items per user is $\le k$. This function is used to generate the maximum prediction dataset with k elements; basically, this is a restriction of the original test set to max k items per user.  

In [40]:
def max_k(X,k, group):

    counts = np.asarray(X.groupby(by= group).rating.count())
    idx = np.where(counts > k)
    counts[idx] = k

    return counts 

In [41]:
Xp_random = max_k(Xtst_random, k=5, group= 'user')

In [42]:
max_precision_at_k(Xp_random,k=5) 

0.5785714285714285

using the test set obtained from the random splitter one can achieve a maximum precision of $\sim 0.58$. 

### 1.2.3 Python stratified splitter 

The python stratified splitter gives the same result of the numpy one. 

In [43]:
Xp_strat= max_k(Xtst_strat, k=5, group = 'user')

In [44]:
max_precision_at_k(Xp_strat, k=5) 

0.5533333333333333

In this case the `python random splitter` is the one giving an higher maximum performance. However, this is not the case for the movielens dataset studied below. The maximum achievable performance is not only a function of the data sparsness but also of the per user rating distribution. 

## 2 Movielens 100k dataset

Below we apply the above analysis on the movielens 100k dataset. The size of the dataset enhanches many of the features found above. 

In [48]:
# Select Movielens data size: 100k
MOVIELENS_DATA_SIZE = '100k'

data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=['userID','movieID','rating','timestamp']
)

# Convert to 32-bit in order to reduce memory consumption 
data.loc[:, 'rating'] = data['rating'].astype(np.int32) 

data.head()

Unnamed: 0,userID,movieID,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [49]:
total_rated_ml = data.rating.count()
total_rated_ml

100000

In [50]:
Nusers= len(data.userID.unique())
Nusers

943

In [51]:
Nitems= len(data.movieID.unique())
Nitems

1682

In [52]:
Nitems/Nusers 

1.7836691410392365

### 2.1 Split data

### 2.1.1. Random Splitter 

In [54]:
#split data using the random split 
Ztr_random, Ztst_random = python_random_split(data, ratio = 0.75, seed= 123)

In [55]:
Ztst_random.head()

Unnamed: 0,userID,movieID,rating,timestamp
42083,600,651,4,888451492
71825,607,494,5,883879556
99535,875,1103,5,876465144
47879,648,238,3,882213535
36734,113,273,4,875935609


In [56]:
print( 'global % of rated items in the train set', (len(Ztr_random.rating)/total_rated_ml)*100 )
print( 'global % of rated items in the test set', (len(Ztst_random.rating)/total_rated_ml)*100 )

global % of rated items in the train set 75.0
global % of rated items in the test set 25.0


In [57]:
#per user % of training example 
(Ztr_random.groupby(by= 'userID').rating.count()/total_rated_ml)*100 

userID
1      0.203
2      0.045
3      0.037
4      0.018
5      0.133
6      0.161
7      0.303
8      0.046
9      0.014
10     0.148
11     0.130
12     0.039
13     0.454
14     0.077
15     0.080
16     0.111
17     0.018
18     0.197
19     0.015
20     0.036
21     0.136
22     0.096
23     0.105
24     0.055
25     0.066
26     0.085
27     0.017
28     0.058
29     0.025
30     0.027
       ...  
914    0.018
915    0.017
916    0.237
917    0.022
918    0.073
919    0.163
920    0.017
921    0.078
922    0.100
923    0.058
924    0.060
925    0.023
926    0.015
927    0.086
928    0.022
929    0.035
930    0.048
931    0.047
932    0.173
933    0.129
934    0.136
935    0.033
936    0.116
937    0.031
938    0.080
939    0.045
940    0.084
941    0.019
942    0.059
943    0.126
Name: rating, Length: 943, dtype: float64

As it can be seen, most users get a % of training examples $\ll 1$%. We can check this explictly by printing the % of users getting more than 1% of the training examples. 

In [58]:
( ((Ztr_random.groupby(by= 'userID').rating.count()/total_rated_ml)*100  >= 0.1).sum()/Nusers )*100  

28.738069989395548

In [59]:
#per user % of test example 
(Ztst_random.groupby(by= 'userID').rating.count()/total_rated_ml)*100 

userID
1      0.069
2      0.017
3      0.017
4      0.006
5      0.042
6      0.050
7      0.100
8      0.013
9      0.008
10     0.036
11     0.051
12     0.012
13     0.182
14     0.021
15     0.024
16     0.029
17     0.010
18     0.080
19     0.005
20     0.012
21     0.043
22     0.032
23     0.046
24     0.013
25     0.012
26     0.022
27     0.008
28     0.021
29     0.009
30     0.016
       ...  
914    0.005
915    0.009
916    0.080
917    0.013
918    0.030
919    0.054
920    0.009
921    0.032
922    0.027
923    0.016
924    0.022
925    0.009
926    0.005
927    0.034
928    0.010
929    0.014
930    0.015
931    0.014
932    0.068
933    0.055
934    0.038
935    0.006
936    0.026
937    0.009
938    0.028
939    0.004
940    0.023
941    0.003
942    0.020
943    0.042
Name: rating, Length: 943, dtype: float64

Since we are interested in evaluating the effect of the splitter on the @k metrics with $k=10$, we can find the fraction of users in the test set having at least 10 test examples. As explained in the previous section, this also defines the maximum achievable precision of the recommender. 

In [61]:
( ( (Ztst_random.groupby(by= 'userID').rating.count() ) >= 10 ).sum()/Nusers )

0.7073170731707317

The max precision @k can be evaluated using the functions defined in the previous section

In [65]:
Zp_random = max_k(Ztst_random, k=10, group = 'userID')
max_precision_at_k(Zp_random, k=10) 

0.8846235418875928

### 2.1.2 Python Stratified Splitter

In [67]:
start = time.time()
#split data using the random split 
Ztr_strat, Ztst_strat = python_stratified_split(data, ratio = 0.75, seed= 123, col_user ='userID', col_item= 'itemID')
elapsed_strat = time.time() - start
print('it took %f s to split the data' %elapsed_strat )

it took 3.073160 s to split the data


In [68]:
Ztst_strat.head()

Unnamed: 0,userID,movieID,rating,timestamp
15764,1,196,5,874965677
14792,1,103,1,878542845
8737,1,209,4,888732908
62069,1,191,5,875072956
25721,1,141,3,878542608


In [69]:
print( 'global % of rated items in the train set', (len(Ztr_strat.rating)/total_rated_ml)*100 )
print( 'global % of rated items in the test set', (len(Ztst_strat.rating)/total_rated_ml)*100 )

global % of rated items in the train set 74.992
global % of rated items in the test set 25.008000000000003


In [70]:
#per user % of training example 
(Ztr_strat.groupby(by= 'userID').rating.count()/total_rated_ml)*100 

userID
1      0.204
2      0.046
3      0.040
4      0.018
5      0.131
6      0.158
7      0.302
8      0.044
9      0.016
10     0.138
11     0.136
12     0.038
13     0.477
14     0.074
15     0.078
16     0.105
17     0.021
18     0.208
19     0.015
20     0.036
21     0.134
22     0.096
23     0.113
24     0.051
25     0.058
26     0.080
27     0.019
28     0.059
29     0.026
30     0.032
       ...  
914    0.017
915    0.020
916    0.238
917    0.026
918    0.077
919    0.163
920    0.020
921    0.082
922    0.095
923    0.056
924    0.062
925    0.024
926    0.015
927    0.090
928    0.024
929    0.037
930    0.047
931    0.046
932    0.181
933    0.138
934    0.130
935    0.029
936    0.106
937    0.030
938    0.081
939    0.037
940    0.080
941    0.016
942    0.059
943    0.126
Name: rating, Length: 943, dtype: float64

Print the % of users getting more than 1% of the training examples. 

In [71]:
( ((Ztr_strat.groupby(by= 'userID').rating.count()/total_rated_ml)*100  >= 0.1).sum()/Nusers )*100  

29.056203605514312

In [72]:
#per user % of test example 
(Ztst_strat.groupby(by= 'userID').rating.count()/total_rated_ml)*100 

userID
1      0.068
2      0.016
3      0.014
4      0.006
5      0.044
6      0.053
7      0.101
8      0.015
9      0.006
10     0.046
11     0.045
12     0.013
13     0.159
14     0.024
15     0.026
16     0.035
17     0.007
18     0.069
19     0.005
20     0.012
21     0.045
22     0.032
23     0.038
24     0.017
25     0.020
26     0.027
27     0.006
28     0.020
29     0.008
30     0.011
       ...  
914    0.006
915    0.006
916    0.079
917    0.009
918    0.026
919    0.054
920    0.006
921    0.028
922    0.032
923    0.018
924    0.020
925    0.008
926    0.005
927    0.030
928    0.008
929    0.012
930    0.016
931    0.015
932    0.060
933    0.046
934    0.044
935    0.010
936    0.036
937    0.010
938    0.027
939    0.012
940    0.027
941    0.006
942    0.020
943    0.042
Name: rating, Length: 943, dtype: float64

 % of test examples $\ge 1$%. 

In [73]:
( ((Ztst_strat.groupby(by= 'userID').rating.count()/total_rated_ml)*100  >= 0.1).sum()/Nusers )*100  

1.8027571580063628

In [74]:
( ( (Ztst_strat.groupby(by= 'userID').rating.count() ) >= 10 ).sum()/Nusers )

0.7020148462354189

In [76]:
Zp_strat = max_k(Ztst_strat, k=10, group = 'userID')
max_precision_at_k(Zp_strat, k=10) 

0.8996818663838813

### 2.1.3 Numpy stratified splitter 

In [80]:
start = time.time()
#to use standard names across the analysis 
header = {
        "col_user": "userID",
        "col_item": "movieID",
        "col_rating": "rating",
    }

#instantiate the splitter 
am = AffinityMatrix(DF = data, **header)

#obtain the sparse matrix 
Z = am.gen_affinity_matrix()
elapsed_am = time.time()- start
print('it took %f s to generate the affinity matrix' %elapsed_am)

it took 0.028296 s to generate the affinity matrix


In [81]:
start = time.time()
Ztr_np, Ztst_np = numpy_stratified_split(Z, ratio=0.75, seed=123)
elapsed_np = time.time() - start 
print('it took %f s to split the data' %elapsed_np)

it took 0.080756 s to split the data


In [82]:
#number of rated elements in the train/test set 
Ztr_np_rated = np.sum(Ztr_np != 0, axis=1)  # number of rated items in the train set
Ztst_np_rated = np.sum(Ztst_np != 0, axis=1)  # number of rated items in the test set

#number of ratings per user
Zrated = np.sum(Z != 0, axis=1)

In [83]:
print( 'global % of rated items in the train set', (Ztr_np_rated.sum() / total_rated_ml)*100 )
print( 'global % of rated items in the test set', (Ztst_np_rated.sum() / total_rated_ml)*100 )

global % of rated items in the train set 74.992
global % of rated items in the test set 25.008000000000003


In [84]:
#per user percentage of training examples 
Ztr_np_rated/Zrated

array([0.75      , 0.74193548, 0.74074074, 0.75      , 0.74857143,
       0.74881517, 0.74937965, 0.74576271, 0.72727273, 0.75      ,
       0.75138122, 0.74509804, 0.75      , 0.75510204, 0.75      ,
       0.75      , 0.75      , 0.75090253, 0.75      , 0.75      ,
       0.74860335, 0.75      , 0.74834437, 0.75      , 0.74358974,
       0.74766355, 0.76      , 0.74683544, 0.76470588, 0.74418605,
       0.75      , 0.75609756, 0.75      , 0.75      , 0.76      ,
       0.75      , 0.75438596, 0.75206612, 0.72727273, 0.74285714,
       0.75      , 0.74863388, 0.75113122, 0.74834437, 0.75      ,
       0.74074074, 0.76      , 0.75757576, 0.74883721, 0.75      ,
       0.73913043, 0.75      , 0.75      , 0.75384615, 0.76190476,
       0.7486631 , 0.75471698, 0.75324675, 0.7486911 , 0.75      ,
       0.76190476, 0.75      , 0.75268817, 0.75      , 0.75      ,
       0.73684211, 0.73333333, 0.76470588, 0.75384615, 0.7480916 ,
       0.73684211, 0.75182482, 0.75757576, 0.74358974, 0.74683

In [85]:
#per user percentage of test examples 
Ztst_np_rated/Zrated

array([0.25      , 0.25806452, 0.25925926, 0.25      , 0.25142857,
       0.25118483, 0.25062035, 0.25423729, 0.27272727, 0.25      ,
       0.24861878, 0.25490196, 0.25      , 0.24489796, 0.25      ,
       0.25      , 0.25      , 0.24909747, 0.25      , 0.25      ,
       0.25139665, 0.25      , 0.25165563, 0.25      , 0.25641026,
       0.25233645, 0.24      , 0.25316456, 0.23529412, 0.25581395,
       0.25      , 0.24390244, 0.25      , 0.25      , 0.24      ,
       0.25      , 0.24561404, 0.24793388, 0.27272727, 0.25714286,
       0.25      , 0.25136612, 0.24886878, 0.25165563, 0.25      ,
       0.25925926, 0.24      , 0.24242424, 0.25116279, 0.25      ,
       0.26086957, 0.25      , 0.25      , 0.24615385, 0.23809524,
       0.2513369 , 0.24528302, 0.24675325, 0.2513089 , 0.25      ,
       0.23809524, 0.25      , 0.24731183, 0.25      , 0.25      ,
       0.26315789, 0.26666667, 0.23529412, 0.24615385, 0.2519084 ,
       0.26315789, 0.24817518, 0.24242424, 0.25641026, 0.25316

In [86]:
(Ztst_np_rated >=10).sum()/Nusers 

0.7020148462354189

In [87]:
Z_top_np = return_top_k(Ztst_np, k=10)
#evaluate the maximum achievable prediction 
Znum_pred_np = np.sum(Z_top_np !=0, axis=1)
print('max_precision at k=10,', max_precision_at_k(Znum_pred_np, 10))

max_precision at k=10, 0.8996818663838813


To conclude, the stratifed splitters achieve a higher maximum achievable precision. On the other hand, the python stratified splitter is almost two order of magnitude slower than the python splitter.  

## 3 Movielens 1m dataset

In [88]:
# Select Movielens data size: 100k
MOVIELENS_DATA_SIZE = '1m'

data1m = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=['userID','movieID','rating','timestamp']
)

# Convert to 32-bit in order to reduce memory consumption 
data1m.loc[:, 'rating'] = data1m['rating'].astype(np.int32) 

data1m.head()

Unnamed: 0,userID,movieID,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


### 3.1 Python Random splitter

In [91]:
#split data using the random split 
Ytr_random, Ytst_random = python_random_split(data1m, ratio = 0.75, seed= 123)

In [92]:
Yp_random = max_k(Ytst_random, k=10, group = 'userID')

In [93]:
max_precision_at_k(Yp_random, k=10) 

0.9251366120218578

### 3.2 Python stratified splitter 

In [94]:
start = time.time()
Atr_strat, Atst_strat = python_stratified_split(data1m, ratio = 0.75, seed= 123, col_user ='userID', col_item= 'itemID')
elapsed_strat = time.time() - start

In [95]:
print( 'it took %f s to split the data' %elapsed_strat)

it took 89.691625 s to split the data


In [96]:
Ap_tst = max_k(Atst_strat, k=10, group = 'userID')
max_precision_at_k(Ap_tst, k=10) 

0.9386258278145694

### 3.1 Numpy stratified splitter  

In [97]:
start = time.time()
#to use standard names across the analysis 
header = {
        "col_user": "userID",
        "col_item": "movieID",
        "col_rating": "rating",
    }

#instantiate the splitter 
am1m = AffinityMatrix(DF = data1m, **header)

#obtain the sparse matrix 
A = am1m.gen_affinity_matrix()
elapsed_affinity = time.time() - start

In [98]:
print('it took %f s to generate the afifnity matrix' %elapsed_affinity )

it took 0.223092 s to generate the afifnity matrix


In [99]:
start = time.time()
Atr_np, Atst_np = numpy_stratified_split(A, ratio=0.75, seed=123)
elapsed_np = time.time() - start

In [100]:
print('it took %f s to split the data' %elapsed_np )

it took 0.641507 s to split the data


In [101]:
A_top_np = return_top_k(Atst_np, k=10)
Anum_pred_np = np.sum(A_top_np !=0, axis=1)
print('max_precision at k=10,', max_precision_at_k(Anum_pred_np, 10))

max_precision at k=10, 0.9386258278145694
