# Test script
This is a standard test script which can be used to simply give a dataset as input(make sure the dataset is in CSR format with 1st 3 entries as [user_id, col_id, ratings, ... ] and ratings are in between 1 - 5. Else, you will have to make changes to original jupyter notebook) and get performance evaluation of THE algorithm. <br>
There also maybe some substitute functions for your use other than those mentioned in this testscript. Look into the original jupyter notebook for more details.

Running the main Jupyter notebook which has all the functions defined. Make sure the path is correct in next cell

In [1]:
from datetime import datetime
datetime.now().time()     # (hour, min, sec, microsec)

datetime.time(12, 21, 43, 315943)

In [2]:
# if this way of importing another jupyter notebook fails for you
# then you can use any one of the many methods described here:
# https://stackoverflow.com/questions/20186344/ipynb-import-another-ipynb-file
%run '../src/finalcode.ipynb'

In [3]:
datetime.now().time()     # (hour, min, sec, microsec)

datetime.time(12, 21, 44, 946750)

### Setting constants

In [4]:
'''Dataset Parameters'''
################################################################################################################
DATA_PATH = '../data/ml-100k/u.data' # ml-100k data set has 100k ratings, 943 users and 1682 items
DELIMITER = "\t"               # tab separated or comma separated data format
N_RATINGS = 100000
################################################################################################################

In [5]:
# These parameters will be detected automatically from dataset
# -1 is for the default value
FIRST_INDEX = -1
USERS = -1
ITEMS = -1
SPARSITY = -1                  # 'p' in the equations
UNOBSERVED = 0                 # default value in matrix for unobserved ratings; prefer to keep it 0

In [6]:
# To reduce size of csr for testing purpose
# WARNING: ONLY TO BE USED FOR TESTING
# (for real run, put SIZE_REDUCTION = False)
SIZE_REDUCTION = False
#USER_LIMIT = 50
#ITEM_LIMIT = 100

In [7]:
'''Hyperparameters'''
# All the hyperparameters have default values
#To use them, set the parameters as -1
################################################################################################################
TRAIN_TEST_SPLIT = -1                   # %age of test ratings wrt train rating ; value in between 0 and 1
C1 = -1                                 # probability of edges in training set going to E1
C2 = -1                                 # probability of edges in training set going to E2
RADIUS = 7                              # radius of neighborhood, radius = # edges between start and end vertex
UNPRED_RATING = 3                       # rating (normalized) for which we dont have predicted rating between 1 - 5
THRESHOLD = 0.01                        # distance similarity threshold used for rating prediction
################################################################################################################

In [8]:
# checks on hyper parameters    
if isinstance(C1, float) and isinstance(C2, float) and (C1 > 0) and (C2 > 0) and 1 - C1 - C2 > 0:
    print('c1 = {}'.format(C1))
    print('c2 = {}'.format(C2))
    print('c3 = {}'.format(1-C1-C2))
elif (C1 == -1) and (C2 == -1):
    C1 = C2 = 0.33
    print('c1 = {} (default)'.format(C1))
    print('c2 = {} (default)'.format(C2))
    print('c3 = {} (default)'.format(1-C1-C2))
else:
    print('ERROR: Incorrect values set for C1 and C2')
    
if isinstance(RADIUS, int) and RADIUS > 0:
    print('Radius = {}'.format(RADIUS))
elif RADIUS == -1:
    print('Radius = default value as per paper')
else:
    print('ERROR: Incorrect values set for Radius')

if UNPRED_RATING >= 1 and UNPRED_RATING <= 5:
    print('Rating set for unpredicted ratings = {}'. format(UNPRED_RATING))
elif UNPRED_RATING == -1:
    UNPRED_RATING = 3
    print('Rating set for unpredicted ratings = {} (default)'. format(UNPRED_RATING))
else:
    print('ERROR: Incorrect values set for UNPRED_RATING')
    
if TRAIN_TEST_SPLIT > 0 and TRAIN_TEST_SPLIT < 1:
    print('TRAIN_TEST_SPLIT = {}'.format(TRAIN_TEST_SPLIT))
elif TRAIN_TEST_SPLIT == -1:
    TRAIN_TEST_SPLIT = 0.2
    print('TRAIN_TEST_SPLIT = 0.2 (default)')
else:
    print('ERROR: Incorrect values set for TRAIN_TEST_SPLIT')

c1 = 0.33 (default)
c2 = 0.33 (default)
c3 = 0.34 (default)
Radius = 7
Rating set for unpredicted ratings = 3
TRAIN_TEST_SPLIT = 0.2 (default)


### Read and prepare the dataset

In [9]:
data_csr = read_data_csr(fname=DATA_PATH, delimiter=DELIMITER)

if SIZE_REDUCTION:
    data_csr = reduce_size_of_data_csr(data_csr)

if data_csr.shape[0] == N_RATINGS:  # gives total no of ratings read; useful for verification
    print('Reading dataset: done')
else:
    print('Reading dataset: FAILED')
    print( '# of missing ratings: ' + str(N_RATINGS - data_csr.shape[0]))
    
check_and_set_data_csr(data_csr=data_csr)

Reading dataset: done
USERS = 943
ITEMS = 1682
All users and items have at least one rating! Good!
SPARSITY (p) = 0.0290249433107
Sym matrix : p is polynomially larger than 1/n, all guarantees applicable
Check and set dataset : done


In [10]:
[train_data_csr, test_data_csr] = generate_train_test_split_csr(data_csr=data_csr, split=TRAIN_TEST_SPLIT)

Generating train test split: done


In [11]:
train_data_csr = normalize_ratings_csr(train_data_csr)
train_data_csr = csr_to_symmetric_csr(train_data_csr)
# the symmetric matrix obtained doesnt contain repititions for any user item pair
# only the item_ids are scaled by item_ids += USERS
# hence, we can safely go ahead and use this CSR matrix for sample splitting step

Normalize ratings: done
CSR to symmetric CSR matrix: done


### Make predictions using THE algorithm 

##### Step 1: Sample splitting

In [12]:
[m1_csr, m2_csr, m3_csr] = sample_splitting_csr(data_csr=train_data_csr, c1=C1, c2=C2)

Sample splitting: done


##### Step 2: Expanding the Neighborhood

In [13]:
[r_neighbor_matrix, r1_neighbor_matrix] = generate_neighbor_boundary_matrix(m1_csr)
# all neighbor boundary vector for each user u is stored as u'th row in neighbor_matrix
# though here the vector is stored a row vector, we will treat it as column vector in Step 4
# Note: we might expect neighbor matrix to be symmetric with dimensions (USERS+ITEMS)*(USERS+ITEMS)
#     : since distance user-item and item-user should be same
#     : but this is not the case since there might be multiple paths between user-item
#     : and the random path picked for user-item and item-user may not be same
#     : normalizing the matrix also will result to rise of difference

Creating graph as dictionary:


100%|██████████| 2625/2625 [00:02<00:00, 1050.18it/s]


Generating neighbor boundary matrix at 7-hop distance:


100%|██████████| 2625/2625 [30:10<00:00,  1.45it/s]


Generating neighbor boundary matrix at 8-hop distance:


100%|██████████| 2625/2625 [32:02<00:00,  1.37it/s]


In [14]:
describe_neighbor_count(r_neighbor_matrix)

To effectively choose RADIUS value for next run of algorithm:
Showing distribution of count of neighbors for every vertex:
                 0
count  2625.000000
mean      0.001524
std       0.047794
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max       2.000000


In [15]:
describe_neighbor_count(r1_neighbor_matrix)

To effectively choose RADIUS value for next run of algorithm:
Showing distribution of count of neighbors for every vertex:
            0
count  2625.0
mean      0.0
std       0.0
min       0.0
25%       0.0
50%       0.0
75%       0.0
max       0.0


##### Step 3: Computing the distances

In [16]:
distance_matrix = compute_distance_matrix(r_neighbor_matrix, r1_neighbor_matrix, m2_csr)

Generating distance matrix


100%|██████████| 2625/2625 [25:49:16<00:00, 35.41s/it]   


In [19]:
distance_matrix.shape

(2625, 2625)

In [20]:
sum(sum(distance_matrix))

0.0

In [18]:
describe_distance_matrix(distance_matrix)
# The error below is expected as we did not observe any distance values between any two vertices
# as no vertex has neighbors at r=7 or r=8 hop

To effectively choose THRESHOLD value in next step:
Showing distribution of non zero (or observed) entries of distance matrix:


ValueError: Cannot describe a DataFrame without columns

##### Step 4: Averaging datapoints to produce final estimate

In [None]:
# prefer to choose a threshold now based on describe_distance_matrix
THRESHOLD = 2

In [None]:
sim_matrix = generate_sim_matrix(distance_matrix, threshold=THRESHOLD)

In [None]:
# Prepare the test dataset using Model preparation section functions
test_data_csr = normalize_ratings_csr(test_data_csr)
test_data_csr = csr_to_symmetric_csr(test_data_csr)

In [None]:
# Getting estimates for only test data points
prediction_array = generate_averaged_prediction_array(sim_matrix, m3_csr, test_data_csr)

# To generate complete rating matrix do the following:
#prediction_matrix = generate_averaged_prediction_matrix(sim_matrix, m3_csr)

### Evaluate the predictions

In [None]:
# We have already prepared the test data (required for our algorithm)
y_actual  = test_data_csr[:,2]
y_predict = prediction_array
# If we want, we could scale our ratings back to 1 - 5 range for evaluation purposes
#But then paper makes no guarantees about scaled ratings
#y_actual  = y_actual * 5
#y_predict = y_predict * 5

In [None]:
get_rmse(y_actual, y_predict)

In [None]:
get_avg_err(y_actual, y_predict)

In [None]:
check_mse(m1_csr, y_actual, y_predict)

In [None]:
datetime.now().time()     # (hour, min, sec, microsec)