# Test script
This is a sample test script to run on very_small_graph.txt dataset. This script is created to help you better understand how exactly the functions operate.

Running the main Jupyter notebook which has all the functions defined. Make sure the path is correct in next cell

In [1]:
from datetime import datetime
datetime.now().time()     # (hour, min, sec, microsec)

datetime.time(18, 43, 0, 274879)

In [2]:
# if this way of importing another jupyter notebook fails for you
# then you can use any one of the many methods described here:
# https://stackoverflow.com/questions/20186344/ipynb-import-another-ipynb-file
%run '../src/finalcode.ipynb'

In [3]:
datetime.now().time()     # (hour, min, sec, microsec)

datetime.time(18, 43, 3, 434831)

### Setting constants

In [4]:
FIRST_INDEX = -1
USERS = -1
ITEMS = -1
SPARSITY = -1                  # 'p' in the equations
UNOBSERVED = 0                 # default value in matrix for unobserved ratings
N_RATINGS = 7
C1 = 0                         # only to account for scale_factor in step 3
C2 = 1                         # only to account for scale_factor in step 3

RADIUS = 3                              # radius of neighborhood, radius = # edges between start and end vertex
UNPRED_RATING = -1                      # rating (normalized) for which we dont have predicted rating

### Read and prepare the dataset

In [5]:
m1_csr = read_data_csr(fname='../data/very_small_graph.txt', delimiter="\t")
check_and_set_data_csr(data_csr=m1_csr)

USERS = 4
ITEMS = 3
All users and items have at least one rating! Good!
SPARSITY (p) = 0.285714285714
Sym matrix : p is polynomially larger than 1/n, all guarantees applicable
Check and set dataset : done


In [6]:
m1_csr = normalize_ratings_csr(m1_csr)          ##### REMOVE THIS CELL
m1_csr = csr_to_symmetric_csr(m1_csr)

Normalize ratings: done
CSR to symmetric CSR matrix: done


### Make predictions using THE algorithm 

##### Step 1: Sample splitting

In [7]:
# This step is  being skipped (not needed) for very_small_graph.txt dataset

##### Step 2: Expanding the Neighborhood

In [8]:
[r_neighbor_matrix, r1_neighbor_matrix] = generate_neighbor_boundary_matrix(m1_csr)
# all neighbor boundary vector for each user u is stored as u'th row in neighbor_matrix
# though here the vector is stored a row vector, we will treat it as column vector in Step 4
# Note: we might expect neighbor matrix to be symmetric with dimensions (USERS+ITEMS)*(USERS+ITEMS)
#     : since distance user-item and item-user should be same
#     : but this is not the case since there might be multiple paths between user-item
#     : and the random path picked for user-item and item-user may not be same
#     : normalizing the matrix also will result to rise of difference

Creating graph as dictionary:


100%|██████████| 7/7 [00:00<00:00, 18465.49it/s]

Generating neighbor boundary matrix at 3-hop distance:



100%|██████████| 7/7 [00:00<00:00, 12793.08it/s]

Generating neighbor boundary matrix at 4-hop distance:



100%|██████████| 7/7 [00:00<00:00, 7091.82it/s]


In [9]:
describe_neighbor_count(r_neighbor_matrix)

To effectively choose RADIUS value for next run of algorithm:
Showing distribution of count of neighbors for every vertex:
              0
count  7.000000
mean   1.428571
std    0.534522
min    1.000000
25%    1.000000
50%    1.000000
75%    2.000000
max    2.000000


In [10]:
describe_neighbor_count(r1_neighbor_matrix)

To effectively choose RADIUS value for next run of algorithm:
Showing distribution of count of neighbors for every vertex:
              0
count  7.000000
mean   0.285714
std    0.487950
min    0.000000
25%    0.000000
50%    0.000000
75%    0.500000
max    1.000000


##### Step 3: Computing the distances

In [11]:
distance_matrix = compute_distance_matrix(r_neighbor_matrix, r1_neighbor_matrix, m1_csr)
distance_matrix

Generating distance matrix


100%|██████████| 7/7 [00:00<00:00, 2327.03it/s]


array([[ 0.       ,  0.07168  ,  0.       ,  0.14336  ,  0.       ,
         0.       ,  0.       ],
       [ 0.07168  ,  0.       ,  0.0086016,  0.243712 ,  0.100352 ,
         0.100352 ,  0.100352 ],
       [ 0.       ,  0.0086016,  0.       ,  0.14336  ,  0.       ,
         0.       ,  0.       ],
       [ 0.14336  ,  0.243712 ,  0.14336  ,  0.       ,  0.14336  ,
         0.14336  ,  0.14336  ],
       [ 0.       ,  0.100352 ,  0.       ,  0.14336  ,  0.       ,
         0.       ,  0.       ],
       [ 0.       ,  0.100352 ,  0.       ,  0.14336  ,  0.       ,
         0.       ,  0.       ],
       [ 0.       ,  0.100352 ,  0.       ,  0.14336  ,  0.       ,
         0.       ,  0.       ]])

In [12]:
describe_distance_matrix(distance_matrix)

To effectively choose THRESHOLD value in next step:
Showing distribution of non zero (or observed) entries of distance matrix:
               0
count  22.000000
mean    0.121986
std     0.056814
min     0.008602
25%     0.100352
50%     0.143360
75%     0.143360
max     0.243712


##### Step 4: Averaging datapoints to produce final estimate

In [13]:
sim_matrix = generate_sim_matrix(distance_matrix, threshold=.26)
sim_matrix

Generating distance similarity matrix:


100%|██████████| 7/7 [00:00<00:00, 4688.62it/s]


array([[False,  True, False,  True, False, False, False],
       [ True, False,  True,  True,  True,  True,  True],
       [False,  True, False,  True, False, False, False],
       [ True,  True,  True, False,  True,  True,  True],
       [False,  True, False,  True, False, False, False],
       [False,  True, False,  True, False, False, False],
       [False,  True, False,  True, False, False, False]], dtype=bool)

In [14]:
prediction_array = generate_averaged_prediction_array(sim_matrix, m1_csr, m1_csr)
prediction_array

Generating prediction array:


100%|██████████| 7/7 [00:00<00:00, 7875.57it/s]


array([ 0.2       ,  0.2       ,  0.73333333,  0.2       ,  0.2       ,
        0.73333333,  0.73333333])

In [15]:
prediction_matrix = generate_averaged_prediction_matrix(sim_matrix, m1_csr)
prediction_matrix

Generating prediction matrix:


100%|██████████| 7/7 [00:00<00:00, 656.55it/s]


array([[ 0.2       ,  0.73333333,  0.2       ,  0.73333333,  0.2       ,
         0.2       ,  0.2       ],
       [ 0.73333333,  0.66666667,  0.73333333,  0.70909091,  0.73333333,
         0.73333333,  0.73333333],
       [ 0.2       ,  0.73333333,  0.2       ,  0.73333333,  0.2       ,
         0.2       ,  0.2       ],
       [ 0.73333333,  0.70909091,  0.73333333,  0.76      ,  0.73333333,
         0.73333333,  0.73333333],
       [ 0.2       ,  0.73333333,  0.2       ,  0.73333333,  0.2       ,
         0.2       ,  0.2       ],
       [ 0.2       ,  0.73333333,  0.2       ,  0.73333333,  0.2       ,
         0.2       ,  0.2       ],
       [ 0.2       ,  0.73333333,  0.2       ,  0.73333333,  0.2       ,
         0.2       ,  0.2       ]])

### Evaluate the predictions

In [16]:
# We have already prepared the test data (required for our algorithm)
test_data_csr = m1_csr
y_actual  = test_data_csr[:,2]
y_predict = prediction_array
# If we want, we could scale our ratings back to 1 - 5 range for evaluation purposes
#But then paper makes no guarantees about scaled ratings
#y_actual  = y_actual * 5
#y_predict = y_actual * 5

In [17]:
get_rmse(y_actual, y_predict)

0.4700557211144026

In [18]:
get_avg_err(y_actual, y_predict)

0.38095238095238093

In [19]:
check_mse(m1_csr, y_actual, y_predict)

MSE Upper bound: 0.870550563296
MSE of predictions: 0.220952380952
As per the discussion in the paper, MSE is bounded by O((pn)**(-1/5))


In [20]:
datetime.now().time()     # (hour, min, sec, microsec)

datetime.time(18, 43, 4, 169284)