Importing important modules which will be used in this jupyter notebook throughout

In [None]:
# Built and testing on python2
import numpy as np
from tqdm import *

# Model Notations
$u$ = user <br>
$i$ = item <br>
$M$ = symmetric rating matrix of size $n \times n$ (usually the dataset) <br>
$E$ = set of $(u,i)$ if, user $u$ has rated item $i$ in matrix $M$ (intuitively $E$ is edge matrix between user and items. <br>
$p$ = sparsity of $M$ i.e. (= #observed ratings in $M$ / total # ratings in $M$)<br>
$r$ = radius, distance (in no of edges) between user $u$ and item $i$ at neighborhood boundary (look in step 2) <br>

# Model preparation
We first look at function which converts our asymmetric rating matrix to a symmetric matix and another function that normalizes the ratings between [0,1].

# Algorithm Details
As per paper: *We present and discuss details of each step of the algorithm, which primarily involves computing pairwise distances (or similarities) between vertices.*

### Step 1: Sample Splitting
Partition the rating matrix into three different parts. Following are the exerpts from paper:
- *Each edge in $E$ is independently placed into $E_1, E_2,$ or $E_3$, with probabilities $c_1, c_2,$ and $1 - c_1 - c_2$ respectively. Matrices $M_1, M_2$, and $M_3$ contain information from the subset of the data in $M$ associated to $E_1, E_2$, and $E_3$ respectively.*
- *$M_1$ is used to define local neighborhoods of each vertex (in step 2), $M_2$ is used to compute similarities of these neighborhoods (in step 3), and $M_3$ is used to average over datapoints for the final estimate (in step 4)*

### Step 2: Expanding the Neighborhood
We do the following in this step:
- radius $r$ to be tuned using cross validation. We can use its default value as $r = \frac{6\ln(1/p)}{8\ln(c_1pn)}$ as per paper.
- use matrix $M_1$ to build neighborhood based on radius $r$
- Build BFS tree rooted at each vertex to get product of the path from user to item, such that
  - each vertex (user or item) in a path from user to boundary item is unique
  - the path chosen is the shortest path (#path edges) between the user and the boundary item
  - in case of multiple paths (or trees), choose any one path (i.e. any one tree) at random
- Normalize the product of ratings by total no of final items at the boundary

$N_{u,r}$ obtained is a vector for user $u$ for $r$-hop, where each element is product of path from user to item or zero. $\tilde{N_{u,r}}$ is normalized vector.


### Step 3: Computing the distances
Distance computation between two users (using matrix $M_2$) using the following formula (only $dist_1$ implemented for now):

$$ dist(u,v) = \left(\frac{1 - c_1p}{c_2p}\right) (\tilde{N_{u,r}} - \tilde{N_{v,r}})^T M_2 (\tilde{N_{u,r+1}} - \tilde{N_{v,r+1}}) $$

### Step 4: Averaging datapoints to produce final estimate
Average over nearby data points based on the distance(similarity) threshold $n_n$ (using matrix $M_3$). $n_n$ to be tuned using cross validation. Mathematically (from paper):

$$ \hat{F_{u,v}} = \frac{1}{\mid E_{uv1} \mid} \sum_{(a,b) \in E_{uv1}} M_3(a,b) $$
*where $E_{uv1}$ denotes the set of undirected edges $(a, b)$ such that $(a, b) \in E_3$ and both $dist(u, a)$ and $dist_1(v, b)$ are less than $n_n$*

# Other important functions

### Data Handling functions

### Substitute functions
Functions which can also be used instead of algorithm specific implementations for testing purposes

### Evaluation
We evaluate our recommendation algorithm using RMSE (root mean square error). <br>
According to paper, if sparsity $p$ is polynomially larger than $n^{-1}$, i.e. if $p = n^{-1 + \epsilon}$ for $\epsilon > 0$, then we can safely use $dist_1$ distance computation formula and MSE is bounded by $O((pn)^{-1/5})$.

# Test Script / Experiment
The following jupyter notebook cells make calls to above cells to run experiments on a recommendation dataset.