## Random Projection

### Loading the data

We will load a user-rating data for which each line is of the form

userId,movieId,rating,timestamp

with one line of header. 

The timestamps are useful, but for now we are interested in only userId, movieId and rating. 

In [None]:
import numpy as np
import pandas as pd

ratings = pd.read_csv("/Users/debapriyo/Dropbox/data/ml-latest/ratings.csv")

ratings

## Creating a sparse matrix of the ratings

Instead of a 2-D array, we will create a sparse matrix. We will use the rows as the items and columns as the users.

In [None]:
from scipy.sparse import csr_matrix

ii = np.array([1,2,1,2,4])
jj = np.array([1,1,2,3,3])
vv = np.array([1,1,2,3,2])

R2 = csr_matrix((vv, (ii,jj)))

print(R2)
print(R2.todense())

In [None]:
from scipy.sparse import csr_matrix

users = ratings["userId"].values.astype(int)
movies = ratings["movieId"].values.astype(int)
vals = ratings["rating"].values


R = csr_matrix((vals,(movies-1,users-1)))

print("The data has %d items with %d dimensions (users)." %(R.shape[0], R.shape[1]))

## Random projection -- estimating the dimensions

Let us first estimate how many dimensions we will need for certain error bounds. 

In [None]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim
from sklearn import random_projection

for ep in [0.1, 0.15, 0.2, 0.25, 0.3, 0.4]:
    min_dim = johnson_lindenstrauss_min_dim(n_samples=R.shape[0], eps=ep)
    print("Minimum #of dimensions for epsilon = %f is %d." %(ep, min_dim))

### Performing the random projection

A GaussianRandomProjection (works with similar arguments) for this data will be hard to compute on this machine. We will compute the sparse random projection.

In [None]:
# Projection with epsilon = 0.1
transformer = random_projection.SparseRandomProjection(eps=0.1)
RP = transformer.fit_transform(R)
print(RP.shape)
print(RP)

## Testing effectiveness

We test the effectiveness of the random projection by computing similarity of items in the original dimension as well as in the reduced dimension.

First we select some pairs of items.

In [None]:
n = 2000

randpairs = np.random.randint(R.shape[0],size=(n,2))
print(randpairs)

### Computing distances

Let us define a distance measure between two vectors. 

In [None]:
def dist(v1, v2):
    return np.linalg.norm(v1-v2)

v1 = np.array([1,1])
v2 = np.array([0,1])

print(dist(v1,v2))

### Distance in the original dimension

In [None]:
import time

dist_original = np.zeros(n)
i = 0
total_time = 0.0
for pair in randpairs:
    # print(pair[0],pair[1])
    tick = time.time()
    v1 = R[pair[0]].todense()
    v2 = R[pair[1]].todense()
    dist_original[i] = dist(v1,v2)
    tock = time.time()
    total_time = total_time + tock - tick
    i = i+1
print("Total time: ", (total_time) , " seconds.")

### Distance in the reduced dimension

In [None]:
dist_reduced = np.zeros(n)
i = 0
total_time = 0.0
for pair in randpairs:
    # print(pair[0],pair[1])
    tick = time.time()
    v1 = RP[pair[0]].todense()
    v2 = RP[pair[1]].todense()
    dist_reduced[i] = dist(v1,v2)*1
    tock = time.time()
    total_time = total_time + tock - tick
    i = i+1
print("Total time: ", (total_time) , " seconds.")

### How close are we after reducing dimension?

In [None]:
error = dist_original - dist_reduced

nz = []
for i in range(len(error)):
    if (dist_original[i] != 0):
       nz.append(np.abs(error[i]/dist_original[i]))
    
nz_error = np.array(nz)
    
print("Mean absolute eps: ", np.mean(nz_error))
print("Max eps: ", np.max(nz_error))

# For how many cases did the distance cross the epsilon
indicator = nz_error > 0.1
print(sum(indicator)/2000)