CF should avoid calculating full matrix when providing recommendations #406

rcurtin · 2015-02-19T22:13:00Z

Right now, the first line of CF::GetRecommendations() reads

rating = w * h

which has the issue that the RAM on the system must be equal to the number of items vs. the number of recommendations. Then, we run tree-based kNN on the rating matrix, which is of high dimension, which will be slow:

// Calculate the neighborhood of the queried users.
// This should be a templatized option.
neighbor::AllkNN a(rating, query);
arma::mat resultingDistances; // Temporary storage.
a.Search(numUsersForSimilarity, neighborhood, resultingDistances);

But this isn't necessary. Note that what we are trying to do is find the most similar users (columns), but we have decomposed the input matrix X = W * H. (H is the matrix that holds the user preferences, depending on how you look at it.)

Now, some quick linear algebra gives us that X.col(i) = W * H.col(i). But remember, we are looking for the nearest neighbors of X.col(i), so this is equivalent to the nearest neighbors of H.col(i). Why aren't we searching for the nearest neighbors in the H matrix, then?

A patch for this ticket should also include some information on the speedup obtained (in either a test program or the cf executable), and verification that the module provides the same results (perhaps through the already written tests).

The text was updated successfully, but these errors were encountered:

rcurtin · 2015-02-24T23:18:33Z

My linear algebra is wrong. X.col(i) = W * H.col(i), but it is not true that d(X.col(i), X.col(j)) = d(H.col(i), H.col(j)) unless we make some assumptions about W, which we can't do. Oops. Self-assigning until I figure out how this can be done.

rcurtin · 2015-03-25T22:11:02Z

d(X.col(i), X.col(j)) = d(W H.col(i), W H.col(j)).

For the L2 distance (which is fine for now), we can show that this is the Mahalanobis distance with M^{-1} = W^T W. Decompose M^{-1} = L L^T (Cholesky decomposition), and then multiply H by L^T to obtain H' (this takes O(r^2 n) time). Then, once this is done,

d(X.col(i), X.col(j)) = d(H'.col(i), H'.col(j))

and each distance calculation takes only O(r) time. We can use simple nearest neighbor search out of the box on H', then.

rcurtin added this to the mlpack 1.1.0 milestone Feb 19, 2015

rcurtin added P: major t: bug report labels Feb 19, 2015

rcurtin self-assigned this Feb 24, 2015

rcurtin removed the D: easy label Feb 24, 2015

rcurtin added the D: moderate label Mar 25, 2015

rcurtin closed this as completed in fbdc8d4 Apr 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF should avoid calculating full matrix when providing recommendations #406

CF should avoid calculating full matrix when providing recommendations #406

rcurtin commented Feb 19, 2015

rcurtin commented Feb 24, 2015

rcurtin commented Mar 25, 2015

CF should avoid calculating full matrix when providing recommendations #406

CF should avoid calculating full matrix when providing recommendations #406

Comments

rcurtin commented Feb 19, 2015

rcurtin commented Feb 24, 2015

rcurtin commented Mar 25, 2015