Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse matrix equality efficient #480

Merged

Conversation

rahulpalamuttam
Copy link
Contributor

This is a pull request that originates from a requirement in Spark for a more efficient SparseMatrix equality check. When comparing two CSCMatrices, use activeKeysIterator instead of keysIterator. This will return and check the non-zero-value keys in the SparseMatrix. An extra conditional is introduced to make sure the active size of both SparseMatrices are the same.

I tried to to implement the case where equality needed to be checked between a SparseMatrix and DenseMatrix. However, I couldn't find a way of doing this without traversing the DenseMatrix at least once.

@jkbradley
Copy link

Note: This PR was from [https://github.com/apache/spark/pull/8960] and [https://issues.apache.org/jira/browse/SPARK-10906]

case _ => false
override def equals(p1: Any) = (this, p1) match {
case (x: CSCMatrix[V], p1: CSCMatrix[_]) =>
x.rows == p1.rows && x.cols == p1.cols && x.activeSize == p1.activeSize &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is buggy. If there is an implicit 0 in one matrix and a corresponding explicit 0 in the other, then this will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley
Please see my comment below.

@jkbradley
Copy link

If one matrix is dense, I think it's OK to take time linear in the size of the dense matrix (since you have to anyways). The key optimization is for 2 sparse matrices.

@rahulpalamuttam
Copy link
Contributor Author

I do not think it is possible to have an explicit zero in the CSCmatrix.
Take a look at the following lines of code in the CSCMatrix class :

Conditional in update function :

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/CSCMatrix.scala#L79-L84

Locate function which utilizes binarySearch :

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/CSCMatrix.scala#L148-L152

So if the element is not found then ind will be < 0. If v is zero, then we update nothing, if it isn't then we update the array value. I have updated the PR with an additional test that shows that explicit zeros become implicit.

@jkbradley
Copy link

Is that still true if you use the constructor? I don't think it is.

@rahulpalamuttam
Copy link
Contributor Author

@jkbradley
I have updated the test using CSCMatrix instead of CSCMatrix.Builder. - is this what you had in mind?
If so then this test passes as well.

@dlwh
Copy link
Member

dlwh commented Jan 6, 2016

i think he means the new CSCMatrix constructor that takes an array.

I'm ok with that possible equality breakage so long as we add a warning to
the constructor to CSCMatrix that you can't pass in an Array with zeros in
it and expect equality to work.

-- David

On Wed, Jan 6, 2016 at 2:11 PM, rahul palamuttam notifications@github.com
wrote:

I have updated the test using CSCMatrix instead of CSCMatrix.Builder. - is
this what you had in mind?
If so then this test passes as well.


Reply to this email directly or view it on GitHub
#480 (comment).

@jkbradley
Copy link

@dlwh In MLlib, we'd like equality to ensure semantic equality. Would you mind if we did the more careful check in Breeze? If you think it's too expensive, then we can implement it in MLlib instead.

@rahulpalamuttam As @dlwh said, I meant that MLlib sometimes creates a Breeze CSCMatrix using a constructor which does not check for explicit zeros. Because of that, we need to assume Breeze matrices could contain explicit zeros. Note also that the update() method could create an explicit 0 if you set an existing entry to 0.

@dlwh
Copy link
Member

dlwh commented Jan 6, 2016

ok sure. in that case we probably need a loop that walks over columns in
parallel, doing an operation similar to the loop that SparseVector's inner
product uses on each column.

https://github.com/dlwh/breeze/blob/4f10a4199f00e026827eb01dfcb06cdd1e558952/math/src/main/scala/breeze/linalg/operators/SparseVectorOps.scala#L891-L891

Probably only copy the "smallVectors" for starters unless you're feeling
ambitious.

-- David

On Wed, Jan 6, 2016 at 2:32 PM, jkbradley notifications@github.com wrote:

@dlwh https://github.com/dlwh In MLlib, we'd like equality to ensure
semantic equality. Would you mind if we did the more careful check in
Breeze? If you think it's too expensive, then we can implement it in MLlib
instead.

@rahulpalamuttam https://github.com/rahulpalamuttam As @dlwh
https://github.com/dlwh said, I meant that MLlib sometimes creates a
Breeze CSCMatrix using a constructor which does not check for explicit
zeros. Because of that, we need to assume Breeze matrices could contain
explicit zeros. Note also that the update() method could create an explicit
0 if you set an existing entry to 0.


Reply to this email directly or view it on GitHub
#480 (comment).

@dlwh
Copy link
Member

dlwh commented Jan 8, 2016

(lemme know if you need help with it!)

On Wed, Jan 6, 2016 at 2:39 PM, David Hall david.lw.hall@gmail.com wrote:

ok sure. in that case we probably need a loop that walks over columns in
parallel, doing an operation similar to the loop that SparseVector's inner
product uses on each column.

https://github.com/dlwh/breeze/blob/4f10a4199f00e026827eb01dfcb06cdd1e558952/math/src/main/scala/breeze/linalg/operators/SparseVectorOps.scala#L891-L891

Probably only copy the "smallVectors" for starters unless you're feeling
ambitious.

-- David

On Wed, Jan 6, 2016 at 2:32 PM, jkbradley notifications@github.com
wrote:

@dlwh https://github.com/dlwh In MLlib, we'd like equality to ensure
semantic equality. Would you mind if we did the more careful check in
Breeze? If you think it's too expensive, then we can implement it in MLlib
instead.

@rahulpalamuttam https://github.com/rahulpalamuttam As @dlwh
https://github.com/dlwh said, I meant that MLlib sometimes creates a
Breeze CSCMatrix using a constructor which does not check for explicit
zeros. Because of that, we need to assume Breeze matrices could contain
explicit zeros. Note also that the update() method could create an explicit
0 if you set an existing entry to 0.


Reply to this email directly or view it on GitHub
#480 (comment).

@rahulpalamuttam
Copy link
Contributor Author

@dlwh
Thank you. So I tried to implement the check based on the small inner product implementation. I had a tough time trying to figure out an analogous implementation for "indexAt". The recent commit uses activeIterator. This iterator also accesses the data array values linearly. I figured this would be more efficient rather than using the apply function which does a binary search for the value based on the given key.

I ended up using a sequence of while loops as you suggested over both iterators making sure to have additional while loops to skip over the zeros. The last pair of while loops you see is to ensure that an iterator that still has elements only contain zeros.

Not sure if there are any hidden implications and or edge cases here that I didn't consider.

while(ykeyval._2 == 0 && yIter.hasNext) ykeyval = yIter.next()
if(xkeyval != ykeyval) return false
}
if(xIter.hasNext == true && yIter.hasNext == false){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd rather (xIter.hasNext && !yIter.hasNext) (and on the next one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed with latest commit

@dlwh
Copy link
Member

dlwh commented Jan 8, 2016

nitpick, then lgtm!

x.rows == y.rows && x.cols == y.cols &&
keysIterator.forall(k => x(k) == y(k))
case _ =>
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this should be super.==(p1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - this is in Matrix.scala not CSCMatrix.scala so I am not sure you want to use the equals method of the super class. Do you want to override the equals method in CSCMatrix? It doesn't have an explicit implementation in CSCMatrix.scala

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, blah, sorry. maybe move the CSC/CSC check to CSC itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@dlwh
Copy link
Member

dlwh commented Jan 8, 2016

well, one other check

@rahulpalamuttam
Copy link
Contributor Author

How about this? I also went back to matching the input matrix, instead of matching the pair of matrices.
Also I was not exactly sure why the initial implementation matches for Matrix[_], instead of Matrix[V], but I kept the former.

@dlwh
Copy link
Member

dlwh commented Jan 8, 2016

lgtm, thanks!

dlwh added a commit that referenced this pull request Jan 8, 2016
@dlwh dlwh merged commit 231c53a into scalanlp:master Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants