Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LogisticRegressionMG] Support standardization with no data modification #5724

Merged

Conversation

lijinf2
Copy link
Contributor

@lijinf2 lijinf2 commented Jan 18, 2024

The key idea is to modify coefficients in linearFwd to get the same predictions, and modify the gradients in linearBwd to get the same gradients.

Copy link

copy-pr-bot bot commented Jan 18, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Jan 18, 2024
@lijinf2 lijinf2 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 18, 2024
@lijinf2 lijinf2 marked this pull request as ready for review January 19, 2024 01:46
@lijinf2 lijinf2 requested review from a team as code owners January 19, 2024 01:46
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lijinf2. Your changes look great for the most part. I'd still ike to see the mg standardization pieces pulled out eventually. Most of my feedback is in the quality of the pytests but I think the fixes should be straightforward.

true,
raft::mul_op(),
stream);
raft::resource::sync_stream(*(this->handle_p));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to sync here since we're not copying anything to host to be read directly after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Revised.

SimpleDenseMat<T> mean_mat(mean_vector, 1, D);

// calculate mean
rmm::device_uvector<T> ones(num_rows, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice if we could encapculate this normalization computation so that it can reused in RAFT. I understand now the complexities involved in refactoring. For example, I often forget about the SimpleMat because it's buried so deep in cuML's other APIs (and only used in the qn solvers).

Still, we are going to start moving some of the mnmg primitives over to RAFT. Hopefully in the not-so-distant future. This also includes k-means and whatnot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Seems need a PR to RAFT then a PR to cuml to revise this part.
Thinking to get it done in the next release. Created an issue ticket for tracking this: #5739.

"max_iter": max_iter,
}

X_origin = np.array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to lead to tests which are brittle and hard to maintain. Please generate this data and process the naive version of the expected result (you can do the standardization up front and use singe-gpu logistic). Please do this instead of hardcoding values. We seldomly have to resort to doing this but only because a reasonable way to achieve a naive test doesn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried generating a random sparse matrix of any size for classification. Please check!

"max_iter": max_iter,
}

X = np.array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see below for comments on hardcoding these values and generating larger test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.

)
@pytest.mark.parametrize("datatype", [np.float32])
@pytest.mark.parametrize("delayed", [False])
@pytest.mark.parametrize("ncol_and_nclasses", [(2, 2), (6, 4)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test for a few different variations here please? Even just a couple higher numbers like (100, 10) would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Just added (100, 10)

datatype,
)

X_origin = np.ascontiguousarray(X_origin.T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please generate larger arrays for testing, especially when sparse. It doesn't have to be massive, but larger than 4x5 would be helpful (like 100x25 or 1000x100).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, added a function to generate sparse matrix of any size for multi-class classification.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for creating the issue to pull out the normalization pieces.

@lijinf2 lijinf2 requested a review from a team as a code owner February 2, 2024 17:47
@github-actions github-actions bot added the ci label Feb 2, 2024
@github-actions github-actions bot removed the ci label Feb 2, 2024
@raydouglass raydouglass removed the request for review from a team February 5, 2024 15:32
@raydouglass
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@raydouglass raydouglass merged commit dc02a3f into rapidsai:branch-24.02 Feb 5, 2024
53 of 54 checks passed
@lijinf2 lijinf2 deleted the fea_lrmg_std_no_data_modify branch March 5, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants