-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+2] MultiOutputClassifier #6127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] MultiOutputClassifier #6127
Conversation
Thanks a lot for the PR, could you remove the main in test file, as all tests are run by nose |
|
||
|
||
def test_multi_target_init_with_random_forest(): | ||
''' test if multi_target initilizes correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change the docstrings of all test*
functions to comments please?
@rvraghav93, I just had a chat with @hugobowne about doing these changes and fixing the unit test failures. Might submit a PR soon if that's ok...? |
You mean a PR to @hugobowne's branch right? Raising another PR to scikit learn is not necessary :) |
Also a few points - The file MultiOneRest is empty? This PR seems to have only the tests. And I think the prefered filename would be Please see if this approach could be followed instead? |
I suppose the file MultiOneRest.py was committed as a executable and hence showing in the diff as empty. Had faced this issue before :) |
Ah that's new to me! |
wacky issue! @rvraghav93, i think the preferred filename should be multi_one_vs_rest.py or something along these lines, as this PR deals specifically with '[one-versus-all] classification models' -- In particular, it doesn't deal with regressors at all. Moreover, it should probably be generalized to deal with all classification models (this will be an easy extension). @rvraghav93, we had completed it before you suggested your approach. after fixing all necessary issues, i suggest we i) generalize to deal with all classification models & leave regressors for a different PR. thoughts? |
Yes, like @mblondel suggests here, we should have
Indeed. regression meta-estimator could be done in a separate PR. And Thanks for your patience! |
@@ -0,0 +1 @@ | |||
"""MultiOneVsRestClassifier===========================This module includes several classes that extend base estimators to multi-target estimators. Most sklearn estimators use a response matrix to train a target functionwith a single output variable. I.e. typical estimators use the training set X to estimate a target function f(X) that predicts a single Y. The purpose of this class is to extend estimatorsto be able to estimate a series of target functions (f1,f2,f3...,fn)that are trained on a single X predictor matrix to predict a seriesof reponses (y1,y2,y3...,yn)."""#Author: Hugo Bowne-Anderson <hugobowne@gmail.com>#Author: Chris Rivera <chris.richard.rivera@gmail.com>#Author: Michael Williamson#License: BSD 3 clauseimport arrayimport numpy as npimport warningsimport scipy.sparse as spfrom sklearn.base import BaseEstimator, ClassifierMixinfrom sklearn.base import clone, is_classifierfrom sklearn.base import MetaEstimatorMixin, is_regressorfrom sklearn.preprocessing import LabelBinarizerfrom sklearn.metrics.pairwise import euclidean_distancesfrom sklearn.utils import check_random_statefrom sklearn.utils.validation import _num_samplesfrom sklearn.utils.validation import check_consistent_lengthfrom sklearn.utils.validation import check_is_fittedfrom sklearn.externals.joblib import Parallelfrom sklearn.externals.joblib import delayedfrom sklearn.multiclass import OneVsRestClassifierclass MultiOneVsRestClassifier(): """ Converts any classifer estimator into a multi-target classifier estimator. This class fits and predicts a series of one-versus-all models to response matrix Y, which has n_samples and p_target variables, on the predictor Matrix X with n_samples and m_feature variables. This allows for multiple target variable classifications. For each target variable (column in Y), a separate OneVsRestClassifier is fit. See the base OneVsRestClassifier Class in sklearn.multiclass for more details. Parameters ---------- estimator : estimator object An estimator object implementing `fit` & `predict_proba`. n_jobs : int, optional, default: 1 The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. Note that parallel processing only occurs if there is multiple classes within each target variable. It does each target variable in y in series. Attributes __________ estimator: Sklearn estimator: The base estimator used to constructe the model. """ def __init__(self, estimator=None, n_jobs=1): self.estimator = estimator self.n_jobs = n_jobs def fit(self, X, y): """ Fit the model to data. Creates a seperate model for each Response column. Parameters ---------- X : (sparse) array-like, shape = [n_samples, n_features] Data. y : (sparse) array-like, shape = [n_samples, p_targets] Multi-class targets. An indicator matrix turns on multilabel classification. Returns ------- self """ # check to see that the data is numeric # check to see that the X and y have the same number of rows. # Calculate the number of classifiers self._num_y = y.shape[1] ## create a dictionary to hold the estimators. self.estimators_ ={} for i in range(self._num_y): # init a new classifer for each and fit it. estimator = clone(self.estimator) #make a fresh clone ovr = OneVsRestClassifier(estimator,self.n_jobs) self.estimators_[i] = ovr.fit(X,y[:, i]) return self def predict(self, X): """Predict multi-class multiple target variable using a model trained for each target variable. Parameters ---------- X : (sparse) array-like, shape = [n_samples, n_features] Data. Returns ------- y : dict of [sparse array-like], shape = {predictors: n_samples} or {predictors: [n_samples, n_classes], n_predictors}. Predicted multi-class targets across multiple predictors. Note: entirely separate models are generated for each predictor. """ # check to see if the fit has been performed check_is_fitted(self, 'estimators_') results = {} for label, model_ in self.estimators_.iteritems(): results[label] = model_.predict( X) return(results) def predict_proba(self, X): """Probability estimates. This returns prediction probabilites for each class for each label in the form of a dictionary. Parameters ---------- X : array-like, shape = [n_samples, n_features] Returns ------- prob_dict (dict) A dictionary containing n_label sparse arrays with shape = [n_samples, n_classes]. Each row in the array contains the the probability of the sample for each class in the model, where classes are ordered as they are in `self.classes_`. """ # check to see whether the fit has occured. check_is_fitted(self, 'estimators_') results ={} for label, model_ in self.estimators_.iteritems(): results[label] = model_.predict_proba(X) return(results) def score(self, X, Y): """"Returns the mean accuracy on the given test data and labels. Parameters ---------- X : array-like, shape = [n_samples, n_features] Y : (sparse) array-like, shape = [n_samples, p_targets] Returns ------- scores (np.array) Array of p_target floats of the mean accuracy of each estimator_.predict wrt. y. """ check_is_fitted(self, 'estimators_') # Score the results for each function results =[] for i in range(self._num_y): estimator = self.estimators_[i] results.append(estimator.score(X,Y[:,i])) return results def get_params(self): '''returns the parameters of the estimator.''' return self.estimator.get_params() def set_params(self, params): """sets the params for the estimator.""" self.estimator.set_params(params) def __repr__(self): return 'MultiOneVsRestClassifier( %s )' %self.estimator.__repr__() @property def multilabel_(self): """returns a vector of whether each classifer is a multilabel classifier in tuple for """ return [(label, model_.multilabel_) for label, model_ in self.estimators_.iteritems()] @property def classes_(self): return [(label, model_.label_binarizer_) for label, model_ in self.estimators_.iteritems()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you did not mean to commit an empty file :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MechCoder i definitely didn't ! the diff is empty for some wacky reason but the file is not! can you confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try
chmod -x
- Add the file again and commit
git config core.fileMode false
- Add the file again and commit, if needed (I think you won't need to)
- Squash all the commits
- Force push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could just copy over the code to multioutput.py
, remove this file and force push because that is what you are ultimately going to do anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my thinking exactly @MechCoder
@hugobowne I've changed the title to WIP. Let us know once you finish up the |
@hugobowne Thanks a lot for letting me know about this. I did pull the code from your branch. I would be happy if I can help in any way. One doubt - Even if I add a commit to this code, to continue working on this PR, I need to push them to this branch, for which I won't have the access rights. Any help is appreciated ! Thanks. |
I have just done simple modifications and refactored the code a little. It is at this branch. Please do look at it though I haven't added any new functionality. Thanks. |
thanks for patience, all. this is a quick note: but for workflow here, generally, perhaps @MechCoder or @rvraghav93 could suggest best practice given the following: I won't have much time to contribute in the upcoming weeks & @maniteja123 is going to work on the MultiOutputClassifier -- in this case, is it i) best for him to issue PRs to my branch OR ii) should I give him collaborator access to my branch so that I don't need to merge etc... (in which case this all may move more quickly). is there a common practice for this? |
The common way to do this is as a PR to your branch as you had suggested. But if you don't mind giving him access to your repository, you can go ahead as it would indeed speed things up :) |
I agree On Sat, Jan 9, 2016 at 10:57 PM, Raghav R V notifications@github.com
Manoj, |
hi all. I have just now merged @james-nichols PR into my branch. I then tried to squash commits but think I may have completely bungled it -- i used this as a guide: http://gitready.com/advanced/2009/02/10/squashing-commits-with-rebase.html thoughts? @rvraghav93 @MechCoder @maniteja123 : I have given you collaborator rights to my sklearn fork so please feel free to work on the branch -- I would suggest that you shoot me an email when working on it & i will do the same. collaborator on code @MrChristophRivera can also field questions when I'm unable to. |
ba0db84
to
bae109a
Compare
ok I just attempted to squash again. let me know how it's looking. apologies for rookie errors! |
predicts a single Y. The purpose of this class is to extend estimators | ||
to be able to estimate a series of target functions (f1,f2,f3...,fn) | ||
that are trained on a single X predictor matrix to predict a series | ||
of reponses (y1,y2,y3...,yn). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this paragraph looks great but it should belong to an example and not here, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left it for here as of now. Will make a point.
forest_.fit(X, y[:, i]) | ||
assert_equal(list(forest_.predict(X)), list(predictions[:, i])) | ||
assert_almost_equal(list(forest_.predict_proba(X)), | ||
list(predict_proba[:, :, i]), decimal=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decimal=1
is small, can't you go further?
Can you have the exact result with appropriate random_state ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it to assert_array_equal
now and the test succeeds.
@TomDLT I have done all the changes. I also did go through the whole code for any errors in documentation or tests. Hopefully, I have addressed all the comments. |
@@ -45,18 +45,18 @@ def __init__(self, estimator, n_jobs=1): | |||
|
|||
def fit(self, X, y, sample_weight=None): | |||
""" Fit the model to data. | |||
Fits a seperate model for each output variable. | |||
Fit a seperate model for each output variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate
Can you squash into 2 commits? |
Yeah, I should do something like |
Yes, at the end you should have only two commits, hugobown's work and yours |
Just a doubt. when I rebase with |
Just squash you last 12 into one. git rebase -i HEAD~12 |
I have one local commit also, so it should be 13, right ? |
2ea42ca
to
2c8dd4e
Compare
yes |
@TomDLT Please merge if you are happy! Thanks! |
This looks really good to me! Just one detail:
Actually not for META_ESTIMATORS, but I am not sure if we should add it in common tests or in test_multioutput.py |
@maniteja123 Could you just add a test to check for NonFittedError? |
2c8dd4e
to
29ee54a
Compare
@MechCoder I added a simple test for NotFittedError when predict, predict_proba and score are called. |
Merging with master. Thanks for your perseverance! 🍷 🍷 |
Thanks @maniteja123 and @hugobowne |
We forgot to update whatsnew.rst for this. Could you do that? |
Yeah sure. Shall I push it to this branch itself ? |
And thank you so so much @MechCoder @rvraghav93 @TomDLT and everyone else for all the help and bearing patiently with my doubts and sincere thanks to @hugobowne for letting me work on this. I am again sorry for taking so much of your time in reviewing this multiple times. |
Yes please push it here. I'll cherry-pick it |
@MechCoder sorry for the delay, This is the commit |
TODO for this PR