Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we create pool of classifiers? #167

Closed
sara-eb opened this issue Sep 21, 2019 · 10 comments
Closed

How can we create pool of classifiers? #167

sara-eb opened this issue Sep 21, 2019 · 10 comments

Comments

@sara-eb
Copy link

sara-eb commented Sep 21, 2019

As you have mentioned in your examples, BaggingClassifier or RandomeForest classifier are considered as a pool of classifier itself.

I am wondering is it possible if I create a pool of classifiers including traditional ensemble methods like RF, Adaboost in combination of single classifiers like SVM, kNN?

Thanks

@Menelau
Copy link
Collaborator

Menelau commented Sep 24, 2019

@sara-eb Hello,

Sorry for the delay response. The library accepts any list of classifiers as the pool of classifiers so it does accept a combination of ensemble methods with single classifier models. There are two ways of doing that:

`X ,y = make_classification()
rf = RandomForestClassifier(n_estimators=10).fit(X, y)
adaboost = AdaBoostClassifier(n_estimators=10).fit(X, y)
svm = SVC().fit(X, y)
tree = DecisionTreeClassifier().fit(X, y)

pool1 = [rf, adaboost , svm, tree]
pool2 = rf.estimators_ + adaboost.estimators_ + [svm, tree]`

In the case, pool1 is a pool of classifiers composed of 4 estimators (although random forest and adaboost are composed of multiple base estimators, the DS method looks at them as a being a single one). pool2 treats each member of random forest/adaboost as a single, independent model instead of their combination. So, the DS model sees it as a pool composed of 22 models (10 coming from rf, 10 from adaboost, 1 svm and 1 decision tree).

You may want to check our heterogeneous example too in which we use classifiers of different types in the pool: https://deslib.readthedocs.io/en/latest/auto_examples/example_heterogeneous.html#sphx-glr-auto-examples-example-heterogeneous-py

@sara-eb
Copy link
Author

sara-eb commented Sep 24, 2019

@Menelau Thank you very much sir,
Your explanation is very clear.

Thanks again

@sara-eb
Copy link
Author

sara-eb commented Oct 9, 2019

@Menelau I created a pool of classifiers for my data including a random forest with 200 estimators and an AdaBoost classifier with 600 decision trees, and I am using faiss technique as knn_type.

 pool_classifiers = [model_ada, model_rf]
knorae = KNORAE(pool_classifiers=pool_classifiers,
                    knn_classifier=knn_type)
    print("Fitting KNORAE on X_DSEL dataset")
    knorae.fit(X_DSEL, y_DSEL)

    print("Saving the dynamic selection model in ", ds_model_outdir)
    outfile = ds_model_outdir+'KNORAE_rfE200_adaDT600.joblib'
    print(outfile)
    dump(knorae, outfile)

Since my validation (i.e., DSEL) dataset is quite big number of samples, I was trying to fit the DS model on validation data and save the model for later prediction on test dataset. However, I am facing an issue of saving it:
TypeError: can't pickle module objects

What could be the reason?

@Menelau
Copy link
Collaborator

Menelau commented Oct 10, 2019

@sara-eb Hello,

I have a feeling that it happens because of the information stored in the faiss knn but I'm not sure. I will investigate that and get back to you asap.

You can try using dill instead of pickle for saving the model: https://pypi.org/project/dill/ I believe that should work for you.

@sara-eb
Copy link
Author

sara-eb commented Oct 12, 2019

@Menelau

Thanks for recommendation,
I installed dill and tried to save with dill:


pickle_filename = ds_model_outdir+'KNORAE_rfE200_adaDT600.pkl'
pickle.dump(knorae, open(pickle_filename,'wb'))

However, still getting error;
TypeError: can't pickle SwigPyObject objects

I have traind RandomForest classifier in parallel, can this be the reason?

@Menelau
Copy link
Collaborator

Menelau commented Oct 13, 2019

@sara-eb ,

Parallel random forest shouldn’t be a problem at all. I dig deeper into this issue and I found a problem with the serialization of the Faiss KNN. In the case, the index computed by the faiss knn needs to be converted to a string before it is written to a file (see facebookresearch/faiss#914).

So I prepared a workaround with functions for saving and loading DS models that should solve this problem (save_ds, load_ds). In the case, they just check whether faiss is being used for the knn calculation in the DS models and if yes, do the conversions before saving/loading. I added the code in this gist: https://gist.github.com/Menelau/0cde51c3622be6313fd96b4dffb17996
Can you check if using this workaround solves your problem?

Now I will see how to add to DESlib a saving/loading functionality for the DS methods (that can handle Faiss knn automatically) as soon as possible.

@sara-eb
Copy link
Author

sara-eb commented Oct 27, 2019

@Menelau Thank you very much sir, It works perfectly
Appreciate it

@sara-eb
Copy link
Author

sara-eb commented Nov 10, 2019

@Menelau I am facing new issue with scoring now on the test set. What could be the reason.

 score = knorae.score(X_test, y_test)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/sklearn/base.py", line 357, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/deslib/base.py", line 440, in predict
    distances, neighbors = self._get_region_competence(X_DS)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/deslib/base.py", line 381, in _get_region_competence
    return_distance=True)
  File "/home/esara/deslib-env/lib/python3.6/site-packages/deslib/util/faiss_knn_wrapper.py", line 112, in kneighbors
    dist, idx = self.index_.search(X, n_neighbors)
AttributeError: 'numpy.ndarray' object has no attribute 'search'

@Menelau
Copy link
Collaborator

Menelau commented Nov 13, 2019

Hello,

How did you load the ds model? Did you use the load_ds function I provided in the gist: https://gist.github.com/Menelau/0cde51c3622be6313fd96b4dffb17996 ?

I believe the error is in the way you are loading the DS model. In order to save the Faiss model, it's index is converted to a numpy array, so that it can be pickled. In the case, the self.index_ variable is the one containing the indexes, so it is serialized in the save_ds function (by converting to numpy array). Then, in order to load it back the conversion to numpy array back to Faiss index needs to be done (which the load_ds function in the Gist performs).

@sara-eb
Copy link
Author

sara-eb commented Nov 14, 2019

@Menelau Thanks a lot sir, sorry I did not realize that I need to reload the model since the model is already in the variable list in the memory.
Thank you very much for mentioning the point.

@Menelau Menelau closed this as completed Nov 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants