Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Sparse input support for decision tree and forest #3173

Merged
merged 24 commits into from
Nov 21, 2014
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
c03c01a
ENH Bring sparse input support to tree-based methods
arjoly Apr 3, 2014
0bc8a98
FIX+ENH add min_weight_fraction_split support for sparse splitter
arjoly Sep 30, 2014
7cb9a5c
Re-organize code dense splitter then sparse splitter
arjoly Sep 30, 2014
0e86dfd
Simplify call to extract_nnz making it a method
arjoly Sep 30, 2014
9dd87ad
ENH while -> for loop
arjoly Sep 30, 2014
62893e3
ENH reduce number of parameters
arjoly Sep 30, 2014
70226d0
FIX min_weight_fraction_split with random splitter
arjoly Sep 30, 2014
6cd9333
FIX min_weight_leaf in best sparse splitter
arjoly Sep 30, 2014
306924b
ENH remove spurious code
arjoly Sep 30, 2014
cb9b741
cosmit
arjoly Sep 30, 2014
c6af5c6
ENH adaboost should accept c and fortran array
arjoly Sep 30, 2014
cb11511
COSMIT simplify function call
arjoly Sep 30, 2014
1c26cec
ENH expand ternary operator
arjoly Sep 30, 2014
3047cd6
Revert previous version
arjoly Sep 30, 2014
38183c8
ENH move utils near its use
arjoly Sep 30, 2014
06701fc
ENH add a benchmark script for sparse input data
arjoly Sep 30, 2014
9f3f5bb
Extract non zero value extraction constant
arjoly Sep 30, 2014
24281b1
Lower number of trees
arjoly Sep 30, 2014
c31d565
wip benchmark
arjoly Oct 1, 2014
bf98916
Temporarily allows to set algorithm switching through an environment …
arjoly Oct 6, 2014
2838a14
Benchmark: Add more estimators + uncomment text
arjoly Oct 20, 2014
8b3b071
FIX duplicate type coercision + DOC fix inversion between csc and csr
arjoly Oct 20, 2014
4f423d6
Remove constant print
arjoly Nov 6, 2014
ab57964
COSMIT add Base prefix to DenseSplitter and DenseSplitter
arjoly Nov 17, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions benchmarks/bench_20newsgroups.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
from __future__ import print_function, division
from time import time
import argparse
import numpy as np

from sklearn.dummy import DummyClassifier

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.metrics import accuracy_score
from sklearn.utils.validation import check_array

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

ESTIMATORS = {
"dummy": DummyClassifier(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a linear model such as LogisticRegression and MultinomialNB as common baselines for text classification here.

"random_forest": RandomForestClassifier(n_estimators=100,
max_features="sqrt",
min_samples_split=10),
"extra_trees": ExtraTreesClassifier(n_estimators=100,
max_features="sqrt",
min_samples_split=10),
"logistic_regression": LogisticRegression(),
"naive_bayes": MultinomialNB(),
"adaboost": AdaBoostClassifier(n_estimators=10),
}


###############################################################################
# Data

if __name__ == "__main__":

parser = argparse.ArgumentParser()
parser.add_argument('-e', '--estimators', nargs="+", required=True,
choices=ESTIMATORS)
args = vars(parser.parse_args())

data_train = fetch_20newsgroups_vectorized(subset="train")
data_test = fetch_20newsgroups_vectorized(subset="test")
X_train = check_array(data_train.data, dtype=np.float32,
accept_sparse="csc")
X_test = check_array(data_test.data, dtype=np.float32, accept_sparse="csr")
y_train = data_train.target
y_test = data_test.target

print("20 newsgroups")
print("=============")
print("X_train.shape = {0}".format(X_train.shape))
print("X_train.format = {0}".format(X_train.format))
print("X_train.dtype = {0}".format(X_train.dtype))
print("X_train density = {0}"
"".format(X_train.nnz / np.product(X_train.shape)))
print("y_train {0}".format(y_train.shape))
print("X_test {0}".format(X_test.shape))
print("X_test.format = {0}".format(X_test.format))
print("X_test.dtype = {0}".format(X_test.dtype))
print("y_test {0}".format(y_test.shape))
print()

print("Classifier Training")
print("===================")
accuracy, train_time, test_time = {}, {}, {}
for name in sorted(args["estimators"]):
clf = ESTIMATORS[name]
try:
clf.set_params(random_state=0)
except (TypeError, ValueError):
pass

print("Training %s ... " % name, end="")
t0 = time()
clf.fit(X_train, y_train)
train_time[name] = time() - t0
t0 = time()
y_pred = clf.predict(X_test)
test_time[name] = time() - t0
accuracy[name] = accuracy_score(y_test, y_pred)
print("done")

print()
print("Classification performance:")
print("===========================")
print()
print("%s %s %s %s" % ("Classifier ", "train-time", "test-time",
"Accuracy"))
print("-" * 44)
for name in sorted(accuracy, key=accuracy.get):
print("%s %s %s %s" % (name.ljust(16),
("%.4fs" % train_time[name]).center(10),
("%.4fs" % test_time[name]).center(10),
("%.4f" % accuracy[name]).center(10)))

print()
2 changes: 1 addition & 1 deletion doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ construction. The prediction of the ensemble is given as the averaged
prediction of the individual classifiers.

As other classifiers, forest classifiers have to be fitted with two
arrays: an array X of size ``[n_samples, n_features]`` holding the
arrays: a sparse or dense array X of size ``[n_samples, n_features]`` holding the
training samples, and an array Y of size ``[n_samples]`` holding the
target values (class labels) for the training samples::

Expand Down
18 changes: 12 additions & 6 deletions doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,10 @@ Classification
:class:`DecisionTreeClassifier` is a class capable of performing multi-class
classification on a dataset.

As other classifiers, :class:`DecisionTreeClassifier` take as input two
arrays: an array X of size ``[n_samples, n_features]`` holding the training
samples, and an array Y of integer values, size ``[n_samples]``, holding
the class labels for the training samples::
As other classifiers, :class:`DecisionTreeClassifier` take as input two arrays:
an array X, sparse or dense, of size ``[n_samples, n_features]`` holding the
training samples, and an array Y of integer values, size ``[n_samples]``,
holding the class labels for the training samples::

>>> from sklearn import tree
>>> X = [[0, 0], [1, 1]]
Expand Down Expand Up @@ -157,7 +157,7 @@ a PDF file (or any other supported file type) directly in Python::

After being fitted, the model can then be used to predict new values::

>>> clf.predict(iris.data[0, :])
>>> clf.predict(iris.data[:1, :])
array([0])

.. figure:: ../auto_examples/tree/images/plot_iris_001.png
Expand Down Expand Up @@ -195,7 +195,6 @@ instead of integer values::
>>> clf.predict([[1, 1]])
array([ 0.5])


.. topic:: Examples:

* :ref:`example_tree_plot_tree_regression.py`
Expand Down Expand Up @@ -337,6 +336,13 @@ Tips on practical use
* All decision trees use ``np.float32`` arrays internally.
If training data is not in this format, a copy of the dataset will be made.

* If the input matrix X is very sparse, it is recommended to convert to sparse
``csc_matrix` before calling fit and sparse ``csr_matrix`` before calling
predict. Training time can be orders of magnitude faster for a sparse
matrix input compared to a dense matrix when features have zero values in
most of the samples.



.. _tree_algorithms:

Expand Down
Loading