[WIP] Random forest wrapper #621

Salonijain27 · 2019-05-21T16:32:30Z

No description provided.

dantegd · 2019-05-21T17:13:39Z

This PR depends on #407 right?

JohnZed

Hi, Saloni --
Great to see this change coming along fast! I know it's early, so I didn't want to go into to many detailed bits. High-level feedback would be:

When in doubt, let's match the sklearn interfaces/package layouts/class names/etc. So we should do RandomForestClassifier and RandomForestRegressor classes. If we can do it cleanly, they'd just be slim wrappers around an underlying base class like this one (RandomForest) that supports both approaches.
We should be careful about which functions modify state on the self object always. Only the constructor, fit, and very clear "setter" functions should modify state otherwise we get unexpected side effects.
Should break out the unrelated changes from this PR.
Would be great to add tests super early on and commit them, even if most are failing. That will definitely speed up development.

Thanks!

JohnZed · 2019-05-21T23:48:43Z

python/cuml/test/conftest.py

@@ -0,0 +1,60 @@
+import pytest


These are probably part of a separate change, right? I'd suggest separating them out, keeping them in separate branches on your machine, since it's otherwise easy to accidentally have a commit that spans both changes and gets hard to disengtangle later.

Saloni based her PR on the branch of PR #407 so thats why you see so many commits, once that is merged those commits should go away from this PR. That said I might recommend not basing new PRs on branches of open PRs, which I mentioned offline to @Salonijain27

yes, i believe I did. I can close this PR and create a new one based on the branch-0.8

i fixed this branch by merging the o.8 branch into it

JohnZed · 2019-05-21T23:57:14Z

python/cuml/random_forest/random_forest.pyx

+
+    # min_rows_per_node in cuml = min_samples_split in sklearn
+    # max_leaves
+    def __init__(self, n_estimators=25, max_depth=None, max_features=None, min_rows_per_node=None, bootstrap=True):


Ideally we'd make the names and defaults match sklearn. So n_estimators=10 and change min_rows_per_node to min_samples_split unless there's a blocker to one of those.

Do we need a type arg? Classifier vs. regressor e.g.?

at the moment we only have classifier, but will add that argument in

JohnZed · 2019-05-22T00:30:23Z

python/cuml/random_forest/random_forest.pyx

+
+
+
+class Randomforest(Base):


Probably RandomForest to match sklearn style. Maybe this is the base class then we add RandomForestClassifier and RandomForestRegressor in future PRs?

JohnZed · 2019-05-22T00:30:45Z

python/cuml/random_forest/random_forest.pyx

@@ -0,0 +1,213 @@
+import ctypes


This should probably be in ensemble/random_forest to match sklearn.

JohnZed · 2019-05-22T00:32:03Z

python/cuml/random_forest/random_forest.pyx

+        self.min_rows_per_node = min_rows_per_node
+        self.bootstrap = bootstrap
+
+    def _get_ctype_ptr(self, obj):


Not sure I understand this. Is this an idiom used elsewhere? Since it doesn't involve self, seems like it should be in a utility function in a shared module somewhere.

It is in base.pyx, seems like a copy paste issue. PR #612 moves it to a utility function instead indeed! We're thinking on the same line :)

@Salonijain27 @JohnZed here you can see the new shiny input utility function that will deal with converting any input type and do the corresponding checks needed: https://github.com/rapidsai/cuml/blob/b49981e06b6a629557e89d4be8cded4bca2ca6c7/python/cuml/utils/input_utils.py

JohnZed · 2019-05-22T02:20:45Z

python/cuml/random_forest/random_forest.pyx

+    def _get_column_ptr(self, obj):
+        return self._get_ctype_ptr(obj._column._data.to_gpu_array())
+
+    def fit(self, X):


fit should probably take a labels or y param?

JohnZed · 2019-05-22T02:21:11Z

python/cuml/random_forest/random_forest.pyx

+        input_ptr = self._get_ctype_ptr(X_m)
+
+        cdef cumlHandle* handle_ = <cumlHandle*> <size_t> self.handle.getHandle()
+        self.labels_ = cudf.Series(np.zeros(self.n_rows, dtype=np.int32))


So this is just a placeholder right now, right?

JohnZed · 2019-05-22T02:21:44Z

python/cuml/random_forest/random_forest.pyx

+
+        cdef uintptr_t input_ptr
+        if (isinstance(X, cudf.DataFrame)):
+            self.gdf_datatype = np.dtype(X[X.columns[0]]._column.dtype)


Yeah, definitely a utility function for this would be nice.

You are talking about Pr #612 ;) that is the point of that PR

JohnZed · 2019-05-22T02:22:31Z

python/cuml/random_forest/random_forest.pyx

+
+        cdef uintptr_t input_ptr
+        if (isinstance(X, cudf.DataFrame)):
+            self.gdf_datatype = np.dtype(X[X.columns[0]]._column.dtype)


Not sure that predict should set anything on self. I think it's surprising if predict changes any internal state since you'll often generate one instance and call predict many times on it.

So it should probably be more like checking that the dtype here matches the expected self.dtype

Having a self.dtype (instead of the legacy badly named gdf_datatype) for all models to be able to check inputs is being standardized on PR #612

This PR can either follow the example there if that one is merged first, or I can change this in that PR if this one makes it first

I can edit it to follow the PR #612

JohnZed · 2019-05-22T02:26:58Z

python/cuml/random_forest/random_forest.pyx

+        input_ptr = self._get_ctype_ptr(X_m)
+
+        cdef cumlHandle* handle_ = <cumlHandle*> <size_t> self.handle.getHandle()
+        clust_mat = numba_utils.row_matrix(self.cluster_centers_)


Maybe copypasta from another algorithm?

Yes, sorry i changed it in my local and forgot to update the branch

0.8 into local 0.8

Branch 0.8

Salonijain27 added 30 commits March 28, 2019 11:37

Update test_kmeans.py

e0f8ddd

Update test_linear_model.py

1353493

Update test_nearest_neighbors.py

e03f913

Update test_pca.py

291bd8e

Update test_sgd.py

ac88edc

Update test_tsvd.py

095409b

Update CHANGELOG.md

f9b176e

Update test_umap.py

f35e172

Update test_umap.py

1a7698f

Update test_kmeans.py

3b6e67d

Update test_dbscan.py

84092d5

Update test_dbscan.py

bbe79b4

Update test_linear_model.py

532ea84

Update test_nearest_neighbors.py

f61c3f7

Update test_pca.py

82dfc6b

Update test_sgd.py

3fbdc69

Update test_tsvd.py

1b87452

Update test_umap.py

7135d32

Update test_umap.py

966bfd2

Update test_dbscan.py

38c790d

Update test_nearest_neighbors.py

16ea5d5

Update test_sgd.py

a3b18a4

Update test_umap.py

1c47814

Update test_pca.py

9565d3f

Update test_pca.py

24b3f94

Update test_kmeans.py

27d8056

Update test_linear_model.py

a168a64

Update test_dbscan.py

41d8969

Update test_dbscan.py

6e5a9d4

Update test_dbscan.py

d27b02f

Salonijain27 added 21 commits May 14, 2019 12:20

Update test_pca.py

a9be365

Update test_sgd.py

f4bc776

Update test_coordinate_descent.py

c6b6ed5

Update test_coordinate_descent.py

351c16c

Update test_dbscan.py

11eed14

Update test_umap.py

85a8a70

Update test_tsvd.py

91e8fa2

Update test_sgd.py

06b7270

Update test_pca.py

b9e97fc

Update test_linear_model.py

c95a741

Update test_linear_model.py

8e48a8b

Update test_umap.py

2ae81ed

Update test_kmeans.py

ba94b64

Update test_nearest_neighbors.py

1cb0ceb

Update test_pca.py

cddbf03

Update test_tsvd.py

d14475c

Update test_sgd.py

40f3aa0

Update test_umap.py

e238e97

Merge branch 'all_tests' into branch-0.8

063daa2

Create random_forest.pyx

a167a2f

Update __init__.py

bbe6c70

Salonijain27 added 2 commits May 21, 2019 14:45

Update random_forest.pyx

dfc2f95

Update random_forest.pyx

f34d628

JohnZed reviewed May 22, 2019

View reviewed changes

Salonijain27 added 3 commits May 22, 2019 10:08

Merge pull request #15 from rapidsai/branch-0.8

a778d7b

0.8 into local 0.8

Merge branch 'fea-random-forest' into branch-0.8

4308aab

Merge pull request #16 from Salonijain27/branch-0.8

a8259d5

Branch 0.8

Salonijain27 closed this May 22, 2019

Salonijain27 deleted the fea-random-forest branch May 22, 2019 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Random forest wrapper #621

[WIP] Random forest wrapper #621

Salonijain27 commented May 21, 2019

dantegd commented May 21, 2019

JohnZed left a comment

JohnZed May 21, 2019

dantegd May 22, 2019

Salonijain27 May 22, 2019

Salonijain27 May 22, 2019

JohnZed May 21, 2019

JohnZed May 22, 2019

Salonijain27 May 22, 2019

JohnZed May 22, 2019

JohnZed May 22, 2019

JohnZed May 22, 2019

dantegd May 22, 2019

dantegd May 22, 2019

JohnZed May 22, 2019

JohnZed May 22, 2019

JohnZed May 22, 2019

dantegd May 22, 2019

JohnZed May 22, 2019

JohnZed May 22, 2019

dantegd May 22, 2019 •

edited

Loading

Salonijain27 May 22, 2019

JohnZed May 22, 2019

Salonijain27 May 22, 2019

[WIP] Random forest wrapper #621

[WIP] Random forest wrapper #621

Conversation

Salonijain27 commented May 21, 2019

dantegd commented May 21, 2019

JohnZed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dantegd May 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dantegd May 22, 2019 •

edited

Loading