unstable result with Decision Tree Classifier #12188

Ichaab · 2018-09-28T09:41:44Z

Description

I have developed a java code to produce a decision tree without any pruning strategy. The used desicion rule is also the default majority rule. Then I opted to use python for its simplicity. The problem is the randomness in the DecisionTreeClassifier. Although splitter is set to "best" and max_features="None", so as all features are used, and random_state to 1, I don't finish with the same result as the java code generates. Exactly the same training and test data sets are used for python and java. How can I eliminate all randomness to obtain the same result with the java code please?
Help please.

Steps/Code to Reproduce

Expected Results

Same decision tree produced with java

Actual Results

different confusion matrix in each time I fixed random_state to 1.

Versions

Windows-8.1-6.3.9600-SP0
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.20.dev0

Ichaab · 2018-09-28T14:43:54Z

@jnothman @glouppe

jnothman · 2018-09-29T23:37:10Z

I can't think of a short answer, but I don't claim to be an expert on tree algorithms. What kind of data are you testing with?

amueller · 2018-10-01T17:15:35Z

different confusion matrix in each time I fixed random_state to 1.

Can you elaborate on what you mean by that? Fixing random state to 1 the results should always be the same. Not fixing the random state should result in the same trees except for tie-breaking IIRC.

Ichaab · 2018-10-04T09:46:23Z

I am using continuous features so numerical data. In my code python I define the decision tree as :
DT=DecisionTreeClassifier(random_state=1), so all features are used as it is in the java code. Concerning samples, I define the same training and test data sets without using train_test_split. I think all conditions are the same to produce the same result as the java code decision tree.
In each time I modified the random_state, the result is modified as well. How to fix it please in order to obtain the same confusion matrix with java. In other words, how to remove the randomness aspect in the building of the decision tree in python?

Thank you a lot for your help.

amueller · 2018-10-04T13:46:48Z

Can you provide code that shows non-deterministic behavior in DecisionTreeClassifier with random_state set? That shouldn't happen.

Ichaab · 2018-10-08T09:13:19Z

datatra = pd.read_csv('./glass1-1tra.txt', sep='\t', header=0)
datatst = pd.read_csv('./glass1-1tst.txt', sep='\t', header=0)
data=pd.read_csv('./glass1.txt', sep='\t', header=0)
nbr_col=data.shape[1]-1
X = data.ix[:, 0:nbr_col].values
Y = data.ix[:, nbr_col].values

X1_train =datatra.ix[:,0:nbr_col].values
Y1_train =datatra.ix[:, nbr_col].values

X1_tst=datatst.ix[:,0:nbr_col].values
Y1_tst=datatst.ix[:, nbr_col].values

clfs = {
    'entropy': DecisionTreeClassifier(criterion='entropy')
}

def classifieur(clfs,X1, Y1,X2,Y2):
    for clf_name in clfs:
		if (clf_name == 'entropy'):
			DT=DecisionTreeClassifier(criterion='entropy',random_state=1)
		DT.fit(X1,Y1)
		YDT=DT.predict(X2)
		target_names = ['class -', 'class +']
		print(confusion_matrix(Y2,YDT))
		print(classification_report_imbalanced(Y2, YDT, target_names=target_names))


classifieur(clfs,X1_train, Y1_train,X1_tst,Y1_tst)

In fact I run manually a 10 cross validation, that is for each database I introduce 10 folders in each one there is a train set and a test one. For the first folder for example I obtain the same confusion matrix with my java code when I fix random_state to 1. for the 2nd folder, same results are obtained when random_state=5. So I can't set random_state to have always the same confusion matrix obtained without any randomness (using all examples and all features).
Is there any solution please to obtain for the 10 folders the same result with java?

Ichaab · 2018-10-08T11:51:53Z

@amueller @jmschrei @nelson-liu

ngoix · 2018-10-09T07:22:39Z

The randomness comes from ties on the features to split on. What you describe is expected, as you will obtain different trees for different random_state parameters. Or for different implementations, even with the same random_state parameter. See #12259 for a discussion to remove this randomness.

GMarzinotto mentioned this issue Oct 12, 2018

[WIP] priority of features in decision to splitting #12364

Open

cmarmo added the module:tree label Feb 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unstable result with Decision Tree Classifier #12188

unstable result with Decision Tree Classifier #12188

Ichaab commented Sep 28, 2018 •

edited

Ichaab commented Sep 28, 2018

jnothman commented Sep 29, 2018

amueller commented Oct 1, 2018

Ichaab commented Oct 4, 2018

amueller commented Oct 4, 2018

Ichaab commented Oct 8, 2018 •

edited by jnothman

Ichaab commented Oct 8, 2018

ngoix commented Oct 9, 2018

unstable result with Decision Tree Classifier #12188

unstable result with Decision Tree Classifier #12188

Comments

Ichaab commented Sep 28, 2018 • edited

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Ichaab commented Sep 28, 2018

jnothman commented Sep 29, 2018

amueller commented Oct 1, 2018

Ichaab commented Oct 4, 2018

amueller commented Oct 4, 2018

Ichaab commented Oct 8, 2018 • edited by jnothman

Ichaab commented Oct 8, 2018

ngoix commented Oct 9, 2018

Ichaab commented Sep 28, 2018 •

edited

Ichaab commented Oct 8, 2018 •

edited by jnothman