Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstable result with Decision Tree Classifier #12188

Open
Ichaab opened this issue Sep 28, 2018 · 8 comments
Open

unstable result with Decision Tree Classifier #12188

Ichaab opened this issue Sep 28, 2018 · 8 comments

Comments

@Ichaab
Copy link

Ichaab commented Sep 28, 2018

Description

I have developed a java code to produce a decision tree without any pruning strategy. The used desicion rule is also the default majority rule. Then I opted to use python for its simplicity. The problem is the randomness in the DecisionTreeClassifier. Although splitter is set to "best" and max_features="None", so as all features are used, and random_state to 1, I don't finish with the same result as the java code generates. Exactly the same training and test data sets are used for python and java. How can I eliminate all randomness to obtain the same result with the java code please?
Help please.

Steps/Code to Reproduce

Expected Results

Same decision tree produced with java

Actual Results

different confusion matrix in each time I fixed random_state to 1.

Versions

Windows-8.1-6.3.9600-SP0
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.20.dev0

@Ichaab
Copy link
Author

Ichaab commented Sep 28, 2018

@jnothman @glouppe

@jnothman
Copy link
Member

I can't think of a short answer, but I don't claim to be an expert on tree algorithms. What kind of data are you testing with?

@amueller
Copy link
Member

amueller commented Oct 1, 2018

different confusion matrix in each time I fixed random_state to 1.

Can you elaborate on what you mean by that? Fixing random state to 1 the results should always be the same. Not fixing the random state should result in the same trees except for tie-breaking IIRC.

@Ichaab
Copy link
Author

Ichaab commented Oct 4, 2018

I am using continuous features so numerical data. In my code python I define the decision tree as :
DT=DecisionTreeClassifier(random_state=1), so all features are used as it is in the java code. Concerning samples, I define the same training and test data sets without using train_test_split. I think all conditions are the same to produce the same result as the java code decision tree.
In each time I modified the random_state, the result is modified as well. How to fix it please in order to obtain the same confusion matrix with java. In other words, how to remove the randomness aspect in the building of the decision tree in python?

Thank you a lot for your help.

@amueller
Copy link
Member

amueller commented Oct 4, 2018

Can you provide code that shows non-deterministic behavior in DecisionTreeClassifier with random_state set? That shouldn't happen.

@Ichaab
Copy link
Author

Ichaab commented Oct 8, 2018

datatra = pd.read_csv('./glass1-1tra.txt', sep='\t', header=0)
datatst = pd.read_csv('./glass1-1tst.txt', sep='\t', header=0)
data=pd.read_csv('./glass1.txt', sep='\t', header=0)
nbr_col=data.shape[1]-1
X = data.ix[:, 0:nbr_col].values
Y = data.ix[:, nbr_col].values

X1_train =datatra.ix[:,0:nbr_col].values
Y1_train =datatra.ix[:, nbr_col].values

X1_tst=datatst.ix[:,0:nbr_col].values
Y1_tst=datatst.ix[:, nbr_col].values

clfs = {
    'entropy': DecisionTreeClassifier(criterion='entropy')
}

def classifieur(clfs,X1, Y1,X2,Y2):
    for clf_name in clfs:
		if (clf_name == 'entropy'):
			DT=DecisionTreeClassifier(criterion='entropy',random_state=1)
		DT.fit(X1,Y1)
		YDT=DT.predict(X2)
		target_names = ['class -', 'class +']
		print(confusion_matrix(Y2,YDT))
		print(classification_report_imbalanced(Y2, YDT, target_names=target_names))


classifieur(clfs,X1_train, Y1_train,X1_tst,Y1_tst)

In fact I run manually a 10 cross validation, that is for each database I introduce 10 folders in each one there is a train set and a test one. For the first folder for example I obtain the same confusion matrix with my java code when I fix random_state to 1. for the 2nd folder, same results are obtained when random_state=5. So I can't set random_state to have always the same confusion matrix obtained without any randomness (using all examples and all features).
Is there any solution please to obtain for the 10 folders the same result with java?

@Ichaab
Copy link
Author

Ichaab commented Oct 8, 2018

@ngoix
Copy link
Contributor

ngoix commented Oct 9, 2018

The randomness comes from ties on the features to split on. What you describe is expected, as you will obtain different trees for different random_state parameters. Or for different implementations, even with the same random_state parameter. See #12259 for a discussion to remove this randomness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants