New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unstable result with Decision Tree Classifier #12188
Comments
I can't think of a short answer, but I don't claim to be an expert on tree algorithms. What kind of data are you testing with? |
Can you elaborate on what you mean by that? Fixing random state to 1 the results should always be the same. Not fixing the random state should result in the same trees except for tie-breaking IIRC. |
I am using continuous features so numerical data. In my code python I define the decision tree as : Thank you a lot for your help. |
Can you provide code that shows non-deterministic behavior in |
datatra = pd.read_csv('./glass1-1tra.txt', sep='\t', header=0)
datatst = pd.read_csv('./glass1-1tst.txt', sep='\t', header=0)
data=pd.read_csv('./glass1.txt', sep='\t', header=0)
nbr_col=data.shape[1]-1
X = data.ix[:, 0:nbr_col].values
Y = data.ix[:, nbr_col].values
X1_train =datatra.ix[:,0:nbr_col].values
Y1_train =datatra.ix[:, nbr_col].values
X1_tst=datatst.ix[:,0:nbr_col].values
Y1_tst=datatst.ix[:, nbr_col].values
clfs = {
'entropy': DecisionTreeClassifier(criterion='entropy')
}
def classifieur(clfs,X1, Y1,X2,Y2):
for clf_name in clfs:
if (clf_name == 'entropy'):
DT=DecisionTreeClassifier(criterion='entropy',random_state=1)
DT.fit(X1,Y1)
YDT=DT.predict(X2)
target_names = ['class -', 'class +']
print(confusion_matrix(Y2,YDT))
print(classification_report_imbalanced(Y2, YDT, target_names=target_names))
classifieur(clfs,X1_train, Y1_train,X1_tst,Y1_tst) In fact I run manually a 10 cross validation, that is for each database I introduce 10 folders in each one there is a train set and a test one. For the first folder for example I obtain the same confusion matrix with my java code when I fix random_state to 1. for the 2nd folder, same results are obtained when random_state=5. So I can't set random_state to have always the same confusion matrix obtained without any randomness (using all examples and all features). |
The randomness comes from ties on the features to split on. What you describe is expected, as you will obtain different trees for different |
Description
I have developed a java code to produce a decision tree without any pruning strategy. The used desicion rule is also the default majority rule. Then I opted to use python for its simplicity. The problem is the randomness in the DecisionTreeClassifier. Although splitter is set to "best" and max_features="None", so as all features are used, and random_state to 1, I don't finish with the same result as the java code generates. Exactly the same training and test data sets are used for python and java. How can I eliminate all randomness to obtain the same result with the java code please?
Help please.
Steps/Code to Reproduce
Expected Results
Same decision tree produced with java
Actual Results
different confusion matrix in each time I fixed random_state to 1.
Versions
Windows-8.1-6.3.9600-SP0
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.20.dev0
The text was updated successfully, but these errors were encountered: