[![Open in Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justmarkham/scikit-learn-tips/master?filepath=notebooks%2F25_decision_tree_pruning.ipynb)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/justmarkham/scikit-learn-tips/blob/master/notebooks/25_decision_tree_pruning.ipynb)

# 🤖⚡ scikit-learn tip #25 ([video](https://www.youtube.com/watch?v=ioQ2Ahi-I_M&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=25))

New in scikit-learn 0.22: Pruning of decision trees to avoid overfitting!

- Uses cost-complexity pruning
- Increase "ccp_alpha" to increase pruning (default value is 0)

See example 👇

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain')
df['Sex'] = df['Sex'].map({'male':0, 'female':1})

In [2]:
features = ['Pclass', 'Fare', 'Sex', 'Parch']
X = df[features]
y = df['Survived']

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

In [4]:
# default tree has 331 nodes
dt = DecisionTreeClassifier(random_state=0)
dt.fit(X, y).tree_.node_count

331

In [5]:
cross_val_score(dt, X, y, cv=5, scoring='accuracy').mean()

0.8036281463812692

In [6]:
# pruned tree has 121 nodes
dt = DecisionTreeClassifier(ccp_alpha=0.001, random_state=0)
dt.fit(X, y).tree_.node_count

121

In [7]:
# pruning improved the cross-validated accuracy
cross_val_score(dt, X, y, cv=5, scoring='accuracy').mean()

0.8081162513338773

### Want more tips? [View all tips on GitHub](https://github.com/justmarkham/scikit-learn-tips) or [Sign up to receive 2 tips by email every week](https://scikit-learn.tips) 💌

© 2020 [Data School](https://www.dataschool.io). All rights reserved.