# Working with Regression Trees in Python

## Learning Objectives
Decision Trees are one of the most popular approaches to supervised machine learning. Decison Trees use an inverted tree-like structure to model the relationship between independent variables and a dependent variable. A tree with a continuous dependent variable is known as a **Regression Tree**. By the end of this tutorial, you will have learned:

+ How to import, explore and prepare data
+ How to build a Regression Tree model
+ How to visualize the structure of a Regression Tree
+ How to Prune a Regression Tree 

## 1. Collect the Data

In [None]:
import pandas as pd
income = pd.read_csv("income.csv")
income.head()

## 2. Explore the Data

In [None]:
income.info()

In [None]:
income.describe()

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
ax = sns.boxplot(data = income, x = 'Education', y = 'Salary')

In [None]:
ax = sns.boxplot(data = income, x = 'Education', y = 'Age')

In [None]:
ax = sns.scatterplot(data = income, 
                     x = 'Age', 
                     y = 'Salary', 
                     hue = 'Education', 
                     style = 'Education', 
                     s = 150)
ax = plt.legend(bbox_to_anchor = (1.02, 1), loc = 'upper left')

## 3. Prepare the Data

In [None]:
y = income[['Salary']]

In [None]:
X = income[['Age', 'Education']]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = 0.6,
                                                    stratify = X['Education'],
                                                    random_state = 1234) 

In [None]:
X_train.shape, X_test.shape

In [None]:
X_train.head()

In [None]:
X_train = pd.get_dummies(X_train)
X_train.head()

In [None]:
X_test = pd.get_dummies(X_test)
X_test.head()

## 4. Train and Evaluate the Regression Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 1234)

In [None]:
model = regressor.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
y_test_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_test_pred)

## 5. Visualize the Regression Tree

In [None]:
from sklearn import tree
plt.figure(figsize = (15,15))
tree.plot_tree(model, 
                   feature_names = list(X_train.columns), 
                   filled = True);

In [None]:
plt.figure(figsize = (15,15))
tree.plot_tree(model, 
               feature_names = list(X_train.columns), 
               filled = True,
               max_depth = 1);

In [None]:
importance = model.feature_importances_
importance

In [None]:
feature_importance = pd.Series(importance, index = X_train.columns)
feature_importance.sort_values().plot(kind = 'bar')
plt.ylabel('Importance');

## 6. Prune the Regression Tree

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

Let's get the list of effective alphas for the training data.

In [None]:
path = regressor.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
list(ccp_alphas)

We remove the maximum effective alpha because it is the trivial tree with just one node.

In [None]:
ccp_alphas = ccp_alphas[:-1]
list(ccp_alphas)

Next, we train several trees using the different values for alpha.

In [None]:
train_scores, test_scores = [], []
for alpha in ccp_alphas:
    regressor_ = DecisionTreeRegressor(random_state = 1234, ccp_alpha = alpha)
    model_ = regressor_.fit(X_train, y_train)
    train_scores.append(model_.score(X_train, y_train))
    test_scores.append(model_.score(X_test, y_test))

In [None]:
plt.plot(ccp_alphas, 
         train_scores, 
         marker = "o", 
         label = 'train_score', 
         drawstyle = "steps-post")
plt.plot(ccp_alphas, 
         test_scores, 
         marker = "o", 
         label = 'test_score', 
         drawstyle = "steps-post")
plt.legend()
plt.title('R-squared by alpha');

In [None]:
test_scores

In [None]:
ix = test_scores.index(max(test_scores))
best_alpha = ccp_alphas[ix]
best_alpha

In [None]:
regressor_ = DecisionTreeRegressor(random_state = 1234, ccp_alpha = best_alpha)
model_ = regressor_.fit(X_train, y_train)

In [None]:
model_.score(X_train, y_train)

In [None]:
model_.score(X_test, y_test)

In [None]:
plt.figure(figsize = (15,15))
tree.plot_tree(model_, 
                   feature_names = list(X_train.columns),
                   filled = True);