# Assignment6-R-Tree-based

# Assignment 6 - Tree-based approaches
# 1.1 Overview of the steps
1. Load the data and get an overview of the data
2. Learn and assess Classification Trees
3. Learn and assess Regression Trees
4. Learn and assess Regression Bagging (Trees) and Random Forests
5. Learn and assess Regression Boosting (Trees)
# 1.2 Steps in detail
## 1.2.1 Load the data and get an overview of the data
Load the data file `Carseats.csv`.
In these data, the `Sales` of carseats is a quantitative response variable. Get an overview of the variables [here](https://rdrr.io/cran/ISLR/man/Carseats.html).

In [113]:
import numpy.random
import pandas as pd
from IPython.display import display, Markdown
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
%matplotlib notebook
from statsmodels.formula.api import ols
import scipy
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree as sklearn_plot_tree


default_figsize=(8, 6)
default_alpha = .05
image_format = 'png'

In [2]:
carseats_df = pd.read_csv('../ISLR/data/Carseats.csv', index_col=[0])

Display the number of predictors (including the response `Sales`) and their names:

In [3]:
print(len(carseats_df.columns))
print(carseats_df.columns)

11
Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US'],
      dtype='object')


In [4]:
carseats_df.describe(include='all')

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
count,400.0,400.0,400.0,400.0,400.0,400.0,400,400.0,400.0,400,400
unique,,,,,,,3,,,2,2
top,,,,,,,Medium,,,Yes,Yes
freq,,,,,,,219,,,282,258
mean,7.496325,124.975,68.6575,6.635,264.84,115.795,,53.3225,13.9,,
std,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,,16.200297,2.620528,,
min,0.0,77.0,21.0,0.0,10.0,24.0,,25.0,10.0,,
25%,5.39,115.0,42.75,0.0,139.0,100.0,,39.75,12.0,,
50%,7.49,125.0,69.0,5.0,272.0,117.0,,54.5,14.0,,
75%,9.32,135.0,91.0,12.0,398.5,131.0,,66.0,16.0,,


Display the number of data points:

In [5]:
len(carseats_df)

400

Display the data in a table

> Top 20 rows are shown.

In [6]:
n = 20
display(carseats_df.info(verbose=True))
display(carseats_df.head(n))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 1 to 400
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Sales        400 non-null    float64
 1   CompPrice    400 non-null    int64  
 2   Income       400 non-null    int64  
 3   Advertising  400 non-null    int64  
 4   Population   400 non-null    int64  
 5   Price        400 non-null    int64  
 6   ShelveLoc    400 non-null    object 
 7   Age          400 non-null    int64  
 8   Education    400 non-null    int64  
 9   Urban        400 non-null    object 
 10  US           400 non-null    object 
dtypes: float64(1), int64(7), object(3)
memory usage: 37.5+ KB


None

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
1,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
2,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
3,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
4,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
5,4.15,141,64,3,340,128,Bad,38,13,Yes,No
6,10.81,124,113,13,501,72,Bad,78,16,No,Yes
7,6.63,115,105,0,45,108,Medium,71,15,Yes,No
8,11.85,136,81,15,425,120,Good,67,10,Yes,Yes
9,6.54,132,110,0,108,124,Medium,76,10,No,No
10,4.69,132,113,0,131,124,Medium,76,17,No,Yes


Compute the pairwise correlation of the predictors in the data set.

In [7]:
def corrmat(df, render=display):
    """Does not do symbol-coded chart."""
    def pearsonr_pval(x,y):
        return scipy.stats.pearsonr(x,y)[1]
    render(Markdown('Pearson:'))
    corr = df.corr(method='pearson')
    render(corr)
    render(Markdown('P values:'))
    render(df.corr(method=pearsonr_pval))
    render(Markdown('Pearson (chart):'))
    fig, ax = plt.subplots(figsize=default_figsize)
    sns.heatmap(corr.round(2), ax=ax, annot=True, vmax=1, vmin=-1, center=0, cmap='vlag')
    plt.show()

> In R, given `Carseats` is a 2-dimensional list, `Carseats[,-(10:11)]` and `Carseats[,-7]` returns the same 2-dimensional list without 6th, 10th and 11th column. Indexing in R is one-based, indexing in Python is zero-based.

In [8]:
carseats_quantitative_df = carseats_df.drop(carseats_df.columns[[6, 9, 10]], axis=1)
print(carseats_quantitative_df.columns)
corrmat(carseats_quantitative_df)

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'Age', 'Education'],
      dtype='object')


Pearson:

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
Sales,1.0,0.064079,0.151951,0.269507,0.050471,-0.444951,-0.231815,-0.051955
CompPrice,0.064079,1.0,-0.080653,-0.024199,-0.094707,0.584848,-0.100239,0.025197
Income,0.151951,-0.080653,1.0,0.058995,-0.007877,-0.056698,-0.00467,-0.056855
Advertising,0.269507,-0.024199,0.058995,1.0,0.265652,0.044537,-0.004557,-0.033594
Population,0.050471,-0.094707,-0.007877,0.265652,1.0,-0.012144,-0.042663,-0.106378
Price,-0.444951,0.584848,-0.056698,0.044537,-0.012144,1.0,-0.102177,0.011747
Age,-0.231815,-0.100239,-0.00467,-0.004557,-0.042663,-0.102177,1.0,0.006488
Education,-0.051955,0.025197,-0.056855,-0.033594,-0.106378,0.011747,0.006488,1.0


P values:

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
Sales,1.0,0.2009398,0.00231,4.377677e-08,0.3139816,7.618187e-21,3e-06,0.299944
CompPrice,0.2009398,1.0,0.107257,0.6294282,0.05842996,4.502047e-38,0.045118,0.615355
Income,0.00230967,0.1072571,1.0,0.2391048,0.875206,0.2579181,0.925816,0.256598
Advertising,4.377677e-08,0.6294282,0.239105,1.0,6.904797e-08,0.3743306,0.9276,0.502875
Population,0.3139816,0.05842996,0.875206,6.904797e-08,1.0,0.8086862,0.394778,0.033425
Price,7.618187e-21,4.502047e-38,0.257918,0.3743306,0.8086862,1.0,0.041103,0.814826
Age,2.78895e-06,0.04511817,0.925816,0.9275997,0.3947781,0.04110296,1.0,0.897076
Education,0.2999442,0.6153555,0.256598,0.5028753,0.03342466,0.814826,0.897076,1.0


Pearson (chart):

<IPython.core.display.Javascript object>

Plot the response to its most correlated predictor.

In [9]:
def fit_lr(x, y):
    X = sm.add_constant(x)
    return sm.OLS(y, X).fit()

def plot(x, y, xlab, ylab, mod_fit=None, alpha=default_alpha):
    fig, ax = plt.subplots(figsize=default_figsize)
    ax.plot(x, y, 'yo')
    if mod_fit:
        X = sm.add_constant(x)
        regr = mod_fit.predict(X)
        ax.plot(x, regr, 'k')
        prediction = mod_fit.get_prediction(X)
        frame = prediction.summary_frame(alpha=alpha)
        zipped = pd.concat([x, frame.mean_ci_lower, frame.mean_ci_upper], axis=1)
        zipped.sort_values(x.name, inplace=True)
        ax.fill_between(zipped[x.name], zipped[frame.mean_ci_lower.name], zipped[frame.mean_ci_upper.name], color='k', alpha=.3)
    ax.set_xlabel(xlab)
    ax.set_ylabel(ylab)
    fig.show()

def format_pearsonr(values):
    return f'R = {values[0]}, p < {values[1]}'

def fit_lr_plot_full(x, y, xlab=None, ylab=None):
    mod_fit = fit_lr(x, y)
    print(format_pearsonr(scipy.stats.pearsonr(x, y)))
    plot(x, y, xlab or getattr(x, 'name', 'x'), getattr(y, 'name', 'y'), mod_fit)
    return mod_fit

In [10]:
carseats_df['Price']


1      120
2       83
3       80
4       97
5      128
      ... 
396    128
397    120
398    159
399     95
400    120
Name: Price, Length: 400, dtype: int64

In [11]:
fit_lr_plot_full(carseats_df['Sales'], carseats_df['Price'])

R = -0.4449507278465725, p < 7.61818701191294e-21


<IPython.core.display.Javascript object>

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fd16565cd60>

## Interpret the results.

### Correlation matrix.

The absolute value of every pairwise correlation is below 0.6, which means that there is no strong correlation between any 2 analysed variables.

There is a correlation of 0.58 between `Price` and `CompPrice`. It is within expectations since the price of competitors is connected to the price of a specific store.

Slightly weaker correlation is between `Price` and `Sales`. The correlation is -0.44, which partially conforms to the basic economic laws: the lower the price, the more sales are to be made.

There are also 2 correlations with absolute value below 0.3:
- 0.27 between `Advertising` and `Sales`. It is within expectations because sales are somewhat affected by advertisement;
- -0.23 between `Age` and `Sales`. An interesting observation: younger people seems to buy more carseats, likely, due to the fact of them bringing up children.

### Scatter plot.

The earlier discussed correlation between `Sales` and `Price` is entirely confirmed by the chart itself: the linear regression slope is also negative. Since the Pearson correlation coefficient is not large, the slope is not very steep.

## 1.2.2 Learn and assess Classification Trees

Predict that the `Sales` is high using the predictors.

In [12]:
carseats_backup_df = carseats_df.copy()

In [13]:
high = carseats_backup_df['Sales'].transform(lambda x: 'No' if x <= 8 else 'Yes')
carseats_df = carseats_backup_df.copy()
carseats_df['High'] = high
print(carseats_df.columns)

Index(['Sales', 'CompPrice', 'Income', 'Advertising', 'Population', 'Price',
       'ShelveLoc', 'Age', 'Education', 'Urban', 'US', 'High'],
      dtype='object')


We now use the `tree()` function to fit a classification tree in order to predict `High` using all variables
but `Sales`.

In [46]:
def print_tree(tree, x_df, y_df):
    print('Variables actually used in tree construction:', x_df.columns)
    n_leaves = tree.get_n_leaves()
    print('Number of terminal nodes:', n_leaves)
    predicted_df = tree.predict(x_df)
    degrees_of_freedom = len(x_df) - n_leaves
    imprecision_mask = ~(predicted_df == y_df.squeeze())
    residual_sum_of_squares = np.square(imprecision_mask.astype(int)).to_numpy().sum()
    print(f'Residual mean deviance: {residual_sum_of_squares / degrees_of_freedom} = {residual_sum_of_squares} / {degrees_of_freedom}')
    missclassified_count = predicted_df[imprecision_mask].shape[0]
    print(f'Misclassification error rate: {missclassified_count / len(x_df)} = {missclassified_count} / {len(x_df)}')

> One-hot encoding is needed before using `DecisionTreeClassifier`.

In [47]:
categorical_columns = ['ShelveLoc', 'Urban', 'US']

tree_df = carseats_df.copy()
one_hot_df = pd.get_dummies(carseats_df[categorical_columns])
tree_df[one_hot_df.columns] = one_hot_df
high_columns = list(c for c in tree_df.columns if c.startswith('High'))
tree_df.drop(categorical_columns, axis=1, inplace=True)

In [49]:
tree_carseats = DecisionTreeClassifier(splitter='best', criterion='gini')
x_df = tree_df.drop([*high_columns, 'Sales'], axis=1)
y_df = tree_df[high_columns]
tree_carseats.fit(x_df, y_df)
print_tree(tree_carseats, x_df, y_df)

Variables actually used in tree construction: Index(['CompPrice', 'Income', 'Advertising', 'Population', 'Price', 'Age',
       'Education', 'ShelveLoc_Bad', 'ShelveLoc_Good', 'ShelveLoc_Medium',
       'Urban_No', 'Urban_Yes', 'US_No', 'US_Yes'],
      dtype='object')
Number of terminal nodes: 61
Residual mean deviance: 0.0 = 0 / 339
Misclassification error rate: 0.0 = 0 / 400


> There are more variables, than in `R` counterpart due to one-hot encoder.

> Due to difference in implementation there are much terminal nodes (leaves) as well as 100% precision. I compared documentation between R's `tree` and Python's `DecisionTreeClassifier`, but couldn't get the same result as it was in previous assignments.

> One-hot encoding could be the root cause of all the differences between R and Python implementations since it is unknown, how categorical data is handled by R.

In [121]:
def plot_tree(tree, x_df, y_df, filename_no_ext):
    fig, ax = plt.subplots(figsize=default_figsize)
    class_full_names = y_df.drop_duplicates().apply(lambda row: ';'.join('{c}:{v}'.format(c=y_df.columns[i], v=v) for i, v in enumerate(row)), axis=1).to_numpy()
    sklearn_plot_tree(tree, ax=ax, class_names=class_full_names, feature_names=tree.feature_names_in_, filled=True)
    fig.show()
    filename = f'{filename_no_ext}.{image_format}'
    fig.savefig(filename, dpi=1000, format=image_format)
    display(Markdown(f'#### For better resolution please, see the file `{filename}`'))

In [122]:
plot_tree(tree_carseats, x_df, y_df, 'tree_carseats')

<IPython.core.display.Javascript object>

#### For better resolution please, see the file `tree_carseats.png`

> The assignment has 2 visualisations of the tree:
> - brief line-based, without much information;
> - detailed, colorful, nice and bottom-aligned.
>
> I have only one visualisation because I am totally satisfied with it.
> The only thing that is missing in my visualisation is alignment of leaf nodes by bottom, but I don't see the point in it.
>
> Also, I checked other visualizations and didn't find anything better there. Besides, my tree is multiple times larger than the assignment's tree and tree visualisation is a known problem in InfoVis.