# Tree-based Methods

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
%matplotlib inline
plt.style.use('seaborn-white')
RANDOM_STATE = 42

## Random Forest Trees

In [56]:
df = pd.read_csv('https://raw.githubusercontent.com/sukhjitsehra/datasets/master/CP322/Hitters.csv').dropna()
df.info()

In [57]:
X = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis= 1)
y = np.log(df.Salary)

### Question 1: 

Complete the following class to fit, predict and score a decision tree. 

<!--
BEGIN QUESTION
name: q1
points: 5
-->

In [58]:
class DecisionTreeModel:
    def __init__(self, X_train, X_test, y_train, y_test, max_depth=None, max_leaf_nodes=None, min_samples_split=2, min_samples_leaf=1):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.tree = DecisionTreeRegressor(max_depth=max_depth, max_leaf_nodes= max_leaf_nodes, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=RANDOM_STATE)
        
    def fit(self):
        ...
        return
    
    def predict(self):
        self.y_pred = ...
        return
    
    def calculate_metrics_regression(self):
        y_pred = self.predict()
        mse = ...
        return mse
        
    def plot(self):
        plt.figure(figsize=(15,10))
        tree.plot_tree(self.tree, filled=True)
        plt.show()
    
    def plot_importance(self):
        Importance = pd.DataFrame({'Importance':self.tree.feature_importances_*100}, index=self.X_train.columns)
        Importance.sort_values('Importance', axis=0, ascending=True).plot(kind='barh', color='r', )
        plt.xlabel('Variable Importance')
        plt.gca().legend_ = None


In [None]:
grader.check("q1")

In [60]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state=RANDOM_STATE)

Question 2: Use the above class to define a model using some values (`using hit and trial method`) of hyper-parameters so that the mse value is lower than .24.


Note: In decision tree algorithms, hyperparameters are used to control the model complexity and prevent overfitting. Here are the explanations of the four hyperparameters you have asked for and generally, it's recommended to try different hyperparameter values and evaluate the performance of the resulting models using cross-validation:

1. max_depth: This hyperparameter controls the maximum depth of the decision tree. The depth of a decision tree is the length of the longest path from the root node to a leaf node. Setting a high value for max_depth can lead to overfitting, while setting it too low may cause underfitting. Usually, it is recommended to set this hyperparameter to a value between 3 and 10, depending on the complexity of the problem and the size of the dataset. If the dataset is small, a smaller value can be used, while if the dataset is large and complex, a larger value may be necessary.

2. max_leaf_nodes: This hyperparameter controls the maximum number of leaf nodes that can be present in the decision tree. This can be an alternative to max_depth to control the complexity of the tree. If both max_depth and max_leaf_nodes are set, the one that results in a smaller decision tree will be used. A value between 5 and 50 is often used as a starting point for this hyperparameter. However, the optimal value depends on the size of the dataset and the complexity of the problem.

3. min_samples_split: This hyperparameter sets the minimum number of samples required to split an internal node. If the number of samples is less than min_samples_split, the node is not split, and it becomes a leaf node. This hyperparameter helps prevent overfitting and also affects the shape of the decision tree. This hyperparameter can be set to a value between 2% to 10% of the total number of samples in the dataset. A typical starting value for this hyperparameter is 2 or 3.

4. min_samples_leaf: This hyperparameter sets the minimum number of samples required to be at a leaf node. If the number of samples is less than min_samples_leaf, the tree builder will try to split the node until there are at least min_samples_leaf samples in the leaf. This hyperparameter also helps prevent overfitting and affects the shape of the decision tree. This hyperparameter can be set to a smaller value than min_samples_split. A typical starting value is 1 or 2, but it can be increased depending on the size of the dataset and the complexity of the problem.

<!--
BEGIN QUESTION
name: q2
points: 5
-->

In [61]:
fit_decision = ...
fit_decision.fit()
fit_decision.predict()
mse = fit_decision.calculate_metrics_regression()

In [None]:
grader.check("q2")

In [62]:
fit_decision.plot_importance()