# Decision Tree Regression
Decision Tree Regression is a non-linear regression technique that partitions the data into subsets based on feature values. Each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents the predicted value. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

## Advantages:
- Interpretability: Decision trees are easy to interpret and visualize. Each decision in the tree represents a simple rule.
- Non-Parametric: They do not require the data to follow any particular distribution.
- Handling Nonlinear Relationships: Capable of capturing complex, nonlinear relationships between features and the target variable.
- Feature Importance: Provide a measure of feature importance, helping to identify the most influential features.

## Disadvantages:
- Overfitting: Trees can become overly complex and overfit the training data, leading to poor generalization on new data.
- Unstable: Small changes in the data can lead to significantly different tree structures.
- Bias: Can be biased if some classes dominate; require balancing or resampling techniques.

## Use Cases:
- Finance: Predicting stock prices or credit scoring.
- Healthcare: Diagnosing diseases based on symptoms and test results.
- Retail: Predicting sales based on historical data and marketing efforts.

## Scaling (not necessary)
Decision Tree Regression does not require feature scaling. The algorithm is not sensitive to the magnitude of the features.

## Encoding (necessary)
If categorical data is present, it must be encoded to numerical values.

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from scipy.stats import uniform, loguniform

# Read Dataset

In [2]:
df = pd.read_csv('50_StartUp_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,Florida,New York
0,0,165349.2,136897.8,471784.1,192261.83,0.0,1.0
1,1,162597.7,151377.59,443898.53,191792.06,0.0,0.0
2,2,153441.51,101145.55,407934.54,191050.39,1.0,0.0
3,3,144372.41,118671.85,383199.62,182901.99,0.0,1.0
4,4,142107.34,91391.77,366168.42,166187.94,1.0,0.0


# get X , Y

In [3]:
x=df.drop('Profit',axis=1)
y=df['Profit']

## Get train, test and valid data

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=.1, random_state=42)
x_train, x_valid, y_train, y_valid=train_test_split(x_train,y_train,test_size=.1, random_state=42)

In [5]:
print('x_train shape =',x_train.shape)
print('x_test shape =',x_test.shape)
print('x_valid shape =',x_valid.shape)
print('y_train shape =',y_train.shape)
print('y_test shape =',y_test.shape)
print('y_valid shape =',y_valid.shape)

x_train shape = (40, 6)
x_test shape = (5, 6)
x_valid shape = (5, 6)
y_train shape = (40,)
y_test shape = (5,)
y_valid shape = (5,)


# Train

## Grid Search

In [14]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

tree_reg = DecisionTreeRegressor()

params = {
    'criterion': ['squared_error', 'absolute_error'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': [None, 'sqrt', 'log2']
}

param_grid = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 5, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10, 15, 20],
    'min_samples_leaf': [1, 2, 4, 6, 8],
    'max_features': [None, 'auto', 'sqrt', 'log2'],
    'max_leaf_nodes': [None, 10, 20, 30, 40, 50, 100],
    'min_impurity_decrease': [0.0, 0.01, 0.02, 0.05],
}


grid_search = GridSearchCV(tree_reg, params, scoring='r2', cv=5, n_jobs=-1)

# Train the grid search
grid_search.fit(x_train, y_train)  

196000 fits failed out of a total of 784000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
108000 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\PC\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\PC\anaconda3\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\PC\anaconda3\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\PC\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameter

In [15]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Best Hyperparameter Index: 1103
Best Hyperparameters: {'criterion': 'squared_error', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': 50, 'min_impurity_decrease': 0.02, 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'random'}
Best Cross-Validated Score: 0.9712907607988983


In [25]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

## Randomized Search

In [17]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV

tree_reg = DecisionTreeRegressor()

params = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 5, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 6],
    'max_features': [None, 'sqrt', 'log2']
}

param_dist = {
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'min_samples_split': [2, 5, 10, 15, 20, 25, 30],
    'min_samples_leaf': [1, 2, 4, 6, 8, 10, 12],
    'max_features': [None, 'auto', 'sqrt', 'log2'],
    'max_leaf_nodes': [None, 10, 20, 30, 40, 50, 100, 200],
    'min_impurity_decrease': [0.0, 0.01, 0.02, 0.05, 0.1],
}

random_search = RandomizedSearchCV(tree_reg, params, scoring='r2', n_iter=10, cv=5, n_jobs=-1, random_state=42)

# Train the random search
random_search.fit(x_train, y_train)

In [187]:
# print("Best Hyperparameter Index:", random_search.best_index_)
# print("Best Hyperparameters:", random_search.best_params_)
# print("Best Cross-Validated Score:", random_search.best_score_)

In [188]:
# model = random_search.best_estimator_
# y_pred = model.predict(x_test)

## Train DecisionTreeRegressor without search

In [18]:
from sklearn.tree import DecisionTreeRegressor
model=DecisionTreeRegressor(max_depth=5, criterion='squared_error')
# model=SVR(C = 0.1, kernel='linear', gamma='auto', epsilon=0.1, degree=3)
model.fit(x_train, y_train)

# Check overfiiting

In [19]:
y_train_pred=model.predict(x_train)
r2_score(y_train_pred , y_train)

0.9994914810725622

In [20]:
y_valid_pred=model.predict(x_valid)
r2_score(y_valid_pred , y_valid)

0.917971350315208

# Evaluate model

In [21]:
y_pred = model.predict(x_test)

## r2_score

In [22]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

0.9721553385948084

## mean_squared_error

In [30]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
mse

688688761.1342258

## mean_absolute_error

In [31]:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
mae

23671.388383268113