# Extra Trees Regression
Extra Trees Regression (Extremely Randomized Trees) is an ensemble learning method that aggregates the results of multiple unpruned regression trees. The trees are trained using different parts of the same training set and use the average of the predictions from individual trees. Extra Trees introduces randomness by selecting random splits for each tree node.

## Advantages:
- Reduces Overfitting: Less likely to overfit compared to individual decision trees due to averaging multiple trees.
- Robustness: Handles a large number of features well and can handle both numerical and categorical data.
- Fast Training: Can be faster to train than Random Forests since it uses random splits rather than optimal splits.

## Disadvantages:
- Complexity: The ensemble model is more complex and less interpretable than individual trees.
- Computationally Intensive: Can be computationally intensive due to training multiple trees.
- Randomness: The introduction of extra randomness can sometimes lead to less accurate predictions compared to Random Forests.

## Use Cases:
- Regression Problems: Effective for any general regression task where robust and accurate predictions are needed.
- High-dimensional Data: Suitable for datasets with a large number of features.
- Predictive Modeling: Used in finance, healthcare, and other fields where predictive modeling is required.

## Scaling (not necessary)
No, scaling is not necessary for Extra Trees Regression since tree-based methods are invariant to the scale of the features.

## Encoding (necessary)
Encoding categorical data to numerical is essential, as Random Forests can only handle numerical input.

# Import Libraries

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error


# Read Dataset

In [5]:
df = pd.read_csv('50_StartUp_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,Florida,New York
0,0,165349.2,136897.8,471784.1,192261.83,0.0,1.0
1,1,162597.7,151377.59,443898.53,191792.06,0.0,0.0
2,2,153441.51,101145.55,407934.54,191050.39,1.0,0.0
3,3,144372.41,118671.85,383199.62,182901.99,0.0,1.0
4,4,142107.34,91391.77,366168.42,166187.94,1.0,0.0


# get X , Y

In [6]:
x=df.drop('Profit',axis=1)
y=df['Profit']

## Get train, test and valid data

In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=.1, random_state=42)
x_train, x_valid, y_train, y_valid=train_test_split(x_train,y_train,test_size=.1, random_state=42)

In [8]:
print('x_train shape =',x_train.shape)
print('x_test shape =',x_test.shape)
print('x_valid shape =',x_valid.shape)
print('y_train shape =',y_train.shape)
print('y_test shape =',y_test.shape)
print('y_valid shape =',y_valid.shape)

x_train shape = (40, 6)
x_test shape = (5, 6)
x_valid shape = (5, 6)
y_train shape = (40,)
y_test shape = (5,)
y_valid shape = (5,)


# Train

## Grid Search

In [19]:
from sklearn.model_selection import GridSearchCV

extra_trees_reg = ExtraTreesRegressor(random_state=42)

params = {
    'n_estimators': [50, 100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid = {
    'n_estimators': [50, 100, 200, 500, 1000],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10, 15],
    'min_samples_leaf': [1, 2, 4, 8],
    'bootstrap': [True, False]
}


grid_search = GridSearchCV(extra_trees_reg, params, scoring='r2', cv=5, n_jobs=-1)

# Train the grid search
grid_search.fit(x_train, y_train)  

In [15]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Best Hyperparameter Index: 2
Best Hyperparameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Cross-Validated Score: 0.9546654369405101


In [25]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

## Randomized Search

In [17]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV

extra_trees_reg = ExtraTreesRegressor(random_state=42)

params = {
    'n_estimators': [50, 100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_dist = {
    'n_estimators': np.arange(50, 1001, 50),
    'max_depth': [None] + list(np.arange(10, 101, 10)),
    'min_samples_split': np.arange(2, 21, 2),
    'min_samples_leaf': np.arange(1, 21, 2),
    'bootstrap': [True, False]
}

random_search = RandomizedSearchCV(extra_trees_reg, params, scoring='r2', n_iter=10, cv=5, n_jobs=-1, random_state=42)

# Train the random search
random_search.fit(x_train, y_train)

In [20]:
print("Best Hyperparameter Index:", random_search.best_index_)
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validated Score:", random_search.best_score_)

Best Hyperparameter Index: 8
Best Hyperparameters: {'n_estimators': 200, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_depth': 80, 'bootstrap': True}
Best Cross-Validated Score: 0.9713219497481103


In [188]:
# model = random_search.best_estimator_
# y_pred = model.predict(x_test)

## Train ExtraTreesRegressor without search

In [21]:

model=ExtraTreesRegressor(bootstrap=True, max_depth=80, min_samples_split=4,n_estimators=200, random_state=42)
# model=SVR(C = 0.1, kernel='linear', gamma='auto', epsilon=0.1, degree=3)
model.fit(x_train, y_train)

# Check overfiiting

In [24]:
from sklearn.metrics import r2_score
y_train_pred=model.predict(x_train)
r2_score(y_train_pred , y_train)

0.9892146530090006

In [25]:
y_valid_pred=model.predict(x_valid)
r2_score(y_valid_pred , y_valid)

0.9627016959243828

# Evaluate model

In [26]:
y_pred = model.predict(x_test)

## r2_score

In [27]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

0.8655323285863459

## mean_squared_error

In [28]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
mse

91584873.43007998

## mean_absolute_error

In [29]:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
mae

5537.032574155879