# Extra Trees Classifier
The Extra Trees Classifier (Extremely Randomized Trees) is an ensemble learning method that combines the predictions of multiple randomized decision trees. Unlike Random Forests, which select optimal splits for each tree, Extra Trees use random splits of all observations at each node. This randomness often leads to a reduction in variance and bias, leading to better generalization.

## Advantages
- Reduced Variance: The increased randomness helps to reduce the model's variance, making it less likely to overfit.
- Faster Training: Due to random splits, training can be faster compared to methods that seek optimal splits.
- Feature Importance: Provides insights into the importance of features.
- Handles Missing Values: Can handle datasets with missing values.
- Parallel Computation: Like Random Forests, Extra Trees can be parallelized for faster computations.

## Disadvantages
- Interpretability: More complex and harder to interpret compared to single decision trees.
- Randomness: The increased randomness can sometimes lead to slightly less accuracy compared to methods that use optimal splits.
- Computationally Intensive: Requires significant memory and computational power for large datasets.

## Use Cases
- Finance: Fraud detection, risk assessment, and credit scoring.
- Marketing: Customer segmentation, churn prediction, and recommendation systems.
- Healthcare: Disease prediction and patient outcome analysis.
- E-commerce: Inventory forecasting and product recommendation.

## Scaling(not need)
Extra Trees do not require feature scaling because they are based on decision tree algorithms, which are not sensitive to the scale of the features.

## Encoding(necessary) 
Categorical data needs to be encoded into numerical values.

# Import library

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris
from scipy.stats import uniform, loguniform

# Read data

In [2]:
df = pd.read_csv('Breast_Cancer.csv')
x = df.drop('diagnosis',axis=1)
y = df['diagnosis']

In [3]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Scale data

In [4]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Train

## Grid Search

In [5]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV

et_clf = ExtraTreesClassifier(random_state=42)

params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid = {}


grid_search = GridSearchCV(et_clf, params, scoring='accuracy', cv=5, n_jobs=-1)

# Train the grid search
grid_search.fit(x_train, y_train)  

In [6]:
print("Best Hyperparameter Index:", grid_search.best_index_)
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validated Score:", grid_search.best_score_)

Best Hyperparameter Index: 9
Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 50}
Best Cross-Validated Score: 0.9692307692307693


In [7]:
# Get the model with best hyperparameters
model = grid_search.best_estimator_
y_pred = model.predict(x_test)

## Randomized Search

In [8]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import RandomizedSearchCV

et_clf = ExtraTreesClassifier(random_state=42)

params = {
    'n_estimators': np.arange(50, 200, 10),
    'max_depth': [None] + list(np.arange(10, 50, 10)),
    'min_samples_split': np.arange(2, 20, 2),
    'min_samples_leaf': np.arange(1, 10, 1)
}

param_dist = {}


random_search = RandomizedSearchCV(et_clf, params, scoring='accuracy', n_iter=50, cv=5, n_jobs=-1, random_state=42)

# Train the random search
random_search.fit(x_train, y_train)

In [9]:
print("Best Hyperparameter Index:", random_search.best_index_)
print("Best Hyperparameters:", random_search.best_params_)
print("Best Cross-Validated Score:", random_search.best_score_)

Best Hyperparameter Index: 40
Best Hyperparameters: {'n_estimators': 120, 'min_samples_split': 8, 'min_samples_leaf': 1, 'max_depth': 10}
Best Cross-Validated Score: 0.9626373626373625


In [10]:
model = random_search.best_estimator_
y_pred = model.predict(x_test)

## Train DecisionTreeClassifier without search

In [11]:
from sklearn.ensemble import ExtraTreesClassifier
model=ExtraTreesClassifier(min_samples_leaf=2, n_estimators=50, random_state=42)
model.fit(x_train, y_train)