### **Extra Trees Regressor** (Extremely Randomized Trees Regressor)

The **Extra Trees Regressor** is an ensemble method in machine learning for regression tasks. It uses a collection of decision trees, where each tree is trained on the entire dataset, but with extra randomness introduced in tree construction.

#### **How It Works**
1. **Ensemble of Trees**:
   - Similar to Random Forest, it aggregates predictions from multiple decision trees.
   - Trees are built independently, and their predictions are averaged to produce the final output.

2. **Random Splits**:
   - Unlike Random Forest, which chooses the best split based on a criterion like Gini or MSE, Extra Trees randomly selects thresholds for splits.
   - This increases diversity among the trees and reduces overfitting.

3. **Bootstrapping**:
   - Unlike Random Forest, Extra Trees typically use the entire dataset without bootstrapping (by default), although it can be enabled.

#### **Key Features**
- **Extra Randomness**: 
  - Both features and split thresholds are selected randomly.
- **Bias-Variance Tradeoff**: 
  - Extra Trees focus on reducing variance by randomizing splits, often at the cost of slightly increased bias.

#### **Advantages**
- **Efficient Training**: Since splits are random, the algorithm is faster to train compared to Random Forest.
- **Good Generalization**: Works well for high-dimensional data and reduces the risk of overfitting.
- **Robust to Noise**: Handles noisy data effectively due to its randomness.

#### **Disadvantages**
- **Less Interpretability**: Increased randomness makes it harder to interpret compared to standard decision trees.
- **Not Always Optimal**: May perform worse than Random Forest for some datasets where well-optimized splits are crucial.

#### **Common Use Cases**
- Predicting continuous values in tasks like:
  - House price prediction
  - Stock market analysis
  - Energy consumption forecasting

#### **Implementation in Python**
Here’s an example of how to use **Extra Trees Regressor** with Scikit-Learn:

```python
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load or generate data
X, y = some_dataset()

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Extra Trees Regressor
et_regressor = ExtraTreesRegressor(n_estimators=100, random_state=42)

# Train the model
et_regressor.fit(X_train, y_train)

# Make predictions
y_pred = et_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
```

#### **Key Parameters**
- `n_estimators`: Number of trees in the forest (default = 100).
- `max_features`: Number of features to consider for splits (default = auto).
- `max_depth`: Maximum depth of trees (default = None, meaning full depth).
- `min_samples_split`: Minimum samples required to split a node.
- `min_samples_leaf`: Minimum samples required at a leaf node.

#### **Difference from Random Forest**
- **Random Forest** selects the best split at each node, while Extra Trees splits randomly.
- Extra Trees is computationally faster but may result in higher bias.

---

### **Isolation Forest**

The **Isolation Forest** is an anomaly detection algorithm based on the idea of isolating data points. Unlike density-based or distance-based approaches, it focuses on isolating anomalies rather than profiling "normal" data points.

#### **How Isolation Forest Works**
1. **Random Partitioning**:
   - The algorithm randomly selects a feature and a random split value between the feature’s minimum and maximum values.
   - This process creates a "tree" structure where data points are progressively isolated by splits.

2. **Isolation Depth**:
   - Normal points require more splits to be isolated (deeper in the tree) because they tend to cluster together.
   - Anomalies are isolated with fewer splits because they are far from other points or in low-density regions.

3. **Ensemble of Trees**:
   - Multiple trees are built to improve robustness.
   - The average path length (depth of isolation) across all trees is used to score how anomalous a data point is.

4. **Anomaly Scoring**:
   - A score close to 1 indicates an anomaly.
   - A score closer to 0 suggests a normal point.

#### **Advantages**
- **Efficient**:
  - Works well with large datasets since it has a linear time complexity.
- **No Assumptions**:
  - Does not assume any distribution or cluster structure in the data.
- **Handles High Dimensions**:
  - Performs effectively even with high-dimensional data.

#### **Disadvantages**
- **Randomness**:
  - Results can vary slightly due to random splits unless a seed is fixed.
- **Limited Interpretability**:
  - The process of isolation is less interpretable than other methods like distance-based clustering.

#### **Common Use Cases**
- **Fraud Detection**: Identifying unusual transactions in finance.
- **Intrusion Detection**: Spotting anomalous behavior in network traffic.
- **Manufacturing**: Detecting defective products or rare events.

#### **Implementation in Python**

Here’s an example using **Scikit-Learn**:

```python
from sklearn.ensemble import IsolationForest
import numpy as np

# Example dataset
X = np.random.rand(100, 2)  # Normal data
X_outliers = np.random.rand(10, 2) + 2  # Add anomalies
X = np.vstack((X, X_outliers))

# Initialize Isolation Forest
isoforest = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)

# Fit the model
isoforest.fit(X)

# Predict anomalies
predictions = isoforest.predict(X)  # -1 for anomalies, 1 for normal points
anomaly_scores = isoforest.decision_function(X)

# Results
print("Predictions:", predictions)
print("Anomaly Scores:", anomaly_scores)
```

#### **Key Parameters**
- **`n_estimators`**: Number of trees in the forest (default = 100).
- **`max_samples`**: Number of samples to train each tree. Can be an integer or a fraction of the data.
- **`contamination`**: The proportion of anomalies in the data (default = `auto` or a user-specified value).
- **`max_features`**: Number of features to consider for each split.

#### **Anomaly Score**
- A decision function gives the anomaly score for each point.
- The more negative the score, the more likely the point is an anomaly.

---
Examples:

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.ensemble import IsolationForest


import warnings
warnings.filterwarnings("ignore")

In [17]:
df = pd.read_csv(r'C:\Users\DAI.STUDENTSDC\Desktop\Machine Learning\Data\Data Sets\milk.csv', index_col=0)

iso = IsolationForest(
    n_estimators=25,
    contamination=0.05,
    random_state=24
)

pred = iso.fit_predict(df)

print('Unqiue')
print(np.unique(pred, return_counts=True))
 

print('Outliers')
df[pred == -1]

Unqiue
(array([-1,  1]), array([ 2, 23], dtype=int64))
Outliers


Unnamed: 0_level_0,water,protein,fat,lactose,ash
Animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SEAL,46.4,9.7,42.0,0.0,0.85
DOLPHIN,44.9,10.6,34.9,0.9,0.53


In [18]:
df = pd.read_csv(r'C:\Users\DAI.STUDENTSDC\Desktop\Machine Learning\Data\Cases\Recency Frequency Monetary\rfm_data_customer.csv', index_col=2)


iso = IsolationForest(
    n_estimators=25,
    contamination=0.05,
    random_state=24
)
pred = iso.fit_predict(df)


print('Unqiue')
print(np.unique(pred, return_counts=True))
 
print('Outliers')
df[pred == -1]

Unqiue
(array([-1,  1]), array([ 2000, 37999], dtype=int64))
Outliers


Unnamed: 0_level_0,customer_id,revenue,number_of_orders,recency_days
most_recent_visit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005-11-05,16677,220,5,422
2006-10-15,2718,1633,16,78
2006-11-25,27531,1804,18,37
2006-11-19,18357,1561,18,43
2006-12-21,2036,216,3,11
...,...,...,...,...
2005-08-29,8581,381,4,490
2006-09-29,34982,1998,19,94
2006-03-16,25267,1464,16,291
2006-02-25,5468,1576,15,310
