**_After obtaining `output2`, I separated the Boolean columns, `normalized_distance`, and `similarity_score` from the rest of the dataset. I then ran the model on both datasets and achieved better results with the dataset containing the specified columns._**

#### Output 3
**Best parameters:** `{'n_estimators': 150, 'max_samples': 128, 'contamination': 0.1602582555413908, 'max_features': 0.9, 'bootstrap': False}`
**Best score:** `0.719502379676749`


In the third iteration, I introduced a new value for `contamination` and increased `n_estimators` to `150`. This resulted in a significant improvement, with the best score reaching `0.719502379676749`.

#### Output 4
**Best parameters:** `{'n_estimators': 150, 'max_samples': 64, 'contamination': 0.1, 'max_features': 0.9, 'bootstrap': False}`
**Best score:** `0.7301869661156942`

In the fourth iteration, I further reduced `max_samples` to `64`, which led to an even better score of `0.7301869661156942`. This iteration confirmed that a smaller sample size was beneficial for the model's performance.

In the final iteration, I confirmed the best parameters from the previous iteration. The score remained consistent at `0.7301869661156942`, indicating that the model had reached an optimal configuration for the given dataset.

### Conclusion
Through each iteration, I systematically explored different hyperparameter combinations and derived insights from previous results. This iterative approach allowed me to progressively improve the model's performance, ultimately achieving a significant increase in the best score from `0.6556688332880479` to `0.7301869661156942`. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import KFold
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import sys

In [2]:
# Load data (replace with your dataset)
data = pd.read_csv(r"C:\Users\Tanmay\V-Patrol\work\final.csv")
data.drop(columns=["Unnamed: 0","Unnamed: 0.1","Unnamed: 0.2","Unnamed: 0.3","Unnamed: 0.4","Unnamed: 0.5","Unnamed: 0.6","Unnamed: 0.7"],inplace=True)

In [None]:
# Convert boolean columns to int64
bool_columns = data.select_dtypes(include='bool').columns
data[bool_columns] = data[bool_columns].astype('int64')

In [None]:
X = data.values

In [None]:


# Define the parameter grid with sampled values
param_grid = {
    'n_estimators': [100,150],
    'max_samples': [32,64, 128],
    'contamination': [0.1],
    'max_features': [0.75, 0.90,1],
    'bootstrap': [False]
}


def hybrid_scorer(scores):
    norm_scores = (scores - scores.min()) / (scores.max() - scores.min())
    cluster_labels = KMeans(n_clusters=2, n_init=10).fit_predict(norm_scores.reshape(-1, 1))
    cluster_score = silhouette_score(norm_scores.reshape(-1, 1), cluster_labels)

    return cluster_score

# Cross-validation setup
kf = KFold(n_splits=2, shuffle=True, random_state=42)

# Store results
best_score = -np.inf
best_params = None
results = []

# Nested for loops to iterate over parameter combinations
for n_estimators in param_grid['n_estimators']:
    for max_samples in param_grid['max_samples']:
        for contamination in param_grid['contamination']:
            for max_features in param_grid['max_features']:
                for bootstrap in param_grid['bootstrap']:
                    # Initialize the model with the current parameters
                    model = IsolationForest(
                        n_estimators=n_estimators,
                        max_samples=max_samples,
                        contamination=contamination,
                        max_features=max_features,
                        bootstrap=bootstrap,
                        random_state=42
                    )

                    # Perform cross-validation
                    fold_scores = []
                    for train_index, test_index in kf.split(X):
                        X_train, X_test = X[train_index], X[test_index]
                        model.fit(X_train)
                        scores = model.decision_function(X_test)
                        score = hybrid_scorer(scores)
                        fold_scores.append(score)

                    # Calculate mean score for the current parameter combination
                    mean_score = np.mean(fold_scores)
                    results.append({
                        'params': {
                            'n_estimators': n_estimators,
                            'max_samples': max_samples,
                            'contamination': contamination,
                            'max_features': max_features,
                            'bootstrap': bootstrap
                        },
                        'mean_score': mean_score
                    })

                    print(f"Parameters: {n_estimators}, {max_samples}, {contamination}, {max_features}, {bootstrap}")
                    print(f"Mean Score: {mean_score}")

                    if mean_score == np.nan:
                        print("Mean score is NaN. Skipping this iteration.")
                        sys.exit()

                    # Update best score and parameters if current mean score is better
                    if mean_score > best_score:
                        best_score = mean_score
                        best_params = {
                            'n_estimators': n_estimators,
                            'max_samples': max_samples,
                            'contamination': contamination,
                            'max_features': max_features,
                            'bootstrap': bootstrap
                        }

# Print the best parameters and score
print("Best parameters:", best_params)
print("Best score:", best_score)

# Print all results
for result in results:
    print(result)