Students:

Joana Rodrigues - 20240603    
Maria Francisca - 20240346     
Rui Reis - 20240854      
Tomás Silva - 20230982     
Victor Pita - 20240596        

# Project Part 4 - Clustering

Predicting agreement reached with clustering

# Index 
1. [Imports](#imports)   
    1.1. [Import Libraries](#importlibraries)    
    1.2. [Import Data files](#importfiles)
  

`Accident Date` Injury date of the claim   
`Age at Injury Age` of injured worker when the injury occurred.  
`Alternative Dispute Resolution` Adjudication processes external to the Board.   
`Assembly Date` The date the claim was first assembled.   
`Attorney/ Representative` Is the claim being represented by an Attorney?   
`Average Weekly Wage ` The wage used to calculate workers’ compensation, disability, or an Paid Leave wage replacement benefits.      
`Birth Year` The reported year of birth of the injured worker.   
`C-2 Date` Date of receipt of the Employer's Report of Work-Related; Injury/Illness or equivalent (formerly Form C-2).   
`C-3 Date` Date Form C-3 (Employee Claim Form) was received.   
`Carrier Name`Name of primary insurance provider responsible for providing workers’ compensation coverage to the injured worker’s employer.   
`Carrier Type` Type of primary insurance provider responsible for providing workers’ compensation coverage.   
`Claim Identifier` Unique identifier for each claim, assigned by WCB.   
`County of Injury` Name of the New York County where the injury occurred.   
`COVID-19 Indicator` Indication that the claim may be associated with COVID-19.   
`District Name` Name of the WCB district office that oversees claims for that region or area of the state.   
`First Hearing Date` Date the first hearing was held on a claim at a WCB hearing location. A blank date means the claim has not yet had a hearing held.    
`Gender` The reported gender of the injured worker.   
`IME-4 Count` Number of IME-4 forms received per claim. The IME-4 form is the “Independent Examiner's Report of Independent Medical Examination” form.   
`Industry Code` NAICS code and descriptions are available at: https://www.naics.com/search-naics-codes-by-industry/.   
`Industry Code Description` 2-digit NAICS industry code description used to classify businesses according to their economic activity.   
`Medical Fee Region` Approximate region where the injured worker would receive medical service.   
`OIICS Nature of Injury Description` The OIICS nature of injury codes & descriptions are available at https://www.bls.gov/iif/oiics_manual_2007.pdf.   
`WCIO Cause of Injury Code` The WCIO cause of injury codes & descriptions are at https://www.wcio.org/Active%20PNC/WCIO_Cause_Table.pdf   
`WCIO Cause of Injury Description` See description of field above.    
`WCIO Nature of Injury Code` The WCIO nature of injury are available at https://www.wcio.org/Active%20PNC/WCIO_Nature_Table.pdf   
`WCIO Nature of Injury Description`See description of field above.   
`WCIO Part Of Body Code` The WCIO part of body codes & descriptions are available athttps://www.wcio.org/Active%20PNC/WCIO_Part_Table.pdf   
`WCIO Part Of Body Description` See description of field above.   
`Zip Code` The reported ZIP code of the injured worker’s home address.   
`Agreement Reached` Binary variable: Yes if there is an agreement without the involvement of the WCB -> unknown at the start of a claim.   
`WCB Decision` Multiclass variable: Decision of the WCB relative to the claim: “Accident” means that claim refers to workplace accident, “Occupational Disease” means illness from the workplace. -> requires WCB deliberation so it is unknown at start of claim.    
`Claim Injury Type` Main target variable: Deliberation of the WCB relative to benefits awarded to the claim. Numbering indicates severity   

<hr>
<a class="anchor" id="imports">
    
# 1. Import
</a>

<hr>
<a class="anchor" id="importlibraries">
    
## 1.1. Import libraries
</a>

In [1]:
import pandas as pd 
import numpy as np 
import warnings
import itertools
import math
import time
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from xgboost import XGBClassifier
from sklearn.feature_selection import SelectKBest, f_classif
import pickle
import category_encoders as ce
from collections import Counter
from scipy.stats import chi2_contingency
from sklearn.cluster import KMeans
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.mixture import GaussianMixture

In [2]:
warnings.filterwarnings('ignore')

<hr>
<a class="anchor" id="importfiles">
    
## 1.2. Import data files
</a>

In [3]:
df_train = pickle.load(open("df_train_cleaned.pkl", 'rb'))
df_test = pickle.load(open("df_test_cleaned.pkl", 'rb'))
# features_selected = pickle.load(open("all_features.pkl", 'rb')) # features selected in the previous notebook

In [4]:
df_train.drop(columns =[ 'Claim Injury Type'], inplace = True)

Separation of target and other variables and encoding of the target.        
Labelencoder is used in this case as the 'Claim Injury Type' can be considered ordinal because there are injury types worse than others.

In [5]:
df_train['Agreement Reached'].value_counts()
value_counts = df_train['Agreement Reached'].value_counts()
percentages = df_train['Agreement Reached'].value_counts(normalize=True) * 100

# Combine counts and percentages into a DataFrame
value_summary = pd.DataFrame({
    'Count': value_counts,
    'Percentage (%)': percentages})

print(value_summary)

                    Count  Percentage (%)
Agreement Reached                        
False              541909       95.353801
True                26405        4.646199


In [6]:
df_train['Agreement Reached'] = df_train['Agreement Reached'].astype(bool)

In [7]:
X = df_train.drop(columns=['Agreement Reached'])  # Features
y = df_train['Agreement Reached']  # Target column

### Function that defines the feature types      
(Same function as in the previous notebooks)

In [8]:
def identify_feature_types(df):
    # Identifying date features
    date_features = [column for column in df.columns if 'Date' in column]
    
    # Identifying categorical (object) features initially
    categorical_features = df.select_dtypes(include=['object']).columns.tolist()
    
    # Identifying boolean features
    boolean_features = df.select_dtypes(include=['bool']).columns.tolist()
    
    # Identifying numerical features (integers and floats), but excluding those with 'Code', 'County', or 'Carrier' in their name
    numerical_features = [
        column for column in df.select_dtypes(include=['int64', 'float64']).columns
        if 'Code' not in column and 'County' not in column and 'Carrier' not in column and 'Decision' not in column and 'Indicator' not in column  and 'Grouped' not in column
    ]
    
    # Adding features with 'Code', 'County', or 'Carrier' in the name to categorical features, even if they are numerical
    categorical_features.extend([
        column for column in df.columns if 'Code' in column or 'County' in column or 'Carrier' in column or 'Decision' in column or 'Grouped' in column
    ])

    # Removing duplicates in case any feature is accidentally added twice
    categorical_features = list(set(categorical_features))
    
    return {
        'date_features': date_features,
        'numerical_features': numerical_features,
        'categorical_features': categorical_features,
        'boolean_features': boolean_features
    }


### 3.1.2 Division of features
This division is made because different feature selection methods are used for the different types of features.

In [9]:
# Extract numerical features based on df_train
feature_types = identify_feature_types(X)

metric_features = feature_types['numerical_features']

# Separate into numerical and non-numerical features for training and testing
feat_x_train_num = X[metric_features]

feat_x_val_num = df_test[metric_features]


### Correlation with target feature
This evaluates the correlation between each feature and the target

In [10]:
def numeric_features(features_df, target_series, threshold=0.2, correlation_threshold=0.8):
    # Ensure target_series index aligns with features_df
    target_series = pd.Series(target_series, index=features_df.index)

    # Calculate correlation between features and target
    correlation_matrix = features_df.corrwith(target_series)

    # Convert correlation results to a DataFrame
    correlation_df = correlation_matrix.reset_index()
    correlation_df.columns = ['Feature', 'Correlation_with_Target']

    # Filter features based on absolute correlation threshold
    relevant_features = correlation_df[correlation_df['Correlation_with_Target'].abs() > threshold]

    # Identify features that are highly correlated with each other
    correlation_matrix = features_df.corr()

    # Create a list to keep track of features to drop
    features_to_drop = []

    # Iterate over the upper triangle of the correlation matrix to find highly correlated pairs
    for i in range(len(correlation_matrix.columns)):
        for j in range(i):
            if abs(correlation_matrix.iloc[i, j]) > correlation_threshold:
                feature_1 = correlation_matrix.columns[i]
                feature_2 = correlation_matrix.columns[j]
                
                # Remove the feature with the lower correlation to the target
                if abs(correlation_matrix[feature_1].corr(target_series)) < abs(correlation_matrix[feature_2].corr(target_series)):
                    features_to_drop.append(feature_1)
                else:
                    features_to_drop.append(feature_2)

    # Remove duplicates from the features_to_drop list
    features_to_drop = list(set(features_to_drop))

    # Remove highly correlated features from the original feature set
    final_relevant_features = relevant_features[~relevant_features['Feature'].isin(features_to_drop)]

    return relevant_features


### Function that handles the missing values      
For the numerical columns it replaces them with the median due to not having removed the outliers, this value seems a better approach    
For the categorical columns we substitute the missing values with the mode. Initially, we thought about substituting the missing values of the categoricals according to the mode for each type of injury but as there are not a lot of missing values in this stage, we went for a simpler approach.

In [11]:
def handle_missing_values(X_train, X_val, df_test):
    # Identify feature types
    feature_types = identify_feature_types(X_train)
    
    # For numerical columns: fill NaN with the median value (to avoid outlier influence)
    numerical_features = feature_types['numerical_features']
    
    for column in numerical_features:
        if column in X_train.columns:
            # Use the median value of the training set for filling missing values
            median_value = X_train[column].median()  # Using median to handle potential outliers
            X_train[column] = X_train[column].fillna(median_value)
            X_val[column] = X_val[column].fillna(median_value)
            df_test[column] = df_test[column].fillna(median_value)

    # For categorical columns: fill NaN with the mode (most frequent value)
    categorical_features = feature_types['categorical_features']
    
    for column in categorical_features:
        if column in X_train.columns:
            # Use the mode value of the training set for filling missing values
            mode_value = X_train[column].mode()[0]  # Find the mode (most frequent value) for each categorical column
            X_train[column] = X_train[column].fillna(mode_value)
            X_val[column] = X_val[column].fillna(mode_value)
            df_test[column] = df_test[column].fillna(mode_value)

    # print("Columns with missing values in X_train:")
    # print(X_train.columns[X_train.isna().any()].tolist())  # Using .any() to check for any NaNs in each column
    # print("Columns with missing values in X_val:")
    # print(X_val.columns[X_val.isna().any()].tolist())
    # print("Columns with missing values in df_test:")
    # print(df_test.columns[df_test.isna().any()].tolist())
    return X_train, X_val, df_test


### Scaler
The scaler chosen was MinMaxScale because, even though there were some outliers in the data, the transformation made to log features helped reduce this differences and therefore the outliers are not extreme. Furthermore, this scaler also works well with data that does not follow a normal distribution

In [12]:
def scaler(X_train, X_val, df_test):
    metric_features = [col for col in X_train.columns if pd.api.types.is_numeric_dtype(X_train[col])]

    # Scale numeric features
    scaler = MinMaxScaler()
    X_train_scaled = X_train.copy()
    X_val_scaled = X_val.copy()
    X_test_scaled = df_test.copy()

    X_train_scaled[metric_features] = scaler.fit_transform(X_train[metric_features])
    X_val_scaled[metric_features] = scaler.transform(X_val[metric_features])
    X_test_scaled[metric_features] = scaler.transform(X_test_scaled[metric_features])

    return X_train_scaled, X_val_scaled, X_test_scaled

### Evaluation of model
This function is to be incorporated to understand if the changes made in the models worked for the better or not.
It allows to see if the classes are being well predicted and if not if they are being predicted at all or not.

In [13]:
def evaluate_model_predictions(y_pred_val, y_val):
    # # Calculate performance metrics for training and testing
    # train_accuracy = accuracy_score(y_train, y_pred_train)
    # train_f1_macro = f1_score(y_train, y_pred_train, average='macro')
    
    test_accuracy = accuracy_score(y_val, y_pred_val)
    test_f1_macro = f1_score(y_val, y_pred_val, average='macro')

    # Display results
    # print(f"Accuracy of train: {train_accuracy:.4f}")
    print(f"Accuracy of test: {test_accuracy:.4f}")
    # print(f"F1 Macro (Train): {train_f1_macro:.4f}")
    print(f"\033[1mF1 Macro (Test)\033[0m: {test_f1_macro:.4f}")
    
    print("\nClassification Report for Validation Data:")
    print(classification_report(y_val, y_pred_val))

    print("\Confusion Matrix for Validation Data:")
    print(confusion_matrix(y_val, y_pred_val))
    return test_accuracy, test_f1_macro

### Joint of all preprocessing and feature selection

In [14]:
def preprocessing(X_train, X_val, X_test, y_train):
    # Data preprocessing
    handle_missing_values(X_train, X_val, X_test)
    X_train_scaled, X_val_scaled, X_test_scaled = scaler(X_train, X_val, X_test)

    # Identify feature types
    feature_types = identify_feature_types(X_train_scaled)  # Use training data to identify features
    num_features = feature_types['numerical_features']
    
     # Feature selection 
    relevant_num_features = numeric_features(X_train_scaled[num_features], y_train)['Feature']

    # Combine selected numeric and categorical features
    selected_features = list(relevant_num_features)
    X_train_selected = X_train_scaled[selected_features]
    X_val_selected = X_val_scaled[selected_features]
    X_test_selected = X_test_scaled[selected_features]

    return X_train_selected, X_val_selected,X_test_selected
    

### Clustering Function
The following function performs clustering on a dataset to predict the target variable 'Agreement Reached' based on clustering results. It uses stratified k-fold cross-validation to split the data, applies preprocessing steps (handling missing values, encoding, and scaling), and then fits one of three clustering models (KMeans, DBSCAN, or Agglomerative Clustering) to the training data. It maps clusters to the majority class labels and evaluates the model's performance during each fold. Finally, it predicts the labels for the test set based on the learned clustering and returns the predicted labels.

In [15]:
def clustering_agreement_supervised(X, y, X_test, clustering_type='kmeans'):
    skf = StratifiedKFold(n_splits=3)
    timer = []

    for train_index, val_index in skf.split(X, y):
        # Split data
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        begin = time.perf_counter()
        X_train_selected, X_val_selected,X_test_selected = preprocessing(X_train, X_val, X_test, y_train)

        # Supervised Clustering
        # Separate by class
        X_train_class_0 = X_train_selected[y_train == 0]
        X_train_class_1 = X_train_selected[y_train == 1]

        # Apply clustering separately for each class
        if clustering_type == 'kmeans':
            clustering_model_class_0 = KMeans(n_clusters=1, random_state=42)
            clustering_model_class_1 = KMeans(n_clusters=1, random_state=42)
        elif clustering_type == 'gmm':
            clustering_model_class_0 = GaussianMixture(n_components=1, random_state=42)
            clustering_model_class_1 = GaussianMixture(n_components=1, random_state=42)
        else:
            raise ValueError("Only 'kmeans' and 'gmm' clustering are supported for supervised clustering.")

        clustering_model_class_0.fit(X_train_class_0)
        clustering_model_class_1.fit(X_train_class_1)

        # Assign clusters to validation data
        if clustering_type == 'gmm':
            val_log_prob_0 = clustering_model_class_0.score_samples(X_val_selected)
            val_log_prob_1 = clustering_model_class_1.score_samples(X_val_selected)
            y_pred_val = np.where(val_log_prob_0 > val_log_prob_1, 0, 1)
        else:  # For KMeans
            val_distances_0 = clustering_model_class_0.transform(X_val_selected)
            val_distances_1 = clustering_model_class_1.transform(X_val_selected)
            y_pred_val = np.where(val_distances_0 < val_distances_1, 0, 1)

        end = time.perf_counter()
        timer.append(end - begin)

    print(f"Average fold duration: {np.mean(timer):.2f}s")
    print("Clustering completed.")
    test_accuracy, test_f1_macro = evaluate_model_predictions(y_pred_val, y_val)

    # Test Predictions
    if clustering_type == 'gmm':
        test_log_prob_0 = clustering_model_class_0.score_samples(X_test_selected)
        test_log_prob_1 = clustering_model_class_1.score_samples(X_test_selected)
        y_pred_test = np.where(test_log_prob_0 > test_log_prob_1, 0, 1)
    else:  # For KMeans
        test_distances_0 = clustering_model_class_0.transform(X_test_selected)
        test_distances_1 = clustering_model_class_1.transform(X_test_selected)
        y_pred_test = np.where(test_distances_0 < test_distances_1, 0, 1)

    return y_pred_test, test_accuracy, test_f1_macro


In [16]:
y_pred_test_kmeans, accuracy_kmeans, f1_kmeans = clustering_agreement_supervised(X, y, df_test)

Average fold duration: 10.34s
Clustering completed.
Accuracy of test: 0.7753
[1mF1 Macro (Test)[0m: 0.5557

Classification Report for Validation Data:
              precision    recall  f1-score   support

       False       0.99      0.78      0.87    180636
        True       0.14      0.78      0.24      8802

    accuracy                           0.78    189438
   macro avg       0.57      0.78      0.56    189438
weighted avg       0.95      0.78      0.84    189438

\Confusion Matrix for Validation Data:
[[140025  40611]
 [  1956   6846]]


In [17]:
y_pred_test_gmm, accuracy_gmm, f1_gmm = clustering_agreement_supervised(X, y, df_test, clustering_type='gmm')

Average fold duration: 6.63s
Clustering completed.
Accuracy of test: 0.7290
[1mF1 Macro (Test)[0m: 0.5364

Classification Report for Validation Data:
              precision    recall  f1-score   support

       False       0.99      0.72      0.84    180636
        True       0.14      0.91      0.24      8802

    accuracy                           0.73    189438
   macro avg       0.57      0.81      0.54    189438
weighted avg       0.95      0.73      0.81    189438

\Confusion Matrix for Validation Data:
[[130105  50531]
 [   806   7996]]


In [18]:
def gmm(X, y, X_test, threshold =0.7):
    skf = StratifiedKFold(n_splits=3)
    timer = []

    for train_index, val_index in skf.split(X, y):
        # Split data
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        begin = time.perf_counter()
        X_train_selected, X_val_selected,X_test_selected = preprocessing(X_train, X_val, X_test, y_train)
        
            # GMM clustering
        gmm_class_0 = GaussianMixture(n_components=1, random_state=42)
        gmm_class_1 = GaussianMixture(n_components=1, random_state=42)

        gmm_class_0.fit(X_train_selected[y_train == 0])
        gmm_class_1.fit(X_train_selected[y_train == 1])

            # Validation probabilities
        val_probs_0 = np.exp(gmm_class_0.score_samples(X_val_selected))
        val_probs_1 = np.exp(gmm_class_1.score_samples(X_val_selected))
        val_probs_total = val_probs_0 + val_probs_1

            # Normalize probabilities
        val_probs_1_normalized = val_probs_1 / val_probs_total

        y_pred_val = np.where(val_probs_1_normalized >= threshold, 1, 0)

        end = time.perf_counter()
        timer.append(end - begin)

    test_accuracy, test_f1_macro= evaluate_model_predictions(y_pred_val, y_val)

    print(f"Average fold duration: {np.mean(timer):.2f}s")
    print("Clustering completed.")

    # Test Predictions
    test_probs_0 = np.exp(gmm_class_0.score_samples(X_test_selected))
    test_probs_1 = np.exp(gmm_class_1.score_samples(X_test_selected))
    test_probs_total = test_probs_0 + test_probs_1

    # Normalize probabilities
    test_probs_1_normalized = test_probs_1 / test_probs_total

    # Apply threshold
    y_pred_test = np.where(test_probs_1_normalized >= threshold, 1, 0)

    return y_pred_test, test_accuracy, test_f1_macro


In [19]:
y_pred_test_gmm80, accuracy_gmm80, f1_gmm80 = gmm(X, y, df_test, threshold = 0.8)

Accuracy of test: 0.7470
[1mF1 Macro (Test)[0m: 0.5422

Classification Report for Validation Data:
              precision    recall  f1-score   support

       False       0.99      0.74      0.85    180636
        True       0.14      0.84      0.24      8802

    accuracy                           0.75    189438
   macro avg       0.56      0.79      0.54    189438
weighted avg       0.95      0.75      0.82    189438

\Confusion Matrix for Validation Data:
[[134101  46535]
 [  1395   7407]]
Average fold duration: 5.60s
Clustering completed.


In [20]:
y_pred_test_gmm85, accuracy_gmm85, f1_gmm85 = gmm(X, y, df_test, threshold = 0.85)

Accuracy of test: 0.8659
[1mF1 Macro (Test)[0m: 0.5842

Classification Report for Validation Data:
              precision    recall  f1-score   support

       False       0.97      0.89      0.93    180636
        True       0.16      0.46      0.24      8802

    accuracy                           0.87    189438
   macro avg       0.57      0.67      0.58    189438
weighted avg       0.93      0.87      0.89    189438

\Confusion Matrix for Validation Data:
[[159982  20654]
 [  4747   4055]]
Average fold duration: 6.62s
Clustering completed.


In [21]:
y_pred_test_gmm90, accuracy_gmm90, f1_gmm90 = gmm(X, y, df_test, threshold = 0.90)

Accuracy of test: 0.9012
[1mF1 Macro (Test)[0m: 0.5827

Classification Report for Validation Data:
              precision    recall  f1-score   support

       False       0.96      0.93      0.95    180636
        True       0.17      0.30      0.22      8802

    accuracy                           0.90    189438
   macro avg       0.57      0.61      0.58    189438
weighted avg       0.93      0.90      0.91    189438

\Confusion Matrix for Validation Data:
[[168105  12531]
 [  6189   2613]]
Average fold duration: 5.34s
Clustering completed.


In [22]:
def kmeans(X, y, X_test, threshold=0.7):
    skf = StratifiedKFold(n_splits=2)
    timer = []

    for train_index, val_index in skf.split(X, y):
        # Split data
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        begin = time.perf_counter()

        X_train_selected, X_val_selected,X_test_selected = preprocessing(X_train, X_val, X_test, y_train)

        # KMeans clustering
        kmeans_class_0 = KMeans(n_clusters=1, random_state=42)
        kmeans_class_1 = KMeans(n_clusters=1, random_state=42)

        kmeans_class_0.fit(X_train_selected[y_train == 0])
        kmeans_class_1.fit(X_train_selected[y_train == 1])

        # Validation distances
        val_distances_0 = kmeans_class_0.transform(X_val_selected).flatten()
        val_distances_1 = kmeans_class_1.transform(X_val_selected).flatten()

        # Convert distances to "probabilities" (inverse distance)
        val_inverse_dist_0 = 1 / (val_distances_0 + 1e-8)
        val_inverse_dist_1 = 1 / (val_distances_1 + 1e-8)
        val_total_inverse_dist = val_inverse_dist_0 + val_inverse_dist_1

        val_probs_1 = val_inverse_dist_1 / val_total_inverse_dist

        # Apply threshold
        y_pred_val = np.where(val_probs_1 >= threshold, 1, 0)

        end = time.perf_counter()
        timer.append(end - begin)

    print(f"Fold duration: {end - begin:.2f}s")
    test_accuracy, test_f1_macro=evaluate_model_predictions(y_pred_val, y_val)

    print(f"Average fold duration: {np.mean(timer):.2f}s")
    print("Clustering completed.")

    # Test Predictions
    test_distances_0 = kmeans_class_0.transform(X_test_selected).flatten()
    test_distances_1 = kmeans_class_1.transform(X_test_selected).flatten()

    # Convert distances to probabilities
    test_inverse_dist_0 = 1 / (test_distances_0 + 1e-8)
    test_inverse_dist_1 = 1 / (test_distances_1 + 1e-8)
    test_total_inverse_dist = test_inverse_dist_0 + test_inverse_dist_1

    test_probs_1 = test_inverse_dist_1 / test_total_inverse_dist

    # Apply threshold
    y_pred_test = np.where(test_probs_1 >= threshold, 1, 0)

    return y_pred_test, test_accuracy, test_f1_macro

In [23]:
y_pred_test_kmeans70, accuracy_kmeans70, f1_kmeans70 = kmeans(X, y, df_test, threshold = 0.7)

Fold duration: 4.33s
Accuracy of test: 0.8476
[1mF1 Macro (Test)[0m: 0.5902

Classification Report for Validation Data:
              precision    recall  f1-score   support

       False       0.98      0.86      0.92    270954
        True       0.17      0.59      0.27     13203

    accuracy                           0.85    284157
   macro avg       0.57      0.73      0.59    284157
weighted avg       0.94      0.85      0.88    284157

\Confusion Matrix for Validation Data:
[[233040  37914]
 [  5381   7822]]
Average fold duration: 4.37s
Clustering completed.


In [24]:
y_pred_test_kmeans75, accuracy_kmeans75, f1_kmeans75 = kmeans(X, y, df_test, threshold = 0.75)
#increasing the threshold more than this will make the true not be seen in the model

Fold duration: 4.50s
Accuracy of test: 0.8479
[1mF1 Macro (Test)[0m: 0.5904

Classification Report for Validation Data:
              precision    recall  f1-score   support

       False       0.98      0.86      0.92    270954
        True       0.17      0.59      0.27     13203

    accuracy                           0.85    284157
   macro avg       0.57      0.73      0.59    284157
weighted avg       0.94      0.85      0.88    284157

\Confusion Matrix for Validation Data:
[[233116  37838]
 [  5386   7817]]
Average fold duration: 4.37s
Clustering completed.


## Clustering model assessment

In [25]:
model_results = {
    'Model': [
        'GMM_Model', 'GMM_Model_80', 'GMM_Model_85', 'GMM_Model_90',
        'KMeans_Model', 'KMeans_Model_70', 'KMeans_Model_75'
    ],
    'Test Accuracy': [
        accuracy_gmm, accuracy_gmm80, accuracy_gmm85, accuracy_gmm90, accuracy_kmeans, accuracy_kmeans70,
        accuracy_kmeans75],
  'Test F1 Score': [
        f1_gmm, f1_gmm80, f1_gmm85, f1_gmm90, f1_kmeans, f1_kmeans70, f1_kmeans75 
    ]
}

# Create DataFrame with model results
results_models = pd.DataFrame(model_results)

# Set the 'Model' column as the index
results_models.set_index('Model', inplace=True)

# Display the DataFrame with the results
results_models


Unnamed: 0_level_0,Test Accuracy,Test F1 Score
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
GMM_Model,0.729004,0.53637
GMM_Model_80,0.746988,0.542244
GMM_Model_85,0.865914,0.584231
GMM_Model_90,0.901181,0.582749
KMeans_Model,0.775299,0.555716
KMeans_Model_70,0.847637,0.590215
KMeans_Model_75,0.847887,0.590389


To use this variable in the model for prediction

In [26]:
df_test = pickle.load(open("df_test_cleaned.pkl", 'rb'))

# Add the prediction column to the DataFrame
df_test['Agreement Reached'] = y_pred_test_gmm90

# Save the updated DataFrame back to the same pickle file
file = open('df_test_cleaned.pkl', 'wb')
pickle.dump(df_test, file)
file.close()  # Close the file after saving

print("Predictions have been added to the DataFrame and saved to the pickle file.")

Predictions have been added to the DataFrame and saved to the pickle file.
