#                                            **Assignment No. 2**
# *Supervised Learning: Predicting 1 Year Survival of Patients with Hepatocellular Carcinoma*

    The main goal of this assignment is to, by using supervised learning, try to predict if a patient will survive 1 year after being diagnosed with Hepatocellular Carcinoma (HCC)
    
   



# **Data Summary**

This hepatocellular carcinoma dataset consists of patient-data from 165 former patients of Hospital and University Centre of Coimbra (Portugal). The dataset contains 49 features selected according to the EASL-EORTC (European Association for the Study of the Liver - European Organization for Research and Treatment of Cancer) Clinical Practice Guidelines. The target variable, "Class", is the survival of each patient at 1 year and is represented as 'Dies' and 'Lives'

# **First Step -> Install dependencies**

In [None]:
!pip install pandas numpy matplotlib seaborn

In [None]:
!pip install -U ydata-profiling

In [None]:
!pip install fancyimpute

In [None]:
!pip install sklearn

# **Second Step -> Import Libraries**

In [None]:

import math
import random
import numpy as np
import pandas as pd
import seaborn as sns
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sb
from fancyimpute import KNN
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import precision_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import precision_score, confusion_matrix, roc_curve, auc, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

%matplotlib inline

# **Data Exploration**

Let´s analyze our data!

In [None]:
hcc_data = pd.read_csv("hcc_dataset.csv")
hcc_data.head()

We see that we have categorical values in our data, so we will replace them with numerical values, doing a encoder method by hand, notice that we could use libraries to do this step, but we preferred doing this way.

In [None]:
# Read the CSV file into a DataFrame
df =pd.read_csv("hcc_dataset.csv")

# Replace "Lives" with "1" in all columns
df.replace("Lives", "1", inplace=True)

# Replace "Dies" with "0" in all columns
df.replace("Dies", "0", inplace=True)

# Replace "Yes" with "1" in all columns
df.replace("Yes", "1", inplace=True)

# Replace "No" with "0" in all columns
df.replace("No", "0", inplace=True)

# Replace "True" with "1" in all columns
df.replace("True", "1", inplace=True)

# Replace "False" with "0" in all columns
df.replace("False", "0", inplace=True)

# Replace "Male" with "1" in all columns
df.replace("Male", "1", inplace=True)

# Replace "Female" with "0" in all columns
df.replace("Female", "0", inplace=True)

# Replace "Active" with "0" in all columns
df.replace("Active", "0", inplace=True)

# Replace "Restricted" with "1" in all columns
df.replace("Restricted", "1", inplace=True)

# Replace "Ambulatory" with "2" in all columns
df.replace("Ambulatory", "2", inplace=True)

# Replace "Selfcare" with "3" in all columns
df.replace("Selfcare", "3", inplace=True)

# Replace "Disabled" with "4" in all columns
df.replace("Disabled", "4", inplace=True)

# Replace "None" with "1" in all columns
df.replace("None", "1", inplace=True)

# Replace "Grade I/II" with "2" in all columns
df.replace("Grade I/II", "2", inplace=True)

# Replace "Grade III/IV" with "3" in all columns
df.replace("Grade III/IV", "3", inplace=True)

# Replace "Mild" with "2" in all columns
df.replace("Mild", "2", inplace=True)

# Replace "Moderate/Severe" with "3" in all columns
df.replace("Moderate/Severe", "3", inplace=True)

# Write the modified DataFrame back to a CSV file
df.to_csv("modified_file.csv", index=False)

In [None]:
#Preview data
hcc_data = pd.read_csv("modified_file.csv")
hcc_data.head()

Now that all our categarical values were replaced, we are going to create dummy variables for gender to prevent the algorithm to mistakenly interpret these values as having a specific order or magnitude.

In [None]:
# Create dummy variables for gender
dummy_gender = pd.get_dummies(hcc_data['Gender'], prefix='male', dummy_na=False)

# Map 'false' to 0 and 'true' to 1
dummy_gender = dummy_gender.astype(int)

# Concatenate the dummy variables with the original DataFrame
hcc_data = pd.concat([hcc_data, dummy_gender], axis=1)

# Drop the original 'Gender' column
hcc_data.drop(['Gender'], axis=1, inplace=True)


In [None]:
#Examine dataset shape
hcc_data.shape

We can see that our dataset now has 51 columns! Which means our new 2 columns were created and our initial "Gender" column was removed.

Now its time to see how many missing values we have.

In [None]:
hcc_data_modified = hcc_data.replace('?', np.nan)

In [None]:
missing_data_summary = hcc_data_modified.isnull().sum()
print(missing_data_summary)

Wow! We can see that we have many missing data, let's adress that problem later.

Now we want to see the percentage of people who Lives and Dies, for that we are going to create a pie graph!

In [None]:
# Map 0 to 'Dies' and 1 to 'Lives'
hcc_data['Class'] = hcc_data['Class'].map({0: 'Dies', 1: 'Lives'})

# Calculate percentage of "Lives" and "Dies"
class_counts = hcc_data['Class'].value_counts(normalize=True) * 100

# Plotting
plt.figure(figsize=(8, 6))
plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', startangle=140, colors=['skyblue', 'lightcoral'], textprops={'color': 'black'})
plt.title('Percentages of survival at 1 year', color='black')  # Set title color to black
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

61.8% of the patients in our dataset lives 1 year after being diagnosed with HCC, still 38.2% is more than 1/3 of the patients, which means that it is a dangerous disease.

Let´s evaluate our data without rows with missing values.

In [None]:
# Drop rows with missing values except in the 'Class' column
hcc_data_cleaned = hcc_data_modified.dropna(subset=['Class'])

# Check the unique values in the 'Class' column
print(hcc_data_cleaned['Class'].unique())

# Make sure 'Class' is categorical
hcc_data_cleaned['Class'] = hcc_data_cleaned['Class'].astype('category')

# Now, try plotting again
sb.pairplot(hcc_data_cleaned, hue='Class')
;

As we expected, the data that remains is very few, because almost all the features had missing values in it.

# **Data Preprocessing**

We had to drop the rows with missing values. Let's take a look at those now.

# Imputation 

For the ex-categorical values, we are going to fill the missing values with the most common value in each column. 

In [None]:
columns_to_impute = ['Symptoms','Alcohol','HBsAg','HBeAg','HBcAb','HCVAb','Cirrhosis','Endemic','Smoking','Diabetes','Obesity','Hemochro','AHT','CRI','HIV','NASH','Varices','Spleno','PHT','PVT','Metastasis','Hallmark','PS','Encephalopathy','Ascites']
df_filled_specific = hcc_data_modified.copy()
for column in columns_to_impute:
    most_common_value = df[column].mode()[0]
    df_filled_specific[column].fillna(most_common_value, inplace=True)
df_filled_specific.head()

 # K-Nearest Neighbors

For the remaining missing values we are going to utilize KNN imputed data.

In [None]:
#Run KNN model using fancyimpute package
from fancyimpute import KNN

# Use 3 nearest rows which have a feature to fill in each row's missing features
# it returns a np.array which I store as a pandas dataframe
hcc_filled = pd.DataFrame(KNN(3).fit_transform(df_filled_specific))


In [None]:
##the column categories were removed when imputed into model, reinsert column headings
hcc_filled.columns = df_filled_specific.columns
hcc_filled.index = df_filled_specific.index
hcc_filled.head()

Now that we don't have missing values anymore it's time to start the work!

## Panda Profile Report


Creating a Profile Report will help us see every correlation and detail about each feature.

In [None]:
profile = ProfileReport(hcc_filled, title='Pandas Profiling Report', explorative=True)
profile.to_file("hcc_filled.html")

# Correlation Matrix

Let´s see the correlation among the dataset features.

In [None]:
# Basic correlogram demonstrates minimal correlation among the dataset features
corr = hcc_filled.corr()
corr.style.background_gradient(cmap='coolwarm')

Let´s create a summary of our features so it's easier to someone seeing our work to understand.

**Summmary of 49 features:**

Gender                       (1=Male;0=Female)	

Symptoms                     (1=Yes;0=No)

Alcohol	                     (1=Yes;0=No)

Hepatitis B Surface Antigen	 (1=Yes;0=No)

Hepatitis B e Antigen        (1=Yes;0=No)

Hepatitis B Core Antibody    (1=Yes;0=No)

Hepatitis C Virus Antibody	 (1=Yes;0=No)

Cirrhosis                    (1=Yes;0=No)

Endemic Countries	         (1=Yes;0=No)

Smoking                      (1=Yes;0=No)

Diabetes	                 (1=Yes;0=No)

Obesity                      (1=Yes;0=No)

Hemochromatosis              (1=Yes;0=No)

Arterial Hypertension        (1=Yes;0=No)

Chronic Renal Insufficiency	 (1=Yes;0=No)

Human Immunodeficiency Virus (1=Yes;0=No)

Nonalcoholic Steatohepatitis	(1=Yes;0=No)

Esophageal Varices				(1=Yes;0=No)

Splenomegaly					(1=Yes;0=No)

Portal Hypertension			    (1=Yes;0=No)	

Portal Vein Thrombosis			(1=Yes;0=No)	

Liver Metastasis				(1=Yes;0=No)	

Radiological Hallmark			(1=Yes;0=No)	

Age at diagnosis				20-93				

Grams of Alcohol per day		Grams/day	

Packs of cigarets per year		Packs/year		

Performance Status*			    [0,1,2,3,4,5]	

Encephalopathy degree*			[1,2,3]			

Ascites degree*				    [1,2,3]			

International Normalised Ratio*	0.84-4.82		

Alpha-Fetoprotein (ng/mL)		AFP			

Haemoglobin (g/dL)				

Mean Corpuscular Volume	 (fl)	MCV

Leukocytes(G/L)					

Platelets	(G/L)				

Albumin (mg/dL)				

Total Bilirubin(mg/dL)			

Alanine transaminase (U/L)		ALT

Aspartate transaminase (U/L)	AST				

Gamma glutamyl transferase (U/L)GGT			

Alkaline phosphatase (U/L)		ALP				

Total Proteins (g/dL)			TP				

Creatinine (mg/dL)				

Number of Nodules				0-5			

Major dimension of nodule (cm)	

Direct Bilirubin (mg/dL)		

Iron	(mcg/dL)					

Oxygen Saturation (%)			

Ferritin (ng/mL)				

Class Attribute	


Now, we are going to see our full data and look for outliers.

In [None]:
# Check the unique values in the 'Class' column
print(hcc_filled['Class'].unique())

# Make sure 'Class' is categorical
hcc_filled['Class'] = hcc_filled['Class'].astype('category')

# Now, try plotting again
sb.pairplot(hcc_filled, hue='Class')

We can see that we have a good amount of outliers, so our next step is to remove them!

Let's start by using the Isolation Forest (iForest) algorithm because it is a popular method for anomaly detection, particularly for identifying outliers in a dataset.

In [None]:
# Initialize the Isolation Forest model
iso_forest = IsolationForest(contamination=0.1, random_state=42)

# Fit the model
iso_forest.fit(hcc_filled)

# Predict the anomalies
hcc_filled['anomaly'] = iso_forest.predict(hcc_filled)

# Filter the outliers
hcc_filled_cleaned = hcc_filled[hcc_filled['anomaly'] == 1].drop(columns='anomaly')

hcc_filled_cleaned.shape

After reviewing again we notice that we still have some outliers, so this time we are going to remove them manually, but here's the catch, if we are going to remove them manually we need to have a reason behind it, so we are gonna write them as comments inside our next code.

In [None]:
# 1. Excluding AFP values greater than 3000 ng/mL
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['AFP'] <= 3000]
# Justification: AFP levels exceeding 3000 ng/mL are exceptionally high and likely indicate advanced or metastatic HCC, which may not be representative of the typical population.

# 2. Excluding Packs_year values greater than 2000
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['Packs_year'] <= 500]
# Justification: Packs_year values exceeding 500 are exceptionally high and suggest heavy smoking, which may not be representative of the typical population and can have a significant impact on health outcomes including HCC.

# 3. Excluding grams_a_day values greater than 400
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['Grams_day'] <= 400]
# Justification: grams_a_day values exceeding 400 are exceptionally high and suggest severe alcohol abuse, which is a significant risk factor for liver diseases including HCC. Excluding such extreme cases helps focus the analysis on typical patterns.

# 4. Excluding encephalopathy values greater than 3.0
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['Encephalopathy'] <= 3.0]
# Justification: encephalopathy values exceeding 3.0 indicate severe hepatic encephalopathy, which is a serious complication of advanced liver disease including HCC. Excluding such extreme cases helps focus the analysis on typical patterns.

# 5. Excluding ALT values greater than 400 IU/L
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['ALT'] <= 400]
# Justification: ALT levels exceeding 400 IU/L suggest severe liver damage, possibly from advanced liver disease including HCC. Excluding such extreme cases helps focus the analysis on typical patterns.

# 6. Excluding ferritin values greater than 2000 ng/mL
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['Ferritin'] <= 2000]
# Justification: ferritin levels exceeding 2000 ng/mL are exceptionally high and could indicate various health issues including liver disease. Excluding such extreme cases helps focus the analysis on typical patterns.

# 7. Excluding GGT values above 1300 IU/L
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['GGT'] <= 1300]
# Justification: GGT levels exceeding 1300 IU/L are exceptionally high and indicate severe liver dysfunction, likely due to advanced liver disease such as HCC.

# 8. Excluding AST values above 500 IU/L
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['AST'] <= 500]
# Justification: AST levels exceeding 500 IU/L are significantly elevated and suggest severe liver damage, likely due to advanced liver disease such as HCC.

# 9. Excluding Total_bil values above 30 µmol/L
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['Total_Bil'] <= 30]
# Justification: Total bilirubin levels exceeding 30 µmol/L are markedly elevated and indicate impaired liver function, likely due to advanced liver disease such as HCC.

# 10. Excluding the patient with HIV
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['HIV'] != 1]
# Justification: While HIV status may not directly relate to liver function, considering the patient with HIV along with extreme liver function test values as outliers helps ensure that the analysis focuses on typical patterns within the dataset.

# 11. Excluding the patient with Dir_Bil level above 20 µmol/L
hcc_filled_cleaned = hcc_filled_cleaned[hcc_filled_cleaned['Dir_Bil'] <= 20]
# Justification: Direct bilirubin (Dir_Bil) levels exceeding 20 µmol/L are considered elevated and may indicate liver dysfunction or bile duct obstruction. With a Dir_Bil level of 30 µmol/L, this patient's level is significantly higher than the normal range, suggesting a substantial deviation from typical values. Excluding this patient helps ensure that the analysis focuses on typical patterns within the dataset and avoids skewing the results due to extreme values.

# After excluding outliers, you may want to reset the index if needed
hcc_filled_cleaned.reset_index(drop=True, inplace=True)

hcc_filled_cleaned.shape


Now that we removed all outliers, lets see how our data looks like.

In [None]:
# Check the unique values in the 'Class' column
print(hcc_filled_cleaned['Class'].unique())

# Make sure 'Class' is categorical
hcc_filled_cleaned['Class'] = hcc_filled_cleaned['Class'].astype('category')

# Now, try plotting again
sb.pairplot(hcc_filled_cleaned, hue='Class')

# **Data Modeling and Data Evaluation**

Now let's model our data and evaluate them using differents classifiers, different balancing methods and grid parameters for each classifier.

In [None]:
# Load your data
X = hcc_filled_cleaned.drop(columns=['Class'])
y = hcc_filled_cleaned['Class']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define classifiers
classifiers = {
    "Decision Tree": DecisionTreeClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),
    "Naive Bayes": GaussianNB(),
    "AdaBoost": AdaBoostClassifier(algorithm='SAMME')
}

# Define balancing methods
balancing_methods = {
    'None': None,
    'Under-sampling': RandomUnderSampler(random_state=42),
    'SMOTE': SMOTE(random_state=42),
    'ADASYN': ADASYN(random_state=42)
}

# Grid parameters for each classifier
grid_params = {
    "Decision Tree": {'max_depth': [3, 5, 7, None]},
    "K-Nearest Neighbors": {'n_neighbors': [3, 5, 7]},
    "Random Forest": {'n_estimators': [50, 100, 200]},
    "Gradient Boosting": {'n_estimators': [50, 100, 200], 'learning_rate': [0.05, 0.1, 0.5]},
    "AdaBoost": {'n_estimators': [50, 100, 200], 'learning_rate': [0.05, 0.1, 0.5]}
}

Let's also create our functions to evaluate the precision and accuracy.

In [None]:
def evaluate_classifier(clf, X_train, y_train, X_test, y_test):
    clf.fit(X_train, y_train)
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    train_precision = precision_score(y_train, y_train_pred, average='weighted')
    test_precision = precision_score(y_test, y_test_pred, average='weighted')
    
    return train_precision, test_precision, y_test_pred

def plot_metrics(classifier_name, y_test, y_test_pred):
    cm = confusion_matrix(y_test, y_test_pred)
    sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
    plt.title(f"{classifier_name} - Confusion Matrix")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()
    
    fpr, tpr, _ = roc_curve(y_test, y_test_pred)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{classifier_name} (AUC = {roc_auc:.2f})')
    plt.title("ROC Curve")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.legend(loc="lower right")
    plt.plot([0, 1], [0, 1], linestyle='--', color='r')
    plt.show()

# Decision Tree

In [None]:
# Evaluating Decision Tree Classifier
results1 = []

for balance_name, sampler in balancing_methods.items():
    print(f"Evaluating Decision Tree with {balance_name} balancing method...\n")
    
    if sampler:
        X_resampled, y_resampled = sampler.fit_resample(X, y)
    else:
        X_resampled, y_resampled = X, y

    grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid=grid_params["Decision Tree"], scoring='accuracy', cv=5)
    grid_search.fit(X_resampled, y_resampled)
    best_classifier = grid_search.best_estimator_
    
    print(f"Best Parameters: {grid_search.best_params_}")
    
    # Training and evaluating on test set
    train_precision, test_precision, y_test_pred = evaluate_classifier(best_classifier, X_train, y_train, X_test, y_test)
    
    print(f"Train Precision: {train_precision:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    
    plot_metrics("Decision Tree", y_test, y_test_pred)
    
    # Performing cross-validation for precision
    precision_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='precision_weighted')
    mean_precision = precision_scores.mean()
    std_precision = precision_scores.std()
    
    # Performing cross-validation for accuracy
    accuracy_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='accuracy')
    mean_accuracy = accuracy_scores.mean()
    std_accuracy = accuracy_scores.std()
    
    results1.append((balance_name, "Decision Tree", mean_precision, std_precision, mean_accuracy, std_accuracy))
    
    print(f"Mean Precision (CV): {mean_precision:.4f}")
    print(f"Std Precision (CV): {std_precision:.4f}")
    print(f"Mean Accuracy (CV): {mean_accuracy:.4f}")
    print(f"Std Accuracy (CV): {std_accuracy:.4f}\n")

# Compiling results in a DataFrame
results1_df = pd.DataFrame(results1, columns=['Balancing', 'Classifier', 'Mean Precision', 'Std Precision', 'Mean Accuracy', 'Std Accuracy'])
print(results1_df)


We can see that ADASYN had better results with a mean accuracy of 0.722857 and a mean precision of 0.715118 compared to other balancing methods for the Decision Tree classifier

# K-Nearest Neighbors

In [None]:
# Evaluating K-Nearest Neighbors
results2 = []

for balance_name, sampler in balancing_methods.items():
    print(f"Evaluating K-Nearest Neighbors with {balance_name} balancing method...\n")
    
    if sampler:
        X_resampled, y_resampled = sampler.fit_resample(X, y)
    else:
        X_resampled, y_resampled = X, y

    grid_search = GridSearchCV(KNeighborsClassifier(), param_grid=grid_params["K-Nearest Neighbors"], scoring='accuracy', cv=5)
    grid_search.fit(X_resampled, y_resampled)
    best_classifier = grid_search.best_estimator_
    
    print(f"Best Parameters: {grid_search.best_params_}")
    
    # Training and evaluating on test set
    train_precision, test_precision, y_test_pred = evaluate_classifier(best_classifier, X_train, y_train, X_test, y_test)
    
    print(f"Train Precision: {train_precision:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    
    plot_metrics("K-Nearest Neighbors", y_test, y_test_pred)
    
    # Performing cross-validation for precision
    precision_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='precision_weighted')
    mean_precision = precision_scores.mean()
    std_precision = precision_scores.std()
    
    # Performing cross-validation for accuracy
    accuracy_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='accuracy')
    mean_accuracy = accuracy_scores.mean()
    std_accuracy = accuracy_scores.std()
    
    results2.append((balance_name, "K-Nearest Neighbors", mean_precision, std_precision, mean_accuracy, std_accuracy))
    
    print(f"Mean Precision (CV): {mean_precision:.4f}")
    print(f"Std Precision (CV): {std_precision:.4f}")
    print(f"Mean Accuracy (CV): {mean_accuracy:.4f}")
    print(f"Std Accuracy (CV): {std_accuracy:.4f}\n")

# Compiling results in a DataFrame
results2_df = pd.DataFrame(results2, columns=['Balancing', 'Classifier', 'Mean Precision', 'Std Precision', 'Mean Accuracy', 'Std Accuracy'])
print(results2_df)


We can see that ADASYN had better results with a mean precision of 0.679572 and a mean accuracy of 0.670420 compared to other balancing methods for the K-Nearest Neighbors classifier.

# Random Forest

In [None]:
# Evaluating Random Forest
results3 = []

for balance_name, sampler in balancing_methods.items():
    print(f"Evaluating Random Forest with {balance_name} balancing method...\n")
    
    if sampler:
        X_resampled, y_resampled = sampler.fit_resample(X, y)
    else:
        X_resampled, y_resampled = X, y

    grid_search = GridSearchCV(RandomForestClassifier(), param_grid=grid_params["Random Forest"], scoring='accuracy', cv=5)
    grid_search.fit(X_resampled, y_resampled)
    best_classifier = grid_search.best_estimator_
    
    print(f"Best Parameters: {grid_search.best_params_}")
    
    # Training and evaluating on test set
    train_precision, test_precision, y_test_pred = evaluate_classifier(best_classifier, X_train, y_train, X_test, y_test)
    
    print(f"Train Precision: {train_precision:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    
    plot_metrics("Random Forest", y_test, y_test_pred)
    
    # Performing cross-validation for precision
    precision_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='precision_weighted')
    mean_precision = precision_scores.mean()
    std_precision = precision_scores.std()
    
    # Performing cross-validation for accuracy
    accuracy_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='accuracy')
    mean_accuracy = accuracy_scores.mean()
    std_accuracy = accuracy_scores.std()
    
    results3.append((balance_name, "Random Forest", mean_precision, std_precision, mean_accuracy, std_accuracy))
    
    print(f"Mean Precision (CV): {mean_precision:.4f}")
    print(f"Std Precision (CV): {std_precision:.4f}")
    print(f"Mean Accuracy (CV): {mean_accuracy:.4f}")
    print(f"Std Accuracy (CV): {std_accuracy:.4f}\n")

# Compiling results in a DataFrame
results3_df = pd.DataFrame(results3, columns=['Balancing', 'Classifier', 'Mean Precision', 'Std Precision', 'Mean Accuracy', 'Std Accuracy'])
print(results3_df)

We can see that SMOTE had better results with a mean precision of 0.884804 and a mean accuracy of 0.831933 compared to other balancing methods for the Random Forest classifier.

# Gradient Boosting

In [None]:
# Evaluating Gradient Boosting
results4 = []

for balance_name, sampler in balancing_methods.items():
    print(f"Evaluating Gradient Boosting with {balance_name} balancing method...\n")
    
    if sampler:
        X_resampled, y_resampled = sampler.fit_resample(X, y)
    else:
        X_resampled, y_resampled = X, y

    grid_search = GridSearchCV(GradientBoostingClassifier(), param_grid=grid_params["Gradient Boosting"], scoring='accuracy', cv=5)
    grid_search.fit(X_resampled, y_resampled)
    best_classifier = grid_search.best_estimator_
    
    print(f"Best Parameters: {grid_search.best_params_}")
    
    # Training and evaluating on test set
    train_precision, test_precision, y_test_pred = evaluate_classifier(best_classifier, X_train, y_train, X_test, y_test)
    
    print(f"Train Precision: {train_precision:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    
    plot_metrics("Gradient Boosting", y_test, y_test_pred)
    
    # Performing cross-validation for precision
    precision_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='precision_weighted')
    mean_precision = precision_scores.mean()
    std_precision = precision_scores.std()
    
    # Performing cross-validation for accuracy
    accuracy_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='accuracy')
    mean_accuracy = accuracy_scores.mean()
    std_accuracy = accuracy_scores.std()
    
    results4.append((balance_name, "Gradient Boosting", mean_precision, std_precision, mean_accuracy, std_accuracy))
    
    print(f"Mean Precision (CV): {mean_precision:.4f}")
    print(f"Std Precision (CV): {std_precision:.4f}")
    print(f"Mean Accuracy (CV): {mean_accuracy:.4f}")
    print(f"Std Accuracy (CV): {std_accuracy:.4f}\n")

# Compiling results in a DataFrame
results4_df = pd.DataFrame(results4, columns=['Balancing', 'Classifier', 'Mean Precision', 'Std Precision', 'Mean Accuracy', 'Std Accuracy'])
print(results4_df)

We can see that SMOTE had better results with a mean precision of 0.855822 and a mean accuracy of 0.820168 compared to other balancing methods for the Gradient Boosting classifier.

# Naive Bayes

In [None]:
# Evaluating Gradient Boosting
results5 = []

for balance_name, sampler in balancing_methods.items():
    print(f"Evaluating Naive Bayes with {balance_name} balancing method...\n")
    
    if sampler:
        X_resampled, y_resampled = sampler.fit_resample(X, y)
    else:
        X_resampled, y_resampled = X, y

    grid_search.fit(X_resampled, y_resampled)
    best_classifier = grid_search.best_estimator_
    
    
    # Training and evaluating on test set
    train_precision, test_precision, y_test_pred = evaluate_classifier(best_classifier, X_train, y_train, X_test, y_test)
    
    print(f"Train Precision: {train_precision:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    
    plot_metrics("Naive Bayes", y_test, y_test_pred)
    
    # Performing cross-validation for precision
    precision_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='precision_weighted')
    mean_precision = precision_scores.mean()
    std_precision = precision_scores.std()
    
    # Performing cross-validation for accuracy
    accuracy_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='accuracy')
    mean_accuracy = accuracy_scores.mean()
    std_accuracy = accuracy_scores.std()
    
    results5.append((balance_name, "Naive Bayes", mean_precision, std_precision, mean_accuracy, std_accuracy))
    
    print(f"Mean Precision (CV): {mean_precision:.4f}")
    print(f"Std Precision (CV): {std_precision:.4f}")
    print(f"Mean Accuracy (CV): {mean_accuracy:.4f}")
    print(f"Std Accuracy (CV): {std_accuracy:.4f}\n")

# Compiling results in a DataFrame
results5_df = pd.DataFrame(results5, columns=['Balancing', 'Classifier', 'Mean Precision', 'Std Precision', 'Mean Accuracy', 'Std Accuracy'])
print(results5_df)

We can see that SMOTE had better results with a mean precision of 0.858534 and a mean accuracy of 0.831765 compared to other balancing methods for the Naive Bayes classifier.

# AdaBoost

In [None]:
# Evaluating AdaBoost
results6 = []

for balance_name, sampler in balancing_methods.items():
    print(f"Evaluating AdaBoost with {balance_name} balancing method...\n")
    
    if sampler:
        X_resampled, y_resampled = sampler.fit_resample(X, y)
    else:
        X_resampled, y_resampled = X, y

    grid_search = GridSearchCV(AdaBoostClassifier(algorithm='SAMME'), param_grid=grid_params["AdaBoost"], scoring='accuracy', cv=5)
    grid_search.fit(X_resampled, y_resampled)
    best_classifier = grid_search.best_estimator_
    
    print(f"Best Parameters: {grid_search.best_params_}")
    
    # Training and evaluating on test set
    train_precision, test_precision, y_test_pred = evaluate_classifier(best_classifier, X_train, y_train, X_test, y_test)
    
    print(f"Train Precision: {train_precision:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    
    plot_metrics("AdaBoost", y_test, y_test_pred)
    
    # Performing cross-validation for precision
    precision_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='precision_weighted')
    mean_precision = precision_scores.mean()
    std_precision = precision_scores.std()
    
    # Performing cross-validation for accuracy
    accuracy_scores = cross_val_score(best_classifier, X_resampled, y_resampled, cv=5, scoring='accuracy')
    mean_accuracy = accuracy_scores.mean()
    std_accuracy = accuracy_scores.std()
    
    results6.append((balance_name, "AdaBoost", mean_precision, std_precision, mean_accuracy, std_accuracy))
    
    print(f"Mean Precision (CV): {mean_precision:.4f}")
    print(f"Std Precision (CV): {std_precision:.4f}")
    print(f"Mean Accuracy (CV): {mean_accuracy:.4f}")
    print(f"Std Accuracy (CV): {std_accuracy:.4f}\n")

# Compiling results in a DataFrame
results6_df = pd.DataFrame(results6, columns=['Balancing', 'Classifier', 'Mean Precision', 'Std Precision', 'Mean Accuracy', 'Std Accuracy'])
print(results6_df)

We can see that SMOTE had better results with a mean precision of 0.871511 and a mean accuracy of 0.820504 compared to other balancing methods for the AdaBoost classifier.

# Evaluation

In [None]:
# Concatenate all DataFrames
all_results_df = pd.concat([results1_df, results2_df, results3_df, results4_df, results5_df, results6_df])

# Reset index and start from 1
all_results_df.reset_index(drop=True, inplace=True)
all_results_df.index = all_results_df.index + 1

# Print the combined DataFrame
print(all_results_df)

In [None]:
all_results_df = pd.DataFrame({
    'Balancing': [
        'None', 'Under-sampling', 'SMOTE', 'ADASYN',
        'None', 'Under-sampling', 'SMOTE', 'ADASYN',
        'None', 'Under-sampling', 'SMOTE', 'ADASYN',
        'None', 'Under-sampling', 'SMOTE', 'ADASYN',
        'None', 'Under-sampling', 'SMOTE', 'ADASYN',
        'None', 'Under-sampling', 'SMOTE', 'ADASYN'
    ],
    'Classifier': [
        'Decision Tree', 'Decision Tree', 'Decision Tree', 'Decision Tree',
        'K-Nearest Neighbors', 'K-Nearest Neighbors', 'K-Nearest Neighbors', 'K-Nearest Neighbors',
        'Random Forest', 'Random Forest', 'Random Forest', 'Random Forest',
        'Gradient Boosting', 'Gradient Boosting', 'Gradient Boosting', 'Gradient Boosting',
        'Naive Bayes', 'Naive Bayes', 'Naive Bayes', 'Naive Bayes',
        'AdaBoost', 'AdaBoost', 'AdaBoost', 'AdaBoost'
    ],
    'Mean Precision': [
        0.585129, 0.595180, 0.729394, 0.715118,
        0.644376, 0.590119, 0.675738, 0.679572,
        0.583558, 0.694524, 0.884804, 0.868511,
        0.687385, 0.632222, 0.855822, 0.810854,
        0.682726, 0.608214, 0.858534, 0.814208,
        0.673444, 0.733333, 0.871511, 0.840535
    ],
    'Mean Accuracy': [
        0.619000, 0.557143, 0.698319, 0.722857,
        0.711000, 0.585714, 0.663361, 0.670420,
        0.702333, 0.700000, 0.831933, 0.850084,
        0.693667, 0.600000, 0.820168, 0.792437,
        0.709667, 0.628571, 0.831765, 0.792437,
        0.710667, 0.728571, 0.820504, 0.798319
    ]
})

# Melt the dataframe to have a long format suitable for seaborn
melted_df = pd.melt(all_results_df, id_vars=['Balancing', 'Classifier'], value_vars=['Mean Accuracy'])

# Create a grouped bar chart
plt.figure(figsize=(14, 8))
sns.barplot(x='Classifier', y='value', hue='Balancing', data=melted_df, ci=None)
plt.xlabel('Classifier')
plt.ylabel('Mean Accuracy')
plt.title('Comparison of Mean Accuracy for Different Classifiers and Balancing Methods')
plt.legend(title='Balancing Method')
plt.xticks(rotation=45)
plt.show()


Overall, after comprehensive evaluation of various classifiers and balancing methods, the Random Forest classifier, particularly when augmented with SMOTE or ADASYN for handling class imbalance, emerged as the top-performing model, consistently exhibiting the highest mean precision and mean accuracy scores across the dataset!

# **Conclusion**

In our project, we developed six distinct Machine Learning models to determine which one was the most suitable for our problem and dataset. The findings indicated that Random Forest with ADASYN consistently achieved the highest mean accuracy and mean precision scores overall. We can also find that overall machine learning models have a better mean accuracy and mean precision when combined with the balancing methods SMOTE and ADASYN. 
We are satisfied with the models and the outcomes we obtained. This work helped us gain a better understanding of the workflow of a data science and machine learning project, as well
as improve our knowledge of various Python data science libraries.

The Decision Tree classifier, when paired with SMOTE and ADASYN balancing techniques, demonstrates improved performance in mean precision and mean accuracy, highlighting the effectiveness of synthetic oversampling methods in mitigating class imbalance. Similarly, K-Nearest Neighbors (KNN) benefits from oversampling methods, indicating enhanced capture of underlying data structure. In contrast, Random Forest consistently outperforms other classifiers across all balancing methods, attributed to its ensemble approach and robustness to overfitting, particularly accentuated when combined with SMOTE and ADASYN. Gradient Boosting also exhibits strong performance, especially with oversampled data, leveraging its iterative nature to focus on challenging instances. While Naive Bayes shows mixed results, AdaBoost consistently performs well across all methods, with SMOTE and ADASYN yielding the highest precision and accuracy. Recommendations for future analysis include further exploring hyperparameter tuning, ensemble methods, and feature engineering techniques to improve model performance. Overall, the study emphasizes the significance of addressing class imbalance and employing suitable preprocessing methods to enhance machine learning model efficacy, necessitating continued refinement for robust predictive models.