### Hard voting ensemble classifier

A hard voting ensemble classifier is an ensemble learning method that combines multiple base classifiers and predicts the class that gets the most votes. The base classifiers are trained on the same training set and predict the class labels. The

## Importing the necessary libraries

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import config.ConnectionConfig as cc
from pyspark.sql import SparkSession
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
import joblib

## spark session

In [32]:
cc.setupEnvironment()
spark = cc.startLocalCluster("UFC_Logistic_Regression_Training")
spark.getActiveSession()
spark = SparkSession.builder.appName("UFC_Fights").getOrCreate()

## data retrieval

In [33]:
total_df = spark.read.csv("../processed_data/fight_total.csv", header=True, inferSchema=True)
total_df.show()
# select all columns except the fighter1 and fighter2 columns
data = total_df.drop('fighter1', 'fighter2')
# convert to a pandas dataframe
data = data.toPandas()
data

## Splitting data into training and testing sets

In [34]:
X = data.drop('outcome', axis = 1)
y = data['outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.01, random_state = 42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Training different models for the ensemble

## Random Forest Classifier

Random Forest is an ensemble learning method that is used for classification and regression problems. It works by creating a large number of decision trees and combining their predictions to make a final prediction. Random Forest is a powerful algorithm that is used in a wide range of applications, including image classification, text classification, and bioinformatics.

In [35]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42, max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=100)

rf.fit(X_train, y_train)

print(f"Accuracy: {rf.score(X_test, y_test)}")

## neural network

mlp stands for Multi-layer Perceptron, this is a type of feedforward artificial neural network, this means that the connections between the nodes do not form a cycle. The nodes are organized in layers, the input layer, one or more hidden layers, and the output layer. The nodes in the input layer are connected to the nodes in the hidden layer, and the nodes in the hidden layer are connected to the nodes in the output layer. The nodes in the hidden layer are connected to all the nodes in the input layer and the output layer.

In [36]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(5, 5), solver='adam', max_iter=20000, activation='identity', learning_rate='adaptive', learning_rate_init=0.01, random_state=42)

mlp.fit(X_train, y_train)

print(f"Accuracy: {mlp.score(X_train, y_train)}")

## Logistic Regression

In [37]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_clf = LogisticRegression(fit_intercept=True, random_state=42, C=10, penalty='l2', solver='saga')

log_clf.fit(X_train_scaled, y_train)

print(f"Accuracy: {log_clf.score(X_test_scaled, y_test)}")


## Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm that is used for classification and regression problems. It works by finding the hyperplane that best separates the data into different classes. The hyperplane is the line that maximizes the margin between the classes. SVM is a powerful algorithm that is used in a wide range of applications, including image classification, text classification, and bioinformatics.

In [38]:
from sklearn.svm import SVC


svm_clf = SVC(random_state=42, C=1, gamma='scale', kernel='linear', probability=True)

svm_clf.fit(X_train_scaled, y_train)

print(f"Accuracy: {svm_clf.score(X_test_scaled, y_test)}")

## Ensemble Classifier --> Hard Voting

In hard voting, the predicted class label for a particular sample is the majority class label among the classifiers. In soft voting, the predicted class label for a particular sample is the class label that has the highest probability among all the classifiers.

In [40]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[('rf', rf), ('mlp', mlp), ('log_clf', log_clf), ('svm_clf', svm_clf)],
    voting='soft'
)

voting_clf.fit(X_train_scaled, y_train)

print(f"Accuracy: {voting_clf.score(X_test_scaled, y_test)}")

print(f"Actual: {y_test.values}")

print(f"Predicted: {voting_clf.predict(X_test_scaled)}")
# joblib.dump(voting_clf, 'models/ufc_voting_clf.pkl')
accuracy_score(y_test, voting_clf.predict(X_test_scaled))

## Hyperparameter tuning for each model

### Random Forest

In [41]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

print(f"Accuracy: {grid_search.score(X_test, y_test)}")

print(f"Actual: {y_test.values}")

print(f"Predicted: {grid_search.predict(X_test)}")

### Neural Network

In [42]:
param_grid = {
    'hidden_layer_sizes': [(50, 50, 50), (50, 100, 50), (100,)],
    'activation': ['tanh ', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant', 'adaptive'],
}

mlp = MLPClassifier(random_state=42, max_iter=1000)

grid_search = GridSearchCV(estimator=mlp, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

print(f"Accuracy: {grid_search.score(X_test, y_test)}")

print(f"Actual: {y_test.values}")

print(f"Predicted: {grid_search.predict(X_test)}")

### Logistic Regression

In [43]:
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

log_clf = LogisticRegression(fit_intercept=True, random_state=42)

grid_search = GridSearchCV(estimator=log_clf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

grid_search.fit(X_train_scaled, y_train)

print(grid_search.best_params_)

print(f"Accuracy: {grid_search.score(X_test_scaled, y_test)}")

print(f"Actual: {y_test.values}")

print(f"Predicted: {grid_search.predict(X_test_scaled)}")

### Support Vector Machine

In [None]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

svm_clf = SVC(random_state=42, probability=True)

grid_search = GridSearchCV(estimator=svm_clf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

grid_search.fit(X_train_scaled, y_train)

print(grid_search.best_params_)

print(f"Accuracy: {grid_search.score(X_test_scaled, y_test)}")

print(f"Actual: {y_test.values}")


## Hyperparameter tuning for the ensemble classifier

In [None]:
param_grid = {
    'voting': ['hard', 'soft'],
}

voting_clf = VotingClassifier(
    estimators=[('rf', rf), ('mlp', mlp), ('log_clf', log_clf), ('svm_clf', svm_clf)],
    voting='hard'
)
voting_clf_soft = VotingClassifier(
    estimators=[('rf', rf), ('mlp', mlp), ('log_clf', log_clf), ('svm_clf', svm_clf)],
    voting='soft'
)
#grid_search = GridSearchCV(estimator=voting_clf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

#grid_search.fit(X_train_scaled, y_train)
voting_clf_soft.fit(X_train_scaled, y_train)
voting_clf.fit(X_train_scaled, y_train)
#print(grid_search.best_params_)

#print(f"Accuracy: {grid_search.score(X_test_scaled, y_test)}")

#print(f"Actual: {y_test.values}")

print(f"Accuracy on hard model: {voting_clf.score(X_train_scaled, y_train)}")
print(f"Accuracy on soft model: {voting_clf_soft.score(X_train_scaled, y_train)}")

print(f"Accuracy Score on Test set (hard): {voting_clf.score(X_test_scaled, y_test)}")
print(f"Accuracy Score on Test set (soft): {voting_clf_soft.score(X_test_scaled, y_test)}")