## Random Forest Classifier

Given that we have a large dataset and a mix of categorical and numerical data, we can start with Random Forest due to its ability to handle complex relationships and mixed data types without extensive preprocessing.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    log_loss
)
from sklearn.preprocessing import LabelEncoder

# Load the dataset
df = pd.read_csv('path_to_your_dataset.csv')

# Preprocessing steps (assuming they have been done)

# Encode categorical variables
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# Split the data into features and target
X = df.drop('player_victory', axis=1)
y = df['player_victory']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_classifier.predict(X_test)
y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
conf_matrix = confusion_matrix(y_test, y_pred)
logloss = log_loss(y_test, y_pred_proba)

# Print the metrics
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')
print(f'ROC-AUC Score: {roc_auc:.2f}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Log Loss: {logloss:.2f}')

To evaluate our classification model comprehensively, we can use a variety of metrics. Each metric provides different insights into the performance of the model. Here are some common metrics used for binary classification tasks which we have printed out for this model:

Accuracy: The proportion of true results (both true positives and true negatives) among the total number of cases examined.

Precision: The ratio of true positives to all positive predictions. Precision is a measure of the accuracy of the positive predictions.

Recall (Sensitivity): The ratio of true positives to all actual positives. Recall measures the ability of the classifier to find all the positive samples.

F1 Score: The harmonic mean of precision and recall. An F1 score balances the trade-off between precision and recall.

ROC-AUC Score: The area under the receiver operating characteristic (ROC) curve. It is a plot of the true positive rate against the false positive rate for the different possible cut points of a diagnostic test.

Confusion Matrix: A table used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

Log Loss (Cross-Entropy Loss): Measures the performance of a classification model where the prediction is a probability between 0 and 1. The loss increases as the predicted probability diverges from the actual label.

## Results Visualizations

### Confusion Matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Assuming y_test and y_pred are already defined as they are the true labels and model predictions respectively

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

This is a table that is used to describe the performance of a classification model.

True Positives (TP): The cases in which the model correctly predicted the positive class.
True Negatives (TN): The cases in which the model correctly predicted the negative class.
False Positives (FP): The cases in which the model incorrectly predicted the positive class (also known as a "Type I error").
False Negatives (FN): The cases in which the model incorrectly predicted the negative class (also known as a "Type II error").

The heatmap visualization of the confusion matrix uses color to emphasize the different values, with darker colors typically representing higher numbers. This visualization makes it easy to see the proportion of correct and incorrect predictions.

### ROC Curve

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10, 7))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

The Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

True Positive Rate (TPR): Also known as recall, it is the ratio of TP to the sum of TP and FN.
False Positive Rate (FPR): It is the ratio of FP to the sum of FP and TN.
The area under the ROC curve (AUC) is a measure of the model's ability to distinguish between the classes. An AUC of 0.5 suggests no discrimination (i.e., random chance), while an AUC of 1.0 indicates perfect discrimination.

### Precision-Recall Curve

In [None]:
from sklearn.metrics import precision_recall_curve

precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)

plt.figure(figsize=(10, 7))
plt.plot(recall, precision, color='blue', lw=2, label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="upper right")
plt.show()

The Precision-Recall curve shows the trade-off between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

Precision: The ratio of TP to the sum of TP and FP.
Recall: The ratio of TP to the sum of TP and FN.
This curve is particularly useful when the classes are very imbalanced. Unlike the ROC curve, the Precision-Recall curve focuses on the performance of the positive class.

### Feature Importance

In [None]:
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = X_train.columns

plt.figure(figsize=(15, 7))
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices], color="r", align="center")
plt.xticks(range(X_train.shape[1]), feature_names[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.show()

Feature Importance
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable. Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Random Forest Classifier for extracting the top features for our dataset.

Higher Bar: Indicates that the feature is more important for the model when making predictions.
Lower Bar: Indicates that the feature is less important.
In the bar chart, each bar represents a feature in the dataset, and the length of the bar corresponds to the importance score. This helps in understanding which features have the most impact on the predictions made by the model.