# Objective and Forecast 


	- Business Problem: Predict a favorable (profitable) bet for the winner between two fighters at an UFC event. 
	- ML Problem: Binary Classification (either one person wins or the other)
	- Metrics: Confusion matrix, Log Loss, and ROC-AUC curve

### The Data and Sources 

For a more detailed breakdown of the data, please scroll to the EDA section. This section will contain a quick preface of the structure of the data, and any outside sources to be credited. 

#### The Features 

The features contain a total of 118 columns that each fall in one of the following categories:

	- Identification variables: fighter names, dates, locations
	- Fighters' physical attributes: reach, height, weight, gender, weight class
	- Fighters' skill attributes: significant strike percentage, submissions and takedowns, win-loss record, win-loss streaks, ranking
	- Bout-specific details: total fight time, who won/how did they win (ect. Red - Unanimous Decision, or Blue - KO/TKO)

#### The Target

What we're trying to predict is the winner of each bout, presented in the train set as the 'Winner' column, with class levels being either 'Red' or 'Blue' (the colors of the corners in the UFC octagon - depending on which corner is whose, the fighter is either classified as a 'Red' fighter or 'Blue' fighter).

Before training, I encoded these values as 0 and 1, with 'Red' being the former. Because there is a disproportionately large number of 'Red' winners, there were more 0's than 1's, and I had to be careful with the fact that features that seem less significant might be skewed towards the 0 side. 

### Quick Note 

## Overview of the Project

### What I worked with - Insights from EDA

	- Between the 118 numerical and categorical features, there was a high amount of dimensionality that had to be reduced. A lot of the columns were circumstantial (if one was nonzero, the others were not), and had a large amount of nulls 
	- The dataset is imbalanced, erring in favor of fighters from the Red corner
	- A few of the continuous variables like height and reach are collinear, and so are a lot of the ordinal variables, creating a lot of noise  

### What I tried - Approaches and Models

I tried a total of 7 feature engineering approaches, and 3 total models.

Feature engineering approaches: 
	
	- Looking at all columns (ordinal, encoded categorical, and continuous) and using those that have the highest correlation with the target column
	- Finding a mix of ordinal and continous features that are skewed towards one outcome, or normaly distributed
	- Reducing dimensionality by consolidating variables unique to each fighter into one variable indexing the difference

Models:

	- Ensemble methods like Random Forest, Gradient Boosting, etc. 
		- Able to handle imbalanced data, and data that skews one way
		- Able to handle complex datasets with high dimensionality, robust to outliers 
		- Particularly sensitive to features with more probabilistic pull on the outcome
	- Probabilistic models like Naive Bayes
		- A probabilistic estimate 
		- Easily interpretable
		- Fast and useful for real-time prediciton
	- SVM
		- I thought there'd be a chance, if there was some sort of linear correlation, SVM could handle the high dimensionality through composition analysis
		- This ended up being a bust...


### What worked and how I could tell - The best model, features, and hyperparameters

The best arrangement of features was actually a mix between the approaches. I picked a number of features that were normally distributed or heavily favored one outcome, and I also created a new variable called `perf_diff` which was the difference of each fighter's cumulative performance score (computed by aggregating their signifcant strike percentage, total takedown percentage, and total submissions attempted). 

The best model ended up being a random forest classifier - I figured this out by using four particular metrics:

	- Accuracy - across the board, the total number of correct predictions as a proportion of all. Worst is 0, best is 1. 
	- Precision - interpreted using the confusion matrix. The number of true positive predictions as a proportion of all positive predictions. Worst is 0, best is 1.  
	- F1-score - a harmonic mean of precision and recall (true positive predictions as a proportion of ALL predictions). Worst is 0, best is 1. 
	- Log Loss - a loss metric particularly effective for probabilistic outcomes. Best is 0, worst is 1. 

I used these metrics to discern the best hyperparameters for my particular instance of this algorithm. 

Finally, I used the trajectory of the ROC-AUC curve to set a probability threshold to fully maximize the potential of my model. 

# EDA and Feature Engineering Concerns 

In [1]:
import numpy as np 
import pandas as pd
from helpers import * 
import warnings
from pandas.errors import SettingWithCopyWarning

warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
warnings.simplefilter(action="ignore", category=FutureWarning)
ufc = pd.read_csv('ufc-master.csv')

In [2]:
ufc.head()

Unnamed: 0,R_fighter,B_fighter,R_odds,B_odds,R_ev,B_ev,date,location,country,Winner,...,finish_details,finish_round,finish_round_time,total_fight_time_secs,r_dec_odds,b_dec_odds,r_sub_odds,b_sub_odds,r_ko_odds,b_ko_odds
0,Thiago Santos,Johnny Walker,-150.0,130,66.666667,130.0,2021-10-02,"Las Vegas, Nevada, USA",USA,Red,...,,5.0,5:00,1500.0,800.0,900.0,2000.0,1600.0,-110.0,175.0
1,Alex Oliveira,Niko Price,170.0,-200,170.0,50.0,2021-10-02,"Las Vegas, Nevada, USA",USA,Blue,...,,3.0,5:00,900.0,450.0,350.0,700.0,1100.0,550.0,120.0
2,Misha Cirkunov,Krzysztof Jotko,110.0,-130,110.0,76.923077,2021-10-02,"Las Vegas, Nevada, USA",USA,Blue,...,,3.0,5:00,900.0,550.0,275.0,275.0,1400.0,600.0,185.0
3,Alexander Hernandez,Mike Breeden,-675.0,475,14.814815,475.0,2021-10-02,"Las Vegas, Nevada, USA",USA,Red,...,Punch,1.0,1:20,80.0,175.0,900.0,500.0,3500.0,110.0,1100.0
4,Joe Solecki,Jared Gordon,-135.0,115,74.074074,115.0,2021-10-02,"Las Vegas, Nevada, USA",USA,Blue,...,,3.0,5:00,900.0,165.0,200.0,400.0,1200.0,900.0,600.0


In [3]:
ufc.shape

(4896, 119)

## Handling Nulls 

Initially, I tried to handle nulls in a balanced way - initially by filling null values with aggregate statistics (the mean for a continuous feature's nulls, and the mode for categorical features). I also tried to proportion null values to better suit trends from that particular feature or other related features. Ultimately, however, I found it was best to simply drop the null values entirely, because it didn't affect any of the features' trends or omit a significant amount of information. 

I also found the rank variables useless because they 

In [4]:
bad_cols = []

### Columns with more than half their values as nulls

for col in ufc.columns:
    if ufc[col].isna().sum() > ufc.shape[0] // 2:
        bad_cols.append(col) 

In [5]:
bad_cols

['B_match_weightclass_rank',
 'R_match_weightclass_rank',
 "R_Women's Flyweight_rank",
 "R_Women's Featherweight_rank",
 "R_Women's Strawweight_rank",
 "R_Women's Bantamweight_rank",
 'R_Heavyweight_rank',
 'R_Light Heavyweight_rank',
 'R_Middleweight_rank',
 'R_Welterweight_rank',
 'R_Lightweight_rank',
 'R_Featherweight_rank',
 'R_Bantamweight_rank',
 'R_Flyweight_rank',
 'R_Pound-for-Pound_rank',
 "B_Women's Flyweight_rank",
 "B_Women's Featherweight_rank",
 "B_Women's Strawweight_rank",
 "B_Women's Bantamweight_rank",
 'B_Heavyweight_rank',
 'B_Light Heavyweight_rank',
 'B_Middleweight_rank',
 'B_Welterweight_rank',
 'B_Lightweight_rank',
 'B_Featherweight_rank',
 'B_Bantamweight_rank',
 'B_Flyweight_rank',
 'B_Pound-for-Pound_rank',
 'finish_details']

We'll drop these columns entirely.

## Imbalanced Data  

In [6]:
ufc.Winner.value_counts()

Winner
Red     2859
Blue    2037
Name: count, dtype: int64

We will have to resample this with replacement. Some of the models we'll be looking at are able to handle imbalanced data, but it's definitely helpful to make sure the data is balanced for the sake of the model and for accurate insights from EDA. 