# UFC prediction

The purpose of this notebook is to predict the result of UFC matches (UFC 245 event is the latest one as I'm curently writting this introduction). 

But first, let's speak a little bit about the context of this dataset.  
The UFC is nowadays the biggest mixed martial arts competition organization in the world in terms of views and prestige. The UFC's fighters are usually considered as the best in the world, they often came from different organisations and got promoted there after a long road. People have always wanted to predict the winner of those kind of fights and even more today as bettings on such games are now becoming huge. To be able to predict (showing probabilities) the final result of a UFC match could be a good betting decision helper and therefore making fights even more exciting to watch. 

The special thing about this dataset to keep in mind is that each row is a compilation of both fighter statistics UP UNTIL THIS FIGHT (and this is very important to understand !). 

In [None]:
from IPython.display import Image
Image("/kaggle/input/ufc245/UFC 245.jpg")

In [None]:
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.max_columns = None
pd.options.display.max_rows = None
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

In [None]:
df = pd.read_csv('/kaggle/input/ufcdata/data.csv')

b_age = df['B_age']  #  we replace B_age to put it among B features 
df.drop(['B_age'], axis = 1, inplace = True)
df.insert(76, "B_age", b_age)

df_fe = df.copy() #  We make a copy of the dataframe for the feature engineering part later

df.head(5)

In [None]:
print(df.shape)
len(df[df['Winner'] == 'Draw'])

The last fight (and ufc event) recorded on this dataset was on the 8th November of 2019.

In [None]:
last_fight = df.loc[0, ['date']]
print(last_fight)

# Data Cleaning

Before April 2001, there were almost no rules in UFC (no judges, no time limits, no rounds, etc.). It's up to this precise date that UFC started to implement a set of rules known as "Unified Rules of Mixed Martial Arts" in accordance with the New Jersey State Athletic Control Board in United States. Therefore, we delete all fights before this major update in UFC's rules history. 

In [None]:
limit_date = '2001-04-01'
df = df[(df['date'] > limit_date)]
print(df.shape)

In [None]:
print("Total NaN in dataframe :" , df.isna().sum().sum())
print("Total NaN in each column of the dataframe")
na = []
for index, col in enumerate(df):
    na.append((index, df[col].isna().sum())) 
na_sorted = na.copy()
na_sorted.sort(key = lambda x: x[1], reverse = True) 

for i in range(len(df.columns)):
    print(df.columns[na_sorted[i][0]],":", na_sorted[i][1], "NaN")

Most NaN values can be explained because of empty statistics of new fighters joining UFC and fighting for their first time. Actually, they get NaN values until their first fight, so according to how the dataset is built, those statistics are filled in their second fight. 

As there are just a few, we replace weight, height, age and reach NaN values by the median value obtained from each of those features. 
We also replace 'B_Stance' and 'R_Stance' NaN values by the mode which in our case is "Orthodox" stance. 

In [None]:
from sklearn.impute import SimpleImputer

imp_features = ['R_Weight_lbs', 'R_Height_cms', 'B_Height_cms', 'R_age', 'B_age', 'R_Reach_cms', 'B_Reach_cms']
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')

for feature in imp_features:
    imp_feature = imp_median.fit_transform(df[feature].values.reshape(-1,1))
    df[feature] = imp_feature

imp_stance = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_R_stance = imp_stance.fit_transform(df['R_Stance'].values.reshape(-1,1))
imp_B_stance = imp_stance.fit_transform(df['B_Stance'].values.reshape(-1,1))
df['R_Stance'] = imp_R_stance
df['B_Stance'] = imp_B_stance

In [None]:
print('Number of features with NaN values :', len([x[1] for x in na if x[1] > 0]))

We delete all NaN rows from na_features columns. We also drop Referee and location columns as they are useless (no predictive power) for the model we want to build. 

In [None]:
na_features = ['B_avg_BODY_att', 'R_avg_BODY_att']
df.dropna(subset = na_features, inplace = True)

df.drop(['Referee', 'location'], axis = 1, inplace = True)

In [None]:
print(df.shape)
print("Total NaN in dataframe :" , df.isna().sum().sum())

# Feature Engineering

In [None]:
df.info()

There are 135 quantitative features and 8 categorical features. 
Let's see which features are categorical features.

In [None]:
list(df.select_dtypes(include=['object', 'bool']))

We delete B_draw and R_draw columns as all their values are fixed to 0 (constant). We still need to keep date to filter our dataframe and also the rest of the categorical features. 

In [None]:
print(df['B_draw'].value_counts())
print(df['R_draw'].value_counts())
df.drop(['B_draw', 'R_draw'], axis=1, inplace=True)

We also delete every match where the result is a draw (equality), indeed we don't want to add an additional class to our target variable. As we have already seen earlier, only 83 fights were a draw in the dataset. Now we only have a winner and a looser.  We delete 'Catch Weight' rows from weight_class feature as those fights are anecdotals. 

In [None]:
df = df[df['Winner'] != 'Draw']
df = df[df['weight_class'] != 'Catch Weight']

In [None]:
plt.figure(figsize=(50, 40))
corr_matrix = df.corr(method = 'pearson').abs()
sns.heatmap(corr_matrix, annot=True)

In [None]:
sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1)
                 .astype(np.bool))
                 .stack()
                 .sort_values(ascending=False))
print(sol[0:10])

We can notice that variables which are higly correlated between each other are the "attempted" and "landed" ones. Intuitively, the more strikes a fighter attempts, the more strikes are actually landed to his opponent.  
For instance, the higher the average significant head strikes "landed of attempted" for one fighter, the higher the average significant strikes "landed of attempted" he will get.  

Though, we can't only keep "landed" variables because we need to know the accuracy of the fighter (whether on leg kicks, head or body strikes, submission , clinches, etc.). Maybe he attempts a lot of shots but has a hard time touching his opponent.  
We will try to transform some features and hence reduce the number of variables to build a more comprehensive and lighter dataframe later. The goal is to keep main informations while avoiding dependence between features. But first, we are going to use all our dataframe features on the model. 

We can also notice that the number of rounds is highly correlated to the title bout. The rule is that for a title bout, the fights must last 5 rounds maximum and only 3 for a non title bout. But UFC changed rules to allow non title bout fights to last 5 rounds (those on the main cards acctually). 

# Data Preprocessing

As we only need the last fights of UFC fighters to get their last updated statistics and hence feed it into our model, we don't need to keep previous fights of active fighters as it won't make any differences in the model's performance.  
In another words, the dataset train/test split strategy will be the following :
1. We train our model on every fighter's fight at t-1 where t is the last fight of the fighter. 
2. We then test our model on every fighter's fight at t. 
3. We drop row fights at moments t-2, t-3, etc.

In [None]:
#  i = index of the fighter's fight, 0 means the last fight, -1 means first fight
def select_fight_row(df, name, i): 
    df_temp = df[(df['R_fighter'] == name) | (df['B_fighter'] == name)]  # filter df on fighter's name
    df_temp.reset_index(drop=True, inplace=True) #  as we created a new temporary dataframe, we have to reset indexes
    idx = max(df_temp.index)  #  get the index of the oldest fight
    if i > idx:  #  if we are looking for a fight that didn't exist, we return nothing
        return 
    arr = df_temp.iloc[i,:].values
    return arr

select_fight_row(df, 'Amanda Nunes', 0) #  we get the last fight of Amanda Nunes

In [None]:
# get all active UFC fighters (according to the limit_date parameter)
def list_fighters(df, limit_date):
    df_temp = df[df['date'] > limit_date]
    set_R = set(df_temp['R_fighter'])
    set_B = set(df_temp['B_fighter'])
    fighters = list(set_R.union(set_B))
    return fighters

In [None]:
fighters = list_fighters(df, '2017-01-01')
print(len(fighters))

We build a new DataFrame by adding the last fight of every active UFC fighters and we build another Dataframe by adding the second last fight of every active UFC fighter. 

In [None]:
def build_df(df, fighters, i):      
    arr = [select_fight_row(df, fighters[f], i) for f in range(len(fighters)) if select_fight_row(df, fighters[f], i) is not None]
    cols = [col for col in df] 
    df_fights = pd.DataFrame(data=arr, columns=cols)
    df_fights.drop_duplicates(inplace=True)
    df_fights['title_bout'] = df_fights['title_bout'].replace({True: 1, False: 0})
    df_fights.drop(['R_fighter', 'B_fighter', 'date'], axis=1, inplace=True)
    return df_fights

df_train = build_df(df, fighters, 0)
df_test = build_df(df, fighters, 1)

In [None]:
df_train.head(5)

In [None]:
print(df_train.shape)
print(df_test.shape)

The train and test sets are pretty well balanced. We don't need to apply sampling techniques here.

In [None]:
print(len(df_train[df_train['Winner'] == 'Blue']))
print(len(df_train[df_train['Winner'] == 'Red']))
print(len(df_test[df_test['Winner'] == 'Blue']))
print(len(df_test[df_test['Winner'] == 'Red']))

Let's create a column transformer to encode each categorical feature.  
We will rather use ordinal encoder than one-hot-encoder as it leads to a poorer accuracy on Random Forest model when dummie features get high. Here is a good explanation why : https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769.  

In [None]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.compose import make_column_transformer

preprocessor = make_column_transformer((OrdinalEncoder(), ['weight_class', 'B_Stance', 'R_Stance']), remainder='passthrough')

# If the winner is from the Red corner, Winner label will be encoded as 1, otherwise it will be 0 (Blue corner)
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(df_train['Winner'])
y_test = label_encoder.transform(df_test['Winner'])

X_train, X_test = df_train.drop(['Winner'], axis=1), df_test.drop(['Winner'], axis=1)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

# Random Forest Model

Random Forest is a tree-based model and hence does not require feature scaling. Those algorithm computations aren't based on distance (euclidian distance or whatever), therefore, normalizing data is useless.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score

# Random Forest composed of 100 decision trees. We optimized parameters using cross-validation and GridSearch tool paired together
random_forest = RandomForestClassifier(n_estimators=100, 
                                       criterion='entropy', 
                                       max_depth=10, 
                                       min_samples_split=2,
                                       min_samples_leaf=1, 
                                       random_state=0)

model = Pipeline([('encoding', preprocessor), ('random_forest', random_forest)])
model.fit(X_train, y_train)

# We use cross-validation with 5-folds to have a more precise accuracy (reduce variation)
accuracies = cross_val_score(estimator=model, X=X_train, y=y_train, cv=5)
print('Accuracy mean : ', accuracies.mean())
print('Accuracy standard deviation : ', accuracies.std())

y_pred = model.predict(X_test)
print('Testing accuracy : ', accuracy_score(y_test, y_pred), '\n')

target_names = ["Blue","Red"]
print(classification_report(y_test, y_pred, labels=[0,1], target_names=target_names))

We get a mean accuracy of 0.53 on the cross_val_score and 0.7 for the f1-score. The discrepancy between cross and test score is maybe due to the fact that the model does well in this particular test set but doesn't generalize well. At least the mean accuracy on train data is higher than a random choice (0.5) and that's a good point as MMA is a very uncertain sport. 

In [None]:
#from sklearn.model_selection import GridSearchCV
#parameters = [{'random_forest__n_estimators': [10, 50, 100, 500, 1000],
#               'random_forest__criterion': ['gini', 'entropy'],
#               'random_forest__max_depth': [5, 10, 50],
#               'random_forest__min_samples_split': [2, 3, 4],
#               'random_forest__min_samples_leaf': [1, 2, 3],
#              }]
#model = Pipeline([('encoding', preprocessor), ('random_forest', RandomForestClassifier())])

#grid_search = GridSearchCV(estimator=model, param_grid=parameters, scoring='accuracy', cv=5, n_jobs=-1)
#grid_search = grid_search.fit(X_train, y_train)
#best_accuracy = grid_search.best_score_

#best_params = grid_search.best_params_
#print('Best accuracy : ', best_accuracy)
#print('Best parameters : ', best_params)

Remind that Blue => 0 and Red => 1 for the target value (Winner).

In [None]:
from sklearn.metrics import confusion_matrix

# The confusion matrix looks like the shape below:
# [TN FN
#  FP TP]
cm = confusion_matrix(y_test, y_pred) 
ax = plt.subplot()
sns.heatmap(cm, annot = True, ax = ax, fmt = "d")
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
ax.set_title("Confusion Matrix")
ax.xaxis.set_ticklabels(['Blue', 'Red'])
ax.yaxis.set_ticklabels(['Blue', 'Red'])

In [None]:
feature_names = [col for col in X_train]
feature_importances = model['random_forest'].feature_importances_
indices = np.argsort(feature_importances)[::-1]
n = 30 # maximum feature importances displayed
idx = indices[0:n] 
std = np.std([tree.feature_importances_ for tree in model['random_forest'].estimators_], axis=0)

#for f in range(n):
#    print("%d. feature %s (%f)" % (f + 1, feature_names[idx[f]], feature_importances[idx[f]])) 

plt.figure(figsize=(30, 8))
plt.title("Feature importances")
plt.bar(range(n), feature_importances[idx], color="r", yerr=std[idx], align="center")
plt.xticks(range(n), [feature_names[id] for id in idx], rotation = 45) 
plt.xlim([-1, n]) 
plt.show()

There are no big significant features that could explain by themselves the model results. Almost all variables have an impact on the model even if it's on a very small proportion.  
But we can notice that age, takedowns attempts, current lose streak (if the fighter is on a big loose streak, he has highest chance to loose again), clinched landed, significant strikes that opponents are doing on fighters play a slightly bigger role which makes sense.  


Let's visualize now the construction of a random single Decision Tree among the Forest. 

In [None]:
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image

tree_estimator = model['random_forest'].estimators_[10]
export_graphviz(tree_estimator, 
                out_file='tree.dot', 
                filled=True, 
                rounded=True)
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])
Image(filename = 'tree.png')

## Predictions

Let's make predictions on the next UFC event introducing one waited fight on the main card between Kamaru Usman and Colby Covington (event occuring the 15/12/2019, UFC 245).

In [None]:
def predict(df, pipeline, blue_fighter, red_fighter, weightclass, rounds, title_bout=False): 
    
    #We build two dataframes, one for each figther 
    f1 = df[(df['R_fighter'] == blue_fighter) | (df['B_fighter'] == blue_fighter)].copy()
    f1.reset_index(drop=True, inplace=True)
    f1 = f1[:1]
    f2 = df[(df['R_fighter'] == red_fighter) | (df['B_fighter'] == red_fighter)].copy()
    f2.reset_index(drop=True, inplace=True)
    f2 = f2[:1]
    
    # if the fighter was red/blue corner on his last fight, we filter columns to only keep his statistics (and not the other fighter)
    # then we rename columns according to the color of  the corner in the parameters using re.sub()
    if (f1.loc[0, ['R_fighter']].values[0]) == blue_fighter:
        result1 = f1.filter(regex='^R', axis=1).copy() #here we keep the red corner stats
        result1.rename(columns = lambda x: re.sub('^R','B', x), inplace=True)  #we rename it with "B_" prefix because he's in the blue_corner
    else: 
        result1 = f1.filter(regex='^B', axis=1).copy()
    if (f2.loc[0, ['R_fighter']].values[0]) == red_fighter:
        result2 = f2.filter(regex='^R', axis=1).copy()
    else:
        result2 = f2.filter(regex='^B', axis=1).copy()
        result2.rename(columns = lambda x: re.sub('^B','R', x), inplace=True)
        
    fight = pd.concat([result1, result2], axis = 1) # we concatenate the red and blue fighter dataframes (in columns)
    fight.drop(['R_fighter','B_fighter'], axis = 1, inplace = True) # we remove fighter names
    fight.insert(0, 'title_bout', title_bout) # we add tittle_bout, weight class and number of rounds data to the dataframe
    fight.insert(1, 'weight_class', weightclass)
    fight.insert(2, 'no_of_rounds', rounds)
    fight['title_bout'] = fight['title_bout'].replace({True: 1, False: 0})
    
    pred = pipeline.predict(fight)
    proba = pipeline.predict_proba(fight)
    if (pred == 1.0): 
        print("The predicted winner is", red_fighter, 'with a probability of', round(proba[0][1] * 100, 2), "%")
    else:
        print("The predicted winner is", blue_fighter, 'with a probability of ', round(proba[0][0] * 100, 2), "%")
    return proba

In [None]:
predict(df, model, 'Kamaru Usman', 'Colby Covington', 'Welterweight', 5, True) 

In [None]:
predict(df, model, 'Max Holloway', 'Alexander Volkanovski', 'Featherweight', 5, True) 

In [None]:
predict(df, model, 'Amanda Nunes', 'Germaine de Randamie', "Women's Bantamweight", 5, True)

In [None]:
predict(df, model, 'Jose Aldo', 'Marlon Moraes', 'Bantamweight', 3, False)

In [None]:
predict(df, model, 'Urijah Faber', 'Petr Yan', 'Bantamweight', 3, False)