In [8]:
# Fight Picker
## Tyler Aden

## Introduction

This is the workflow for a fight prediction application. This notebook includes:

* Introduction
    * Background
    * Objective
* Data Requirements
    * Web Scraping
    * Data Cleansing and Blending
    * Finalized Dataset
* Libraries Used
* EDA
    * Statistical Overview
    * Heatmap and Correlation
    * Statistical Tests
    * Distribution Plots
* Feature Selection
    * Feature Significance
* Modelling the Data
    * Logistical Regression
    * Random Forest
    * Neural Network
* Conclusion
* Improvements
* Citation

## Background

This project is heavily influenced by [MMA Predictor](https://github.com/jasonchanhku/UFC-MMA-Predictor/). That project provides much of the bones of what I'm trying to accomplish. However, I think there are a few key elements to expand upon to improve the original project. Generally, the idea is to generate more realistic betting odds between an **Favorite** and an **Underdog** to gain an edge over Vegas and to discover insights that experts may have missed.

## Objective
The objective of this project is to build a model that:
* Predicts better than 50% accuracy
* Predicts better than choosing strictly favorites

Further analysis on betting odds vs model performance to also be done.

## Data Requirements

Two datasets are needed and scraped from public sources:

### UFC Fighters Database

[Dataset](http://www.fightmetric.com/statistics/fighters) that contains fight stats for all UFC fighters

### Dataset preview

In [9]:
import pandas as pd
fighter_db = pd.read_csv('https://raw.githubusercontent.com/tkaden/FightPicker/master/data/UFC_Fighters_Database.csv')
fighter_db.head()

Unnamed: 0,NAME,Weight,WeightClass,REACH,SLPM,SAPM,STRA,STRD,TD,TDA,TDD,SUBA
0,Tom Aaron,155,lightweight,71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Danny Abbadi,155,lightweight,71,3.29,4.41,0.38,0.57,0.0,0.0,0.77,0.0
2,David Abbott,265,heavyweight,77,1.35,3.55,0.3,0.38,1.07,0.33,0.66,0.0
3,Shamil Abdurakhimov,235,heavyweight,76,2.53,2.53,0.45,0.58,1.36,0.22,0.77,0.3
4,Hiroyuki Abe,145,featherweight,70,1.71,3.11,0.36,0.63,0.0,0.0,0.33,0.0


## Fight History

We need a dataset that contains the fight odds as that will be one of the most important variables in our analysis.

Odds are available at www.betmma.tips and is a highly valuable resource for statistics as well. Additionally, fightmetric.com is another great resource.

Source(s):
* http://www.fightmetric.com/statistics/events/completed
* http://www.betmma.tips/mma_betting_favorites_vs_underdogs.php

This project will also include web scraping code to gather this data and clean it.

In [10]:
fights_db = pd.read_csv('https://raw.githubusercontent.com/tkaden/FightPicker/master/data/UFC_Fights.csv')
fights_db.head()

Unnamed: 0,RecordID,Events,Fighter1,Fighter2,Winner,fighter1_odds,fighter2_odds,F1 or F2,Label,Combine,Favourite,Underdog
0,1,UFC 159 - Jones vs. Sonnen,Jon Jones,Chael Sonnen,Jon Jones,1.13,9.0,1,Favourite,Favourite 1,Jon Jones,Chael Sonnen
1,2,UFC 159 - Jones vs. Sonnen,Michael Bisping,Alan Belcher,Michael Bisping,1.57,4.5,1,Favourite,Favourite 1,Michael Bisping,Alan Belcher
2,3,UFC 159 - Jones vs. Sonnen,Roy Nelson,Cheick Kongo,Roy Nelson,1.43,3.2,1,Favourite,Favourite 1,Roy Nelson,Cheick Kongo
3,4,UFC 159 - Jones vs. Sonnen,Phil Davis,Vinny Magalhaes,Phil Davis,1.36,3.55,1,Favourite,Favourite 1,Phil Davis,Vinny Magalhaes
4,5,UFC 159 - Jones vs. Sonnen,Pat Healy,Jim Miller,Pat Healy,3.4,1.4,1,Underdog,Underdog 1,Jim Miller,Pat Healy


## Data Cleansing and Blending

The two datasets above were cleansed and blended together using the following process.

### Feature Mapping

Note that for each feature `x`. It is the difference between the Favourite vs Underdog. Hence if the feature is positive, this implies the favourite fighter has an advantage over the underdog for that feature.



$Feature\quad { X }_{ i }=\quad { X }_{ favourite }\quad -\quad { X }_{ underdog }$

## Finalized Dataset

The following are the response variable and 10 features used in the dataset. Note that each feature has a suffix of **delta** due to the fact that it undergone the feature mapping stated above.

* Label - This is the response variable. Either Favourite or Underdog will win
* REACH - Fighter's reach. (Probabaly the least important feature)
* SLPM - Significant Strikes Landed per Minute
* STRA. - Significant Striking Accuracy
* SAPM - Significant Strikes Absorbed per Minute
* STRD - Significant Strike Defence (the % of opponents strikes that did not land)
* TD - Average Takedowns Landed per 15 minutes
* TDA - Takedown Accuracy
* TDD - Takedown Defense (the % of opponents TD attempts that did not land)
* SUBA - Average Submissions Attempted per 15 minutes
* Odds - Fighter's decimal odds spread for that specific matchup

In [11]:
df = pd.read_csv('https://raw.githubusercontent.com/tkaden/FightPicker/master/data/Cleansed_Data.csv')
df = df.drop('Sum_delta', axis=1)
df.head()

Unnamed: 0,Events,Favourite,Underdog,Label,REACH_delta,SLPM_delta,SAPM_delta,STRA_delta,STRD_delta,TD_delta,TDA_delta,TDD_delta,SUBA_delta,Odds_delta
0,UFC 159 - Jones vs. Sonnen,Jon Jones,Chael Sonnen,Favourite,10,1.17,0.9,0.12,0.03,-1.56,-0.07,0.28,0.2,-7.87
1,UFC 159 - Jones vs. Sonnen,Leonard Garcia,Cody McKenzie,Underdog,-3,1.03,2.29,-0.1,-0.15,-2.2,0.01,0.28,-2.0,1.4
2,UFC Fight Night 34 - Saffiedine vs. Lim,Mairbek Taisumov,Tae Hyun Bang,Favourite,2,0.54,0.08,0.05,-0.05,1.75,0.44,0.28,-0.5,-2.89
3,UFC Fight Night 91 - McDonald vs. Lineker,Cody Pfister,Scott Holtzman,Underdog,4,-3.15,-0.85,-0.24,-0.06,0.55,-0.27,-0.58,-0.4,6.89
4,UFC Fight Night 91 - McDonald vs. Lineker,Matthew Lopez,Rani Yahya,Underdog,2,0.02,0.86,0.13,-0.06,-0.08,0.51,0.37,-0.5,0.81


## Librairies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
import scipy.stats as stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, cross_val_predict
from sklearn.feature_selection import RFECV
from sklearn.metrics import roc_auc_score, classification_report, make_scorer, accuracy_score
import warnings
import time
warnings.filterwarnings('ignore')
%matplotlib inline

#Progress bar
def log_progress(sequence, every=None, size=None, name='Items'):
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'

    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)

    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
            yield record
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )

# Creating Dummies
def create_dummies(df,column_name):
    """Create Dummy Columns (One Hot Encoding) from a single Column

    Usage
    ------

    train = create_dummies(train,"Age")
    """
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df