<a href="https://colab.research.google.com/github/wrcarpenter/MMA-Handicapping-Model/blob/main/Code/UFC_Bout_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing

In [32]:
#Importing
import requests
import csv 
import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
from google.colab import files
import datetime
from datetime import date
from pytz import timezone
eastern = timezone('US/Eastern')
import threading 
from concurrent.futures import ThreadPoolExecutor
import multiprocessing
cores = multiprocessing.cpu_count()

# Read in Dataset


In [22]:
source = 'https://raw.githubusercontent.com/wrcarpenter/MMA-Handicapping-Model/main/Data/ufcBouts_v3.csv'  
df      = pd.read_csv(source, header=0) 
df_orig = df # preserve a copy in case

# Data Summary 

Review data composition for general understanding and debugging purposes. The current construction of the raw dataset should contain 29 columns. 

In [None]:
# Explore the dataset 
print(df.columns)
# Columns in dataset 
print(df.shape[1]) # raw dataset should have 29 columns 
# Unique fighters
print("Number of Unique Fighters: ", len(pd.unique(df['name'])))  # unique fighters
# Number of fights (nows in dataset)
print("Total Fights Recorded: ", len(df))
# Number of unique fights in the dataset 
print("Number of Unique Fights: ", len(pd.unique(df['fight_link'])))  


# Create Variables 

First need to decide what fights in the dataset are upcoming or not. Probably the best way to approach fights is only keep data on completed (all stats are there) type of fights because its unclear to me how the newer ones update.

Need to handle missing data: birthdays, reach, height, etc.

Clean data for various variables and then being calculations to generate for model. 

In [None]:
# Eliminate any missing DOB (can fill this in later)
df = df[df['dob'] != "--"]  # drops missing birthdays from the data 
print(len(df) - len(df_orig))  # drops 865 obs currently
# Sort dob
df['dob'] = df['dob'].replace(',', '', regex=True)  # replace commas
df['dob'] = df['dob'].replace('--', '', regex=True)  # replace commas
df['dob'] = pd.to_datetime(df['dob'], format="%m/%d/%Y")
# Sort event date
df['event_date'] = df['event_date'].replace(',', '', regex=True)  # replace commas 
df['event_date'] = pd.to_datetime(df['event_date'], format="%m/%d/%Y")
# Sorts in new data 
df = df.sort_values(by=['fighter_profile', 'event_date'], ascending=True)
# Show results
# df.loc[df['name'] == 'Jon Jones', ['event_date', 'event', 'total_fights']]

## Variable Creation

* Height
* Reach
* Years on roster (i.e. yrs since first recorded fight)
* Current Age
* Previous number of recorded rights 
* Weeks since last fight
* result of last fight
* result of second to last fight
* Current win streak?
* Current loss streak?
* Longest loss streak 
* longest win streak
* lost by KO in last 4 fights
* ever lost by KO?
* ever lost by submission?

# Other data 

There is additional performance data provided by UFC stats that is not necessarily available from other sites. This is inter-fight data such as: number of knockdowns, total significant strikes, etc. 

At the moment, this kind of data is not being implemented into the model. The main focus is accumulating other data from sites (such as Tapology) to enhance testing.

In [25]:
# Current age of fighter at date time (years)
df['current_age'] = (df['event_date'] - df['dob']) / np.timedelta64(1, 'Y')

# Number of previous recorded fights in data 
df['ones'] = 1
df['prev_num_fights'] = df.groupby(['fighter_profile'])['ones'].cumsum() - 1
# df.drop(columns=['total_fights'])

# Weeks since previous fight
df['wks_since_last_fight'] = df.groupby(['fighter_profile'])['event_date'].diff()  # total time between date values
df['wks_since_last_fight'] = df['wks_since_last_fight'] / np.timedelta64(1, 'W')   # this includes some NaN 
 
# Result of last fight 
# Need to breakdown the results by some kind of number pattern 

# Main fight results (binary variable)
df['win']        = df['fighter_result'].apply(lambda x: 1 if x == 'W' else 0)
df['no_contest'] = df['fighter_result'].apply(lambda x: 1 if x == 'NC' else 0)
df['draw']       = df['fighter_result'].apply(lambda x: 1 if x == 'D' else 0)
# Granular result metrics (if betting on KO win or something else)
# there are some other fields included in these results
df['ko']    = df['method'].apply(lambda x: 1 if x == 'KO/TKO' or x == "TKO - Doctor's Stoppage" or x == "Could Not Continue" else 0)
df['sub']   = df['method'].apply(lambda x: 1 if x == 'Submission' else 0)
df['dec']   = df['method'].apply(lambda x: 1 if x == 'Decision - Unanimous' or x == "Decision - Split" or x == "Decision" else 0)

# Result of second to last fight  
# KOed in last four fights?    true or false
# Subed in last four fights?   true or false

# Testing results 
df.loc[df['name'] == 'Jon Jones', ['event_date', 'event', 'prev_num_fights']]
df.loc[df['name'] == 'Conor McGregor', ['event_date', 'event', 'prev_num_fights', 'wks_since_last_fight', 'win', 'ko_win']]


In [None]:
df['fighter_result'].value_counts()  # win, loss, no-contest, draw
df['method'].value_counts()
df.columns

## Exporting Data


In [None]:
df.to_excel('/content/drive/MyDrive/MMA Model/Data/ufcbouts_model_v1.xlsx')
df.to_csv('/content/drive/MyDrive/MMA Model/Data/ufcbouts_model_v1.csv')
# Download files if needed 
files.download('/content/drive/MyDrive/MMA Model/Data/ufcbouts_model_v1.csv')
files.download('/content/drive/MyDrive/MMA Model/Data/ufcbouts_model_v1.xlsx')