# 3. Merging UFCStats Files


This code is used to scrape and process data related to UFC fights, for the eventual purpose of creating a dataset for machine learning algorithms to predict the outcome of UFC fights.

Here's how it works, broken down into sections:

1. **Importing necessary libraries**: All required packages for this program are imported, such as pandas, numpy, matplotlib, seaborn, os etc. These libraries are essential for data processing and analysis, web scraping, data visualization, file handling, and more.

2. Changing the current directory to where all the relevant files and scripts are present.

3. **Loading the data files**: It loads .csv files located in the "fight_totals3" and "sig_strikes3" subfolders. These files presumably contain statistics for each fight and significant strikes during each fight respectively. A data frame is created for each of these (fight_totals and sig_strike_agg), with an additional 'fight_id' column added to identify the specific fight.

4. **Data Cleaning**: Each data frame is subset to remove unnecessary columns. The code checks for common columns present in both data frames and removes unnecessary columns. It also checks for missing values and takes action accordingly (either dropping the row or filling missing values). It replaces infinity values with zero.

5. **Merge the dataframes**: The two dataframes are then merged based on the common columns to form a new dataframe "fight_DF".

6. **Saving the dataframes**: Interim and final dataframes are saved to .csv files for future use.

The outputs of this code are stored dataframes containing cleaned and merged data on UFC fights, ready for further analysis or machine learning algorithms.

This script also seems to use Selenium and BeautifulSoup for web scraping purpose but does not use these in the provided code section. It might be used in another section which is not displayed here.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mtick
import sqlite3
import seaborn as sns
from matplotlib.pyplot import figure
from bs4 import BeautifulSoup
import time
import requests     
import shutil       
import datetime
from scipy.stats import norm
import warnings
warnings.filterwarnings('ignore')
import requests
import json
from random import randint
import  random
import os
os.chdir('/Users/travisroyce/Library/CloudStorage/OneDrive-Personal/Data Science/Personal_Projects/Sports/UFC_Prediction_V2')
from cmath import nan
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
import pickle
from sklearn.metrics import fbeta_score
from bs4 import BeautifulSoup
import time

In [2]:
# Make master fight_totals dataframe
# load fight_totals files and sig_strike files, both from ufcstats.com
fight_totals_files = os.listdir('data/ufc_stats/fight_totals3')
sig_strike_files = os.listdir('data/ufc_stats/sig_strikes3')


In [3]:
# Make master fight_totals dataframe
fight_totals = pd.DataFrame()

for file in fight_totals_files: 
    try:
        file2 = file.replace('.csv','')
        df = pd.read_csv('data/ufc_stats/fight_totals3/'+file)
        # add fight id column
        df['fight_id'] = file2[:-7]
        fight_totals = fight_totals.append(df)
        # print confirmation
        print(file2)

    except:
        print('error: '+file2)
        pass

fight_totals = fight_totals.reset_index(drop=True)

fight_totals

89820fae001dd151_totals
df33799f117000cb_totals
18aff757c54687f1_totals
d019250cc6d93527_totals
4655930eb83446c7_totals
821c27f0dbb27e86_totals
59bb17087b12ad35_totals
cc6934588298958e_totals
64eb4ae171613218_totals
296cc69bcc3c1635_totals
1fe58cdab57d233b_totals
11706648d34ff3e8_totals
23c6a428df1569bc_totals
32b0e450b11b32fe_totals
d4d6c5ff6bef93ce_totals
c1322d09e8b6efca_totals
3c552514bb9b5c98_totals
e67e15cc6578d1c3_totals
eed8c9955cad1e30_totals
0e9091311ca565ce_totals
387bdcc2ca8709c1_totals
f680e6ebe3bdfe3e_totals
caf3ca7fc0195412_totals
382d626d45c36f14_totals
de2069fea664c4b7_totals
db68ff7bf2487971_totals
37a7dc68f3a0e65d_totals
c3e3d5ef06a45616_totals
43712d71cec9bde6_totals
8857bf28823b1b2d_totals
c4e16d57dd9a1b39_totals
246f38c7855a6400_totals
7ec61e2b0728e6d4_totals
87126427cfcaee52_totals
ae07b35f2797242e_totals
357f438c15e1326f_totals
da26301ae3e7a9b1_totals
07018cb7ae879456_totals
8a7b0cd5ad9ca4fe_totals
e857ca0bc3b9fdcd_totals
534c07488396e124_totals
63c08ecea35e8bee

Unnamed: 0.1,Unnamed: 0,Fighter_A,Fighter_B,A_Kd,B_Kd,A_Sig_strike_land,A_Sig_strike_att,B_Sig_strike_land,B_Sig_strike_att,A_Sig_strike_percent,...,B_Ctrl_time_min,B_Ctrl_time_sec,A_Ctrl_time_tot,B_Ctrl_time_tot,details,event_title,event_url,date,Winner,fight_id
0,0,Rick Story,Martin Kampmann,0,0,61,170,38,147,0.35,...,4,55.0,37.0,295.0,Steve English ...,UFC 139: Shogun vs Henderson,http://www.ufcstats.com/event-details/0ec82142...,"November 19, 2011","Martin Kampmann ""The Hitman""",89820fae001dd151
1,0,Enrique Barzola,Kyle Bochniak,0,0,55,165,41,141,0.33,...,0,27.0,18.0,27.0,Dave Hagen ...,UFC on FOX: Maia vs. Condit,http://www.ufcstats.com/event-details/cfbccfed...,"August 27, 2016","Kyle Bochniak ""Killer B""",df33799f117000cb
2,0,Belal Muhammad,Takashi Sato,0,0,49,106,29,89,0.46,...,0,38.0,151.0,38.0,Rear Naked Choke,UFC 242: Khabib vs. Poirier,http://www.ufcstats.com/event-details/a79bfbc0...,"September 07, 2019","Belal Muhammad ""Remember the Name""",18aff757c54687f1
3,0,Alex Caceres,Edwin Figueroa,0,1,55,89,41,81,0.61,...,2,20.0,364.0,140.0,Two Points Deducted: Low Blows by Caceres ...,UFC 143: Diaz vs Condit,http://www.ufcstats.com/event-details/df2cf66d...,"February 04, 2012","Edwin Figueroa ""El Feroz""",d019250cc6d93527
4,0,TJ Grant,Carlo Prater,0,0,68,113,27,61,0.60,...,0,35.0,414.0,35.0,Brian Costello ...,UFC on FUEL TV: Korean Zombie vs Poirier,http://www.ufcstats.com/event-details/8caca585...,"May 15, 2012",TJ Grant,4655930eb83446c7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7351,0,Alessio Di,Chirico Zak,0,1,53,157,54,139,0.33,...,0,0.0,0.0,0.0,Derek Cleary ...,UFC Fight Night: Smith vs. Rakic,http://www.ufcstats.com/event-details/e29cf523...,"August 29, 2020",Zak Cummings,0a7517a16d2d9db6
7352,0,Miguel Baeza,Hector Aldana,1,0,35,61,8,42,0.57,...,0,9.0,0.0,9.0,Kick to Leg At Distance,UFC Fight Night: Joanna vs. Waterson,http://www.ufcstats.com/event-details/0941df56...,"October 12, 2019","Miguel Baeza ""Caramel Thunder""",1d26ceb1c995655e
7353,0,Andrea Lee,Natalia Silva,0,0,43,125,70,174,0.34,...,0,0.0,0.0,0.0,David Ginsberg ...,UFC 292: Sterling vs. O'Malley,http://www.ufcstats.com/event-details/2719f300...,"August 19, 2023",Natalia Silva,8c7b5b21b3532027
7354,0,Jon Fitch,Chris Wilson,0,0,40,76,22,52,0.52,...,0,8.0,478.0,8.0,Cecil Peoples ...,UFC 82: Pride of a Champion,http://www.ufcstats.com/event-details/598a58db...,"March 01, 2008",Jon Fitch,029cd254fe02d4b8


In [4]:
sig_strike_agg = pd.DataFrame()
for file in sig_strike_files:
    try:
        file2 = file.replace('.csv','')
        df = pd.read_csv('data/ufc_stats/sig_strikes3/'+file)
        df['fight_id'] = file2[:-11]
        sig_strike_agg = sig_strike_agg.append(df)
        print(file2)
    except:
        print('error: '+file2)
        pass
    

sig_strike_agg

4a9b3b97fa2bbca9_sigstrikes
a637cd1012af870f_sigstrikes
221928c8441779e4_sigstrikes
d8f64b001cd590fa_sigstrikes
7a72d7448159fb59_sigstrikes
6b532f492e98d22f_sigstrikes
9eb92b670eebdcd4_sigstrikes
3a42dd84680f444b_sigstrikes
d4c314616eac8f8b_sigstrikes
f2916c10b7ecfcee_sigstrikes
9a8964b148cb5923_sigstrikes
72c3e5eacde4f0e5_sigstrikes
7233b2a56438d77a_sigstrikes
1bf1a6fc9c9acc26_sigstrikes
67a3b8dc3c0c3e45_sigstrikes
1a81573425c585fb_sigstrikes
91a5480f48bfaa71_sigstrikes
43313a340ff0d2b5_sigstrikes
a80162f4c8842da3_sigstrikes
b589aba75770bc6b_sigstrikes
b88b170a7fe40544_sigstrikes
e5ac7343b6d45188_sigstrikes
cfbaeb80b5ff7c00_sigstrikes
6be073563a421bca_sigstrikes
96740aa476055413_sigstrikes
b12bc15f384ecc42_sigstrikes
61d6b591430b6e25_sigstrikes
fa2bed87e5d6c568_sigstrikes
4f9939af9a1d40a3_sigstrikes
81c22cfa22002ea4_sigstrikes
92d0c39f0df77c7e_sigstrikes
e4acc3bb3c3e27d1_sigstrikes
4f6f7f1ad8954bb6_sigstrikes
f5a038c18f7a44fc_sigstrikes
0fe4906754b2d0f9_sigstrikes
33719a2d9c69ed17_sig

Unnamed: 0.1,Unnamed: 0,Fighter_A,Fighter_B,A_Head_Strikes_land,A_Head_Strikes_att,B_Head_Strikes_land,B_Head_Strikes_att,A_Head_Strikes_percent,B_Head_Strikes_percent,A_Body_Strikes_land,...,A_Ground_Strikes_land,A_Ground_Strikes_att,B_Ground_Strikes_land,B_Ground_Strikes_att,A_Ground_Strikes_percent,B_Ground_Strikes_percent,details,event_title,event_url,fight_id
0,0,Thales Leites,Ryan Jensen,1,8,8,20,0.125000,0.400000,1,...,1,1,3,5,1.000000,0.600000,Armbar From Bottom Guard,UFC 74: Respect,http://www.ufcstats.com/event-details/a5c53b3d...,4a9b3b97fa2bbca9
0,0,Alex Oliveira,Piotr Hallmann,32,95,26,53,0.336842,0.490566,7,...,2,4,4,4,0.500000,1.000000,Punch to Head At Distance,UFC Fight Night: Belfort vs Henderson 3,http://www.ufcstats.com/event-details/5345f866...,a637cd1012af870f
0,0,Ed Herman,Chris Price,5,9,1,1,0.555556,1.000000,0,...,4,7,0,0,0.571429,,Armbar From Back Control,UFC Fight Night: Evans vs Salmon,http://ufcstats.com/event-details/13c4313ed0f7...,221928c8441779e4
0,0,Steve Cantwell,Razak Al-Hassan,9,22,5,46,0.409091,0.108696,4,...,2,3,0,0,0.666667,,Armbar From Mount Technical Submission,UFC Fight Night - Fight for the Troops,http://www.ufcstats.com/event-details/30cd319d...,d8f64b001cd590fa
0,0,Anthony Johnson,Yoshiyuki Yoshida,9,19,0,0,0.473684,,1,...,0,0,0,0,,,Punch to Head At Distance,UFC 104: Machida vs Shogun,http://ufcstats.com/event-details/7c0847d3854a...,7a72d7448159fb59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,0,Raphael Assuncao,Cody Garbrandt,4,25,7,24,0.160000,0.291667,2,...,0,0,0,0,,,Punch to Head At Distance,UFC 250: Nunes vs. Spencer,http://www.ufcstats.com/event-details/4c12aa7c...,0ddf834a47533d15
0,0,Jim Miller,David Baron,10,21,5,14,0.476190,0.357143,2,...,11,15,0,0,0.733333,,Rear Naked Choke,UFC 89: Bisping vs Leben,http://www.ufcstats.com/event-details/312f47c3...,9b8c29f50d452025
0,0,Gabriel Gonzaga,Chris Tuchscherer,20,25,0,1,0.800000,0.000000,1,...,20,25,0,1,0.800000,0.000000,Punches to Head On Ground,UFC 102: Couture vs Nogueira,http://www.ufcstats.com/event-details/c6a33ff1...,4d3297af1b7889ba
0,0,Joe Lauzon,Marcin Held,17,58,20,70,0.293103,0.285714,3,...,0,1,0,3,0.000000,0.000000,Derek Cleary ...,UFC Fight Night: Rodriguez vs. Penn,http://www.ufcstats.com/event-details/46effbd1...,00720c46f864fe03


In [5]:
fight_totals.to_csv('data/final/aggregates/All_Fight_Totals.csv', index=False)
sig_strike_agg.to_csv('data/final/aggregates/All_Sig_Strikes.csv', index=False)

In [6]:
# get the columns that are in both dataframes
common_cols = list(set(fight_totals.columns).intersection(sig_strike_agg.columns))
common_cols

['event_url',
 'Fighter_B',
 'details',
 'Fighter_A',
 'Unnamed: 0',
 'event_title',
 'fight_id']

In [7]:
# drop unnamed
fight_totals = fight_totals.drop(columns=['Unnamed: 0'])
sig_strike_agg = sig_strike_agg.drop(columns=['Unnamed: 0'])

In [8]:
# drop details from sig_strike_agg
sig_strike_agg = sig_strike_agg.drop(columns=['details'])

In [9]:
common_cols = list(set(fight_totals.columns).intersection(sig_strike_agg.columns))
common_cols

['event_url', 'Fighter_B', 'Fighter_A', 'event_title', 'fight_id']

In [10]:
# merge the two dataframes, using the common columns as the key
fight_DF = pd.merge(fight_totals, sig_strike_agg, on=common_cols, how= 'inner')
fight_DF

Unnamed: 0,Fighter_A,Fighter_B,A_Kd,B_Kd,A_Sig_strike_land,A_Sig_strike_att,B_Sig_strike_land,B_Sig_strike_att,A_Sig_strike_percent,B_Sig_strike_percent,...,B_Clinch_Strikes_land,B_Clinch_Strikes_att,A_Clinch_Strikes_percent,B_Clinch_Strikes_percent,A_Ground_Strikes_land,A_Ground_Strikes_att,B_Ground_Strikes_land,B_Ground_Strikes_att,A_Ground_Strikes_percent,B_Ground_Strikes_percent
0,Rick Story,Martin Kampmann,0,0,61,170,38,147,0.35,0.25,...,5,16,0.560000,0.312500,0,0,4,4,,1.000000
1,Enrique Barzola,Kyle Bochniak,0,0,55,165,41,141,0.33,0.29,...,3,5,0.800000,0.600000,0,0,0,0,,
2,Belal Muhammad,Takashi Sato,0,0,49,106,29,89,0.46,0.32,...,0,0,0.750000,,4,4,0,0,1.000000,
3,Alex Caceres,Edwin Figueroa,0,1,55,89,41,81,0.61,0.50,...,5,5,1.000000,1.000000,1,2,10,18,0.500000,0.555556
4,TJ Grant,Carlo Prater,0,0,68,113,27,61,0.60,0.44,...,2,3,0.857143,0.666667,10,12,0,1,0.833333,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7347,Alessio Di,Chirico Zak,0,1,53,157,54,139,0.33,0.38,...,0,1,1.000000,0.000000,0,0,0,0,,
7348,Miguel Baeza,Hector Aldana,1,0,35,61,8,42,0.57,0.19,...,0,0,,,6,10,0,0,0.600000,
7349,Andrea Lee,Natalia Silva,0,0,43,125,70,174,0.34,0.40,...,0,0,,,0,0,0,0,,
7350,Jon Fitch,Chris Wilson,0,0,40,76,22,52,0.52,0.42,...,6,8,0.250000,0.750000,18,22,0,0,0.818182,


In [11]:
# check for nulls
fight_DF.isnull().sum()

Fighter_A                      0
Fighter_B                     25
A_Kd                           0
B_Kd                           0
A_Sig_strike_land              0
                            ... 
A_Ground_Strikes_att           0
B_Ground_Strikes_land          0
B_Ground_Strikes_att           0
A_Ground_Strikes_percent    2503
B_Ground_Strikes_percent    3297
Length: 80, dtype: int64

In [12]:
# drop when Fighter_B is null
fight_DF = fight_DF.dropna(subset=['Fighter_B'])

In [13]:
missing = fight_DF.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing

A_Total_Strikes_percent         26
B_Total_Strikes_percent         36
details                         71
A_Head_Strikes_percent          71
B_Distance_Strikes_percent     104
A_Distance_Strikes_percent     115
B_Head_Strikes_percent         126
A_Ctrl_time_sec                181
B_Ctrl_time_sec                181
A_Ctrl_time_tot                181
B_Ctrl_time_tot                181
A_Body_Strikes_percent         777
B_Body_Strikes_percent         908
B_Leg_Strikes_percent         1428
A_Leg_Strikes_percent         1429
A_Clinch_Strikes_percent      1724
B_Clinch_Strikes_percent      1747
A_Takedown_percent            2327
A_Ground_Strikes_percent      2494
B_Takedown_percent            2729
B_Ground_Strikes_percent      3288
A_Sub_Success_Percent         5216
B_Sub_Success_Percent         5787
dtype: int64

In [14]:
# drop when Winner is missing
fight_DF = fight_DF.dropna(subset=['Winner'])

In [15]:
# change all the nulls to 0
fight_DF = fight_DF.fillna(0)

In [16]:
# change all the inf to 0
fight_DF = fight_DF.replace([np.inf, -np.inf], 0)

Save Fight_DF, the eventual dataset we train/test on.

In [17]:
fight_DF.to_csv('data/final/aggregates/Fight_DF.csv', index=False)

# Check for Completeness

In [18]:
fight_DF = pd.read_csv('data/final/aggregates/Fight_DF.csv')

In [19]:
fight_DF.head()

Unnamed: 0,Fighter_A,Fighter_B,A_Kd,B_Kd,A_Sig_strike_land,A_Sig_strike_att,B_Sig_strike_land,B_Sig_strike_att,A_Sig_strike_percent,B_Sig_strike_percent,...,B_Clinch_Strikes_land,B_Clinch_Strikes_att,A_Clinch_Strikes_percent,B_Clinch_Strikes_percent,A_Ground_Strikes_land,A_Ground_Strikes_att,B_Ground_Strikes_land,B_Ground_Strikes_att,A_Ground_Strikes_percent,B_Ground_Strikes_percent
0,Rick Story,Martin Kampmann,0,0,61,170,38,147,0.35,0.25,...,5,16,0.56,0.3125,0,0,4,4,0.0,1.0
1,Enrique Barzola,Kyle Bochniak,0,0,55,165,41,141,0.33,0.29,...,3,5,0.8,0.6,0,0,0,0,0.0,0.0
2,Belal Muhammad,Takashi Sato,0,0,49,106,29,89,0.46,0.32,...,0,0,0.75,0.0,4,4,0,0,1.0,0.0
3,Alex Caceres,Edwin Figueroa,0,1,55,89,41,81,0.61,0.5,...,5,5,1.0,1.0,1,2,10,18,0.5,0.555556
4,TJ Grant,Carlo Prater,0,0,68,113,27,61,0.6,0.44,...,2,3,0.857143,0.666667,10,12,0,1,0.833333,0.0


In [20]:
len(fight_DF)

7327

In [21]:
# filter to fights with Jailton Almeida as Fighter_A or Fighter_B
ja_DF = fight_DF[(fight_DF['Fighter_A']=='Jailton Almeida') | (fight_DF['Fighter_B']=='Jailton Almeida')]
ja_DF

Unnamed: 0,Fighter_A,Fighter_B,A_Kd,B_Kd,A_Sig_strike_land,A_Sig_strike_att,B_Sig_strike_land,B_Sig_strike_att,A_Sig_strike_percent,B_Sig_strike_percent,...,B_Clinch_Strikes_land,B_Clinch_Strikes_att,A_Clinch_Strikes_percent,B_Clinch_Strikes_percent,A_Ground_Strikes_land,A_Ground_Strikes_att,B_Ground_Strikes_land,B_Ground_Strikes_att,A_Ground_Strikes_percent,B_Ground_Strikes_percent
2961,Jailton Almeida,Danilo Marques,0,0,30,51,0,0,0.58,0.0,...,0,0,0.0,0.0,30,50,0,0,0.6,0.0
5012,Jailton Almeida,Parker Porter,0,0,18,34,0,0,0.52,0.0,...,0,0,0.0,0.0,17,33,0,0,0.515152,0.0
5574,Jairzinho Rozenstruik,Jailton Almeida,0,0,0,2,4,7,0.0,0.57,...,0,0,0.0,0.0,0,0,4,5,0.0,0.8
5823,Jailton Almeida,Anton Turkalj,0,0,17,24,1,1,0.7,1.0,...,0,0,0.0,0.0,17,21,0,0,0.809524,0.0
5957,Shamil Abdurakhimov,Jailton Almeida,0,0,1,1,45,57,1.0,0.78,...,0,0,0.0,0.0,0,0,44,52,0.0,0.846154
