# 3. Merging 


This code is used to scrape and process data related to UFC fights, for the eventual purpose of creating a dataset for machine learning algorithms to predict the outcome of UFC fights.

Here's how it works, broken down into sections:

1. **Importing necessary libraries**: All required packages for this program are imported, such as pandas, numpy, matplotlib, seaborn, os etc. These libraries are essential for data processing and analysis, web scraping, data visualization, file handling, and more.

2. Changing the current directory to where all the relevant files and scripts are present.

3. **Loading the data files**: It loads .csv files located in the "fight_totals3" and "sig_strikes3" subfolders. These files presumably contain statistics for each fight and significant strikes during each fight respectively. A data frame is created for each of these (fight_totals and sig_strike_agg), with an additional 'fight_id' column added to identify the specific fight.

4. **Data Cleaning**: Each data frame is subset to remove unnecessary columns. The code checks for common columns present in both data frames and removes unnecessary columns. It also checks for missing values and takes action accordingly (either dropping the row or filling missing values). It replaces infinity values with zero.

5. **Merge the dataframes**: The two dataframes are then merged based on the common columns to form a new dataframe "fight_DF".

6. **Saving the dataframes**: Interim and final dataframes are saved to .csv files for future use.

The outputs of this code are stored dataframes containing cleaned and merged data on UFC fights, ready for further analysis or machine learning algorithms.

This script also seems to use Selenium and BeautifulSoup for web scraping purpose but does not use these in the provided code section. It might be used in another section which is not displayed here.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mtick
import sqlite3
import seaborn as sns
from matplotlib.pyplot import figure
from bs4 import BeautifulSoup
import time
import requests     
import shutil       
import datetime
from scipy.stats import norm
import warnings
warnings.filterwarnings('ignore')
import requests
import json
from random import randint
import  random
import os
os.chdir('/Users/travisroyce/Library/CloudStorage/OneDrive-Personal/Data Science/Personal_Projects/Sports/UFC_Prediction_V2')
from cmath import nan
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
import pickle
from sklearn.metrics import fbeta_score
from bs4 import BeautifulSoup
import time

In [3]:
# Make master fight_totals dataframe
# load fight_totals files and sig_strike files, both from ufcstats.com
fight_totals_files = os.listdir('data/ufc_stats/fight_totals3')
sig_strike_files = os.listdir('data/ufc_stats/sig_strikes3')


In [5]:
# Make master fight_totals dataframe
fight_totals = pd.DataFrame()

for file in fight_totals_files: 
    file2 = file.replace('.csv','')
    df = pd.read_csv('data/ufc_stats/fight_totals3/'+file)
    # add fight id column
    df['fight_id'] = file2[:-7]
    fight_totals = fight_totals.append(df)
fight_totals = fight_totals.reset_index(drop=True)

fight_totals

Unnamed: 0.1,Unnamed: 0,Fighter_A,Fighter_B,A_Kd,B_Kd,A_Sig_strike_land,A_Sig_strike_att,B_Sig_strike_land,B_Sig_strike_att,A_Sig_strike_percent,...,B_Ctrl_time_min,B_Ctrl_time_sec,A_Ctrl_time_tot,B_Ctrl_time_tot,details,event_title,event_url,date,Winner,fight_id
0,0,Rick Story,Martin Kampmann,0.0,0.0,61.0,170.0,38.0,147.0,0.35,...,4.0,55.0,37.0,295.0,Steve English ...,UFC 139: Shogun vs Henderson,http://www.ufcstats.com/event-details/0ec82142...,"November 19, 2011","Martin Kampmann ""The Hitman""",89820fae001dd151
1,0,Enrique Barzola,Kyle Bochniak,0.0,0.0,55.0,165.0,41.0,141.0,0.33,...,0.0,27.0,18.0,27.0,Dave Hagen ...,UFC on FOX: Maia vs. Condit,http://www.ufcstats.com/event-details/cfbccfed...,"August 27, 2016","Kyle Bochniak ""Killer B""",df33799f117000cb
2,0,Belal Muhammad,Takashi Sato,0.0,0.0,49.0,106.0,29.0,89.0,0.46,...,0.0,38.0,151.0,38.0,Rear Naked Choke,UFC 242: Khabib vs. Poirier,http://www.ufcstats.com/event-details/a79bfbc0...,"September 07, 2019","Belal Muhammad ""Remember the Name""",18aff757c54687f1
3,0,Alex Caceres,Edwin Figueroa,0.0,1.0,55.0,89.0,41.0,81.0,0.61,...,2.0,20.0,364.0,140.0,Two Points Deducted: Low Blows by Caceres ...,UFC 143: Diaz vs Condit,http://www.ufcstats.com/event-details/df2cf66d...,"February 04, 2012","Edwin Figueroa ""El Feroz""",d019250cc6d93527
4,0,TJ Grant,Carlo Prater,0.0,0.0,68.0,113.0,27.0,61.0,0.60,...,0.0,35.0,414.0,35.0,Brian Costello ...,UFC on FUEL TV: Korean Zombie vs Poirier,http://www.ufcstats.com/event-details/8caca585...,"May 15, 2012",TJ Grant,4655930eb83446c7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5879,0,Brian Ebersole,TJ Waldburger,0.0,1.0,39.0,59.0,18.0,41.0,0.66,...,1.0,44.0,495.0,104.0,Jon Bilyk ...,UFC on FX: Maynard vs Guida,http://www.ufcstats.com/event-details/50593489...,"June 22, 2012","Brian Ebersole ""Bad Boy""",69fd05c73b645a62
5880,0,Alessio Di,Chirico Zak,0.0,1.0,53.0,157.0,54.0,139.0,0.33,...,0.0,0.0,0.0,0.0,Derek Cleary ...,UFC Fight Night: Smith vs. Rakic,http://www.ufcstats.com/event-details/e29cf523...,"August 29, 2020",Zak Cummings,0a7517a16d2d9db6
5881,0,Miguel Baeza,Hector Aldana,1.0,0.0,35.0,61.0,8.0,42.0,0.57,...,0.0,9.0,0.0,9.0,Kick to Leg At Distance,UFC Fight Night: Joanna vs. Waterson,http://www.ufcstats.com/event-details/0941df56...,"October 12, 2019","Miguel Baeza ""Caramel Thunder""",1d26ceb1c995655e
5882,0,Andrea Lee,Natalia Silva,0.0,0.0,43.0,125.0,70.0,174.0,0.34,...,0.0,0.0,0.0,0.0,David Ginsberg ...,UFC 292: Sterling vs. O'Malley,http://www.ufcstats.com/event-details/2719f300...,"August 19, 2023",Natalia Silva,8c7b5b21b3532027


In [9]:
sig_strike_agg = pd.DataFrame()
for file in sig_strike_files:
    file2 = file.replace('.csv','')
    df = pd.read_csv('data/ufc_stats/sig_strikes3/'+file)
    df['fight_id'] = file2[:-11]
    sig_strike_agg = sig_strike_agg.append(df)
sig_strike_agg = sig_strike_agg.reset_index(drop=True)

sig_strike_agg

Unnamed: 0.1,Unnamed: 0,Fighter_A,Fighter_B,A_Head_Strikes_land,A_Head_Strikes_att,B_Head_Strikes_land,B_Head_Strikes_att,A_Head_Strikes_percent,B_Head_Strikes_percent,A_Body_Strikes_land,...,A_Ground_Strikes_land,A_Ground_Strikes_att,B_Ground_Strikes_land,B_Ground_Strikes_att,A_Ground_Strikes_percent,B_Ground_Strikes_percent,details,event_title,event_url,fight_id
0,0,Thales Leites,Ryan Jensen,1.0,8.0,8.0,20.0,0.125000,0.400000,1.0,...,1.0,1.0,3.0,5.0,1.000000,0.6,Armbar From Bottom Guard,UFC 74: Respect,http://www.ufcstats.com/event-details/a5c53b3d...,4a9b3b97fa2bbca9
1,0,Alex Oliveira,Piotr Hallmann,32.0,95.0,26.0,53.0,0.336842,0.490566,7.0,...,2.0,4.0,4.0,4.0,0.500000,1.0,Punch to Head At Distance,UFC Fight Night: Belfort vs Henderson 3,http://www.ufcstats.com/event-details/5345f866...,a637cd1012af870f
2,0,Steve Cantwell,Razak Al-Hassan,9.0,22.0,5.0,46.0,0.409091,0.108696,4.0,...,2.0,3.0,0.0,0.0,0.666667,,Armbar From Mount Technical Submission,UFC Fight Night - Fight for the Troops,http://www.ufcstats.com/event-details/30cd319d...,d8f64b001cd590fa
3,0,Alexis Davis,Sarah Kaufman,18.0,77.0,37.0,90.0,0.233766,0.411111,7.0,...,0.0,0.0,7.0,7.0,,1.0,Armbar From Bottom,UFC 186: Johnson vs Horiguchi,http://www.ufcstats.com/event-details/997b4f52...,6b532f492e98d22f
4,0,Michael Chiesa,Mitch Clarke,48.0,113.0,36.0,85.0,0.424779,0.423529,9.0,...,11.0,12.0,0.0,0.0,0.916667,,Doug Crosby ...,UFC Fight Night: Mendes vs Lamas,http://www.ufcstats.com/event-details/16d09e80...,9eb92b670eebdcd4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5878,0,Clay Guida,Tatsuya Kawajiri,18.0,39.0,18.0,48.0,0.461538,0.375000,7.0,...,18.0,19.0,0.0,0.0,0.947368,,Ben Cartlidge ...,UFC Fight Night: Minotauro vs Nelson,http://www.ufcstats.com/event-details/d632d156...,a2e1e35e4e84e08d
5879,0,Raphael Assuncao,Cody Garbrandt,4.0,25.0,7.0,24.0,0.160000,0.291667,2.0,...,0.0,0.0,0.0,0.0,,,Punch to Head At Distance,UFC 250: Nunes vs. Spencer,http://www.ufcstats.com/event-details/4c12aa7c...,0ddf834a47533d15
5880,0,Jim Miller,David Baron,10.0,21.0,5.0,14.0,0.476190,0.357143,2.0,...,11.0,15.0,0.0,0.0,0.733333,,Rear Naked Choke,UFC 89: Bisping vs Leben,http://www.ufcstats.com/event-details/312f47c3...,9b8c29f50d452025
5881,0,Gabriel Gonzaga,Chris Tuchscherer,20.0,25.0,0.0,1.0,0.800000,0.000000,1.0,...,20.0,25.0,0.0,1.0,0.800000,0.0,Punches to Head On Ground,UFC 102: Couture vs Nogueira,http://www.ufcstats.com/event-details/c6a33ff1...,4d3297af1b7889ba


In [10]:
fight_totals.to_csv('data/final/aggregates/All_Fight_Totals.csv', index=False)
sig_strike_agg.to_csv('data/final/aggregates/All_Sig_Strikes.csv', index=False)

In [11]:
# get the columns that are in both dataframes
common_cols = list(set(fight_totals.columns).intersection(sig_strike_agg.columns))
common_cols

['event_url',
 'fight_id',
 'Fighter_B',
 'event_title',
 'Unnamed: 0',
 'details',
 'Fighter_A']

In [12]:
# drop unnamed
fight_totals = fight_totals.drop(columns=['Unnamed: 0'])
sig_strike_agg = sig_strike_agg.drop(columns=['Unnamed: 0'])

In [13]:
# drop details from sig_strike_agg
sig_strike_agg = sig_strike_agg.drop(columns=['details'])

In [14]:
common_cols = list(set(fight_totals.columns).intersection(sig_strike_agg.columns))
common_cols

['event_url', 'fight_id', 'Fighter_B', 'event_title', 'Fighter_A']

In [15]:
# merge the two dataframes, using the common columns as the key
fight_DF = pd.merge(fight_totals, sig_strike_agg, on=common_cols, how= 'inner')
fight_DF

Unnamed: 0,Fighter_A,Fighter_B,A_Kd,B_Kd,A_Sig_strike_land,A_Sig_strike_att,B_Sig_strike_land,B_Sig_strike_att,A_Sig_strike_percent,B_Sig_strike_percent,...,B_Clinch_Strikes_land,B_Clinch_Strikes_att,A_Clinch_Strikes_percent,B_Clinch_Strikes_percent,A_Ground_Strikes_land,A_Ground_Strikes_att,B_Ground_Strikes_land,B_Ground_Strikes_att,A_Ground_Strikes_percent,B_Ground_Strikes_percent
0,Rick Story,Martin Kampmann,0.0,0.0,61.0,170.0,38.0,147.0,0.35,0.25,...,5.0,16.0,0.560000,0.312500,0.0,0.0,4.0,4.0,,1.000000
1,Enrique Barzola,Kyle Bochniak,0.0,0.0,55.0,165.0,41.0,141.0,0.33,0.29,...,3.0,5.0,0.800000,0.600000,0.0,0.0,0.0,0.0,,
2,Belal Muhammad,Takashi Sato,0.0,0.0,49.0,106.0,29.0,89.0,0.46,0.32,...,0.0,0.0,0.750000,,4.0,4.0,0.0,0.0,1.000000,
3,Alex Caceres,Edwin Figueroa,0.0,1.0,55.0,89.0,41.0,81.0,0.61,0.50,...,5.0,5.0,1.000000,1.000000,1.0,2.0,10.0,18.0,0.500000,0.555556
4,TJ Grant,Carlo Prater,0.0,0.0,68.0,113.0,27.0,61.0,0.60,0.44,...,2.0,3.0,0.857143,0.666667,10.0,12.0,0.0,1.0,0.833333,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5875,Brian Ebersole,TJ Waldburger,0.0,1.0,39.0,59.0,18.0,41.0,0.66,0.43,...,4.0,5.0,0.750000,0.800000,19.0,22.0,1.0,1.0,0.863636,1.000000
5876,Alessio Di,Chirico Zak,0.0,1.0,53.0,157.0,54.0,139.0,0.33,0.38,...,0.0,1.0,1.000000,0.000000,0.0,0.0,0.0,0.0,,
5877,Miguel Baeza,Hector Aldana,1.0,0.0,35.0,61.0,8.0,42.0,0.57,0.19,...,0.0,0.0,,,6.0,10.0,0.0,0.0,0.600000,
5878,Andrea Lee,Natalia Silva,0.0,0.0,43.0,125.0,70.0,174.0,0.34,0.40,...,0.0,0.0,,,0.0,0.0,0.0,0.0,,


In [16]:
# check for nulls
fight_DF.isnull().sum()

Fighter_A                      0
Fighter_B                     20
A_Kd                           0
B_Kd                           0
A_Sig_strike_land              0
                            ... 
A_Ground_Strikes_att           0
B_Ground_Strikes_land          0
B_Ground_Strikes_att           0
A_Ground_Strikes_percent    2096
B_Ground_Strikes_percent    2558
Length: 80, dtype: int64

In [17]:
# drop when Fighter_B is null
fight_DF = fight_DF.dropna(subset=['Fighter_B'])

In [18]:
missing = fight_DF.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing

A_Total_Strikes_percent         11
B_Total_Strikes_percent         15
details                         15
A_Distance_Strikes_percent      34
B_Distance_Strikes_percent      38
A_Head_Strikes_percent          40
B_Head_Strikes_percent          54
A_Body_Strikes_percent         498
B_Body_Strikes_percent         587
A_Leg_Strikes_percent         1000
B_Leg_Strikes_percent         1004
A_Clinch_Strikes_percent      1314
B_Clinch_Strikes_percent      1331
A_Takedown_percent            1864
A_Ground_Strikes_percent      2089
B_Takedown_percent            2153
B_Ground_Strikes_percent      2552
A_Sub_Success_Percent         4325
B_Sub_Success_Percent         4664
dtype: int64

In [19]:
# drop when Winner is missing
fight_DF = fight_DF.dropna(subset=['Winner'])

In [20]:
# change all the nulls to 0
fight_DF = fight_DF.fillna(0)

In [21]:
# change all the inf to 0
fight_DF = fight_DF.replace([np.inf, -np.inf], 0)

Save Fight_DF, the eventual dataset we train/test on.

In [22]:
fight_DF.to_csv('data/final/aggregates/Fight_DF.csv', index=False)