## Scraping Fighter Biography Infro from UFC.Com

Can do via URL:

Example: https://www.ufc.com/athlete/kevin-holland

- All: https://www.ufc.com/athletes/all/active

This code involves both web scraping and data manipulation. It obtains data from the official UFC website and performs certain steps to transform it into a format useful for data analysis or machine learning. Here is a breakdown:

1. **Importing Necessary Libraries**: The script starts by importing necessary Python packages and setting some configurations. Packages include pandas for data manipulation, numpy for numerical operations, matplotlib and seaborn for data visualization, BeautifulSoup and requests for web scraping, and a few others.

2. **Directory Settings**: Sets the working directory where data will be fetched and stored.

3. **Loading Dataset**: Loads a CSV file called "All_Fight_Totals.csv" which seems to have data about the fights and fighter names.

4. **Preparing Fighter Names**: Lists out unique fighter names, removes duplicates and forms URLs to each fighter's bio page on the UFC website.

5. **Scrapping Bio Data**: A function `get_ufc_bios` is defined to scrape biography data of each fighter from their bio page. This function fetches the webpage, extracts desired information - labels and corresponding bio details - and forms a data frame out of it. Bio data for each fighter is stored separately in a CSV file.

6. **Handling Push Errors**: If the data fetching process encounters any errors, the urls causing errors are stored in a separate CSV file.

7. **Fetching Missing Bio Data**: Defines another function 'download_missing_bios'. This function first identifies bios that are missing or caused errors. After that, it attempts to download missing bios again.

8. **Consolidating the Data**: After all available bios are downloaded, it loads all CSV files containing each fighter's bio data and combines them into a master dataset/dataframe. Finally, it outputs the resulting combined dataframe into a CSV file called "All_Fighter_Bios.csv"

This script efficiently collects the bio data of all fighters listed in the UFC and compiles it into a single unified CSV file, ready for data analysis or model building activities.

In [1]:
# Load Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mtick
import sqlite3
import seaborn as sns
from matplotlib.pyplot import figure
from bs4 import BeautifulSoup
import time
import requests     # to get images
import shutil       # to save files locally
import datetime
from scipy.stats import norm
import warnings
warnings.filterwarnings('ignore')
import xgboost
from xgboost import XGBClassifier
from random import randint
import  random
import os
os.chdir('/Users/travisroyce/Library/CloudStorage/OneDrive-Personal/Data Science/Personal_Projects/Sports/UFC_Prediction_V2')

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from cmath import nan


#### List all Fighters

In [2]:
df = pd.read_csv('data/final/aggregates/All_Fight_Totals.csv')

In [3]:
# list of all fighter_A and fighter_B names
fighter_names = list(df['Fighter_A'].unique()) + list(df['Fighter_B'].unique())

# remove duplicates
fighter_names = list(set(fighter_names))

len(fighter_names)

2174

In [4]:
fighter_name_df = pd.DataFrame(fighter_names, columns = ['Fighter_Name'])
fighter_name_df

Unnamed: 0,Fighter_Name
0,Kalib Starnes
1,Spike Carlyle
2,Jan Blachowicz
3,Brodie Farber
4,Alexander Volkanovski
...,...
2169,Kyle Bochniak
2170,Mike Ciesnolevicz
2171,Alan Omer
2172,Benoit Saint


In [5]:
fighter_name_df['ufc_url'] = 'https://www.ufc.com/athlete/' + fighter_name_df['Fighter_Name'].str.lower().str.replace(' ', '-')
fighter_name_df.head()

Unnamed: 0,Fighter_Name,ufc_url
0,Kalib Starnes,https://www.ufc.com/athlete/kalib-starnes
1,Spike Carlyle,https://www.ufc.com/athlete/spike-carlyle
2,Jan Blachowicz,https://www.ufc.com/athlete/jan-blachowicz
3,Brodie Farber,https://www.ufc.com/athlete/brodie-farber
4,Alexander Volkanovski,https://www.ufc.com/athlete/alexander-volkanovski


#### Function to scrape fighter bio information from UFC.com

In [6]:
# NEW Function to scrape fighter bio from UFC.com

# test first

url= 'https://www.ufc.com/athlete/robert-whittaker'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
bio = soup.find('div', class_='c-bio__info-details').text

fighter = soup.find('h1', class_='hero-profile__name').text

# get every c-bio__label and c-bio__text within bio
labels = soup.find_all('div', class_='c-bio__label')
texts = soup.find_all('div', class_='c-bio__text')

# create empty lists to store labels and texts
label_list = []
text_list = []

# loop through labels and texts and append to lists
for label in labels:
    label_list.append(label.text)

for text in texts:
    text_list.append(text.text)

# create dataframe from lists, with label as column names
fighter_bio_df = pd.DataFrame([label_list, text_list])
fighter_bio_df.columns = fighter_bio_df.iloc[0]

# drop 1st row
fighter_bio_df = fighter_bio_df.drop(fighter_bio_df.index[0])



# add fighter name column
fighter_bio_df['fighter'] = fighter

# replace any '\n' with ''
fighter_bio_df = fighter_bio_df.replace('\n', '', regex=True)

fighter_bio_df.to_csv('data/final/fighters/' + fighter + '.csv')
fighter_bio_df



Unnamed: 0,Status,Place of Birth,Trains at,Fighting style,Age,Height,Weight,Octagon Debut,Reach,Leg reach,fighter
1,Active,"Otahuhu, Australia","PMA, Padstow NSW, Australia",Brazilian Jiu-Jitsu,32,72.0,196.0,"Dec. 16, 2012",73.5,43.0,Robert Whittaker


In [7]:
def get_ufc_bios(url):
    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        bio = soup.find('div', class_='c-bio__info-details').text

        fighter = soup.find('h1', class_='hero-profile__name').text

        # get every c-bio__label and c-bio__text within bio
        labels = soup.find_all('div', class_='c-bio__label')
        texts = soup.find_all('div', class_='c-bio__text')

        # create empty lists to store labels and texts
        label_list = []
        text_list = []

        # loop through labels and texts and append to lists
        for label in labels:
            label_list.append(label.text)

        for text in texts:
            text_list.append(text.text)

        # create dataframe from lists, with label as column names
        fighter_bio_df = pd.DataFrame([label_list, text_list])
        fighter_bio_df.columns = fighter_bio_df.iloc[0]

        # drop 1st row
        fighter_bio_df = fighter_bio_df.drop(fighter_bio_df.index[0])



        # add fighter name column
        fighter_bio_df['fighter'] = fighter

        # replace any '\n' with ''
        fighter_bio_df = fighter_bio_df.replace('\n', '', regex=True)

        fighter_bio_df.to_csv('data/final/fighters/' + fighter + '.csv')
        return fighter_bio_df

    except:
        # If there is an error, save it in the bios with errors folder

        print('Error with ' + str(url))
        data = {'fighter': [url]}
        df = pd.DataFrame(data)
        url2 = url.replace('https://www.ufc.com/athlete/', '')
        url3 = url2.replace('-', ' ')
        df.to_csv('data/final/fighters/bios_with_errors/' + url2 + '.csv')


In [8]:
# test
get_ufc_bios('https://www.ufc.com/athlete/Max-Holloway')

Unnamed: 0,Status,Place of Birth,Trains at,Fighting style,Age,Height,Weight,Octagon Debut,Reach,Leg reach,fighter
1,Active,"Waianae, United States",Hawaii Elite MMA - Hawaii,Muay Thai,31,71.0,162.2,"Feb. 04, 2012",69.0,42.0,Max Holloway


In [15]:
def download_missing_bios():
        working_bios_folder = os.listdir('data/final/fighters/')
        working_bios = [x[:-4] for x in working_bios_folder]

        bios_with_errors_folder = os.listdir('data/final/fighters/bios_with_errors/')
        bios_with_errors = [x[:-4] for x in bios_with_errors_folder]

        all_downloaded_bios = working_bios + bios_with_errors
        un_downloaded_bios = fighter_name_df[~fighter_name_df['Fighter_Name'].isin(all_downloaded_bios)]

        # delete any NAN in undownload bios
        un_downloaded_bios = un_downloaded_bios.dropna()

        # download all undownloaded bios
        i = 0
        fighters = len(un_downloaded_bios['ufc_url'])

        # Downloading data for ALL fighters

        for url in un_downloaded_bios['ufc_url']:
            try:
                get_ufc_bios(url)
                print('Done with ' + url + ' ' + str(i) + ' of ' + str(fighters))
                i += 1
            
            except:
                print('Error with ' + url)
                i += 1


In [16]:
download_missing_bios()

Done with https://www.ufc.com/athlete/jan-blachowicz 0 of 376
Error with https://www.ufc.com/athlete/joseph-gigliotti
Done with https://www.ufc.com/athlete/joseph-gigliotti 1 of 376
Error with https://www.ufc.com/athlete/anjos-rafael
Done with https://www.ufc.com/athlete/anjos-rafael 2 of 376
Error with https://www.ufc.com/athlete/rodrigo-de
Done with https://www.ufc.com/athlete/rodrigo-de 3 of 376
Error with https://www.ufc.com/athlete/ji-yeon
Done with https://www.ufc.com/athlete/ji-yeon 4 of 376
Error with https://www.ufc.com/athlete/brendan-o'reilly
Done with https://www.ufc.com/athlete/brendan-o'reilly 5 of 376
Done with https://www.ufc.com/athlete/ian-garry 6 of 376
Done with https://www.ufc.com/athlete/viviane-araujo 7 of 376
Error with https://www.ufc.com/athlete/silva-paul
Done with https://www.ufc.com/athlete/silva-paul 8 of 376
Error with https://www.ufc.com/athlete/felipe-dos
Done with https://www.ufc.com/athlete/felipe-dos 9 of 376
Error with https://www.ufc.com/athlete/ma

In [17]:
len(os.listdir('data/final/fighters/'))

2500

In [18]:
len(os.listdir('data/final/fighters/bios_with_errors/'))

698

# MasterDF for fighter bios

In [21]:
fighter_bio_files=  os.listdir('data/final/fighters/')
fighter_bio_files = [x for x in fighter_bio_files if 'csv' in x]

# append all fighter bios into one dataframe
fighter_bio_df = pd.DataFrame()
for file in fighter_bio_files:
    df = pd.read_csv('data/final/fighters/' + file)
    fighter_bio_df = fighter_bio_df.append(df)

fighter_bio_df = fighter_bio_df.reset_index(drop=True)
fighter_bio_df.head()

Unnamed: 0.1,Unnamed: 0,Status,Place of Birth,Age,Height,Weight,Octagon Debut,fighter,Reach,Leg reach,Trains at,Fighting style
0,1,Not Fighting,"Martinsburg, United States",43.0,76.0,155.0,"Jan. 23, 2008",Corey Hill,,,,
1,1,Not Fighting,"Rio De Janeiro, Brazil",39.0,66.0,135.0,"Feb. 22, 2016",Augusto Mendes,65.0,36.0,,
2,1,Active,"Soure, Brazil",34.0,65.0,124.0,"Jun. 03, 2017",Deiveson Figueiredo,68.0,38.0,Team Figueiredo,Boxing
3,1,Not Fighting,"East Paulo Alto, United States",52.0,68.0,185.0,"Jul. 16, 1999",Eugene Jackson,,,,
4,1,Not Fighting,"Honolulu, United States",51.0,71.0,170.0,"Aug. 21, 2004",Ronald Jhun,,,,


In [22]:
fighter_bio_df.head(3)

Unnamed: 0.1,Unnamed: 0,Status,Place of Birth,Age,Height,Weight,Octagon Debut,fighter,Reach,Leg reach,Trains at,Fighting style
0,1,Not Fighting,"Martinsburg, United States",43.0,76.0,155.0,"Jan. 23, 2008",Corey Hill,,,,
1,1,Not Fighting,"Rio De Janeiro, Brazil",39.0,66.0,135.0,"Feb. 22, 2016",Augusto Mendes,65.0,36.0,,
2,1,Active,"Soure, Brazil",34.0,65.0,124.0,"Jun. 03, 2017",Deiveson Figueiredo,68.0,38.0,Team Figueiredo,Boxing


In [None]:
fighter_bio_df.to_csv('data/final/aggregates/All_Fighter_Bios.csv')