## Scraping Fighter Biography Infro from UFC.Com

Can do via URL:

Example: https://www.ufc.com/athlete/kevin-holland

- All: https://www.ufc.com/athletes/all/active

This code involves both web scraping and data manipulation. It obtains data from the official UFC website and performs certain steps to transform it into a format useful for data analysis or machine learning. Here is a breakdown:

1. **Importing Necessary Libraries**: The script starts by importing necessary Python packages and setting some configurations. Packages include pandas for data manipulation, numpy for numerical operations, matplotlib and seaborn for data visualization, BeautifulSoup and requests for web scraping, and a few others.

2. **Directory Settings**: Sets the working directory where data will be fetched and stored.

3. **Loading Dataset**: Loads a CSV file called "All_Fight_Totals.csv" which seems to have data about the fights and fighter names.

4. **Preparing Fighter Names**: Lists out unique fighter names, removes duplicates and forms URLs to each fighter's bio page on the UFC website.

5. **Scrapping Bio Data**: A function `get_ufc_bios` is defined to scrape biography data of each fighter from their bio page. This function fetches the webpage, extracts desired information - labels and corresponding bio details - and forms a data frame out of it. Bio data for each fighter is stored separately in a CSV file.

6. **Handling Push Errors**: If the data fetching process encounters any errors, the urls causing errors are stored in a separate CSV file.

7. **Fetching Missing Bio Data**: Defines another function 'download_missing_bios'. This function first identifies bios that are missing or caused errors. After that, it attempts to download missing bios again.

8. **Consolidating the Data**: After all available bios are downloaded, it loads all CSV files containing each fighter's bio data and combines them into a master dataset/dataframe. Finally, it outputs the resulting combined dataframe into a CSV file called "All_Fighter_Bios.csv"

This script efficiently collects the bio data of all fighters listed in the UFC and compiles it into a single unified CSV file, ready for data analysis or model building activities.

In [2]:
# Load Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mtick
import sqlite3
import seaborn as sns
from matplotlib.pyplot import figure
from bs4 import BeautifulSoup
import time
import requests     # to get images
import shutil       # to save files locally
import datetime
from scipy.stats import norm
import warnings
warnings.filterwarnings('ignore')
import xgboost
from xgboost import XGBClassifier
from random import randint
import  random
import os
os.chdir('C:/Users/Travis/OneDrive/Data Science/Personal_Projects/Sports/UFC_Prediction_V2')

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from cmath import nan


#### List all Fighters

In [3]:
df = pd.read_csv('data/final/aggregates/All_Fight_Totals.csv')

In [4]:
# list of all fighter_A and fighter_B names
fighter_names = list(df['Fighter_A'].unique()) + list(df['Fighter_B'].unique())

# remove duplicates
fighter_names = list(set(fighter_names))

len(fighter_names)

3086

In [5]:
fighter_name_df = pd.DataFrame(fighter_names, columns = ['Fighter_Name'])
fighter_name_df

Unnamed: 0,Fighter_Name
0,Raoni Barcelos
1,Miesha Tate
2,Francisco Figueiredo
3,Chris Cope
4,Brian Stann
...,...
3081,Geronimo dos
3082,Zelim Imadaev
3083,Ben Rothwell
3084,Nick Catone


In [6]:
fighter_name_df['ufc_url'] = 'https://www.ufc.com/athlete/' + fighter_name_df['Fighter_Name'].str.lower().str.replace(' ', '-')
fighter_name_df.head()

Unnamed: 0,Fighter_Name,ufc_url
0,Raoni Barcelos,https://www.ufc.com/athlete/raoni-barcelos
1,Miesha Tate,https://www.ufc.com/athlete/miesha-tate
2,Francisco Figueiredo,https://www.ufc.com/athlete/francisco-figueiredo
3,Chris Cope,https://www.ufc.com/athlete/chris-cope
4,Brian Stann,https://www.ufc.com/athlete/brian-stann


#### Function to scrape fighter bio information from UFC.com

In [7]:
# NEW Function to scrape fighter bio from UFC.com

# test first

url= 'https://www.ufc.com/athlete/robert-whittaker'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
bio = soup.find('div', class_='c-bio__info-details').text

fighter = soup.find('h1', class_='hero-profile__name').text

# get every c-bio__label and c-bio__text within bio
labels = soup.find_all('div', class_='c-bio__label')
texts = soup.find_all('div', class_='c-bio__text')

# create empty lists to store labels and texts
label_list = []
text_list = []

# loop through labels and texts and append to lists
for label in labels:
    label_list.append(label.text)

for text in texts:
    text_list.append(text.text)

# create dataframe from lists, with label as column names
fighter_bio_df = pd.DataFrame([label_list, text_list])
fighter_bio_df.columns = fighter_bio_df.iloc[0]

# drop 1st row
fighter_bio_df = fighter_bio_df.drop(fighter_bio_df.index[0])



# add fighter name column
fighter_bio_df['fighter'] = fighter

# replace any '\n' with ''
fighter_bio_df = fighter_bio_df.replace('\n', '', regex=True)

fighter_bio_df.to_csv('data/final/fighters/' + fighter + '.csv')
fighter_bio_df



Unnamed: 0,Status,Place of Birth,Trains at,Fighting style,Age,Height,Weight,Octagon Debut,Reach,Leg reach,fighter
1,Active,"Otahuhu, Australia","PMA, Padstow NSW, Australia",Brazilian Jiu-Jitsu,31,72.0,186.0,"Dec. 16, 2012",73.5,43.0,Robert Whittaker


In [8]:
def get_ufc_bios(url):
    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        bio = soup.find('div', class_='c-bio__info-details').text

        fighter = soup.find('h1', class_='hero-profile__name').text

        # get every c-bio__label and c-bio__text within bio
        labels = soup.find_all('div', class_='c-bio__label')
        texts = soup.find_all('div', class_='c-bio__text')

        # create empty lists to store labels and texts
        label_list = []
        text_list = []

        # loop through labels and texts and append to lists
        for label in labels:
            label_list.append(label.text)

        for text in texts:
            text_list.append(text.text)

        # create dataframe from lists, with label as column names
        fighter_bio_df = pd.DataFrame([label_list, text_list])
        fighter_bio_df.columns = fighter_bio_df.iloc[0]

        # drop 1st row
        fighter_bio_df = fighter_bio_df.drop(fighter_bio_df.index[0])



        # add fighter name column
        fighter_bio_df['fighter'] = fighter

        # replace any '\n' with ''
        fighter_bio_df = fighter_bio_df.replace('\n', '', regex=True)

        fighter_bio_df.to_csv('data/final/fighters/' + fighter + '.csv')
        return fighter_bio_df

    except:
        # If there is an error, save it in the bios with errors folder

        print('Error with ' + str(url))
        data = {'fighter': [url]}
        df = pd.DataFrame(data)
        url2 = url.replace('https://www.ufc.com/athlete/', '')
        url3 = url2.replace('-', ' ')
        df.to_csv('data/final/fighters/bios_with_errors/' + url2 + '.csv')


In [9]:
# test
get_ufc_bios('https://www.ufc.com/athlete/Max-Holloway')

Unnamed: 0,Status,Place of Birth,Trains at,Fighting style,Age,Height,Weight,Octagon Debut,Reach,Leg reach,fighter
1,Active,"Waianae, United States",Hawaii Elite MMA - Hawaii,Muay Thai,30,71.0,144.5,"Feb. 04, 2012",69.0,42.0,Max Holloway


In [10]:
def download_missing_bios():
    working_bios_folder = os.listdir('data/final/fighters/')
    working_bios = [x[:-4] for x in working_bios_folder]

    bios_with_errors_folder = os.listdir('data/final/fighters/bios_with_errors/')
    bios_with_errors = [x[:-4] for x in bios_with_errors_folder]

    all_downloaded_bios = working_bios + bios_with_errors
    un_downloaded_bios = fighter_name_df[~fighter_name_df['Fighter_Name'].isin(all_downloaded_bios)]

    # delete any NAN in undownload bios
    un_downloaded_bios = un_downloaded_bios.dropna()

    # download all undownloaded bios
    i = 0
    fighters = len(un_downloaded_bios['ufc_url'])

    # Downloading data for ALL fighters

    for url in un_downloaded_bios['ufc_url']:
        get_ufc_bios(url)
        print('Done with ' + url + ' ' + str(i) + ' of ' + str(fighters))
        i += 1


In [11]:
download_missing_bios()

Done with https://www.ufc.com/athlete/raoni-barcelos 0 of 2744
Done with https://www.ufc.com/athlete/miesha-tate 1 of 2744
Done with https://www.ufc.com/athlete/brian-stann 2 of 2744
Done with https://www.ufc.com/athlete/chris-weidman 3 of 2744
Error with https://www.ufc.com/athlete/jason-von
Done with https://www.ufc.com/athlete/jason-von 4 of 2744
Error with https://www.ufc.com/athlete/mara-romero
Done with https://www.ufc.com/athlete/mara-romero 5 of 2744
Done with https://www.ufc.com/athlete/kyle-stewart 6 of 2744
Done with https://www.ufc.com/athlete/tyler-diamond 7 of 2744
Done with https://www.ufc.com/athlete/elvis-mutapcic 8 of 2744
Error with https://www.ufc.com/athlete/tatsuya-mizuno
Done with https://www.ufc.com/athlete/tatsuya-mizuno 9 of 2744
Error with https://www.ufc.com/athlete/andre-amado
Done with https://www.ufc.com/athlete/andre-amado 10 of 2744
Error with https://www.ufc.com/athlete/silva-mara
Done with https://www.ufc.com/athlete/silva-mara 11 of 2744
Done with ht

In [12]:
len(os.listdir('data/final/fighters/'))

2404

In [13]:
len(os.listdir('data/final/fighters/bios_with_errors/'))

683

# MasterDF for fighter bios

In [14]:
fighter_bio_files=  os.listdir('data/final/fighters/')
fighter_bio_files = [x for x in fighter_bio_files if 'csv' in x]

# append all fighter bios into one dataframe
fighter_bio_df = pd.DataFrame()
for file in fighter_bio_files:
    df = pd.read_csv('data/final/fighters/' + file)
    fighter_bio_df = fighter_bio_df.append(df)

fighter_bio_df = fighter_bio_df.reset_index(drop=True)
fighter_bio_df.head()

Unnamed: 0.1,Unnamed: 0,Status,Place of Birth,Fighting style,Age,Height,Weight,Octagon Debut,Reach,Leg reach,fighter,Trains at
0,1,Not Fighting,"Parrish, United States",MMA,32.0,72.0,155.0,"Jul. 30, 2019",78.0,42.0,Aalon Cruz,
1,1,Not Fighting,"Newport Beach, United States",,47.0,75.0,231.0,"Nov. 17, 2000",,,Aaron Brink,
2,1,,"Tillsonburg, Canada",MMA,29.0,74.0,185.5,"Dec. 11, 2022",73.5,41.0,Aaron Jeffery,
3,1,Active,"Houston, United States",Muay Thai,32.0,69.0,135.0,"May. 24, 2014",71.0,40.0,Aaron Phillips,Headkicks MMA
4,1,Not Fighting,"Tell City, United States",,40.0,68.0,156.0,"May. 10, 2002",,,Aaron Riley,


In [15]:
fighter_bio_df.head(3)

Unnamed: 0.1,Unnamed: 0,Status,Place of Birth,Fighting style,Age,Height,Weight,Octagon Debut,Reach,Leg reach,fighter,Trains at
0,1,Not Fighting,"Parrish, United States",MMA,32.0,72.0,155.0,"Jul. 30, 2019",78.0,42.0,Aalon Cruz,
1,1,Not Fighting,"Newport Beach, United States",,47.0,75.0,231.0,"Nov. 17, 2000",,,Aaron Brink,
2,1,,"Tillsonburg, Canada",MMA,29.0,74.0,185.5,"Dec. 11, 2022",73.5,41.0,Aaron Jeffery,


In [17]:
fighter_bio_df.to_csv('data/final/aggregates/All_Fighter_Bios.csv')