# Scraping BestFightOdds.com

This is a Python script designed to scrape event and odds information about UFC (Ultimate Fighting Championship) fights for data analysis or predictive model development. The website being scraped for the odds is "bestfightodds.com". The script also uses other libraries such as Pandas, BeautifulSoup, and Selenium for data manipulation, webpage parsing and automating browser tasks respectively. 


PROBLEM with BestFightOdds: Many fights are not in the card they say they were (i think), i.e., organization is meh. 


Here's a high-level flow of what this script does:

1. Importing necessary libraries and setting the working directory.

2. It then loads event data from an existing CSV file in the directory ("Final_Hand_Done_BFO_Urls_with UFC_Stats_Urls.csv").

3. It fetches event URLs from the newly loaded file and prepares a list of all unique URLs.

4. It checks these URLs against the completed UFC events listed on the UFC stats website. If there are new events not covered in the initial CSV, they are identified, and their URLs from bestfightodds.com are fetched by searching via 'Google search'.

5. If there are any issues with fetching these URLs, errors are flagged and depending on the issue, the script may require manual intervention.

6. For each of the events, it downloads their odds information by sending a GET request to the URL while providing user-agent headers to avoid the request being blocked. This data is processed and added to a Pandas DataFrame for further processing.

7. Once all the data for the URLs are downloaded, the script then attempts to download all the odds change data by simulating browser interaction with each of the URLs using Selenium WebDriver. 

8. Finally, the process of data collection is summarized, and all data is saved into CSV files for further analysis.

9. Outputs: (in data/final/...)
- 'events/All_Events_V1.csv'
- 'events/odds_per_event/' + event_name + '.csv'
- 'odds/All_Odds_by_Fighter_V1.csv'
- 'odds/All_Odds_by_Fighter_V2.csv'
- 'events/All_Events_V1.csv'
- 'odds/odds_changes/' + name + '.csv'
- 'aggregates/All_Odds_Changes_V1.csv'
- 'odds/All_Odds_by_Fighter_WithChange.csv'




In [5]:
# Load Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mtick
import seaborn as sns
from matplotlib.pyplot import figure
from bs4 import BeautifulSoup
import time
import requests     
import shutil      
import datetime
from scipy.stats import norm
from random import randint
import  random
import os
os.chdir('/Users/travisroyce/Library/CloudStorage/OneDrive-Personal/Data Science/Personal_Projects/Sports/UFC_Prediction_V2')
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

## Excel-fixed data:

Initially, I matched up events in excel, so we will load that file. From here on out, I have a function that will scrape the data for me.

In [6]:
# get the current path
path = os.getcwd()
path

'/Users/travisroyce/Library/CloudStorage/OneDrive-Personal/Data Science/Personal_Projects/Sports/UFC_Prediction_V2'

In [7]:
# load excel-fixed data, which we will update with the scraped data
fight_odds_urls = pd.read_csv('data/final/events/Final_Hand_Done_BFO_Urls_with UFC_Stats_Urls.csv')

print(fight_odds_urls.shape)
fight_odds_urls.head(3)

(543, 11)


Unnamed: 0,event_title,event_url,event_odds_url,Short_title,Short_Title_2 (vs. to vs),Name1,Name2,event_name,event_date,event_id,bfo_url
0,UFC Fight Night: Thompson vs. Holland,http://ufcstats.com/event-details/b23388ff8ac6...,https://www.bestfightodds.com/events/ufc-fight...,Thompson vs. Holland,Thompson vs Holland,thompson,holland,,,,
1,UFC Fight Night: Nzechukwu vs. Cutelaba,http://ufcstats.com/event-details/012fc7cd0779...,https://www.bestfightodds.com/events/ufc-fight...,Nzechukwu vs. Cutelaba,Nzechukwu vs Cutelaba,nzechukwu,cutelaba,,,,
2,UFC 281: Adesanya vs. Pereira,http://ufcstats.com/event-details/b3b6e80b7d5f...,https://www.bestfightodds.com/events/ufc-281-2529,Adesanya vs. Pereira,Adesanya vs Pereira,adesanya,pereira,,,,


In [8]:
# get all event urls from excel-fixed data
event_urls = fight_odds_urls['event_url'].unique()
print(f' {len(event_urls)} events in excel-fixed data')

# get all event urls
event_urls = fight_odds_urls['event_url'].unique()
# split by /, keep last
event_urls = [url.split('/')[-1] for url in event_urls]

 543 events in excel-fixed data


In [9]:

def get_last_events_ufcstats():
    """
    Retrieves information about completed UFC events from the UFC Stats website.
    Returns a DataFrame with columns: event_name, event_date, event_url, event_id.
    """
    url = 'http://www.ufcstats.com/statistics/events/completed'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Extract event links and names
    events = soup.find_all('a', class_='b-link b-link_style_black')
    event_links = [link.get('href') for link in events]
    event_names = [link.text.strip() for link in events]

    # Extract event dates
    event_dates = soup.find_all('span', class_='b-statistics__date')
    event_dates = [date.text.strip() for date in event_dates][1:]

    # Create DataFrame
    df = pd.DataFrame({'event_name': event_names, 'event_date': event_dates, 'event_url': event_links})
    df['event_date'] = pd.to_datetime(df['event_date'])
    df = df.sort_values(by='event_date', ascending=False).reset_index(drop=True)
    df['event_id'] = df['event_url'].apply(lambda x: x.split('/')[-1])
    
    return df 

In [10]:
last_events = get_last_events_ufcstats()
last_events

Unnamed: 0,event_name,event_date,event_url,event_id
0,UFC Fight Night: Fiziev vs. Gamrot,2023-09-23,http://www.ufcstats.com/event-details/c945adc2...,c945adc22c2bfe8f
1,UFC Fight Night: Grasso vs. Shevchenko 2,2023-09-16,http://www.ufcstats.com/event-details/8fa2b065...,8fa2b06572365321
2,UFC 293: Adesanya vs. Strickland,2023-09-09,http://www.ufcstats.com/event-details/ece28074...,ece280745f8727b8
3,UFC Fight Night: Gane vs. Spivac,2023-09-02,http://www.ufcstats.com/event-details/ef61d9f5...,ef61d9f5176b3200
4,UFC Fight Night: Holloway vs. The Korean Zombie,2023-08-26,http://www.ufcstats.com/event-details/89a40703...,89a407032911e27e
5,UFC 292: Sterling vs. O'Malley,2023-08-19,http://www.ufcstats.com/event-details/2719f300...,2719f300b0439039
6,UFC Fight Night: Luque vs. Dos Anjos,2023-08-12,http://www.ufcstats.com/event-details/d2fa318f...,d2fa318f34d0aadc
7,UFC Fight Night: Sandhagen vs. Font,2023-08-05,http://www.ufcstats.com/event-details/6f81b6de...,6f81b6de2557739a
8,UFC 291: Poirier vs. Gaethje 2,2023-07-29,http://www.ufcstats.com/event-details/ccd58ff7...,ccd58ff71e260ed5
9,UFC Fight Night: Aspinall vs. Tybura,2023-07-22,http://www.ufcstats.com/event-details/1174782e...,1174782eacde9b0c


In [11]:
last_events_urls = last_events['event_id'].unique()

# get events  in last_events that are not in fight_odds_urls
new_events = [event for event in last_events_urls if event not in event_urls]
new_events

['c945adc22c2bfe8f']

Identify Events to Download

In [12]:
missing_events = last_events[last_events['event_id'].isin(new_events)]
# strip event names
missing_events['event_name'] = missing_events['event_name'].apply(lambda x: x.strip())
missing_events

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  missing_events['event_name'] = missing_events['event_name'].apply(lambda x: x.strip())


Unnamed: 0,event_name,event_date,event_url,event_id
0,UFC Fight Night: Fiziev vs. Gamrot,2023-09-23,http://www.ufcstats.com/event-details/c945adc2...,c945adc22c2bfe8f


Find BestFightOdds url for event

In [13]:
driver = webdriver.Chrome(path + '/chromedriver')

## NOTE: FIGHT ODDS NOT SCRAPING? (Sep21)

- This is going from an event-based search, when I know from experience that BestFightOdds is bad at that and is better as a Fighter-First Search. 

In [14]:
# find the rows where event_odds_url is empty in the fight_odds_urls dataframe
missing= fight_odds_urls[fight_odds_urls['event_odds_url'].isnull()]
missing


Unnamed: 0,event_title,event_url,event_odds_url,Short_title,Short_Title_2 (vs. to vs),Name1,Name2,event_name,event_date,event_id,bfo_url


In [15]:
def get_bfo_url(event_name):
    """
    Input is event name from missing_events
    Output is url from bestfightodds.com
    """

    driver.get('https://www.google.com/')
    # wait 1 sec
    time.sleep(3)

    # search 'site: bestfightodds.com'
    search = driver.find_element_by_name('q')
    # wait 1 sec
    time.sleep(1)

    search.send_keys('site:bestfightodds.com ' + '"events"' +' ' + event_name)
    search.send_keys(Keys.RETURN)
    
    # find div id="search"
    search_results = driver.find_element_by_id('search')
    # get all links in div
    links = search_results.find_elements_by_tag_name('a')
    # click the first link
    links[0].click()
    # wait 2 sec
    time.sleep(2)
    #

    # # get the first link
    # first_link = driver.find_element_by_xpath('//*[@id="rso"]/div[1]/div/div/div/div/div[1]/a')
    # first_link.click()
    
    # get the url
    url = driver.current_url
    return url

## MUST Scrape Individually from BFO

In [32]:
get_bfo_url('UFC 291')

'https://www.bestfightodds.com/events/ufc-291-2900'

In [16]:
missing_events['event_name'][0]

'UFC Fight Night: Fiziev vs. Gamrot'

In [17]:
get_bfo_url(missing_events['event_name'][0])

'https://www.bestfightodds.com/?fbclid=IwAR1laQi226VVNOy4o6hO7kljG_hjE9TgvaEXGUp1KDFvglE12RxdmCspRX4'

In [18]:
# for each event in missing_events, get the bfo url and add it to the dataframe with lambda apply

bfo_urls = []
for event in missing_events['event_name']:
    try:
        bfo_url = get_bfo_url(event)
        bfo_urls.append(bfo_url)
    except:
        bfo_urls.append('error')
        print(f'error with {event}')

missing_events['bfo_url'] = bfo_urls
missing_events

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  missing_events['bfo_url'] = bfo_urls


Unnamed: 0,event_name,event_date,event_url,event_id,bfo_url
0,UFC Fight Night: Fiziev vs. Gamrot,2023-09-23,http://www.ufcstats.com/event-details/c945adc2...,c945adc22c2bfe8f,https://www.bestfightodds.com/?fbclid=IwAR1laQ...


In [19]:
# rename bfo_url to event_odds_url
missing_events = missing_events.rename(columns={'bfo_url': 'event_odds_url'})
# rename event_name to event_title
missing_events = missing_events.rename(columns={'event_name': 'event_title'})
# drop event_id
missing_events = missing_events.drop(columns=['event_id'])

In [20]:
# update fight_odds_urls with missing_events
fight_odds_urls = pd.concat([fight_odds_urls, missing_events], axis=0)
fight_odds_urls = fight_odds_urls.reset_index(drop=True)
fight_odds_urls.head(3)

Unnamed: 0,event_title,event_url,event_odds_url,Short_title,Short_Title_2 (vs. to vs),Name1,Name2,event_name,event_date,event_id,bfo_url
0,UFC Fight Night: Thompson vs. Holland,http://ufcstats.com/event-details/b23388ff8ac6...,https://www.bestfightodds.com/events/ufc-fight...,Thompson vs. Holland,Thompson vs Holland,thompson,holland,,,,
1,UFC Fight Night: Nzechukwu vs. Cutelaba,http://ufcstats.com/event-details/012fc7cd0779...,https://www.bestfightodds.com/events/ufc-fight...,Nzechukwu vs. Cutelaba,Nzechukwu vs Cutelaba,nzechukwu,cutelaba,,,,
2,UFC 281: Adesanya vs. Pereira,http://ufcstats.com/event-details/b3b6e80b7d5f...,https://www.bestfightodds.com/events/ufc-281-2529,Adesanya vs. Pereira,Adesanya vs Pereira,adesanya,pereira,,,,


In [21]:
# save to csv
fight_odds_urls.to_csv('data/final/events/Final_Hand_Done_BFO_Urls_with UFC_Stats_Urls.csv', index=False)

#### Scraping BFO via Requests & Headers

In [30]:
# Must now scrape BFO via Selenium, then grab the tables. 

url = 'https://www.bestfightodds.com/events/ufc-282-2621'
# driver already open
driver.get(url)
# wait 4 seconds
time.sleep(4)
# get the tables
tables = pd.read_html(driver.page_source)
# print table 1
tables[0]









# Following is old, when we used fake headers
# header = {
#   "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
#   "X-Requested-With": "XMLHttpRequest"
# }
# r = requests.get(url, headers = header)
# print
#print(r)
# dfs = pd.read_html(r.text)
# dfs[0]

Unnamed: 0,Jan Brachowicz
0,Magomed Ankalaev
1,Over 1½ rounds
2,Under 1½ rounds
3,Over 2½ rounds
4,Under 2½ rounds
...,...
1725,Under 7½ fights go the distance
1726,Over 8½ fights go the distance
1727,Under 8½ fights go the distance
1728,Over 9½ fights go the distance


## Scraping BFO Data

In [23]:
fight_odds_urls = pd.read_csv('data/final/events/Final_Hand_Done_BFO_Urls_with UFC_Stats_Urls.csv')

In [24]:
all_event_odds_urls = fight_odds_urls.event_odds_url.unique()
len(all_event_odds_urls)

520

In [25]:
len('https://www.bestfightodds.com/events/')

37

In [31]:
def get_odds(url):
        driver.get(url)
        # wait 4 seconds
        time.sleep(4)

        # get the tables
        tables = pd.read_html(driver.page_source)
        data = tables[1]
        data.rename(columns={'Unnamed: 0': 'fighter'}, inplace=True)
        data['event_odds_url'] = url
        event_name = url.split('/')[4]
        data['event_id'] = event_name
        data.to_csv('data/final/events/odds_per_event/' + event_name + '.csv', index=False)
        return data



In [32]:
get_odds(all_event_odds_urls[1]).head(2)

Unnamed: 0,fighter,DraftKings,BetMGM,Caesars,BetRivers,FanDuel,PointsBet,Unibet,Bet365,BetWay,Props,Props.1,Props.2,event_odds_url,event_id
0,Derrick Lewis,+210▲,+180▲,+200▲,+180▲,+198▲,,+180▲,+180▲,+163▲,,116.0,,https://www.bestfightodds.com/events/ufc-fight...,ufc-fight-night-215-2633
1,Sergey Spivak,-250▼,-225▼,-240▼,-230▼,-250▼,,-230▼,-222▼,-200▼,,116.0,,https://www.bestfightodds.com/events/ufc-fight...,ufc-fight-night-215-2633


In [33]:
# drop nan
all_event_odds_urls = all_event_odds_urls[~pd.isnull(all_event_odds_urls)]
all_event_odds_urls

array(['https://www.bestfightodds.com/events/ufc-fight-night-214-2607',
       'https://www.bestfightodds.com/events/ufc-fight-night-215-2633',
       'https://www.bestfightodds.com/events/ufc-281-2529',
       'https://www.bestfightodds.com/events/ufc-fight-night-214ting-championship-2612',
       'https://www.bestfightodds.com/events/ufc-fight-night-213-2606',
       'https://www.bestfightodds.com/events/ufc-281-2586',
       'https://www.bestfightodds.com/events/ufc-fight-night-212-2579',
       'https://www.bestfightodds.com/events/ufc-vegas-57-2489',
       'https://www.bestfightodds.com/events/ufc-fight-night-211-2569',
       'https://www.bestfightodds.com/events/ufc-279-2541',
       'https://www.bestfightodds.com/events/ufc-fight-night-gane-vs-tuivasa-2518',
       'https://www.bestfightodds.com/events/ufc-278-usman-vs-edwards-2-2545',
       'https://www.bestfightodds.com/events/ufc-fight-night-vera-vs-cruz-2552',
       'https://www.bestfightodds.com/events/ufc-on-espn-santo

In [34]:
# Initialize a counter variable to keep track of the number of URLs processed
n = 0

# Initialize lists to store URLs that resulted in errors and those that were processed successfully
errors = []
complete = []

# Get a list of filenames from the 'data/final/events/odds_per_event/' directory
files = os.listdir('data/final/events/odds_per_event/')

# Extract the file identifiers (excluding the file extension) to store them in 'file_list'
file_list = [n[:-4] for n in files]

# Loop through each URL in the list of URLs pointing to event odds pages
for url in all_event_odds_urls:
    # Extract the unique identifier from the URL (excluding the file extension)
    url_file_name = url.split('/')[-1]
    url_file_name2 = url_file_name[:-4]
    
    # Check if the unique identifier is already in the list of processed files
    if url_file_name not in file_list:
        try:
            # Attempt to get odds data from the URL using the 'get_odds' function
            get_odds(url)
            
            # Print a message indicating the progress (number of URLs processed out of total)
            print(f'{n} / {len(all_event_odds_urls)}')
            
            # Increment the counter variable
            n += 1
        except:
            # Print an error message if an exception occurs while getting odds data
            print(f'ERROR! {n} / {len(all_event_odds_urls)}')
            
            # Append the problematic URL to the 'errors' list
            errors.append(url)
            
            # Increment the counter variable
            n += 1
    else:
        # If the unique identifier is already in 'file_list', print a message indicating that it exists
        print(f'{url_file_name} already exists')
        
        # Append the unique identifier to the 'complete' list
        complete.append(url_file_name)

# Print the total number of errors encountered during the execution
print(f' total errors: {len(errors)}')


ufc-fight-night-214-2607 already exists
ufc-fight-night-215-2633 already exists
ufc-281-2529 already exists
ufc-fight-night-214ting-championship-2612 already exists
ufc-fight-night-213-2606 already exists
ufc-281-2586 already exists
ufc-fight-night-212-2579 already exists
ufc-vegas-57-2489 already exists
ufc-fight-night-211-2569 already exists
ufc-279-2541 already exists
ufc-fight-night-gane-vs-tuivasa-2518 already exists
ufc-278-usman-vs-edwards-2-2545 already exists
ufc-fight-night-vera-vs-cruz-2552 already exists
ufc-on-espn-santos-vs-hill-2534 already exists
ufc-277-pena-vs-nunes-2-2517 already exists
ufc-fight-night-blaydes-vs-aspinall-2519 already exists
ufc-on-abc-ortega-vs-rodriguez-2524 already exists
ufc-on-espn-dos-anjos-vs-fiziev-2501 already exists
ufc-276-adesanya-vs-cannonier-2478 already exists
ufc-on-espn-tsarukyan-vs-gamrot-2574 already exists
ufc-fight-night-kattar-vs-emmett-2523 already exists
ufc-275-teixeira-vs-prochazka-2561 already exists
ufc-fight-night-211-244

### Append Together

In [35]:
# Get list of all files in 'data/final/events/odds_per_event/' directory
files = os.listdir('data/final/events/odds_per_event/')

# Create an empty list to store all the event odds data frames
all_event_odds = []

# Loop through each file in the directory
for file in files:
    # Read the csv file and store it as a pandas data frame
    df = pd.read_csv('data/final/events/odds_per_event/' + file)
    # Append the data frame to the list of all event odds data frames
    all_event_odds.append(df)

# Concatenate all the data frames in the list into a single data frame
all_event_odds = pd.concat(all_event_odds)

# Print the shape (dimensions) of the final data frame and display the first few rows
print(all_event_odds.shape)
all_event_odds.head()

(369317, 17)


Unnamed: 0,fighter,DraftKings,BetMGM,Caesars,BetRivers,FanDuel,PointsBet,Unibet,BetWay,5D,Ref,Props,Props.1,Props.2,event_odds_url,event_id,Bet365
0,Johny Hendricks,,,,,,,,,-220▲,-225▼,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,
1,Robbie Lawler,,,,,,,,,+200▼,+190▲,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,
2,Over 4½ rounds,,,,,,,,,+105▲,-105▲,,,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,
3,Under 4½ rounds,,,,,,,,,-125▼,-125▼,,,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,
4,Fight goes to decision,,,,,,,,,+110▲,,,,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,


Add Event Names & URLs to Event_Odds df

In [36]:
def get_event_title(event_odds_url):
    # first, filter the fight_odds_urls dataframe to only include rows with a matching event_odds_url
    data = fight_odds_urls[fight_odds_urls['event_odds_url'] == event_odds_url]
    # next, extract the event_title value from the filtered dataframe
    event_name = data.event_title.values[0]
    # finally, return the event_title value
    return event_name

def get_ufcstats_url(event_odds_url):
    # similar to the previous function, filter the fight_odds_urls dataframe and extract the event_url value
    data = fight_odds_urls[fight_odds_urls['event_odds_url'] == event_odds_url]
    event_url = data.event_url.values[0]
    # return the event_url value
    return event_url

In [37]:
all_event_odds['event_name'] = all_event_odds.apply(lambda row: get_event_title(row['event_odds_url']), axis=1)

In [38]:
all_event_odds['event_ufcstats_url'] = all_event_odds.apply(lambda row: get_ufcstats_url(row['event_odds_url']), axis=1)

In [39]:
all_event_odds.head(3)

Unnamed: 0,fighter,DraftKings,BetMGM,Caesars,BetRivers,FanDuel,PointsBet,Unibet,BetWay,5D,Ref,Props,Props.1,Props.2,event_odds_url,event_id,Bet365,event_name,event_ufcstats_url
0,Johny Hendricks,,,,,,,,,-220▲,-225▼,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...
1,Robbie Lawler,,,,,,,,,+200▼,+190▲,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...
2,Over 4½ rounds,,,,,,,,,+105▲,-105▲,,,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...


In [40]:
all_event_odds.to_csv('data/final/odds/All_Odds_by_Fighter_V1.csv', index=False)

In [41]:
all_event_odds['5D'] = all_event_odds['5D'].str[:-1]
all_event_odds['5D'] = all_event_odds['5D'].astype(float)
all_event_odds.head(3)

Unnamed: 0,fighter,DraftKings,BetMGM,Caesars,BetRivers,FanDuel,PointsBet,Unibet,BetWay,5D,Ref,Props,Props.1,Props.2,event_odds_url,event_id,Bet365,event_name,event_ufcstats_url
0,Johny Hendricks,,,,,,,,,-220.0,-225▼,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...
1,Robbie Lawler,,,,,,,,,200.0,+190▲,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...
2,Over 4½ rounds,,,,,,,,,105.0,-105▲,,,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...


In [42]:
all_event_odds['Ref'] = all_event_odds['Ref'].str[:-1]
all_event_odds['Ref'] = all_event_odds['Ref'].astype(float)
all_event_odds.head(3)

Unnamed: 0,fighter,DraftKings,BetMGM,Caesars,BetRivers,FanDuel,PointsBet,Unibet,BetWay,5D,Ref,Props,Props.1,Props.2,event_odds_url,event_id,Bet365,event_name,event_ufcstats_url
0,Johny Hendricks,,,,,,,,,-220.0,-225.0,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...
1,Robbie Lawler,,,,,,,,,200.0,190.0,,35.0,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...
2,Over 4½ rounds,,,,,,,,,105.0,-105.0,,,,https://www.bestfightodds.com/events/ufc-181-h...,ufc-181-hendricks-vs-lawler-2-853,,UFC 181: Hendricks vs Lawler II,http://ufcstats.com/event-details/dfdd0c5dd0d4...


In [43]:
# Choose columns to keep
cols = ['fighter', '5D', 'Ref', 'event_odds_url', 'event_ufcstats_url', 'event_id', 'event_name']
all_event_odds = all_event_odds[cols]
all_event_odds.head()

Unnamed: 0,fighter,5D,Ref,event_odds_url,event_ufcstats_url,event_id,event_name
0,Johny Hendricks,-220.0,-225.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
1,Robbie Lawler,200.0,190.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
2,Over 4½ rounds,105.0,-105.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
3,Under 4½ rounds,-125.0,-125.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
4,Fight goes to decision,110.0,,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II


In [44]:
# get shape
all_event_odds.shape

(369317, 7)

In [45]:
all_event_odds.to_csv('data/final/odds/All_Odds_by_Fighter_V2.csv', index=False)


## Summary
- ALL ODDS by fighter are available at:
  - data/ufc_BestFightOdds/All_Event_Odds_by_Fighter.csv
- You can combine this with UfcStatsData easily using 'event_ufcstats_url' 

# Add All Events Df

Saved in final/events

In [46]:
all_event_odds = pd.read_csv('data/final/odds/All_Odds_by_Fighter_V2.csv')

In [47]:
# events are located in data/ufc_stats/events2

folder = 'data/ufc_stats/events2/'

# aggregate all events into one dataframe
all_events = []

for file in os.listdir(folder):
    df = pd.read_csv(folder + file)
    all_events.append(df)

all_events = pd.concat(all_events)

print(all_events.shape)
all_events.head()

(5887, 19)


Unnamed: 0.1,Unnamed: 0,W/L,Weight class,Method,Round,Time,Fighter1,Fighter2,F1_Kd,F2_Kd,F1_Str,F2_Str,F1_Td,F2_Td,F1_Sub,F2_Sub,fight_num,event_id,fight_link
0,0,win,Middleweight,KO/TKO Punch,2,3:33,Israel Adesanya,Robert Whittaker,2,0,40,32,0,0,0,0,1,3cf68c1d17f66af7,http://www.ufcstats.com/fight-details/2556b752...
1,1,win,Lightweight,U-DEC,3,5:00,Dan Hooker,Al Iaquinta,1,0,98,37,0,0,0,0,2,3cf68c1d17f66af7,http://www.ufcstats.com/fight-details/0697d552...
2,2,win,Heavyweight,SUB Arm Triangle,2,3:14,Serghei Spivac,Tai Tuivasa,0,0,23,21,6,0,1,0,3,3cf68c1d17f66af7,http://www.ufcstats.com/fight-details/8cd7ca0e...
3,3,win,Welterweight,S-DEC,3,5:00,Dhiego Lima,Luke Jumeau,0,0,32,24,2,0,0,0,4,3cf68c1d17f66af7,http://www.ufcstats.com/fight-details/fd0fd9a2...
4,4,win,Heavyweight,KO/TKO Punch,1,2:10,Yorgan De Castro,Justin Tafa,1,0,4,6,0,0,0,0,5,3cf68c1d17f66af7,http://www.ufcstats.com/fight-details/9dfac33c...


In [48]:
all_events.to_csv('data/final/events/All_Events_V1.csv', index=False)

In [49]:
all_events = pd.read_csv('data/final/events/All_Events_V1.csv')

# Add Odd Changes to Odds

In [50]:
# create a function to extract odds data from a given URL
def get_odds(url):
    
    # add headers to avoid getting blocked while making requests
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
    }
    
    # make a GET request to the URL with the provided headers
    r = requests.get(url, headers=headers)
    
    # scrape the HTML tables present in the response content
    dfs = pd.read_html(r.text)
    
    # extract the relevant data from the second table
    data = dfs[1]
    
    # rename the first column as "fighter" for better readability
    data.rename(columns={'Unnamed: 0': 'fighter'}, inplace=True)
    
    # add the event odds URL as a new column in the dataframe
    data['event_odds_url'] = url
    
    # extract the event name from the URL and add it as a new column in the dataframe
    event_name = url.split('/')[4]
    data['event_id'] = event_name
    
    # uncomment the following line to save the extracted data as a CSV file
    # data.to_csv('data/ufc_BestFightOdds/odds_by_event/' + event_name + '.csv', index=False)
    
    # return the dataframe with the extracted odds data
    return data

In [51]:
def get_bfodds(event_url):
    # scrape the changes in odds by event url
    
    driver = webdriver.Chrome()
    driver.get(event_url)

    # click show all
    driver.find_element_by_class_name('event-swing-expand').click()
    #wait 2 seconds
    time.sleep(2)

    event_swing_container = driver.find_element_by_id('event-swing-container')
    event_swing_container.text


    # the first half of the text is the odds changes, the seconds half are the names. Lets match them up. 
    # first split the text by new line
    event_swing_container.text.split('\n')

    # get the names. You can identify them because they will be more than 5 characters long
    names = [x for x in event_swing_container.text.split('\n') if len(x) > 5]
    names

    # get the odds changes
    odds_changes = [x for x in event_swing_container.text.split('\n') if '%' in x]
    odds_changes

    # get the event name
    event_name = driver.find_element_by_class_name('table-header')
    # name is h1 in event_name
    name = event_name.find_element_by_tag_name('h1')
    event = name.text

    # get the url
    url = driver.current_url


    # create a dataframe
    df = pd.DataFrame({'names': names, 'odds_changes': odds_changes, 'event': event, 'event_url': url})
    
    #close the driver
    driver.close()

    return df

In [52]:
all_odds_eventlist = all_event_odds['event_odds_url'].unique().tolist()

In [53]:
# Initialize lists to keep track of completed, failed, and newly completed events
already_complete = []
failed = []
newly_complete = []

# Iterate through all events in the all_odds_eventlist
for url in all_odds_eventlist:
    # Extract the event name from the URL
    event_name = url.split('/')[-1]
    
    # Check if the event has already been completed
    if str(event_name) + '.csv' in os.listdir('data/final/odds/odds_changes/'):
        print('already done with ' + url.split('/')[-1])
        already_complete.append(event_name)
        continue
    else:
        try:
            # Get the odds data for the event
            df = get_bfodds(url)
            name = url.split('/')[-1]
            
            # Save the odds data to a CSV file
            df.to_csv('data/final/odds/odds_changes/' + name + '.csv', index=False)
            newly_complete.append(name)
            print('done with ' + name)
        except:
            # If there's an error, add the event to the failed list
            name = url.split('/')[-1]
            failed.append(name)
            print('failed on ' + name)
            continue

# Print the summary of completed, newly completed, and failed events
print("")
print('already done with ' + str(len(already_complete)) + ' events')
print('newly done with ' + str(len(newly_complete)) + ' events')
print('failed on ' + str(len(failed)) + ' events')

already done with ufc-181-hendricks-vs-lawler-2-853
already done with ufc-on-fox-7-henderson-vs-melendez-634
already done with ufc-fight-night-122-bisping-vs-gastelum-1363
already done with ufc-150-henderson-vs-edgar-ii-547
already done with ufc-253-adesanya-vs-costa-1952
already done with ufc-226-miocic-vs-cormier-1447
already done with ufc-fight-night-27-condit-vs-kampmann-ii-692
already done with ufc-fight-night-103-rodriguez-vs-penn-1219
already done with ufc-258-usman-vs-burns-2033
already done with ufc-224-nunes-vs-pennington-1453
already done with ufc-fight-night-134-shogun-vs-smith-1507
already done with ufc-177-dillashaw-vs-soto-849
already done with ufc-76-knockout-12
already done with ufc-on-espn-dos-anjos-vs-fiziev-2501
already done with ufc-fight-night-43-te-huna-vs-marquardt-823
already done with ufc-fight-night-182-felder-vs-dos-anjos-1981
already done with ufc-on-fox-26-lawler-vs-dos-anjos-1365
already done with ufc-270-2322
already done with ufc-87-seek-and-destroy-57


In [54]:
# aggregate all odds changes into one dataframe
all_odds_changes = []

for file in os.listdir('data/final/odds/odds_changes/'):
    df = pd.read_csv('data/final/odds/odds_changes/' + file)
    # append
    all_odds_changes.append(df)

all_odds_changes = pd.concat(all_odds_changes)

print(all_odds_changes.shape)
all_odds_changes.to_csv('data/final/aggregates/All_Odds_Changes_V1.csv', index=False)
all_odds_changes

(11680, 4)


Unnamed: 0,names,odds_changes,event,event_url
0,Urijah Faber,-56%,UFC 181: HENDRICKS VS. LAWLER 2,https://www.bestfightodds.com/events/ufc-181-h...
1,Anthony Pettis,-43%,UFC 181: HENDRICKS VS. LAWLER 2,https://www.bestfightodds.com/events/ufc-181-h...
2,Clay Collard,-40%,UFC 181: HENDRICKS VS. LAWLER 2,https://www.bestfightodds.com/events/ufc-181-h...
3,Todd Duffee,-21%,UFC 181: HENDRICKS VS. LAWLER 2,https://www.bestfightodds.com/events/ufc-181-h...
4,Robbie Lawler,-20%,UFC 181: HENDRICKS VS. LAWLER 2,https://www.bestfightodds.com/events/ufc-181-h...
...,...,...,...,...
19,Nate Diaz,+35%,UFC 202: DIAZ VS. MCGREGOR 2,https://www.bestfightodds.com/events/ufc-202-d...
20,Donald Cerrone,+43%,UFC 202: DIAZ VS. MCGREGOR 2,https://www.bestfightodds.com/events/ufc-202-d...
21,Sabah Homasi,+44%,UFC 202: DIAZ VS. MCGREGOR 2,https://www.bestfightodds.com/events/ufc-202-d...
22,Elizabeth Phillips,+67%,UFC 202: DIAZ VS. MCGREGOR 2,https://www.bestfightodds.com/events/ufc-202-d...


### Edit Event_odds 

Get rid of alternate lines

In [55]:
all_event_odds = pd.read_csv('data/final/odds/All_Odds_by_Fighter_V2.csv')
all_event_odds.head(3)

Unnamed: 0,fighter,5D,Ref,event_odds_url,event_ufcstats_url,event_id,event_name
0,Johny Hendricks,-220.0,-225.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
1,Robbie Lawler,200.0,190.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
2,Over 4½ rounds,105.0,-105.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II


In [56]:
all_options = all_event_odds['fighter'].value_counts()
all_options = pd.DataFrame(all_options)
all_options.reset_index(inplace=True)
delete_options = all_options.head(120)
delete_options

Unnamed: 0,index,fighter
0,Any other result,103634
1,Fight goes to decision,4932
2,Fight doesn't go to decision,4932
3,Fight is not a draw,4927
4,Fight is a draw,4927
...,...,...
115,Fight won't go 2:30 round 2,200
116,Exactly one fight goes the distance,185
117,Under 1½ fights go the distance,185
118,Over 1½ fights go the distance,185


In [57]:
delete_options_list = delete_options['index'].tolist()

In [58]:
# delete rows when fighter is in delete_options_list
all_event_odds = all_event_odds[~all_event_odds['fighter'].isin(delete_options_list)]
all_event_odds.head(2)

Unnamed: 0,fighter,5D,Ref,event_odds_url,event_ufcstats_url,event_id,event_name
0,Johny Hendricks,-220.0,-225.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
1,Robbie Lawler,200.0,190.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II


In [59]:
# delete any rows where fighter has more than 3 words
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('round')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('decision')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('submission')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('knockout')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('draw')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('scorecards')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('majority')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('Not')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('wins')]
all_event_odds = all_event_odds[~all_event_odds['fighter'].str.contains('losses')]
all_event_odds


Unnamed: 0,fighter,5D,Ref,event_odds_url,event_ufcstats_url,event_id,event_name
0,Johny Hendricks,-220.0,-225.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
1,Robbie Lawler,200.0,190.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
62,Hendricks points handicap -5½,120.0,,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
63,Lawler points handicap +5½,-150.0,,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
64,Hendricks has fastest KO of the Night,800.0,,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II
...,...,...,...,...,...,...,...
369208,Uda has fastest Submission of the Night,1000.0,,https://www.bestfightodds.com/events/ufc-202-d...,http://ufcstats.com/event-details/5da4e8dc02e5...,ufc-202-diaz-vs-mcgregor-2-1143,UFC 202: Diaz vs. McGregor 2
369210,Vettori has fastest Submission of the Night,850.0,,https://www.bestfightodds.com/events/ufc-202-d...,http://ufcstats.com/event-details/5da4e8dc02e5...,ufc-202-diaz-vs-mcgregor-2-1143,UFC 202: Diaz vs. McGregor 2
369231,Exactly nine fights end in KO/TKO,9338.0,,https://www.bestfightodds.com/events/ufc-202-d...,http://ufcstats.com/event-details/5da4e8dc02e5...,ufc-202-diaz-vs-mcgregor-2-1143,UFC 202: Diaz vs. McGregor 2
369285,Over 8½ fights end in KO/TKO,8400.0,,https://www.bestfightodds.com/events/ufc-202-d...,http://ufcstats.com/event-details/5da4e8dc02e5...,ufc-202-diaz-vs-mcgregor-2-1143,UFC 202: Diaz vs. McGregor 2


# Add Odds Change to Odds

In [60]:
def get_odds_change_from_db(event_odds_url, fighter):
    """
    Get the odds change from the all_odds_changes dataframe.
    """

    try:
        odds_change = all_odds_changes[(all_odds_changes['event_url'] == event_odds_url)
                                       & (all_odds_changes['names'] == fighter)]['odds_changes'].values[0]
        
        return odds_change
    
    except:
        return np.nan

In [61]:
# test
get_odds_change_from_db('https://www.bestfightodds.com/events/ufc-100-137', 'Brock Lesnar')

'-31%'

In [62]:
# Create odds_change column
all_event_odds["odds_change"] = all_event_odds.apply(
    lambda row: get_odds_change_from_db(row["event_odds_url"], row["fighter"]),
    axis=1
)

In [63]:
# get rid of %
all_event_odds['odds_change'] = all_event_odds['odds_change'].str.replace('%', '')
# turn into float
all_event_odds['odds_change'] = all_event_odds['odds_change'].astype(float)
all_event_odds.head(3)

Unnamed: 0,fighter,5D,Ref,event_odds_url,event_ufcstats_url,event_id,event_name,odds_change
0,Johny Hendricks,-220.0,-225.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II,48.0
1,Robbie Lawler,200.0,190.0,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II,-20.0
62,Hendricks points handicap -5½,120.0,,https://www.bestfightodds.com/events/ufc-181-h...,http://ufcstats.com/event-details/dfdd0c5dd0d4...,ufc-181-hendricks-vs-lawler-2-853,UFC 181: Hendricks vs Lawler II,


In [64]:
# save to csv
all_event_odds.to_csv('data/final/odds/All_Odds_by_Fighter_WithChange.csv', index=False)

In [65]:
print(f' This was last run: {datetime.datetime.now()}')

 This was last run: 2023-09-26 12:21:55.522832
