### Purpose

The goal of this notebook is to merge the wonderful rankings data compiled here: https://www.kaggle.com/martj42/ufc-rankings and my match data

In [47]:
import pandas as pd
from datetime import timedelta
import numpy as np
import os 


C:\Users\matth\OneDrive\github\tiger-millionaire\event_scraper\scraper_helpers


# 1. Load the Data

### Load the match data

In [52]:
dir_path = os.path.dirname(os.path.realpath('scraped_event.csv'))

In [54]:
#match_df = pd.read_csv("../data/ufc-master.csv")

match_df = pd.read_csv('scraper_helpers/scraped_event.csv') #We want to add ranks to the scraped event right?

#Let's put all the labels in a dataframe
match_df['label'] = ''
#If the winner is not Red or Blue we can remove it.
mask = match_df['Winner'] == 'Red'
match_df['label'][mask] = 0
mask = match_df['Winner'] == 'Blue'
match_df['label'][mask] = 1

#df["Winner"] = df["Winner"].astype('category')
match_df = match_df[(match_df['Winner'] == 'Blue') | (match_df['Winner'] == 'Red')]


#Make sure label is numeric
match_df['label'] = pd.to_numeric(match_df['label'], errors='coerce')

#Let's fix the date
match_df['date'] = pd.to_datetime(match_df['date'])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  match_df['label'][mask] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  match_df['label'][mask] = 1


### Load the rankings data

In [55]:
rankings_df = pd.read_csv("../data/rankings_history.csv")
rankings_df['date'] = pd.to_datetime(rankings_df['date'])

In [4]:
weightclass_list = rankings_df.weightclass.unique()
print(weightclass_list)

['Pound-for-Pound' 'Flyweight' 'Bantamweight' 'Featherweight'
 'Lightweight' 'Welterweight' 'Middleweight' 'Light Heavyweight'
 'Heavyweight' "Women's Bantamweight" "Women's Strawweight"
 "Women's Featherweight" "Women's Flyweight"]


The merged dataframe will contain all of the columns for the match dataframe.  It will also contain the following new columns:

* B_Pound-for-Pound_rank
* B_Flyweight_rank
* B_Bantamweight_rank
* B_Featherweight_rank
* B_Lightweight_rank
* B_Welterweight_rank
* B_Middleweight_rank
* B_Light Heavyweight_rank
* B_Heavyweight_rank
* B_Women's Bantamweight_rank
* B_Women's Strawweight_rank
* B_Women's Featherweight_rank
* B_Women's Flyweight_rank
* R_Pound-for-Pound_rank
* R_Flyweight_rank
* R_Bantamweight_rank
* R_Featherweight_rank
* R_Lightweight_rank
* R_Welterweight_rank
* R_Middleweight_rank
* R_Light Heavyweight_rank
* R_Heavyweight_rank
* R_Women's Bantamweight_rank
* R_Women's Strawweight_rank
* R_Women's Featherweight_rank
* R_Women's Flyweight_rank

* R_match_weightclass_rank
* B_match_weightclass_rank

* better_rank

The first batch of columns are just the current rank of the fighter in each weightclass.  I decided to do it this way as opposed to just having a 'rank' column and matching that to the weightclass of the fight, because a fighter can be ranked in multiple weightclasses, and that might give them an advantage that should be discoverable.  THe 'R_' or 'B_' refers to the red or blue fighter.

R_match_weightclass_rank and B_match_weightclass_rank are the rank of the fighter in the weightclass that the current match is taking place in.

better_rank will be {blue, red, neither} denoting the higher ranked fighter.


# 2. Now... How do we combine the two dataframes?

We have date information in both dataframes so I will use that.  We will get a list of all dates in the rankings dataframe.  The match dataframe will look at the most recent rankings before the date of the match and see if either fighter's name is in the list.

First let's get a list of dates from which we have ranking data


In [5]:
print(rankings_df.columns)

Index(['date', 'weightclass', 'fighter', 'rank'], dtype='object')


In [6]:
date_list = rankings_df.date.unique()
display(date_list)

array(['2013-02-04T00:00:00.000000000', '2013-02-11T00:00:00.000000000',
       '2013-02-18T00:00:00.000000000', '2013-02-25T00:00:00.000000000',
       '2013-03-04T00:00:00.000000000', '2013-03-18T00:00:00.000000000',
       '2013-04-08T00:00:00.000000000', '2013-04-15T00:00:00.000000000',
       '2013-04-22T00:00:00.000000000', '2013-04-29T00:00:00.000000000',
       '2013-05-28T00:00:00.000000000', '2013-06-10T00:00:00.000000000',
       '2013-06-18T00:00:00.000000000', '2013-07-08T00:00:00.000000000',
       '2013-07-29T00:00:00.000000000', '2013-08-05T00:00:00.000000000',
       '2013-08-19T00:00:00.000000000', '2013-08-30T00:00:00.000000000',
       '2013-09-02T00:00:00.000000000', '2013-09-06T00:00:00.000000000',
       '2013-09-23T00:00:00.000000000', '2013-10-11T00:00:00.000000000',
       '2013-10-21T00:00:00.000000000', '2013-10-28T00:00:00.000000000',
       '2013-11-08T00:00:00.000000000', '2013-11-11T00:00:00.000000000',
       '2013-11-18T00:00:00.000000000', '2013-12-03

In [7]:
print(min(date_list))

2013-02-04T00:00:00.000000000


In [8]:
max_date = max(date_list)
print(max_date)

2021-04-26T00:00:00.000000000


We have matchup data that goes back a few years earlier than the ranking data, but that isn't a big deal.  We just have to write code that won't return an error if it can't find appropriate ranking data

Let's try to look smart and see if we can figure this out using a lambda function

In [9]:
display(rankings_df.head())

Unnamed: 0,date,weightclass,fighter,rank
0,2013-02-04,Pound-for-Pound,Anderson Silva,1
1,2013-02-04,Pound-for-Pound,Jon Jones,2
2,2013-02-04,Pound-for-Pound,Georges St-Pierre,3
3,2013-02-04,Pound-for-Pound,Jose Aldo,4
4,2013-02-04,Pound-for-Pound,Benson Henderson,5


In [10]:
display(match_df.columns)

Index(['R_fighter', 'B_fighter', 'R_odds', 'B_odds', 'R_ev', 'B_ev', 'date',
       'location', 'country', 'Winner',
       ...
       'B_td_attempted_bout', 'R_td_pct_bout', 'B_td_pct_bout',
       'R_sub_attempts_bout', 'B_sub_attempts_bout', 'R_pass_bout',
       'B_pass_bout', 'R_rev_bout', 'B_rev_bout', 'label'],
      dtype='object', length=138)

In [11]:
test_date_list = match_df.date.unique()
display(test_date_list)


array(['2021-06-26T00:00:00.000000000'], dtype='datetime64[ns]')

In [12]:
def return_rank(fighter_name, date, wc):
    rank = ''
    keep_going = True;
    previous_d = max_date
    #We need to add something so this works for upcoming events
    for d in date_list:
        if keep_going:
            time_dif =  (d - date).total_seconds()
            print(time_dif)
            if ((time_dif > -1) or ((max_date - date).total_seconds() < 0)):
                
                keep_going = False
                #print(fighter_name, time_dif, date, wc, d)
                temp_rankings_df = rankings_df[rankings_df['date']==previous_d].copy()
                temp_rankings_df = temp_rankings_df[temp_rankings_df['weightclass']==wc]
                temp_rankings_df = temp_rankings_df[temp_rankings_df['fighter']==fighter_name]
                #This means we have a match.  We need to return the rank
                if len(temp_rankings_df) > 0:
                    rank = int(temp_rankings_df.iloc[0]['rank'])
                    #display(rank)
                    #print(fighter_name)
                #print(len(temp_rankings_df))
            else:
                previous_d = d
    if isinstance(rank, int):
        print(rank)
        return(rank)
    else:
        return('')

better_rank

In [13]:
match_df['B_match_weightclass_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         x['weight_class']),axis=1)

-264729600.0
5
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [14]:
match_df['R_match_weightclass_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         x['weight_class']),axis=1)

-264729600.0
4
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
15
-264729600.0
-264729600.0


In [15]:
match_df['R_Women\'s Flyweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Flyweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [16]:
match_df['R_Women\'s Featherweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Featherweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [17]:
match_df['R_Women\'s Strawweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Strawweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [18]:
match_df['R_Women\'s Bantamweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Bantamweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
15
-264729600.0
-264729600.0


In [19]:
match_df['R_Heavyweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Heavyweight'),axis=1)

-264729600.0
4
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [20]:
match_df['R_Light Heavyweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Light Heavyweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [21]:
match_df['R_Middleweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Middleweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [22]:
match_df['R_Welterweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Welterweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [23]:
match_df['R_Lightweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Lightweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [24]:
match_df['R_Featherweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Featherweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [25]:
match_df['R_Bantamweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Bantamweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [26]:
match_df['R_Flyweight_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Flyweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [27]:
match_df['R_Pound-for-Pound_rank'] = match_df.apply(lambda x: return_rank(x['R_fighter'],
                                                                         x['date'],
                                                                         'Pound-for-Pound'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [28]:
match_df['B_Women\'s Flyweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Flyweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [29]:
match_df['B_Women\'s Featherweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Featherweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [30]:
match_df['B_Women\'s Strawweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Strawweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [31]:
match_df['B_Women\'s Bantamweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Women\'s Bantamweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [32]:
match_df['B_Heavyweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Heavyweight'),axis=1)

-264729600.0
5
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [33]:
match_df['B_Light Heavyweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Light Heavyweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [34]:
match_df['B_Middleweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Middleweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [35]:
match_df['B_Welterweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Welterweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [36]:
match_df['B_Lightweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Lightweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [37]:
match_df['B_Featherweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Featherweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [38]:
match_df['B_Bantamweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Bantamweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [39]:
match_df['B_Flyweight_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Flyweight'),axis=1)

-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [40]:
match_df['B_Pound-for-Pound_rank'] = match_df.apply(lambda x: return_rank(x['B_fighter'],
                                                                         x['date'],
                                                                         'Pound-for-Pound'),axis=1)



-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0
-264729600.0


In [41]:
def return_better_rank(r_rank, b_rank):
    if (r_rank == ''):
        if b_rank != '':
            return('Blue')
        else:
            return('neither')
    if (b_rank == ''):
        return('Red')
    r_rank = int(r_rank)
    b_rank = int(b_rank)
    if (r_rank < b_rank):
        return('Red')
    else:
        return('Blue')
    return('neither')

In [42]:
match_df['better_rank'] = match_df.apply(lambda x: return_better_rank(x['R_match_weightclass_rank'],
                                                                         x['B_match_weightclass_rank']),axis=1)

In [43]:
display(match_df.head())

Unnamed: 0,R_fighter,B_fighter,R_odds,B_odds,R_ev,B_ev,date,location,country,Winner,...,B_td_attempted_bout,R_td_pct_bout,B_td_pct_bout,R_sub_attempts_bout,B_sub_attempts_bout,R_pass_bout,B_pass_bout,R_rev_bout,B_rev_bout,label
0,Ciryl Gane,Alexander Volkov,,,,,2021-06-26,"Las Vegas, Nevada, USA",USA,Blue,...,,,,,,,,,,1
1,Tanner Boser,Ovince Saint Preux,,,,,2021-06-26,"Las Vegas, Nevada, USA",USA,Blue,...,,,,,,,,,,1
2,Raoni Barcelos,Timur Valiev,,,,,2021-06-26,"Las Vegas, Nevada, USA",USA,Blue,...,,,,,,,,,,1
3,Andre Fili,Daniel Pineda,,,,,2021-06-26,"Las Vegas, Nevada, USA",USA,Blue,...,,,,,,,,,,1
4,Tim Means,Nicolas Dalby,,,,,2021-06-26,"Las Vegas, Nevada, USA",USA,Blue,...,,,,,,,,,,1


In [44]:
#test = (match_df.iloc[1384])

In [45]:
#display(test[['R_fighter', 'R_match_weightclass_rank', 'B_fighter', 'B_match_weightclass_rank', 'date', 'better_rank']])


In [46]:
match_df.drop(columns=['label'], inplace=True)
match_df.to_csv('scraper_helpers/scraped_event_with_ranks.csv', index=False)

### Take a quick look at how the better ranked fighter does:

In [None]:
"""temp_df = match_df[match_df['better_rank']=='Red'].copy()
red_favorite_count = (len(temp_df))
temp_df = temp_df[temp_df['Winner']=='Red']
red_winner_count = len(temp_df)

red_pct = (red_winner_count / red_favorite_count)

temp_df = match_df[match_df['better_rank']=='Blue'].copy()
blue_favorite_count = (len(temp_df))
temp_df = temp_df[temp_df['Winner']=='Blue']
blue_winner_count = len(temp_df)

blue_pct = (blue_winner_count / blue_favorite_count)
print('When Red has the better rank they win ', "{:.2f}".format(red_pct*100), '% of the time')
print('When Blue has the better rank they win ', "{:.2f}".format(blue_pct*100), '% of the time')
"""

In [None]:
#print(blue_favorite_count)