## Phase I Project Proposal 

### What Makes an MMA Fighter Good?

#### Name: Christopher David, DS 3000


### Problem Motivation

Mixed Martial Arts (MMA) is a dynamic and complex sport that requires a combination of physical attributes, technical skill, and strategic intelligence. The question arises: what characteristics make an MMA fighter successful? While attributes such as height, reach, and stance play a role in a fighter's effectiveness, other aspects like match strategy and fight intelligence might be equally—if not more—important. This project aims to explore the relationship between physical attributes and fight outcomes, helping to determine which factors contribute most to success in the octagon. By analyzing historical fight data, I hope to identify trends that can be used to predict fight results and provide recommendations for aspiring MMA athletes. Additionally, this project will allow me to explore correlations between specific physical traits and fight outcomes, potentially revealing whether certain features provide an advantage in specific matchups. 

## Data Collection 
To gather the necessary data, I plan to use a combination of web scraping and API integration. Specifically, I will scrape the UFC statistics website and supplement this data with the MMA API provided by API-Sports. This approach will ensure that I have access to comprehensive and up-to-date information on fighter statistics, fight results, and rankings. 
 ### UFC Statistics 
 The UFC maintains a statistics section on their website that presents fighter data in a structured table format. This makes it an ideal source for extracting information about fighters' physical attributes and fight records. 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Url for scraping
url = 'http://ufcstats.com/statistics/fighters'

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

# locates the table
mma_stats_table = soup.find('table', class_='b-statistics__table')

# header of the table (makes sure DF has proper column names) 
headers = [header.text.strip() for header in mma_stats_table.find_all('th')]

# table rows 
rows = []

# gets rid of the first two headers 
for row in mma_stats_table.find_all('tr')[2:]: 
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    
    if cols: 
        rows.append(cols)

# creates the data frame, printing the column names as the headers 
df = pd.DataFrame(rows, columns=headers)

# Output 
print(df)


        First          Last              Nickname     Ht.       Wt.  Reach  \
0         Tom         Aaron                            --  155 lbs.     --   
1       Danny        Abbadi          The Assassin  5' 11"  155 lbs.     --   
2     Nariman       Abbasov             Bayraktar   5' 8"  155 lbs.  66.0"   
3       David        Abbott                  Tank   6' 0"  265 lbs.     --   
4       Hamdy    Abdelwahab            The Hammer   6' 2"  264 lbs.  72.0"   
5      Mansur   Abdul-Malik                         6' 2"  185 lbs.  79.0"   
6      Shamil  Abdurakhimov                 Abrek   6' 3"  235 lbs.  76.0"   
7    Hiroyuki           Abe               Abe Ani   5' 6"  145 lbs.     --   
8      Daichi           Abe                        5' 11"  170 lbs.  71.0"   
9        Papy         Abedi               Makambo  5' 11"  185 lbs.     --   
10    Ricardo         Abreu               Demente  5' 11"  185 lbs.     --   
11    Klidson         Abreu            White Bear   6' 0"  205 l

### Data Utilizations and Multiple Page Issues

The data above has been scrapped to the best of my ability, I still need to figure out how to get multiple pages but I am planning on going to office hours to help find a solution. The data above is well suited to answer the questions previously stated as it provides me with the height, weight, reach as well as win-to-loss stats. For the future, in terms of the table above, I plan to get rid of the nickname column as well as the belt as I do not think they are distinct features that I will be able to draw any correlations from. Additionally, I will also in the future be using the MMA API and will need to implement the code above. By using the API I will be able to bring in data such as head-to-head statistics as well as weight class rankings. Nonetheless, currently, the data above will help me solve the questions presented before as I will take the data and create various tests such as characterizing relationships between features or regression that will not only figure out if there is a correlation between features and wins but find which features connect with various match outcomes. By doing so I believe I will properly be able to make accurate conclusions to the questions. However, I would also need to further investigate ML techniques to do so. 

In [2]:
def get_ufc_table(letter) :
    """ Gets and returns the table of UFC fighters for a given letter of the English alphabet.
    Args:
        letter (str) : a single character stirng comprised of one of the 26 English letters.
    Returns:
        ufc_df (pd.DataFrame): an unclean dataframe containing the known stats for every fighter with a last name 
        that starts with a given English letter. It has the following set of columns:
            First (str) : the given name of a fighter.
            Last (str) : the family name of a fighter.
            Nickname (str): the nickname of a fighter, which may not exist.
            Ht. (str): the recorded height of a fighter, given in the following format US Customary feet' US Customary inches".
            Wt. (str): the recorded weight of a fighter, given in US Customary pounds.
            Reach (str): the recorded max length for a fighter's attacks, given in US Customary inches.
            Stance (str): the recorded favored stance of a fighter, which may not exist.
            W (str): the number of recorded fights won by a fighter.
            L (str): the number of recorded fights lost by a fighter.
            D (str): the number of recorded fights tied by a fighter.
            Belt (bool): whether a fighter possesses one of the UFC
    """
    parameters = {'char': letter, 'page': 'all'}
    ufc_response = requests.get('http://ufcstats.com/statistics/fighters', params=parameters)
    ufc_soup = BeautifulSoup(ufc_response.text, "html.parser")
    # locates the table
    fun_mma_stats_table = ufc_soup.find('table', class_='b-statistics__table')

    # header of the table (makes sure DF has proper column names) 
    ufc_headers = [ufc_header.text.strip() for ufc_header in mma_stats_table.find_all('th')]

    # table rows 
    ufc_rows = []

    # gets rid of the first two headers 
    for ufc_row in fun_mma_stats_table.find_all('tr')[2:]: 
        ufc_cols = ufc_row.find_all('td')
        ufc_cols = [ufc_col.text.strip() for ufc_col in ufc_cols]
        if ufc_cols: 
            ufc_rows.append(ufc_cols)
    # creates the data frame, printing the column names as the headers 
    ufc_df = pd.DataFrame(ufc_rows, columns=ufc_headers)
    return ufc_df
get_ufc_table('b').head()

Unnamed: 0,First,Last,Nickname,Ht.,Wt.,Reach,Stance,W,L,D,Belt
0,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
1,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
2,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
3,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
4,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,


In [3]:
def get_all_ufc_fighters() :
    """ Gets every UFC fighter's stats in a single dataframe.
    Returns:
        merged_ufc_df (pd.DataFrame) : the merged dataframe containing every fighter. It has the same format as 
        get_ufc_table's return dataframe.
    """
    alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 
                'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
    merged_ufc_df = pd.DataFrame()
    for letter in alphabet :
        merged_ufc_df = pd.concat([merged_ufc_df, get_ufc_table(letter)], axis=0, ignore_index=True)
    return merged_ufc_df
all_fighters = get_all_ufc_fighters()
all_fighters.head()

Unnamed: 0,First,Last,Nickname,Ht.,Wt.,Reach,Stance,W,L,D,Belt
0,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
1,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
2,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
3,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,
4,Kantharaj,Agasa,Kannadiga,--,135 lbs.,--,,12,3,0,


In [4]:
def clean_ufc_df_row(df_row) :
    """ Cleans and returns a given fighter's row of a dataframe.
    Args : 
        df_row (pd.Series): the row of a get_ufc_table style dataframe to clean.
    Returns :
        df_row (pd.Series): the cleaned row, with the following modifications: 
            Ht., Wt., and Reach are now float representations of their preceding forms;
            Stances that are blank are now noted as Unrecorded;
            W, L, D are now int representations fo their preceding forms.
    """
    if df_row['Ht.'] != '--' :
        (feet, inches) = df_row['Ht.'].split(' ')
        height = float(feet[:-1]) * 12.0
        height += float(inches[:-1])
        df_row['Ht.'] = height
    else :
        df_row['Ht.'] = float('nan')
    if df_row['Wt.'] != '--' :
        df_row['Wt.'] = float(df_row['Wt.'][:df_row['Wt.'].find(' ')])
    else :
        df_row['Wt.'] = float('nan')
    if df_row['Reach'] != '--' :
        df_row['Reach'] = float(df_row['Reach'][:-1])
    else :
        df_row['Reach'] = float('nan')
    if df_row['Stance'] == '' :
        df_row['Stance'] = 'Unrecorded'
    df_row['W'] = int(df_row['W'])
    df_row['L'] = int(df_row['L'])
    df_row['D'] = int(df_row['D'])
    return df_row
def clean_ufc_df(ufc_df) :
    """ Cleans and returns a given UFC DataFrame using df.apply.
    Args :
        ufc_df (pd.DataFrame): the dataframe to be cleaned; uses get_ufc_table columns.
    Returns :
        ufc_df (pd.DataFrame): the cleaned dataframe using clean_ufc_df_row.
    """
    return ufc_df.apply(clean_ufc_df_row, axis=1)
all_fighters = clean_ufc_df(all_fighters)