# UFC Retirement Age - Preprocessing and Training Data Development

## 1. Contents
* [1. Contents](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [2. Sourcing and Loading](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [2a. Import relevant libraries](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [2b. Load previously explored DataFrame](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [2c. Preliminary exploration of data](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [3. Additional Data Wrangling](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [3a. Datatype Conversion](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [3b. Cleanup of `country`](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [4. Encoding Categorical Variables](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [4a. Removing `finish_details`](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [4b. Dummy encoding every categorical variable except `fighter`](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [5. Scaling numerical variables](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [6. Combine all preprocessed DataFrames](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [6a. Drop original colummns](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [6b. Concatenate preprocessed DataFrames](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [6c. Save DataFrame](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [7. Final Thoughts](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)

## 2. Sourcing and Loading

**2a. Import relevant libraries**

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import kaggle as kg
import pandas as pd
import missingno as msno
import statsmodels.api as sm
import scipy.stats
from matplotlib.lines import Line2D
from kaggle.api.kaggle_api_extended import KaggleApi
from statsmodels.graphics.api import abline_plot
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model, preprocessing
from zipfile import ZipFile
from scipy import stats
from scipy.stats import t
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu
from numpy.random import seed
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

**2b. Load previously explored Dataframe**

In [2]:
df = pd.read_csv('df.csv', index_col=0)

**2c. Preliminary exploration of data**

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8816 entries, 0 to 8815
Data columns (total 68 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   date                        8816 non-null   object 
 1   fighter                     8816 non-null   object 
 2   odds                        8816 non-null   int64  
 3   country                     8816 non-null   object 
 4   Winner                      8816 non-null   bool   
 5   title_bout                  8816 non-null   bool   
 6   weight_class                8816 non-null   object 
 7   gender                      8816 non-null   object 
 8   current_lose_streak         8816 non-null   int64  
 9   current_win_streak          8816 non-null   int64  
 10  draw                        8816 non-null   int64  
 11  avg_SIG_STR_landed          7431 non-null   float64
 12  avg_SIG_STR_pct             7694 non-null   float64
 13  avg_SUB_ATT                 7627 

In [4]:
print('Shape: ' + str(df.shape))

Shape: (8816, 68)


In [5]:
df.head()

Unnamed: 0,date,fighter,odds,country,Winner,title_bout,weight_class,gender,current_lose_streak,current_win_streak,...,total_title_bouts_dif,win_by_Submission_dif,Height_cms_dif,Reach_cms_dif,age_dif,avg_SIG_STR_landed_dif,avg_SUB_ATT_dif,avg_TD_landed_dif,ko_dif,adj_wins_dif
0,2020-10-03,Holly Holm,-125,United Arab Emirates,True,False,Women's Bantamweight,FEMALE,0,1,...,5,-1,-2.54,2.54,6,-3.41,0.0,0.22,1,3
1,2020-10-03,Irene Aldana,103,United Arab Emirates,False,False,Women's Bantamweight,FEMALE,0,2,...,-5,1,2.54,-2.54,-6,3.41,0.0,-0.22,-1,-3
2,2020-10-03,Yorgan De Castro,-265,United Arab Emirates,False,False,Heavyweight,MALE,1,0,...,0,0,0.0,-2.54,8,-0.74,0.0,0.0,1,4
3,2020-10-03,Carlos Felipe,205,United Arab Emirates,True,False,Heavyweight,MALE,1,0,...,0,0,0.0,2.54,-8,0.74,0.0,0.0,-1,-4
4,2020-10-03,Germaine de Randamie,-150,United Arab Emirates,True,False,Women's Bantamweight,FEMALE,1,0,...,1,0,7.62,5.08,5,-0.24,-0.7,-2.6,2,9


## 3. Additional Data Wrangling

**3a. Datatype conversion**

In [6]:
df.dtypes[df.dtypes == 'object']

date                 object
fighter              object
country              object
weight_class         object
gender               object
Stance               object
finish_details       object
finish_round_time    object
dtype: object

It looks like `date` and `finish_round_time` can be converted to DateTime and Float format, respectively.

In [7]:
df['date'] = pd.to_datetime(df['date'])
df.date.dtypes

dtype('<M8[ns]')

In [8]:
df['finish_round_time'].value_counts(dropna=False)

5:00    4108
NaN      618
4:59      56
2:38      42
1:54      34
        ... 
0:11       2
3:55       2
4:00       2
0:09       2
0:05       2
Name: finish_round_time, Length: 294, dtype: int64

In [9]:
df['finish_round_time'] = pd.to_timedelta('00:' + df['finish_round_time'])
df['finish_round_time'] = df['finish_round_time'] / pd.offsets.Minute(1)

In [10]:
df['finish_round_time'].value_counts(dropna=False)

5.000000    4108
NaN          618
4.983333      56
2.633333      42
1.900000      34
            ... 
4.000000       2
4.066667       2
0.083333       2
3.333333       2
0.150000       2
Name: finish_round_time, Length: 294, dtype: int64

**3b. Cleanup of `country`**

In [11]:
df['country'].value_counts()

 USA                     4900
 Brazil                   800
 Canada                   674
USA                       360
 United Kingdom           330
 Australia                320
United Arab Emirates      144
 Sweden                   144
 Mexico                   140
 China                    122
 Germany                  108
 Japan                    106
 Singapore                 90
 Russia                    72
 New Zealand               66
 United Arab Emirates      58
 Netherlands               50
 South Korea               48
 Poland                    46
 Ireland                   38
 Czech Republic            26
 Croatia                   26
 Chile                     26
 Denmark                   26
 Uruguay                   26
 Philippines               24
 Argentina                 24
Brazil                     22
Name: country, dtype: int64

In [12]:
df['country'] = df['country'].str.lstrip()
df['country'].value_counts()

USA                     5260
Brazil                   822
Canada                   674
United Kingdom           330
Australia                320
United Arab Emirates     202
Sweden                   144
Mexico                   140
China                    122
Germany                  108
Japan                    106
Singapore                 90
Russia                    72
New Zealand               66
Netherlands               50
South Korea               48
Poland                    46
Ireland                   38
Czech Republic            26
Chile                     26
Uruguay                   26
Croatia                   26
Denmark                   26
Argentina                 24
Philippines               24
Name: country, dtype: int64

## 4. Encoding Categorical Variables

**4a. Removing `finish_details`**

In [13]:
df.dtypes[df.dtypes == 'object']

fighter           object
country           object
weight_class      object
gender            object
Stance            object
finish_details    object
dtype: object

In [14]:
df.drop(columns=['finish_details'], axis=1, inplace=True)

In [15]:
df.dtypes[df.dtypes == 'object']

fighter         object
country         object
weight_class    object
gender          object
Stance          object
dtype: object

**4b. Dummy encoding every categorical variable except `fighter`**

In [16]:
df['Stance'] = df['Stance'].str.rstrip()
df['Stance'].value_counts(dropna=False)

Orthodox       6684
Southpaw       1783
Switch          344
Open Stance       5
Name: Stance, dtype: int64

In [17]:
dummy_cols = [column for column in df.dtypes[df.dtypes == 'object'].index if column != 'fighter']
dummy_cols

['country', 'weight_class', 'gender', 'Stance']

In [18]:
df_dummy = pd.get_dummies(df[dummy_cols], columns=dummy_cols, drop_first=True)
df_dummy.head()

Unnamed: 0,country_Australia,country_Brazil,country_Canada,country_Chile,country_China,country_Croatia,country_Czech Republic,country_Denmark,country_Germany,country_Ireland,...,weight_class_Middleweight,weight_class_Welterweight,weight_class_Women's Bantamweight,weight_class_Women's Featherweight,weight_class_Women's Flyweight,weight_class_Women's Strawweight,gender_MALE,Stance_Orthodox,Stance_Southpaw,Stance_Switch
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


In [19]:
df.dtypes[df.dtypes == 'bool']

Winner         bool
title_bout     bool
better_rank    bool
dtype: object

In [20]:
bool_cols = [column for column in df.dtypes[df.dtypes == 'bool'].index]
df[bool_cols] = df[bool_cols].astype(int)
df['Winner']

0       1
1       0
2       0
3       1
4       1
       ..
8811    0
8812    1
8813    0
8814    0
8815    1
Name: Winner, Length: 8816, dtype: int32

## 6. Combine Preprocessed DataFrames

**6a. Drop original columns**

In [21]:
df.drop(columns=dummy_cols, inplace=True)

In [22]:
df.columns

Index(['date', 'fighter', 'odds', 'Winner', 'title_bout',
       'current_lose_streak', 'current_win_streak', 'draw',
       'avg_SIG_STR_landed', 'avg_SIG_STR_pct', 'avg_SUB_ATT', 'avg_TD_landed',
       'avg_TD_pct', 'longest_win_streak', 'losses', 'total_rounds_fought',
       'total_title_bouts', 'win_by_Decision_Majority',
       'win_by_Decision_Split', 'win_by_Decision_Unanimous', 'win_by_KO/TKO',
       'win_by_Submission', 'win_by_TKO_Doctor_Stoppage', 'wins', 'Height_cms',
       'Reach_cms', 'Weight_lbs', 'age', 'better_rank', 'finish',
       'finish_round', 'finish_round_time', 'total_fight_time_secs', 'kd_bout',
       'sig_str_landed_bout', 'sig_str_attempted_bout', 'sig_str_pct_bout',
       'tot_str_landed_bout', 'tot_str_attempted_bout', 'td_landed_bout',
       'td_attempted_bout', 'td_pct_bout', 'sub_attempts_bout', 'pass_bout',
       'rev_bout', 'win_pct', 'adj_wins', 'current_lose_streak_dif',
       'current_win_streak_dif', 'longest_win_streak_dif', 'wins_dif',

**6b. Concatenate preprocessed DataFrames**

In [23]:
df = pd.concat([df, df_dummy], axis=1)
df.columns

Index(['date', 'fighter', 'odds', 'Winner', 'title_bout',
       'current_lose_streak', 'current_win_streak', 'draw',
       'avg_SIG_STR_landed', 'avg_SIG_STR_pct',
       ...
       'weight_class_Middleweight', 'weight_class_Welterweight',
       'weight_class_Women's Bantamweight',
       'weight_class_Women's Featherweight', 'weight_class_Women's Flyweight',
       'weight_class_Women's Strawweight', 'gender_MALE', 'Stance_Orthodox',
       'Stance_Southpaw', 'Stance_Switch'],
      dtype='object', length=103)

**6c. Save DataFrame**

In [24]:
df.to_csv('df.csv')

## 7. Final Thoughts

Since I will have multiple dependent variables later on, I will reserve the train-test split until the Modeling section, where I can properly assess each independent variable in its own dedicated section. 