# UFC Retirement Age - Preprocessing and Training Data Development

## 1. Contents
* [1. Contents](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [2. Sourcing and Loading](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [2a. Import relevant libraries](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [2b. Load previously explored DataFrame](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [2c. Preliminary exploration of data](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [3. Additional Data Wrangling](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [3a. Datatype Conversion](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [3b. Cleanup of `country`](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [4. Encoding Categorical Variables](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [4a. Removing `finish_details`](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [4b. Dummy encoding every categorical variable except `fighter`](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [5. Scaling numerical variables](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [6. Combine all preprocessed DataFrames](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [6a. Drop original colummns](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [6b. Concatenate preprocessed DataFrames](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
 * [6c. Save DataFrame](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)
* [7. Final Thoughts](#_UFC_Retirement_Age_-_Preprocessing_and_Training_Data_Development)

## 2. Sourcing and Loading

**2a. Import relevant libraries**

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import kaggle as kg
import pandas as pd
import missingno as msno
import statsmodels.api as sm
import scipy.stats
from matplotlib.lines import Line2D
from kaggle.api.kaggle_api_extended import KaggleApi
from statsmodels.graphics.api import abline_plot
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model, preprocessing
from zipfile import ZipFile
from scipy import stats
from scipy.stats import t
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu
from numpy.random import seed
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

**2b. Load previously explored Dataframe**

In [2]:
df = pd.read_csv('df.csv', index_col=0)

**2c. Preliminary exploration of data**

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8774 entries, 0 to 8773
Data columns (total 68 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   date                        8774 non-null   object 
 1   fighter                     8774 non-null   object 
 2   odds                        8774 non-null   int64  
 3   country                     8774 non-null   object 
 4   Winner                      8774 non-null   bool   
 5   title_bout                  8774 non-null   bool   
 6   weight_class                8774 non-null   object 
 7   gender                      8774 non-null   object 
 8   current_lose_streak         8774 non-null   int64  
 9   current_win_streak          8774 non-null   int64  
 10  draw                        8774 non-null   int64  
 11  avg_SIG_STR_landed          7389 non-null   float64
 12  avg_SIG_STR_pct             7652 non-null   float64
 13  avg_SUB_ATT                 7585 

In [4]:
print('Shape: ' + str(df.shape))

Shape: (8774, 68)


In [5]:
df.head()

Unnamed: 0,date,fighter,odds,country,Winner,title_bout,weight_class,gender,current_lose_streak,current_win_streak,...,total_title_bouts_dif,win_by_Submission_dif,Height_cms_dif,Reach_cms_dif,age_dif,avg_SIG_STR_landed_dif,avg_SUB_ATT_dif,avg_TD_landed_dif,ko_dif,adj_wins_dif
0,2020-09-19,Colby Covington,-335,USA,True,False,Welterweight,MALE,1,0,...,-5,-2,5.08,-5.08,-6,1.79,-0.2,3.8,-4,-20
1,2020-09-19,Tyron Woodley,260,USA,False,False,Welterweight,MALE,2,0,...,5,2,-5.08,5.08,6,-1.79,0.2,-3.8,4,20
2,2020-09-19,Khamzat Chimaev,-400,USA,True,False,Middleweight,MALE,0,2,...,0,-4,2.54,-5.08,-6,5.57,1.5,2.47,0,-16
3,2020-09-19,Gerald Meerschaert,300,USA,False,False,Middleweight,MALE,1,0,...,0,4,-2.54,5.08,6,-5.57,-1.5,-2.47,0,16
4,2020-09-19,Johnny Walker,-125,USA,True,False,Light Heavyweight,MALE,2,0,...,0,-1,2.54,7.62,-1,0.69,-0.6,-0.7,2,0


## 3. Additional Data Wrangling

**3a. Datatype conversion**

In [6]:
df.dtypes[df.dtypes == 'object']

date                 object
fighter              object
country              object
weight_class         object
gender               object
Stance               object
finish_details       object
finish_round_time    object
dtype: object

It looks like `date` and `finish_round_time` can be converted to DateTime and Float format, respectively.

In [7]:
df['date'] = pd.to_datetime(df['date'])
df.date.dtypes

dtype('<M8[ns]')

In [8]:
df['finish_round_time'].value_counts(dropna=False)

5:00    4108
NaN      576
4:59      56
2:38      42
1:54      34
        ... 
3:55       2
3:20       2
0:37       2
4:04       2
0:09       2
Name: finish_round_time, Length: 294, dtype: int64

In [9]:
df['finish_round_time'] = pd.to_timedelta('00:' + df['finish_round_time'])
df['finish_round_time'] = df['finish_round_time'] / pd.offsets.Minute(1)

In [10]:
df['finish_round_time'].value_counts(dropna=False)

5.000000    4108
NaN          576
4.983333      56
2.633333      42
1.900000      34
            ... 
4.000000       2
4.066667       2
0.083333       2
3.333333       2
0.150000       2
Name: finish_round_time, Length: 294, dtype: int64

**3b. Cleanup of `country`**

In [11]:
df['country'].value_counts()

 USA                     4900
 Brazil                   800
 Canada                   674
USA                       360
 United Kingdom           330
 Australia                320
 Sweden                   144
 Mexico                   140
 China                    122
 Germany                  108
 Japan                    106
United Arab Emirates      102
 Singapore                 90
 Russia                    72
 New Zealand               66
 United Arab Emirates      58
 Netherlands               50
 South Korea               48
 Poland                    46
 Ireland                   38
 Chile                     26
 Denmark                   26
 Czech Republic            26
 Uruguay                   26
 Croatia                   26
 Argentina                 24
 Philippines               24
Brazil                     22
Name: country, dtype: int64

In [12]:
df['country'] = df['country'].str.lstrip()
df['country'].value_counts()

USA                     5260
Brazil                   822
Canada                   674
United Kingdom           330
Australia                320
United Arab Emirates     160
Sweden                   144
Mexico                   140
China                    122
Germany                  108
Japan                    106
Singapore                 90
Russia                    72
New Zealand               66
Netherlands               50
South Korea               48
Poland                    46
Ireland                   38
Uruguay                   26
Czech Republic            26
Chile                     26
Croatia                   26
Denmark                   26
Philippines               24
Argentina                 24
Name: country, dtype: int64

## 4. Encoding Categorical Variables

**4a. Removing `finish_details`**

In [13]:
df.dtypes[df.dtypes == 'object']

fighter           object
country           object
weight_class      object
gender            object
Stance            object
finish_details    object
dtype: object

In [14]:
df.drop(columns=['finish_details'], axis=1, inplace=True)

In [15]:
df.dtypes[df.dtypes == 'object']

fighter         object
country         object
weight_class    object
gender          object
Stance          object
dtype: object

**4b. Dummy encoding every categorical variable except `fighter`**

In [16]:
dummy_cols = [column for column in df.dtypes[df.dtypes == 'object'].index if column != 'fighter']
dummy_cols

['country', 'weight_class', 'gender', 'Stance']

In [17]:
df_dummy = pd.get_dummies(df[dummy_cols], drop_first=True)
df_dummy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8774 entries, 0 to 8773
Data columns (total 41 columns):
 #   Column                              Non-Null Count  Dtype
---  ------                              --------------  -----
 0   country_Australia                   8774 non-null   uint8
 1   country_Brazil                      8774 non-null   uint8
 2   country_Canada                      8774 non-null   uint8
 3   country_Chile                       8774 non-null   uint8
 4   country_China                       8774 non-null   uint8
 5   country_Croatia                     8774 non-null   uint8
 6   country_Czech Republic              8774 non-null   uint8
 7   country_Denmark                     8774 non-null   uint8
 8   country_Germany                     8774 non-null   uint8
 9   country_Ireland                     8774 non-null   uint8
 10  country_Japan                       8774 non-null   uint8
 11  country_Mexico                      8774 non-null   uint8
 12  countr

## 5. Scaling Numerical Variables

In [18]:
scaler_cols = [column for column in df.dtypes[((df.dtypes == 'float64') | (df.dtypes== 'int64'))].index]
scaler_cols

['odds',
 'current_lose_streak',
 'current_win_streak',
 'draw',
 'avg_SIG_STR_landed',
 'avg_SIG_STR_pct',
 'avg_SUB_ATT',
 'avg_TD_landed',
 'avg_TD_pct',
 'longest_win_streak',
 'losses',
 'total_rounds_fought',
 'total_title_bouts',
 'win_by_Decision_Majority',
 'win_by_Decision_Split',
 'win_by_Decision_Unanimous',
 'win_by_KO/TKO',
 'win_by_Submission',
 'win_by_TKO_Doctor_Stoppage',
 'wins',
 'Height_cms',
 'Reach_cms',
 'Weight_lbs',
 'age',
 'finish',
 'finish_round',
 'finish_round_time',
 'total_fight_time_secs',
 'kd_bout',
 'sig_str_landed_bout',
 'sig_str_attempted_bout',
 'sig_str_pct_bout',
 'tot_str_landed_bout',
 'tot_str_attempted_bout',
 'td_landed_bout',
 'td_attempted_bout',
 'td_pct_bout',
 'sub_attempts_bout',
 'pass_bout',
 'rev_bout',
 'win_pct',
 'adj_wins',
 'current_lose_streak_dif',
 'current_win_streak_dif',
 'longest_win_streak_dif',
 'wins_dif',
 'losses_dif',
 'total_rounds_fought_dif',
 'total_title_bouts_dif',
 'win_by_Submission_dif',
 'Height_cms_d

In [19]:
df[scaler_cols].describe()

Unnamed: 0,odds,current_lose_streak,current_win_streak,draw,avg_SIG_STR_landed,avg_SIG_STR_pct,avg_SUB_ATT,avg_TD_landed,avg_TD_pct,longest_win_streak,...,total_title_bouts_dif,win_by_Submission_dif,Height_cms_dif,Reach_cms_dif,age_dif,avg_SIG_STR_landed_dif,avg_SUB_ATT_dif,avg_TD_landed_dif,ko_dif,adj_wins_dif
count,8774.0,8774.0,8774.0,8774.0,7389.0,7652.0,7585.0,7584.0,7565.0,8774.0,...,8774.0,8774.0,8774.0,8774.0,8774.0,6692.0,6888.0,6886.0,8774.0,8774.0
mean,-25.310691,0.535674,0.948142,0.00661,30.037174,0.449731,0.505933,1.280899,0.316491,2.100296,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,277.718132,0.812976,1.501904,0.086483,19.753582,0.11275,0.644255,1.27483,0.235006,2.051145,...,1.684915,1.759636,6.433477,8.83017,5.161365,21.683464,0.854898,1.697168,2.078698,14.218828
min,-1700.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-16.0,-13.0,-33.02,-187.96,-17.0,-128.222222,-6.0,-11.0,-17.0,-112.0
25%,-200.0,0.0,0.0,0.0,16.0,0.39,0.0,0.333333,0.145,1.0,...,0.0,-1.0,-5.08,-5.08,-3.0,-11.38125,-0.4,-0.909773,-1.0,-6.0
50%,-110.0,0.0,0.0,0.0,28.388889,0.45,0.333333,1.0,0.311667,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,173.0,1.0,1.0,0.0,41.909091,0.51,0.8,1.933333,0.459167,3.0,...,0.0,1.0,5.08,5.08,3.0,11.38125,0.4,0.909773,1.0,6.0
max,1300.0,7.0,16.0,2.0,154.0,1.0,7.0,12.5,1.0,16.0,...,16.0,13.0,33.02,187.96,17.0,128.222222,6.0,11.0,17.0,112.0


In [20]:
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[scaler_cols])
df_scaled = pd.DataFrame(df_scaled, columns=scaler_cols)
df_scaled.describe()

Unnamed: 0,odds,current_lose_streak,current_win_streak,draw,avg_SIG_STR_landed,avg_SIG_STR_pct,avg_SUB_ATT,avg_TD_landed,avg_TD_pct,longest_win_streak,...,total_title_bouts_dif,win_by_Submission_dif,Height_cms_dif,Reach_cms_dif,age_dif,avg_SIG_STR_landed_dif,avg_SUB_ATT_dif,avg_TD_landed_dif,ko_dif,adj_wins_dif
count,8774.0,8774.0,8774.0,8774.0,7389.0,7652.0,7585.0,7584.0,7565.0,8774.0,...,8774.0,8774.0,8774.0,8774.0,8774.0,6692.0,6888.0,6886.0,8774.0,8774.0
mean,1.065619e-16,-1.387108e-15,-3.947656e-16,-1.205289e-15,-4.438601e-16,2.786131e-16,1.217451e-15,-3.431971e-16,-3.993721e-16,-1.932617e-15,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,1.000057,1.000057,1.000057,1.000057,1.000068,1.000065,1.000066,1.000066,1.000066,1.000057,...,1.000057,1.000057,1.000057,1.000057,1.000057,1.000075,1.000073,1.000073,1.000057,1.000057
min,-6.03052,-0.6589424,-0.6313293,-0.07644029,-1.520697,-3.989017,-0.7853522,-1.004827,-1.346825,-1.024021,...,-9.496571,-7.388313,-5.132821,-21.287327,-3.29389,-5.913806,-7.018888,-6.481857,-8.178663,-7.877329
25%,-0.6290524,-0.6589424,-0.6313293,-0.07644029,-0.7106622,-0.5297976,-0.7853522,-0.7433367,-0.7297783,-0.5364608,...,0.0,-0.568332,-0.789665,-0.575333,-0.581275,-0.524921,-0.467926,-0.536092,-0.481098,-0.422
50%,-0.3049644,-0.6589424,-0.6313293,-0.07644029,-0.08344799,0.002390055,-0.2679243,-0.2203567,-0.02052916,-0.04890051,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.7141125,0.5711768,0.03452997,-0.07644029,0.6010414,0.5345777,0.4564748,0.5118152,0.6071564,0.4386598,...,0.0,0.568332,0.789665,0.575333,0.581275,0.524921,0.467926,0.536092,0.481098,0.422
max,4.772415,7.951892,10.02242,23.0507,6.275885,4.880777,10.08063,8.801047,2.90867,6.776944,...,9.496571,7.388313,5.132821,21.287327,3.29389,5.913806,7.018888,6.481857,8.178663,7.877329


## 6. Combine All preprocessed DataFrames

**6a. Drop original columns**

In [21]:
df.drop(columns=scaler_cols + dummy_cols, inplace=True)

In [22]:
df.columns

Index(['date', 'fighter', 'Winner', 'title_bout', 'better_rank'], dtype='object')

**6b. Concatenate preprocessed DataFrames**

In [23]:
df = pd.concat([df, df_dummy, df_scaled], axis=1)
df.columns

Index(['date', 'fighter', 'Winner', 'title_bout', 'better_rank',
       'country_Australia', 'country_Brazil', 'country_Canada',
       'country_Chile', 'country_China',
       ...
       'total_title_bouts_dif', 'win_by_Submission_dif', 'Height_cms_dif',
       'Reach_cms_dif', 'age_dif', 'avg_SIG_STR_landed_dif', 'avg_SUB_ATT_dif',
       'avg_TD_landed_dif', 'ko_dif', 'adj_wins_dif'],
      dtype='object', length=104)

**6c. Save DataFrame**

In [24]:
df.to_csv('df.csv')

## 7. Final Thoughts

Since I will have multiple dependent variables later on, I will reserve the train-test split until the Modeling section, where I can properly assess each independent variable in its own dedicated section. 