# UFC data analysis and exploration

The goal of this project is to explore an analize the data captured from a postgres db.

Then, I'll pick those features that are more likely to predict the winners of the next event, apply data analyze and feature enginner and then save them into a mongodb.

Once the data is saved in the mongodb instance. I'll recover and apply some static stadistical models and visualization.

In [1]:
import pandas as pd
import pymongo

## Fetch data from Postgres

In [2]:
master_df = pd.read_csv( '../data/ufc-master.csv' )
upcoming_df = pd.read_csv( '../data/upcoming-event.csv' )
master_df_columns = list( master_df.columns )
upcoming_df_columns = list( upcoming_df.columns )

## Explore raw data

checks if have the same columns

In [3]:
sorted( master_df_columns ) == sorted( upcoming_df_columns )

True

We see the shape of each dataset

In [4]:
print( 'Master df shape:', master_df.shape )
print( 'Upcoming df shape:', upcoming_df.shape )

Master df shape: (4896, 119)
Upcoming df shape: (9, 119)


I notice thanks by shape df property that the number of columns is enourmous. 

I'll put the hat of an expert and I'll pick those columns that I thing will help the most
to create a model that will predict the winners of the next upcomming ufc event.

Uncomment the next cell to see the different columns.

In [5]:
# master_df_columns

There is the column 'Winner', which is the feature I want to predict.

As we can see, the master df has this information, but upcomming df does not.

In [6]:
columns = ['B_fighter', 'R_fighter', 'Winner']

print( 'Master winners' )
print (master_df.loc[0:4, columns] )
print( '\nUpcomming winners' )
print( upcoming_df.loc[0:4, columns] )

Master winners
         B_fighter            R_fighter Winner
0    Johnny Walker        Thiago Santos    Red
1       Niko Price        Alex Oliveira   Blue
2  Krzysztof Jotko       Misha Cirkunov   Blue
3     Mike Breeden  Alexander Hernandez    Red
4     Jared Gordon          Joe Solecki   Blue

Upcomming winners
          B_fighter        R_fighter  Winner
0  Marina Rodriguez   Mackenzie Dern     NaN
1      Jared Gooden      Randy Brown     NaN
2   Matheus Nicolau      Tim Elliott     NaN
3    Mariya Agapova      Sabina Mazo     NaN
4    Felipe Colares  Chris Gutierrez     NaN


Getting information from the different columns

In [7]:
master_df.describe()

Unnamed: 0,R_odds,B_odds,R_ev,B_ev,no_of_rounds,B_current_lose_streak,B_current_win_streak,B_draw,B_avg_SIG_STR_landed,B_avg_SIG_STR_pct,...,B_Flyweight_rank,B_Pound-for-Pound_rank,finish_round,total_fight_time_secs,r_dec_odds,b_dec_odds,r_sub_odds,b_sub_odds,r_ko_odds,b_ko_odds
count,4895.0,4896.0,4895.0,4896.0,4896.0,4896.0,4896.0,4896.0,3966.0,4131.0,...,95.0,35.0,4274.0,4274.0,4093.0,4077.0,3847.0,3835.0,3847.0,3834.0
mean,-117.640449,66.030637,94.827397,167.083323,3.181985,0.477941,0.875408,0.010621,26.308553,0.444741,...,8.473684,9.485714,2.408049,652.313758,294.064745,416.544027,843.010138,1064.543155,514.231869,647.257173
std,268.881452,247.803928,82.843409,136.944643,0.571515,0.769386,1.311379,0.108333,20.935885,0.121332,...,4.259763,4.300283,0.996643,357.911423,230.583958,306.571299,550.126761,627.285034,413.622768,458.846643
min,-1700.0,-1200.0,5.882353,8.333333,3.0,0.0,0.0,0.0,0.0,0.0,...,1.0,2.0,1.0,5.0,-440.0,-200.0,-370.0,-1250.0,-550.0,-275.0
25%,-255.0,-145.0,39.215686,68.965517,3.0,0.0,0.0,0.0,5.61,0.3875,...,5.0,5.0,1.0,297.0,167.0,225.0,435.0,590.0,240.0,325.0
50%,-150.0,130.0,66.666667,130.0,3.0,0.0,0.0,0.0,24.759615,0.45,...,8.0,10.0,3.0,900.0,250.0,349.0,720.0,975.0,435.0,548.5
75%,126.5,220.0,126.5,220.0,3.0,1.0,1.0,0.0,39.075,0.51,...,12.0,13.5,3.0,900.0,400.0,525.0,1200.0,1400.0,700.0,880.75
max,775.0,1300.0,775.0,1300.0,5.0,6.0,12.0,2.0,154.0,1.0,...,15.0,15.0,5.0,1500.0,2200.0,2600.0,4665.0,4785.0,2675.0,3200.0


In [8]:
master_df.dtypes

R_fighter      object
B_fighter      object
R_odds        float64
B_odds          int64
R_ev          float64
               ...   
b_dec_odds    float64
r_sub_odds    float64
b_sub_odds    float64
r_ko_odds     float64
b_ko_odds     float64
Length: 119, dtype: object

Has we can see most features are numeric. Wat I can do for not numeric values is encode or categorize them.

## Select the most interesting features

In [9]:
columns = [
  'R_fighter',
  'B_fighter',
  'gender',
  'country',
  'Winner',
  'weight_class',
  'R_current_lose_streak',
  'B_current_lose_streak',
  'R_losses',
  'B_losses',
  'R_wins',
  'B_wins',
  'R_Height_cms',
  'R_Reach_cms',
  'R_Weight_lbs',
  'B_Height_cms',
  'B_Reach_cms',
  'B_Weight_lbs',
  'better_rank',
  'R_age',
  'B_age',
]

master_df = master_df.loc[:, columns]
upcoming_df = upcoming_df.loc[:, columns]

## Clean data

Renaming the columns

In [10]:
master_df.rename( inplace = True, 
                  columns = { column : column.lower() for column in columns } )
upcoming_df.rename( inplace = True, 
                  columns = { column : column.lower() for column in columns } )
master_df.columns

Index(['r_fighter', 'b_fighter', 'gender', 'country', 'winner', 'weight_class',
       'r_current_lose_streak', 'b_current_lose_streak', 'r_losses',
       'b_losses', 'r_wins', 'b_wins', 'r_height_cms', 'r_reach_cms',
       'r_weight_lbs', 'b_height_cms', 'b_reach_cms', 'b_weight_lbs',
       'better_rank', 'r_age', 'b_age'],
      dtype='object')

We can see that the data is preatty clean already, becouse
 - There are no duplicate fields
 - There are not missing values or null values
 - There are not constant fields

Removes duplicate fields

In [11]:
print("Length original dataframe: ", len( master_df ))
duplicate_rows_df = master_df[master_df.duplicated()]
print("Number of duplicate rows: ", len( duplicate_rows_df ))
master_df = master_df.drop_duplicates()
print("Length without duplicates: ", len( master_df ))

Length original dataframe:  4896
Number of duplicate rows:  0
Length without duplicates:  4896


Trying to find missing values

In [12]:
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4896 entries, 0 to 4895
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   r_fighter              4896 non-null   object 
 1   b_fighter              4896 non-null   object 
 2   gender                 4896 non-null   object 
 3   country                4896 non-null   object 
 4   winner                 4896 non-null   object 
 5   weight_class           4896 non-null   object 
 6   r_current_lose_streak  4896 non-null   int64  
 7   b_current_lose_streak  4896 non-null   int64  
 8   r_losses               4896 non-null   int64  
 9   b_losses               4896 non-null   int64  
 10  r_wins                 4896 non-null   int64  
 11  b_wins                 4896 non-null   int64  
 12  r_height_cms           4896 non-null   float64
 13  r_reach_cms            4896 non-null   float64
 14  r_weight_lbs           4896 non-null   int64  
 15  b_he

In [13]:
master_df.isnull().sum()

r_fighter                0
b_fighter                0
gender                   0
country                  0
winner                   0
weight_class             0
r_current_lose_streak    0
b_current_lose_streak    0
r_losses                 0
b_losses                 0
r_wins                   0
b_wins                   0
r_height_cms             0
r_reach_cms              0
r_weight_lbs             0
b_height_cms             0
b_reach_cms              0
b_weight_lbs             0
better_rank              0
r_age                    0
b_age                    0
dtype: int64

Trying to find constant features

In [14]:
master_df.nunique()

r_fighter                1348
b_fighter                1591
gender                      2
country                    28
winner                      2
weight_class               13
r_current_lose_streak       8
b_current_lose_streak       7
r_losses                   19
b_losses                   16
r_wins                     28
b_wins                     26
r_height_cms               22
r_reach_cms                50
r_weight_lbs               33
b_height_cms               23
b_reach_cms                59
b_weight_lbs               36
better_rank                 3
r_age                      28
b_age                      29
dtype: int64

Checking duplicate features

In [15]:
print( 'Length features: ', len( master_df.T ) )
duplicate_rows_df = master_df.T[master_df.T.duplicated()]
print( 'Number of duplicate features: ', len( duplicate_rows_df ) )
df_dropped = master_df.T.drop_duplicates()
print( 'Length features without duplicates: ', len( df_dropped ) )
master_df = df_dropped.T.convert_dtypes()
master_df.head()

Length features:  21
Number of duplicate features:  0
Length features without duplicates:  21


Unnamed: 0,r_fighter,b_fighter,gender,country,winner,weight_class,r_current_lose_streak,b_current_lose_streak,r_losses,b_losses,...,b_wins,r_height_cms,r_reach_cms,r_weight_lbs,b_height_cms,b_reach_cms,b_weight_lbs,better_rank,r_age,b_age
0,Thiago Santos,Johnny Walker,MALE,USA,Red,Light Heavyweight,3,0,8,2,...,5,187.96,193.04,205,198.12,208.28,205,Red,37,29
1,Alex Oliveira,Niko Price,MALE,USA,Blue,Welterweight,2,2,8,5,...,6,180.34,193.04,170,182.88,193.04,170,neither,33,32
2,Misha Cirkunov,Krzysztof Jotko,MALE,USA,Blue,Middleweight,1,1,4,5,...,9,190.5,195.58,205,185.42,195.58,185,neither,34,32
3,Alexander Hernandez,Mike Breeden,MALE,USA,Red,Lightweight,1,1,3,1,...,0,175.26,182.88,155,177.8,177.8,155,neither,29,32
4,Joe Solecki,Jared Gordon,MALE,USA,Blue,Lightweight,0,0,0,3,...,5,175.26,177.8,155,175.26,172.72,145,neither,28,33


## Save the data into mongo instance

In [16]:
class MongoDB:
  
  def __init__(self):  
    mongo_str = 'mongodb://mongo:mongo@localhost'
    client = pymongo \
            .MongoClient(mongo_str)
    self.clientInstance = client
  
  def client(self):
    return self.clientInstance

  def __send_mongo(self, df, collection):
    chunk = df.to_dict('records')
    collection.insert_many(chunk)

  def __send(self, df, collection, it=3, per=1):
    N = len(df)
    iters = it
    period = per
    chunks = math.floor(N / iters)

    def clamp(n):
      return N if n > N else n

    transfering = True
    i = 0
    while transfering:
      start = i * chunks
      end  = clamp(i * chunks + chunks)
      df_slice = df.iloc[start:end]
      self.__send_mongo(df_slice, collection)
      time.sleep(period)
      i = i + 1
      transfering = end < N
  
  def send(self, df, collection):
    self.__send(df, collection)

  def exec(self, collection, pipeline):
    return collection.aggregate(pipeline)

In [30]:
mongodb = MongoDB()
mongodb_ufc = mongodb.client()['ufc']
mongodb_ufc.master.drop()
mongodb_ufc.upcoming.drop()
mongodb_ufc.master \
    .insert_many( master_df.to_dict( 'records' ) )
upcoming_df.drop( columns='winner', 
                 inplace=True, 
                 errors='ignore' )
mongodb_ufc.upcoming \
    .insert_many( upcoming_df.to_dict( 'records' ) )

<pymongo.results.InsertManyResult at 0x7fc1439d5db0>

## Recover data from mongo instance

## Add features

## Apply statical stadistics

## Data visualization