# Cleaning the 2022 Fight Data
## Summary
The csv file we'll be working with is a subset of semi-cleaned data from the larger web scraping project. The objective here is to analyze and visualize patterns in how fights are won, clusting by weight class and ruleset. For this anaylsis there are a few things that should be touched up prior to continuing. Once completed this data will be moved to Tableau to create an interactive dashboard.

## Items to fix:
* NaN values within the 'Method' and 'Weight' columns replaced with an empty string.
* Creation of a column that shows whether a fight was won by points, regardless of score.
* Removing the 'Pts: ' substring from the score in 'Method'.
* Removing non-integer characters from the 'Weight' column, exception for the absolute weight class ('Abs').
* Calculating the point differential in rows where we have the score.

## Deficiencies with the data:
* This sample data seems to have reasonable data for the ADCC competitions, but other prominent tournaments (namely IBJJF) are lacking. Results may not reflect the broader trends in BJJ and are specific to the data we *do* have.
* Typos - We working with human inputs here and have to assume accuracy regarding all of the fields since I don't have a way to validate the true results, spelling, or weight classes of the various fighters.
* Basis toward wins - In a perfect data scenario, each win for a fighter is a loss for another fighter. Since we don't have equal numbers of wins and losses, I assume there is a bias toward humans inputting wins of their favorite fighters over losses.
* Basis toward popularity - BJJ Heroes maintains good records for prominent fighters, but pages are limited for less popular fighters.


In [20]:
import pandas as pd

df = pd.read_csv('2022_data.csv')

In [21]:
df['Method'] = df['Method'].fillna('')
df['Weight'] = df['Weight'].fillna('')

In [22]:
for index, row in df.iterrows():
    if row['Method'] == 'Points':
        df.loc[index, 'Decided by Points'] = 1
    elif 'Pts' in row['Method']:
        df.loc[index, 'Decided by Points'] = 1
    else:
        df.loc[index, 'Decided by Points'] = 0
        

In [23]:
for index, value in df['Method'].iteritems():
   df.loc[index, 'Method'] = value.replace('Pts: ','')

In [24]:
for index, value in df['Weight'].iteritems():
    df.loc[index, 'Weight'] = value.replace('KG','')

In [25]:
for index, value in df['Weight'].iteritems():
    df.loc[index, 'Weight'] = value.replace('O','')

In [26]:
for index, value in df['Method'].iteritems():
    if any(char.isdigit() for char in value):
        nums = value.split('x')
        if len(nums) == 2:
            points1, points2 = int(nums[0]), int(nums[1])
            df.loc[index, 'Point Diff'] = abs(points1 - points2)
        else:
            pass
    else:
        pass

In [27]:
print(df.head())

            Fighter         Opponent W/L            Method       Competition  \
0  Claudio Calasans  Wellington Paes   W            Armbar     Curitiba SPNG   
1   Guilherme Bacha     Artur Gareev   W               3x1  ACB World Champ.   
2     Adam Benayoun    Kieran Kichuk   L            EBI/OT      Emerald City   
3     Adam Benayoun      Gavin Corbe   L  Referee Decision       NoGi Worlds   
4     Adam Benayoun    Miha Perhavec   W               4x2       Austin WNGO   

  Weight Stage  Year  Decided by Points  Point Diff  
0     85     F  2022                0.0         NaN  
1     95    4F  2022                1.0         2.0  
2     70     F  2022                0.0         NaN  
3     67    4F  2022                0.0         NaN  
4    ABS    SF  2022                1.0         2.0  


In [19]:
df.to_csv('cleaned_2022_data.csv', index=False)