## Capstone: Using Machine Learning to Score Powerlifting Meets

### The Problem: 
Powerlifting is a sport that has widely expanded in popularity over the past 10 years. The increase in interest has lead to an increase in the number of cash prizes available for lifters. With a higher stake, some have called into question the validity of the method for determining who is the "Best Lifter" across weight classes. 

**Intro to Powerlifting:** Powerlifting is a one rep-max sport. Lifters are given three attempts on three lifts-the squat, the bench press, and the deadlift. Three judges decide whether each attempt is good or failed. Your best attempt at each lift is added to your **total** which is then used to determine your rank/score. 

**Scoring Powerlifting:** Lifters are scored based on highest total within weight classes, **divisions** (separated by age group) and equipment level. For each division and equipment level there is also a "Best Lifter award across weight classes. Best Lifter is determined by highest **Wilks Score.** 

Aside: Powerlifting Equipment are shirts, suits and wraps that help to capture the **kinetic energy** associated with the lift. Imagine pulling back a rubber band, that force with which the rubber band springs back into place is the kinetic energy. Equipment applies that force to the lift, allowing the lifter to lift more weight. 
From here on, without equipment will be referred to as **Raw** and with equipment will be referred to as **Single-ply**

**What is the Wilks Score?:** Wilks score is a way of scoring across weight classes created by Robert Wilks in 1994. A coefficient calculated with bodyweight is multiplied by the total lifted (or in the case of one lift only events, just the best lift from that event.) the Wilks Coefficient looks like this:

<img src = "https://wikimedia.org/api/rest_v1/media/math/render/svg/97274a51d2d3213a9ab7f8a98d91fd6eb0c23e81">

Where "x" is the lifters bodyweight and coefficients a-f are values specific to each gender. (for more information check out the Wilks <a href="https://en.wikipedia.org/wiki/Wilks_Coefficient">Wikipedia</a> page) 

**What is Wrong with the Wilks Score?**

In his article, <a href="https://www.strongerbyscience.com/whos-the-most-impressive-powerlifter/">Who's The Most Impressive Powerlifter?</a>, Greg Nuckols outlines the problems with the Wilks score, they are as follows:
1. It’s biased against middleweights
2. It’s based off of mixed raw/equipped data (with no way to account  for it) 
3. It’s overcomplicated for a relationship just between bodyweight/weight lifted (one might say its overfit)
4. It hasn’t been updated since its creation (while new world records have been set) 
5. It's based off of totals. 

6. **Bonus: Not in the article** it only accounts for the relationship between bodyweight and weight lifted, while age and equipment also affect what a person will lift. 

## The Solution:

Create a model (using more data and more features) to predict the total a person will lift. The score will be the difference between the total the person actually lifted, and the predicted total. 

## Executive Summary

**Goal:** 

The goal of this project is to use machine learning principals to create a "fair" metric for scoring powerlifters across ages, genders, weight classes and level of equipment.

**The Data:**

Data was taken from the website <a href="www.openpowerlifting.org">OpenPowerlifting</a> website. OpenPowerlifting offers results from every federation. Data was selected for this project from the USAPL/IPF federations. These federations were chosen because of familiarity with rules, it was the same federations used by Wilks/Nuckols, and because these federations implement drug testing. Use of performance enhancing drugs is a factor that is going to affect lifter's ability which non-drug tested federations have no way to measure. Thus, it would throw off predictions in the model used. It was easier to select a federation in which it is assumed that no lifter is using the drugs. 

**Metrics:** 

Various metrics were used in assesing the scoring system. The first was to recreate the work done by Paul M. Vanderburgh and Alan M. Batterham in their paper *Validation of the Wilks powerlifting formula*, in which they looked at the correlation of bodyweight and Wilks adjusted scores to see if there were any trends. 

The next method used was to simulate 10,000 meets and to select a winner based on the machine learning model and also the wilks score. The distribution of these winners' bodyweight, age, and equipment were compared to the distribution of the lifters in the data (figuring that a fair metric would have winners that followed the distribution of the lifters) 

These distributions were analyzed with histograms, distplots, quantile-quantile plots, and using the Kolmogorov-Smirnov statistic. 

**Findings:** 

The residuals scores had a much lower correlation with bodyweight than did Wilks adjusted scores.

For men, the distributions of bodyweight of the winners was much more similar to the distribution of the lifters than was the distribution of wilks winners. The KS test reflected this as well. 

For women the distributions of residuals winners, wilks winners and the distribution of lifters were all much closer together, making it harder to discern which is more "fair." The wilks winners distribution scored better on th KS test but by a marginal amount. 

For both men and women, the distribution of Raw v. Single Ply lifters in the winners was closer to the distribution within the data for winners as scored by Residuals. 

**Limitations/Assumptions:**

<u>Limitations</u>:
1. Null values reduced the amount of usable data from 75,000 to 50,000. 
2. Unorganized entry scheme for values restricted the amount of information that the dataset could give. 
3. There is no clear way to measure the "fair"-ness of a scoring coefficient. The solution to this problem is not simply to get the highest score possible. Other's could find fault with the methods used.

<u>Assumptions</u>:
1. All weight values (for lifts and for bodyweight) were entered in kilograms.
2. Everyone in single-ply was using uniform equipment.
3. Every lifter has complied with the rules of USAPL/IPF and does not take performance enhancing drugs.
4. All data is entered correctly.

## Exploratory Data Analysis.

Data was collected from the OpenPowerlifting <a href= "https://github.com/sstangl/openpowerlifting/">GitHub</a>. Data was sorted by federation and then by meet. In each meet folder was a csv titled "meet.csv" which had data specific to that meet. There was also a csv called "lifters.csv" which had data about every lifter in the meet. The first task was to concatenate the two csvs so that the meet data was in every lifter's row (most importantly this was to make sure that every lifter had a date associated with their lifts). The second task was to concatenate all the meets to make one big dataframe. That dataframe was then saved as a single csv so that it could be used in analysis and modeling. The code to this can be viewed <a href = "https://git.generalassemb.ly/sophiazwilson/powerlifting-project/blob/master/Part%201-Get%20Data.ipynb">here</a> (Note that this was originally conducted on an AWS instance so the clone of the OpenPowerlifting data is not available)

Since the features of interest- BodyweightKg, Age, Sex, Equipment and weight lifted (TotalKg) were already known, a lot of the EDA was focused on cleaning and imputation. 

The first nulls to drop were lifters who had no data for any of the lifts or a total. 
- Reasoning: Since the modeling was focused how much a person lifted, rows with no data about this would give no information. Sometimes it happens that someone is registered for a meet and they don't show up, this is most likely what's going on in these rows, but they don't need to be in the data set.

The next nulls to deal with were the null values in the BodyweightKg column. Fortunately there is a column called WeightClassKg, which has what weight class the lifter competed in. In most cases these weight classes are an upper limit. In that, the person did not weigh even a kg more than that weight class. (Rules state that if someone weighs 64 kg they do not qualify for the 63kg weight class- and they must compete in the 72kg weight class) 
So if BodyweightKg is null but WeightClassKg is not, that weight class is a very close estimate of what the person weighs. This added about 5,000 values to the BodyweightKg column. 

Rows where both BodyweightKg and WeightClassKg were null were dropped.
- Reasoning: Bodyweight is the basis for all scoring metrics when it comes to weight lifting events. Bodyweight has been proven again and again to have an effect on how much a person can lift. Without Bodyweight, or any way of estimating that information, there is nothing to be done. 

Rows were BodyweightKg were null and WeightClassKg had a "+" were also dropped.
- Reasoning: Values in the WeightClassKg with a "+" in them mean "everyone above this weight competes in this class." Using the weight class as the BodyweightKg in these cases is not as close of a guess as it is with the other weight classes. 

The next nulls to attempt to fill in were the nulls in the Age column. The dataframe originally only had 3,762 values for age, but a lot of them could be imputed. The first way to impute them was to extract the year from the date of the meet and subtract that from the value in the BirthYear column. This brought the values in the Age column up to 28,090 values. 

The next step was to try and get age information from the divisions column. Since most divisions (other than open) are open to a range of ages (a range that's never bigger than 10). This was done by taking the average age in each division and then turning it into a dictionary, from which to enter age. Some keys were still null. So the values for the keys that were null, but had an age range were entered into the dictionary by hand. 

The next step was to look at the "Best" Lifts columns and the TotalKg columns. 