---
Title: "Data Cleaning and Imputation by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: June 26, 2025

Description: Using NFL Scouting Combine Event scores from 2004 - 2023, we will learn about data cleaning and imputation in Python.

Categories:
  - Interpreting findings
  - Ethics
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas
  - Data Imputation


### Data

This Dataset is from the SCORE Network Data Repository. The authors include: Shane Hauk, Michael Schuckers and Robin Lock

Visit the original data page here: https://data.scorenetwork.org/football/nfl-draft-combine.html

The data set contains 6128 rows and 8 columns. Each row represents a player at the NFL Scouting Combine between 2004 and 2023.

Download data: 

Available on the [Intro to Data Cleaning and Imputation by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes): [epl_player_stats_24_25.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description | 
|----|-------------|
| position | Playing position of the player |
| Round | Round player was drafted in |
| forty | 40-yard dash time |
| vertical | Vertical jump height (inches) |
| bench_reps | 225 bench press reps |
| broad_jump | Broad jump distances (inches) |
| shuttle | 20-yard shuttle time |

</details>

---

## Learning Goals

- Learn about the 3 C's of Data cleaning
- Basic principles of Data Cleaning
- Data Imputation
- Use Pandas for Data Cleaning
- Learning about the Ethics of Data Cleaning

---

# Data Cleaning and the "Three C's"

Data cleaning is an essential part of data science and sports analytics. Most of the time, you need to clean up the dataset before using it because it won't always be in a correct or readable format. This brings us to the "three C's". This is a very important aspect of data science that provides basic guidance on what to look for when undergoing the data cleaning process. 

The three C's stand for:
- Consistent
- Complete
- Correct


_Consistent_: This means that your dataset is correctly formatted in a standardized fashion. All instances are properly formatted into their respective data types and there are no inconsistencies with text casing.

_Complete_: There is no missing data. All missing values have been addressed by filling them with a reasonable substitute, imputation (we will discuss this later) and/or removing the data depending on the context (we will discuss the ethics of this later).

_Correct_: Fixing inconsistent data entrys such as misspelled items, inconceivable outliers and/or ilogical errors such as negative ages.


Let's break these down by diving into the football combine data!

---

# Getting Started

Before we get started let's import pandas so we can read in our data.

NOTE: If one of the import statements does not work, you may need to download the library(s). Visit one of the links below for more information on downloading them.

Pandas: https://pandas.pydata.org/docs/getting_started/install.html 

Numpy: https://numpy.org/install/

Now, lets import our libraries:

In [2]:
# Import necessary libraries

#Import numpy library for numerical operations
import numpy as np

#Import pandas library for data manipulation and analysis
import pandas as pd

In [3]:
# Read the in the NFL Combine data and store it in a DataFrame
combine_data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv')

---

# Consistency and Correctness

Let's start by examining the first 5 rows of data to get a better look.

In [4]:
# Reveal first 5 rows of the dataset
combine_data.head()

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
0,QB,,4.79,30.5,,110.0,7.66,4.41
1,RB,,4.5,34.0,21.0,121.0,7.09,4.3
2,QB,,4.6,,,,,
3,QB,,4.95,30.0,,119.0,7.44,4.34
4,WR,,4.78,38.0,,118.0,,4.45


Interesting! Looks like in just the first few rows, there are a lot of missing values.

Before we start deciding what to do with these missing values, let's start by checking the consistentcy. Let's check the datatypes for column and make sure that they align with their values.

In [8]:
# Check the data types of each column to ensure they are consistent with the expected data types
combine_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6128 entries, 0 to 6127
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   position    6128 non-null   object 
 1   Round       4023 non-null   float64
 2   forty       5751 non-null   float64
 3   vertical    4965 non-null   float64
 4   bench_reps  4308 non-null   float64
 5   broad_jump  4891 non-null   float64
 6   three_cone  3985 non-null   float64
 7   shuttle     4093 non-null   float64
dtypes: float64(7), object(1)
memory usage: 383.1+ KB


Nice! looks like we don't need to change any of the datatypes for this dataset.

Let's go ahead and check the _correctness_ and consistency of the dataset by checking some of the rows. To do this, let's take a look at the first 50 rows.

After running the code block below, examine the dataset to make sure they are correctly formatted and there are no absurd outliers. We also need to check that there are no misspelled positions or capitalization issues. 

In [None]:
# Let's check the first 50 rows of the dataset to ensure correctness and consistency.
# This will help us identify any potential issues with the data formatting or inconsistencies.
combine_data.head(50)

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
0,QB,,4.79,30.5,,110.0,7.66,4.41
1,RB,,4.5,34.0,21.0,121.0,7.09,4.3
2,QB,,4.6,,,,,
3,QB,,4.95,30.0,,119.0,7.44,4.34
4,WR,,4.78,38.0,,118.0,,4.45
5,OT,,5.41,28.0,20.0,104.0,8.24,4.61
6,WR,,4.76,31.5,,116.0,7.41,
7,WR,,4.73,38.0,,128.0,7.1,4.09
8,WR,,4.59,28.5,,111.0,7.24,4.27
9,RB,,4.71,31.5,16.0,108.0,,


Looks quite good! There are no capitalization issues in the position column and all of the numeric values are in float (decimal) format.

This is a massive dataset with over 2000 rows so it may be best to examine it using .describe() from the Pandas library. This will allow us to see statistical information about each column, which will tell us if there are any alarming outliers.

In [9]:
# Check the statistical summary of the dataset to identify any outliers or anomalies
combine_data.describe()

Unnamed: 0,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
count,4023.0,5751.0,4965.0,4308.0,4891.0,3985.0,4093.0
mean,3.832712,4.764519,32.838933,20.721681,114.98487,7.265425,4.402817
std,1.931859,0.305037,4.259617,6.394058,9.325569,0.40629,0.26369
min,1.0,4.22,17.5,2.0,74.0,6.28,3.75
25%,2.0,4.53,30.0,16.0,109.0,6.97,4.21
50%,4.0,4.68,33.0,21.0,116.0,7.17,4.36
75%,5.0,4.96,36.0,25.0,121.0,7.51,4.56
max,7.0,6.05,46.5,49.0,147.0,9.04,5.56


This looks very good; none of the max/min times seem unrealistic. The only number that stands out to me is the bench rep max of 49 reps. This means that somebody holds the record for 49 reps of 225 bench; that's insane! 

FUN FACT: Stephen Paea is the player who had those 49 bench reps at the combine. He was a DT out of Oregan State and was drafted 53rd overall to the Chicago Bears in 2011.

Finally, we need to examine the completeness of the dataset. Before we address this, we need to talk about the ethical decision-making involved with missing data.

---


# Ethics of Data Completeness



We need to consider WHY the data is missing in the first place. Did they mean for it to be empty? Did they forget to record certain datapoints? Sometimes it can be extremely hard to tell so it is crucial to document some of the assumptions your making. Transparency is key for your interpertations so that users/readers understand your thinking process, which may affect the usage of the dataset. Context is everything with data completeness. With that being said, let's start tackling this dataset.



# Completeness

Building upon the previous sections, there are severl ways to approach this issue but let's start by asking: Why is the data missing? If it was intentionally done, why? 

In this case, any mising data is likely due to non-particpation. In more recent years, more and more athletes have started to skip certain drills or the combine altogether. It also depends on what variable we are refering to. For example, most of the missing numerical variables in this case are more than likely non=participation related. However, missing 'Round' numbers most likely means that specific player went undrafted. Missing 'Position' values are a little harder to interpret. It may be because a player had no official position coming out of the combine and/or they may have played multiple positions in college. A perfect example of this was Travis Hunter in this year's draft because he had played both Cornerback and Wide Receiver in his college career. In this case, we may want to look at how many rows have missing positions. 

Before getting to that, let's make it simple and create a seperate dataframe for the missing values. 