---
Title: "Data Cleaning and Imputation by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: June 26, 2025

Description: Using NFL Scouting Combine Event scores from 2004 - 2023, we will learn about data cleaning and imputation in Python.

Categories:
  - Interpreting findings
  - Ethics
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas
  - Data Imputation


### Data

This Dataset is from the SCORE Network Data Repository. The authors include: Shane Hauk, Michael Schuckers and Robin Lock

Visit the original data page here: https://data.scorenetwork.org/football/nfl-draft-combine.html

The data set contains 6128 rows and 8 columns. Each row represents a player at the NFL Scouting Combine between 2004 and 2023.

Download data: 

Available on the [Intro to Data Cleaning and Imputation by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes): [epl_player_stats_24_25.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description | 
|----|-------------|
| position | Playing position of the player |
| Round | Round player was drafted in |
| forty | 40-yard dash time |
| vertical | Vertical jump height (inches) |
| bench_reps | 225 bench press reps |
| broad_jump | Broad jump distances (inches) |
| shuttle | 20-yard shuttle time |

</details>

---

## Learning Goals

- Learn about the 3 C's of Data cleaning
- Basic principles of Data Cleaning
- Data Imputation
- Use Pandas for Data Cleaning
- Learning about the Ethics of Data Cleaning

---

# Data Cleaning and the "Three C's"

Data cleaning is an essential part of data science and sports analytics. Most of the time, you need to clean up the dataset before using it because it won't always be in a correct or readable format. This brings us to the "three C's". This is a very important aspect of data science that provides basic guidance on what to look for when undergoing the data cleaning process. 

The three C's stand for:
- Consistent
- Complete
- Correct


_Consistent_: This means that your dataset is correctly formatted in a standardized fashion. All instances are properly formatted into their respective data types and there are no inconsistencies with test casing.

_Complete_: There is no missing data. All missing values have been addressed by filling them with a reasonable substitute, imputation (we will discuss this later) and/or removing the data depending on the context (we will discuss the ethics of this later).

_Correct_: Fixing inconsistent data entrys such as misspelled items, inconceivable outliers and/or ilogical errors such as negative ages.


Let's break these down by diving into the football combine data! Let's get started by importing the neccesary librarys!

In [None]:
# Import necessary libraries

#Import numpy library for numerical operations
import numpy as np

#Import pandas library for data manipulation and analysis
import pandas as pd

In [None]:
# Read the in the NFL Combine data and store it in a DataFrame
combine_data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv')