---
Title: "Data Cleaning and Imputation by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: June 26, 2025

Description: Using NFL Scouting Combine Event scores from 2004 - 2023, we will learn about data cleaning and imputation in Python.

Categories:
  - Interpreting findings
  - Ethics
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas
  - Data Imputation


### Data

This Dataset is from the SCORE Network Data Repository. The authors include: Shane Hauk, Michael Schuckers and Robin Lock

Visit the original data page here: https://data.scorenetwork.org/football/nfl-draft-combine.html

The data set contains 6128 rows and 8 columns. Each row represents a player at the NFL Scouting Combine between 2004 and 2023.

Download data: 

Available on the [Intro to Data Cleaning and Imputation by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes): [epl_player_stats_24_25.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description | 
|----|-------------|
| position | Playing position of the player |
| Round | Round player was drafted in |
| forty | 40-yard dash time |
| vertical | Vertical jump height (inches) |
| bench_reps | 225 bench press reps |
| broad_jump | Broad jump distances (inches) |
| shuttle | 20-yard shuttle time |

</details>

In [1]:
#Import numpy library for numerical operations
import numpy as np

#Import pandas library for data manipulation and analysis
import pandas as pd

# Import linear regression model
from sklearn.linear_model import LinearRegression

### 1. Why would someone want to change the datatype of a column during the data cleaning process?

_ANSWER_: Somebody may want to change a columns datatype in order to complete the data cleaning process like we did during this module. A column may have been incorrectly typed such as a shoe size column. It should be read as a category and not as a numerical quantity. Therefore its data type should NOT be a float, it should be an object. Someone may also want to change a columns datatype in order to achieve specific data filling needs. For example, we used both of these reasons to change the datatype of the 'Round' column during the module. We needed to fill NA values as 'Undrafted' and the 'Round' a player is drafted is interpreted as a category, NOT a quantity/measure.

### 2. Re-read in the ORIGINAL dataset, split it into a complete and missing datasets like we did earlier. Find the median value of the 'bench_reps' column in the missing dataset. Then, use that median value to fill in ALL of the NA values within that column in the missing dataset. Check the number of NA values to make sure they are filled in correctly.

HINT: use .median()

In [15]:
# Re-Read original dataset
combine_data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv')

# Split the dataset into complete and missing datasets
# Complete dataset contains rows with no missing values
complete_data = combine_data.dropna()

# Missing dataset contains rows with at least one missing value
missing_data = combine_data[combine_data.isna().any(axis=1)]

In [16]:
missing_data.head()

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
0,QB,,4.79,30.5,,110.0,7.66,4.41
1,RB,,4.5,34.0,21.0,121.0,7.09,4.3
2,QB,,4.6,,,,,
3,QB,,4.95,30.0,,119.0,7.44,4.34
4,WR,,4.78,38.0,,118.0,,4.45


In [17]:
# Keep original count of NA's in the 'Bench_Reps' column
original_na_count = missing_data['bench_reps'].isna().sum()

# Print that count
print(original_na_count)

1820


In [18]:
# Fill in all NA values in the 'Bench_Reps' column with the median value
missing_data['bench_reps'] = missing_data['bench_reps'].fillna(missing_data['bench_reps'].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  missing_data['bench_reps'] = missing_data['bench_reps'].fillna(missing_data['bench_reps'].median())


In [19]:
# Keep original count of NA's in the 'Bench_Reps' column
new_na_count = missing_data['bench_reps'].isna().sum()

# Print that count
print(new_na_count)

0


In [20]:
missing_data.head()

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
0,QB,,4.79,30.5,20.0,110.0,7.66,4.41
1,RB,,4.5,34.0,21.0,121.0,7.09,4.3
2,QB,,4.6,,20.0,,,
3,QB,,4.95,30.0,20.0,119.0,7.44,4.34
4,WR,,4.78,38.0,20.0,118.0,,4.45


### 3. What does data completeness mean? List some appropriate ways you could achieve data completeness.

Data completeness means that there are NO missing datapoints in the dataset. Every NA value has been addressed appropriately. Some ways you could achieve completeness: Imputation, dropping all rows with missing values (when there is a small amount and depending on the context), changing a column datatype and filling in values (in special cases), etc.

### 4. Which one of the three C's was not implemented in the following dataset? What could be done to fix it? 

| Position | Bench_Reps |
|----------|------------|
| TIGHT END |    35     |
| Quarterback |    25   |
| Tight end |    32     |
| Quarterback |   23    |

Data Consistency was not implemented in this case. All of the 'Position' data entrys are entered with only the first letter capitalized. The first tight end data entry is in all caps which does not adhere to the other entry(s). 'TIGHT END' should be changed to 'Tight end', just like the other tight end entry.

### 5. Using the missing and complete datasets you created in question 2, create a linear regression model to predict shuttle times for players who, we presume, did NOT participate in the shuttle drill during the combine. Use all of the other drill times to predict the shuttle time. Then, imputate those estimated values back into the filtered missing dataset needed to achieve this process. You don't need to combine it back with the original missing dataset and complete dataset. Remember, the model cannot take any rows with missing values outside of the target variable.

HINT: Train the model with the complete dataset.

In [21]:
# Create the linear regression model object
shuttle_linear_model = LinearRegression()

# train the model with the complete dataset
# Predictors variables
x = complete_data[['forty', 'vertical', 'bench_reps', 'broad_jump', 'three_cone']]

# Target variable
y = complete_data['shuttle']

# Train the model
shuttle_linear_model.fit(x, y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [None]:
# Filter missing dataset to only include rows with no missing values except for in target variable so we can model it
missing_shuttle = missing_data[missing_data['shuttle'].isna() & missing_data[['position', 'Round', 'vertical', 'bench_reps', 'broad_jump', 'three_cone', 'forty']].notna().all(axis=1)]

In [23]:
# Keep track of the original amount of missing values in the 'shuttle' column
original_na_count_shuttle = missing_shuttle['shuttle'].isna().sum()

# Print it out
print(original_na_count_shuttle)

53


In [24]:
# Find the predicted values for the missing 'shuttle' times
predicted_shuttles = shuttle_linear_model.predict(X=missing_shuttle[['forty', 'vertical', 'bench_reps', 'broad_jump', 'three_cone']])

print(predicted_shuttles)

[4.36371139 4.24244765 4.42425888 4.3116177  4.29202458 4.197173
 4.31227038 4.35342086 4.26412493 4.30976974 4.31280139 4.26332657
 4.23184012 4.28134427 4.25817686 4.21911288 4.26067976 4.72066396
 4.36788421 4.32732297 4.18259469 4.25742081 4.31637178 4.29113604
 4.34338106 4.44786136 4.48824024 4.5417698  4.428744   4.15561068
 4.33461252 4.18843949 4.37017418 4.5690632  4.08047258 4.11382861
 4.35780069 4.3293518  4.33816163 4.18082136 4.7347528  4.28003361
 4.53665499 4.3037725  4.13827586 4.17533022 4.41727606 4.23253345
 4.31291876 4.66141209 4.46148789 4.78114697 4.27786502]


NOTE: Your numbers may slightly differ

In [26]:
# Impute those values back into the missing dataset
missing_shuttle.loc[:,'shuttle'] = predicted_shuttles

In [None]:
# Check that values have been imputed
missing_shuttle.sample(53)

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
3791,DT,1.0,5.07,30.0,29.0,111.0,7.46,4.54177
5280,CB,3.0,4.52,34.0,11.0,120.0,6.81,4.17533
1506,WR,6.0,4.37,33.0,17.0,125.0,7.08,4.263327
4238,ILB,2.0,5.05,29.0,20.0,111.0,6.97,4.357801
1379,WR,4.0,4.51,36.0,20.0,123.0,7.09,4.264125
3213,WR,3.0,4.49,36.0,20.0,123.0,7.08,4.257421
3953,CB,5.0,4.32,33.5,20.0,125.0,6.83,4.155611
741,RB,4.0,4.48,34.0,20.0,119.0,6.88,4.197173
1892,RB,5.0,4.53,36.5,20.0,118.0,7.13,4.281344
4763,CB,1.0,4.37,35.5,20.0,126.0,6.92,4.180821


In [29]:
# Let's take a look and double check that there are no more missing values in the 'shuttle' column
print(missing_shuttle['shuttle'].isna().sum())

0


Nice!