---
Title: "Data Cleaning and Imputation by Austin Hayes"

Author:
  - Name: Austin Hayes

  -  Email: ahayes65@charlotte.edu

  -  Affiliation: University of North Carolina at Charlotte

Date: June 26, 2025

Description: Using NFL Scouting Combine Event scores from 2004 - 2023, we will learn about data cleaning and imputation in Python.

Categories:
  - Interpreting findings
  - Ethics
  - Importing and Reading data
  - Data Cleaning
  - Data Science
  - Pandas
  - Data Imputation


### Data

This Dataset is from the SCORE Network Data Repository. The authors include: Shane Hauk, Michael Schuckers and Robin Lock

Visit the original data page here: https://data.scorenetwork.org/football/nfl-draft-combine.html

The data set contains 6128 rows and 8 columns. Each row represents a player at the NFL Scouting Combine between 2004 and 2023.

Download data: 

Available on the [Intro to Data Cleaning and Imputation by Austin Hayes](https://github.com/schuckers/Charlotte_SCORE_Summer25/tree/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes): [epl_player_stats_24_25.csv](https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv)

---

### Variables and their Descriptions:


<details>
<summary><b>Variable Descriptions</b></summary>

| Variable | Description | 
|----|-------------|
| position | Playing position of the player |
| Round | Round player was drafted in |
| forty | 40-yard dash time |
| vertical | Vertical jump height (inches) |
| bench_reps | 225 bench press reps |
| broad_jump | Broad jump distances (inches) |
| shuttle | 20-yard shuttle time |

</details>

---

## Learning Goals

- Learn about the 3 C's of Data cleaning
- Basic principles of Data Cleaning
- Data Imputation
- Use Pandas for Data Cleaning
- Learning about the Ethics of Data Cleaning

---

# Data Cleaning and the "Three C's"

Data cleaning is an essential part of data science and sports analytics. Most of the time, you need to clean up the dataset before using it because it won't always be in a correct or readable format. This brings us to the "three C's". This is a very important aspect of data science that provides basic guidance on what to look for when undergoing the data cleaning process. 

The three C's stand for:
- Consistent
- Complete
- Correct


_Consistent_: This means that your dataset is correctly formatted in a standardized fashion. All instances are properly formatted into their respective data types and there are no inconsistencies with text casing.

_Complete_: There is no missing data. All missing values have been addressed by filling them with a reasonable substitute, imputation (we will discuss this later) and/or removing the data depending on the context (we will discuss the ethics of this later).

_Correct_: Fixing inconsistent data entrys such as misspelled items, inconceivable outliers and/or ilogical errors such as negative ages.


Let's break these down by diving into the football combine data!

---

# Getting Started

Before we get started let's import pandas so we can read in our data.

NOTE: If one of the import statements does not work, you may need to download the library(s). Visit one of the links below for more information on downloading them.

Pandas: https://pandas.pydata.org/docs/getting_started/install.html 

Numpy: https://numpy.org/install/

Now, lets import our libraries:

In [1]:
# Import necessary libraries

#Import numpy library for numerical operations
import numpy as np

#Import pandas library for data manipulation and analysis
import pandas as pd

In [2]:
# Read the in the NFL Combine data and store it in a DataFrame
combine_data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv')

---

# Consistency and Correctness

Let's start by examining the first 5 rows of data to get a better look.

In [3]:
# Reveal first 5 rows of the dataset
combine_data.head()

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
0,QB,,4.79,30.5,,110.0,7.66,4.41
1,RB,,4.5,34.0,21.0,121.0,7.09,4.3
2,QB,,4.6,,,,,
3,QB,,4.95,30.0,,119.0,7.44,4.34
4,WR,,4.78,38.0,,118.0,,4.45


Interesting! Looks like in just the first few rows, there are a lot of missing values.

Before we start deciding what to do with these missing values, let's start by checking the consistentcy. Let's check the datatypes for column and make sure that they align with their values.

In [4]:
# Check the data types of each column to ensure they are consistent with the expected data types
combine_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6128 entries, 0 to 6127
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   position    6128 non-null   object 
 1   Round       4023 non-null   float64
 2   forty       5751 non-null   float64
 3   vertical    4965 non-null   float64
 4   bench_reps  4308 non-null   float64
 5   broad_jump  4891 non-null   float64
 6   three_cone  3985 non-null   float64
 7   shuttle     4093 non-null   float64
dtypes: float64(7), object(1)
memory usage: 383.1+ KB


Nice! looks like we don't need to change any of the datatypes for this dataset.

Let's go ahead and check the _correctness_ and consistency of the dataset by checking some of the rows. To do this, let's take a look at the first 50 rows.

After running the code block below, examine the dataset to make sure they are correctly formatted and there are no absurd outliers. We also need to check that there are no misspelled positions or capitalization issues. 

In [5]:
# Let's check the first 50 rows of the dataset to ensure correctness and consistency.
# This will help us identify any potential issues with the data formatting or inconsistencies.
combine_data.head(50)

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
0,QB,,4.79,30.5,,110.0,7.66,4.41
1,RB,,4.5,34.0,21.0,121.0,7.09,4.3
2,QB,,4.6,,,,,
3,QB,,4.95,30.0,,119.0,7.44,4.34
4,WR,,4.78,38.0,,118.0,,4.45
5,OT,,5.41,28.0,20.0,104.0,8.24,4.61
6,WR,,4.76,31.5,,116.0,7.41,
7,WR,,4.73,38.0,,128.0,7.1,4.09
8,WR,,4.59,28.5,,111.0,7.24,4.27
9,RB,,4.71,31.5,16.0,108.0,,


Looks quite good! There are no capitalization issues in the position column and all of the numeric values are in float (decimal) format.

This is a massive dataset with over 6000 rows so it may be best to examine it using .describe() from the Pandas library. This will allow us to see statistical information about each column, which will tell us if there are any alarming outliers.

In [6]:
# Check the statistical summary of the dataset to identify any outliers or anomalies
combine_data.describe()

Unnamed: 0,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
count,4023.0,5751.0,4965.0,4308.0,4891.0,3985.0,4093.0
mean,3.832712,4.764519,32.838933,20.721681,114.98487,7.265425,4.402817
std,1.931859,0.305037,4.259617,6.394058,9.325569,0.40629,0.26369
min,1.0,4.22,17.5,2.0,74.0,6.28,3.75
25%,2.0,4.53,30.0,16.0,109.0,6.97,4.21
50%,4.0,4.68,33.0,21.0,116.0,7.17,4.36
75%,5.0,4.96,36.0,25.0,121.0,7.51,4.56
max,7.0,6.05,46.5,49.0,147.0,9.04,5.56


This looks very good; none of the max/min times seem unrealistic. The only number that stands out to me is the bench rep max of 49 reps. This means that somebody holds the record for 49 reps of 225 bench; that's insane! 

FUN FACT: Stephen Paea is the player who had those 49 bench reps at the combine. He was a DT out of Oregan State and was drafted 53rd overall to the Chicago Bears in 2011.

Finally, we need to examine the completeness of the dataset. Before we address this, we need to talk about the ethical decision-making involved with missing data.

---


# Ethics of Data Completeness



We need to consider WHY the data is missing in the first place. Did they mean for it to be empty? Did they forget to record certain datapoints? Sometimes it can be extremely hard to tell so it is crucial to document some of the assumptions your making. Transparency is key for your interpertations so that users/readers understand your thinking process, which may affect the usage of the dataset. Context is everything with data completeness. With that being said, let's start tackling this dataset.



# Completeness

Building upon the previous sections, there are severl ways to approach this issue but let's start by asking: Why is the data missing? If it was intentionally done, why? 

In this case, any mising data is likely due to non-particpation. In more recent years, more and more athletes have started to skip certain drills or the combine altogether. It also depends on what variable we are refering to. For example, most of the missing numerical variables in this case are more than likely non=participation related. However, missing 'Round' numbers most likely means that specific player went undrafted. Missing 'position' values are a little harder to interpret. It may be because a player had no official position coming out of the combine and/or they may have played multiple positions in college. A perfect example of this was Travis Hunter in this year's draft because he had played both Cornerback and Wide Receiver in his college career. In this case, we may want to look at how many rows have missing positions. 

Before getting to that, let's make it simple and split the data. We will make one dataframe of missing values and the other will have complete records only.

In [7]:
# Make a copy of the dataset with no missing values
complete_combine_data = combine_data.dropna()

# Make a dataframe of the missing values
missing_combine_data = combine_data[combine_data.isnull().any(axis=1)]

Let's take a look at the volume of data for each dataframe.

In [8]:
# Display the shapes of the complete and missing dataframes to understand the extent of missing data
complete_combine_data.shape, missing_combine_data.shape

((1930, 8), (4198, 8))

That is a TON of data with missing values. Let's start by examining the rows with any missing position groups.

In [9]:
missing_combine_data[missing_combine_data['position'].isnull()]

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle


That's great! There are no players with missing positions! However, let's just double check that all of them are properly formatted.

In [10]:
# Search every instance of a unique value in order to ensure that all positions are properly formatted
missing_combine_data['position'].unique()

array(['QB', 'RB', 'WR', 'OT', 'C', 'OG', 'TE', 'FB', 'OL', 'OLB', 'CB',
       'DT', 'S', 'DE', 'ILB', 'DL', 'EDGE', 'LB', 'DB'], dtype=object)

Looks good! Moving on to the numeric variables, let's see if there are any rows missing AT LEAST 4 of their combine drill values.

In [11]:
missing_combine_data[missing_combine_data[['forty','vertical','bench_reps','broad_jump','three_cone','shuttle']].isna().sum(axis=1) >= 4]

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
2,QB,,4.60,,,,,
12,WR,,4.50,,,,,
24,QB,,4.95,,,,,
25,OG,,5.10,,32.0,,,
42,OG,5.0,5.22,,26.0,,,
...,...,...,...,...,...,...,...,...
6118,CB,7.0,4.51,,18.0,,,
6122,DT,4.0,4.89,,,,,
6124,CB,5.0,4.26,42.0,,,,
6125,DE,3.0,,,15.0,,,


Wow, there are 1,123 players who had no data for the majority of the combine drills! In this case, these are most likely players that skipped out on certain drills in attempt to prevent injury and/or only do drills that may be neccesary for their position. It is difficult to decide what to do with these players.

Some of the possible options we could take are: drop all rows with missing data, drop ONLY the rows with majority of their data missing, mark the missing data as a certain value, OR change all of the missing drill values to the mean/median. 

Making this decision means we must weigh the pros and cons of each approach for the most faithful representation of the data. You would most likely not want to drop all of rows with missing values in this case because that is the majority of the dataset and causes an obstruction of the truth due to the elimination of a lot of drill values. Dropping players that have not competed in most of the drills could be a good idea but that would mean ignoring the drills that they DID do. On one hand, if you replace all of the data with a certain value such as the mean, it would'nt change the distribution too much. However, on the other hand, it would not be a truly accurate representation because if those players DID particpate in the drill, the outcome may be drastically different. 

Believe it or not, we may need to do several of these approaches. The best approach may be a somewhat subjective one but let's go with this one:

- To address assumpted undrafted players, we will change the column data type to the object type and enter all missing values as undrafted. 
- Players Missing ALL of their drill values will be dropped from the dataset since we only care about players that participated in the draft combine
- Any other missing drill value will be assigned to the mean value of that column in order to perserve drill values that were completed.


Ideally, we would not want to replace all of the missing values with the mean because it does not accurately represent those player TRUE drill results. However, in order to perserve an accurate datatype, float, this is most likely to be the best decision. The reason we changed the datatype of 'round' and not the drill value columns is due to the nature of 'round'. The round a player was drafted in is NOT a measurement. It can be viewed as a categorical variable, much like that of a shoe size. 'Round' may be represented by a number but it is read as a category that indicates when a player was drafted.

Let's start by dropping all players who, we assume, did not particpated in the draft combine.

In [12]:
# Drop rows that are missing all of the combine drill values
missing_combine_data = missing_combine_data.dropna(subset=['forty', 'vertical','bench_reps','broad_jump','three_cone','shuttle'], how='all')

# Check that they are no longer present
missing_combine_data[missing_combine_data[['forty', 'vertical','bench_reps','broad_jump','three_cone','shuttle']].isnull().all(axis=1)]

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle


Nice! Now, let's change the data type of the round drafted column and replace all missing values with 'undrafted'

In [13]:
# Change the column data type to the object type
missing_combine_data['Round'] = missing_combine_data['Round'].astype('object')

# Check that the data type has been changed
missing_combine_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4053 entries, 0 to 6127
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   position    4053 non-null   object 
 1   Round       1995 non-null   object 
 2   forty       3821 non-null   float64
 3   vertical    3035 non-null   float64
 4   bench_reps  2378 non-null   float64
 5   broad_jump  2961 non-null   float64
 6   three_cone  2055 non-null   float64
 7   shuttle     2163 non-null   float64
dtypes: float64(6), object(2)
memory usage: 285.0+ KB


Looks like it is now an object datatype, now replace the missing values.

In [14]:
# Replace the missing values in the 'Round' column with 'undrafted'
missing_combine_data['Round'] = missing_combine_data['Round'].fillna('Undrafted')

In [15]:
# Let's check a sample of the data to ensure that the missing values have been replaced correctly
missing_combine_data.sample(50)

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
1857,QB,Undrafted,4.95,29.0,,108.0,7.25,4.43
4146,DE,3.0,4.92,,26.0,,,
2286,QB,Undrafted,5.08,27.0,,105.0,7.32,4.45
5266,S,1.0,4.46,35.5,19.0,128.0,,
3888,DT,1.0,5.12,,,,,
4628,ILB,5.0,4.66,,20.0,113.0,,
1829,WR,2.0,4.6,39.0,,127.0,7.0,4.0
1718,QB,6.0,4.84,30.5,,106.0,7.33,4.23
194,OG,Undrafted,5.51,27.5,,99.0,,
213,WR,Undrafted,4.46,37.0,,124.0,7.21,4.22


Looks great! Now we need to enter the rest of the missing values as the mean of their respective columns.

In [16]:
# Replace all values in the 'forty', 'vertical', 'bench_reps', 'broad_jump', 'three_cone', and 'shuttle' columns that are missing with the mean of their respective columns
# This will help us maintain the integrity of the data while filling in the gaps.

# 'forty' column
missing_combine_data['forty'] = missing_combine_data['forty'].fillna(missing_combine_data['forty'].mean())

# 'vertical' column
missing_combine_data['vertical'] = missing_combine_data['vertical'].fillna(missing_combine_data['vertical'].mean())

# 'bench_reps' column
missing_combine_data['bench_reps'] = missing_combine_data['bench_reps'].fillna(missing_combine_data['bench_reps'].mean())

# 'broad_jump' column
missing_combine_data['broad_jump'] = missing_combine_data['broad_jump'].fillna(missing_combine_data['broad_jump'].mean())

# 'three_cone' column
missing_combine_data['three_cone'] = missing_combine_data['three_cone'].fillna(missing_combine_data['three_cone'].mean())

# 'shuttle' column
missing_combine_data['shuttle'] = missing_combine_data['shuttle'].fillna(missing_combine_data['shuttle'].mean())

Let's check a sample of the data to take a look at the results of our operations.

In [17]:
missing_combine_data.sample(50)

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
3340,OLB,7.0,4.71,32.733213,24.0,106.0,7.275134,4.411082
1506,WR,6.0,4.37,33.0,17.0,125.0,7.08,4.411082
4767,CB,Undrafted,4.69,33.5,20.0,118.0,7.01,4.2
1690,QB,4.0,4.95,28.5,20.044996,112.0,7.22,4.39
855,WR,Undrafted,4.46,33.5,20.044996,115.086795,7.275134,4.411082
3428,DT,Undrafted,4.91,32.5,35.0,115.0,7.23,4.2
3372,OLB,2.0,4.57,38.5,26.0,129.0,7.275134,4.0
3112,OG,Undrafted,5.44,26.0,25.0,95.0,7.97,4.91
1063,RB,Undrafted,4.7,35.0,18.0,117.0,6.85,4.49
231,OG,4.0,4.93,32.733213,20.044996,115.086795,7.275134,4.411082


Nice! Finally, let's go ahead and double-check that there are no more rows with missing values!

In [18]:
# Count up missing values in each column to ensure that all missing values have been addressed
missing_combine_data.isna().sum()

position      0
Round         0
forty         0
vertical      0
bench_reps    0
broad_jump    0
three_cone    0
shuttle       0
dtype: int64

Beatiful! Now let's complete the final step of this process and combine the complete and missing datasets back together!

In [19]:
# Lets combine the complete and missing datasets back together
Final_combine_data = pd.concat([complete_combine_data, missing_combine_data], ignore_index=True)

Let's check it out!

In [20]:
Final_combine_data.tail(50)

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
5933,CB,5.0,4.33,39.5,20.044996,132.0,6.48,3.94
5934,CB,2.0,4.5,32.733213,20.044996,115.086795,7.275134,4.411082
5935,CB,1.0,4.44,32.733213,20.044996,115.086795,7.275134,4.411082
5936,LB,5.0,4.762214,32.733213,21.0,115.086795,7.275134,4.411082
5937,S,Undrafted,4.762214,35.0,16.0,120.0,7.275134,4.411082
5938,S,Undrafted,4.52,32.733213,20.044996,124.0,7.0,4.411082
5939,LB,Undrafted,4.76,40.5,20.044996,133.0,7.09,4.55
5940,DT,5.0,4.762214,32.733213,29.0,115.086795,7.275134,4.411082
5941,EDGE,2.0,4.55,35.0,20.044996,122.0,7.275134,4.45
5942,S,Undrafted,4.762214,35.0,15.0,125.0,6.89,4.411082


In [21]:
# Examine the column data types of the final combined dataset to ensure they are consistent
Final_combine_data.info()

# print the statistical summary of the final combined dataset
print(Final_combine_data.describe())

# Print the shape of the final combined dataset to confirm that it has been successfully created
print(Final_combine_data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5983 entries, 0 to 5982
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   position    5983 non-null   object 
 1   Round       5983 non-null   object 
 2   forty       5983 non-null   float64
 3   vertical    5983 non-null   float64
 4   bench_reps  5983 non-null   float64
 5   broad_jump  5983 non-null   float64
 6   three_cone  5983 non-null   float64
 7   shuttle     5983 non-null   float64
dtypes: float64(6), object(2)
memory usage: 374.1+ KB
             forty     vertical   bench_reps   broad_jump   three_cone  \
count  5983.000000  5983.000000  5983.000000  5983.000000  5983.000000   
mean      4.764430    32.820944    20.532236   115.003473     7.268667   
std       0.299064     3.880485     5.434014     8.431625     0.331600   
min       4.220000    17.500000     2.000000    74.000000     6.280000   
25%       4.530000    30.500000    18.000000   111.000000   

In [22]:
# Confirm there are no missing values in the final combined dataset
print(Final_combine_data.isnull().sum())

position      0
Round         0
forty         0
vertical      0
bench_reps    0
broad_jump    0
three_cone    0
shuttle       0
dtype: int64


The data, much like the title of this section, is now 'complete'!

---


# Data Imputation


Data Imputation is an extremely important topic in data cleaning. You may not know what it is yet, but you have actually seen it in this module thus far!

_Data Imputation_: The process of filling in missing data points using estimated/substitute values. 

Do you remember when we replaced all of the missing combine drill values with each columns mean? That's data imputation!

However, that is not the only way to do data imputation. You can use different models to estimate what values to impute for your missing values. If you are new to data modeling or have no experience with it, I'll walk you through it. Model-making is just a way of using pre-existing data to estimate what a value might be. There are seperate models that can be used for categorical and numerical estimates. Today, we will only go over numerical model-making with linear regression. This means that we will be estimating a numerical outcome for one of our variables, such as 40-yard dash time. 


Let's go ahead and start by redoing some steps to setup our missing values dataset from earlier.

In [23]:
# Re-read in the orginal dataset to make sure we are working with the original data
combine_data = pd.read_csv('https://raw.githubusercontent.com/schuckers/Charlotte_SCORE_Summer25/refs/heads/main/Data%20for%20Modules/Data%20for%20Intro%20to%20Data%20Cleaning%20and%20Imputation%20by%20Austin%20Hayes/nfl_combine.csv')

# Split the datset into to a complete and missing dataset
# Make a copy of the dataset with no missing values
complete_combine_data_2 = combine_data.dropna()

# Make a dataframe of the missing values
missing_combine_data_2 = combine_data[combine_data.isnull().any(axis=1)]

Now we need to import the necessary libraries for modeling.

Here are some links to download the libraries, if you don't have them downloaded yet.

Sklearn: https://scikit-learn.org/stable/install.html

In [24]:
# Import necessary libraries for modeling
# train_test_split is used to split the dataset into training and testing sets to train the model. 
from sklearn.model_selection import train_test_split

# Import linear regression model
from sklearn.linear_model import LinearRegression


---

# Linear Regression Model for Imputation


As stated earlier, linear regression models are used to estimate a numerical outcome based on other variables. In this lesson I will show how to impute data using this model. 

Linear regression models: Models that assume a linear relationship between predictor variables, meaning the change in the outcome is constant for every unit change in the predictor(s). In other words, Linear regression models assume that there is a linear relationship between predictors on the x-axis and the outcome on the y-axis. If there is an increase in the x-axis, there is an expect change in one direction on the y-axis.

Every model needs to be trained first using complete data in order to make the most accurate assessment. This will reuqire splitting the data into a training dataset. There are various splits you can do such as 50/50, 40/60, 30/70, etc. However, in this case, we have already done the split because we want to use ALL of the complete data points to estimate the missing ones. We could split the complete data, training and testing the model with just that data but we only care about imputing the data with missing points in this case.

In this example, we will use each players combine drills to predict what their missing 40-yard dash time could be.

In [25]:
# Create the linear regression model object
linear_model = LinearRegression()


# Note: For some reason, the model needs a 2d array for the x variable, so we need to use double brackets.
# We will use the complete data set to train the model.
x = complete_combine_data_2[['vertical', 'bench_reps', 'broad_jump', 'three_cone', 'shuttle']]

# unlike x, y can be a 1d array, so we can use single brackets
# We will use the 'forty' column to train the model off of preexisting 40-yard dash times.
y = complete_combine_data_2['forty']

# Fit the model to the training data
linear_model.fit(x, y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [26]:
missing_combine_data_2.columns

Index(['position', 'Round', 'forty', 'vertical', 'bench_reps', 'broad_jump',
       'three_cone', 'shuttle'],
      dtype='object')

Now that we have built the model and trained it, let's have it guess what every players, who has a missing 40-yard dash time, 40 time is. 

In [27]:
# Gather ONLY the players who have missing 40-yard dash times from the dataset with missing values.
# There can be no missing other missing values within the dataset for the model to work properly.
# Therefore we must ensure that the other columns we are using to predict the 40-yard dash time are not missing.
# Those columns being 'position', 'Round', 'vertical', 'bench_reps', 'broad_jump', 'three_cone', and 'shuttle'.
missing_fortys = missing_combine_data_2[missing_combine_data_2['forty'].isna() & missing_combine_data_2[['position', 'Round', 'vertical', 'bench_reps', 'broad_jump', 'three_cone', 'shuttle']].notna().all(axis=1)]
 #& missing_combine_data_2[cols].notna().all(axis=1)
# Use that data we just created to predict the missing 40-yard dash times
predicted_fortys = linear_model.predict(X=missing_fortys[['vertical', 'bench_reps', 'broad_jump', 'three_cone', 'shuttle']])

print(predicted_fortys)

[5.07133559 4.71745076 4.87740177 4.45985711 4.64774051 5.14412653
 5.08528085 5.09802023 5.09834207 4.9223477  4.51912772 4.64512984
 4.55149641 4.47374944 4.57628351]


Nice! Being that there are 15 predicted values, that means that only 15 players had no missing data EXCEPT for their 40-yard dash times. 

Let's fill the data and combine it back into the missing dataset. Before we begin, let's print out the prexisting amount of NA rows compared to after, to make sure we implement these values correctly.

In [28]:
# Print out the prexisting amount of NA rows compared to after, to make sure we implement these values correctly.
print('Pre-existing amount of NA rows: ', missing_combine_data_2.isna().sum())

Pre-existing amount of NA rows:  position         0
Round         2105
forty          377
vertical      1163
bench_reps    1820
broad_jump    1237
three_cone    2143
shuttle       2035
dtype: int64


As you can see, there are 377 missing values in the 'forty' column.

Now let's impute the values in the missing fortys dataset.

In [29]:
# fill NA values in the missing fortys 
missing_fortys.loc[:,'forty'] = predicted_fortys

In [30]:
# Make sure that the values have been filled in correctly
missing_fortys.sample(15)

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
2469,OG,1.0,5.144127,26.5,35.0,105.0,7.65,4.62
313,C,4.0,5.071336,32.0,21.0,96.0,7.57,4.45
2444,RB,2.0,4.459857,40.0,11.0,126.0,7.07,4.29
5610,LB,5.0,4.473749,38.0,17.0,123.0,6.89,4.14
2971,OL,2.0,5.098342,26.5,33.0,105.0,7.53,4.58
3004,OL,5.0,4.922348,30.5,25.0,112.0,7.5,4.51
2607,OL,2.0,5.085281,28.0,29.0,109.0,7.77,4.62
2283,TE,5.0,4.717451,33.0,18.0,114.0,7.12,4.33
2412,C,2.0,4.877402,30.5,21.0,108.0,7.29,4.4
2449,RB,6.0,4.647741,35.0,15.0,120.0,7.13,4.51


Nice! All of the predicted 40-yard dash times have been filled into the dataset. Now let's remove the rows we have been operating on with missing 40-yard dash times in the orginal dataset and replace them with their updated counterparts that have filled in 40-yard dash times.

In [31]:
# Drop the rows we have been operating on with missing 40-yard dash times in the orginal dataset
missing_combine_data_2.drop(index=missing_fortys.index,inplace=True)

# Now we add the updated rows with filled-in 40-yard dash times.
missing_combine_data_2 = pd.concat([missing_combine_data_2, missing_fortys])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  missing_combine_data_2.drop(index=missing_fortys.index,inplace=True)


In [32]:
# Check that there the amount of rows with missing forty values have been reduced by 15 because that is the size of the missing_fortys dataframe
missing_combine_data_2.isna().sum()

position         0
Round         2105
forty          362
vertical      1163
bench_reps    1820
broad_jump    1237
three_cone    2143
shuttle       2035
dtype: int64

Perfect! Now all of the 15 updated rows are in the original missing values dataset. There was 377 NA values in the 'forty' column, now there are 362. We could go a step further and merge the complete and missing dataframes back together just like we did earlier in the non-imputation section of this module, but you will instead be asked to build upon the missing dataset in the review questions below.

NOTE: We can also check that the columns have been correctly implemented by using a filter that is similar to the one we used to make the missing_fortys dataframe. However, this time we will include the 'forty' column in the list columns we want to be non-null because we filled in their estimated values. Take a look at the code below.

In [34]:
missing_combine_data_2[missing_combine_data_2[['forty','position', 'Round', 'vertical', 'bench_reps', 'broad_jump', 'three_cone', 'shuttle']].notna().all(axis=1)]

Unnamed: 0,position,Round,forty,vertical,bench_reps,broad_jump,three_cone,shuttle
313,C,4.0,5.071336,32.0,21.0,96.0,7.57,4.45
2283,TE,5.0,4.717451,33.0,18.0,114.0,7.12,4.33
2412,C,2.0,4.877402,30.5,21.0,108.0,7.29,4.4
2444,RB,2.0,4.459857,40.0,11.0,126.0,7.07,4.29
2449,RB,6.0,4.647741,35.0,15.0,120.0,7.13,4.51
2469,OG,1.0,5.144127,26.5,35.0,105.0,7.65,4.62
2607,OL,2.0,5.085281,28.0,29.0,109.0,7.77,4.62
2665,OT,2.0,5.09802,28.0,27.0,108.0,7.77,4.69
2971,OL,2.0,5.098342,26.5,33.0,105.0,7.53,4.58
3004,OL,5.0,4.922348,30.5,25.0,112.0,7.5,4.51


Nice! All 15 rows are present with their estimated 40-yard dash times.

--- 

# Review Questions

Still using the data we have used throughout this module, answer the following questions to review your knowledge!

### 1. Why would someone want to change the datatype of a column during the data cleaning process?

### 2. Re-read in the ORIGINAL dataset, split it into a complete and missing datasets like we did earlier. Find the median value of the 'bench_reps' column. Then, use that median value to fill in ALL of the NA values within that column. Check the number of NA values to make sure they are filled in correctly. 

HINT: use .median()

### 3. What does data completeness mean? List some appropriate ways you could achieve data completeness.

### 4. Which one of the three C's was not implemented in the following dataset? What could be done to fix it? 

| Position | Bench_Reps |
|----------|------------|
| TIGHT END |    35     |
| Quarterback |    25   |
| Tight End |    32     |
| Quarterback |   23    |

### 5. Use a linear regression model to predict 'bench_reps' for players who, we presume, did NOT participate in benching during the combine.