# Activity: Perform feature engineering 

## **Introduction**


As you're learning, data professionals working on modeling projects use featuring engineering to help them determine which attributes in the data can best predict certain measures.

In this activity, you are working for a firm that provides insights to the National Basketball Association (NBA), a professional North American basketball league. You will help NBA managers and coaches identify which players are most likely to thrive in the high-pressure environment of professional basketball and help the team be successful over time.

To do this, you will analyze a subset of data that contains information about NBA players and their performance records. You will conduct feature engineering to determine which features will most effectively predict whether a player's NBA career will last at least five years. The insights gained then will be used in the next stage of the project: building the predictive model.


## **Step 1: Imports** 


Start by importing `pandas`.

In [1]:
# Import pandas
import pandas as pd

The dataset is a .csv file named `nba-players.csv`. It consists of performance records for a subset of NBA players. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# RUN THIS CELL TO IMPORT YOUR DATA.

# Save in a variable named `data`.

### YOUR CODE HERE ###

data = pd.read_csv("nba-players.csv", index_col=0)

<details><summary><h4><strong>Hint 1</strong></h4></summary>

The `read_csv()` function from `pandas` allows you to read in data from a csv file and load it into a DataFrame.
    
</details>

<details><summary><h4><strong>Hint 2</strong></h4></summary>

Call the `read_csv()`, pass in the name of the csv file as a string, followed by `index_col=0` to use the first column from the csv as the index in the DataFrame.
    
</details>

## **Step 2: Data exploration** 

Display the first 10 rows of the data to get a sense of what it entails.

In [3]:
# Display first 10 rows of data
data.head(10)

Unnamed: 0,name,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,...,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
0,Brandon Ingram,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,...,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,Andrew Harrison,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,...,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,JaKarr Sampson,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,...,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,Malik Sealy,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,...,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,Matt Geiger,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,...,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1
5,Tony Bennett,75,11.4,3.7,1.5,3.5,42.3,0.3,1.1,32.5,...,0.5,73.2,0.2,0.7,0.8,1.8,0.4,0.0,0.7,0
6,Don MacLean,62,10.9,6.6,2.5,5.8,43.5,0.0,0.1,50.0,...,1.8,81.1,0.5,1.4,2.0,0.6,0.2,0.1,0.7,1
7,Tracy Murray,48,10.3,5.7,2.3,5.4,41.5,0.4,1.5,30.0,...,0.8,87.5,0.8,0.9,1.7,0.2,0.2,0.1,0.7,1
8,Duane Cooper,65,9.9,2.4,1.0,2.4,39.2,0.1,0.5,23.3,...,0.5,71.4,0.2,0.6,0.8,2.3,0.3,0.0,1.1,0
9,Dave Johnson,42,8.5,3.7,1.4,3.5,38.3,0.1,0.3,21.4,...,1.4,67.8,0.4,0.7,1.1,0.3,0.2,0.0,0.7,0


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

There is a function in the `pandas` library that can be called on a DataFrame to display the first n number of rows, where n is a number of your choice. 
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Call the `head()` function and pass in 10.
</details>

Display the number of rows and the number of columns to get a sense of how much data is available to you.

In [4]:
# Display number of rows and number of columns
data.shape

(1340, 21)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

DataFrames in `pandas` have an attribute that can be called to get the number of rows and columns as a tuple.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

You can call the `shape` attribute.
</details>

**Question:** What do you observe about the number of rows and the number of columns in the data?

 [Write your response here. Double-click (or enter) to edit.]

Now, display all column names to get a sense of the kinds of metadata available about each player. Use the columns property in pandas.


In [5]:
# Display all column names
data.columns

Index(['name', 'gp', 'min', 'pts', 'fgm', 'fga', 'fg', '3p_made', '3pa', '3p',
       'ftm', 'fta', 'ft', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'tov',
       'target_5yrs'],
      dtype='object')

The following table provides a description of the data in each column. This metadata comes from the data source, which is listed in the references section of this lab.

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played per game|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

Next, display a summary of the data to get additional information about the DataFrame, including the types of data in the columns.

In [6]:
# Use .info() to display a summary of the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1340 entries, 0 to 1339
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         1340 non-null   object 
 1   gp           1340 non-null   int64  
 2   min          1340 non-null   float64
 3   pts          1340 non-null   float64
 4   fgm          1340 non-null   float64
 5   fga          1340 non-null   float64
 6   fg           1340 non-null   float64
 7   3p_made      1340 non-null   float64
 8   3pa          1340 non-null   float64
 9   3p           1340 non-null   float64
 10  ftm          1340 non-null   float64
 11  fta          1340 non-null   float64
 12  ft           1340 non-null   float64
 13  oreb         1340 non-null   float64
 14  dreb         1340 non-null   float64
 15  reb          1340 non-null   float64
 16  ast          1340 non-null   float64
 17  stl          1340 non-null   float64
 18  blk          1340 non-null   float64
 19  tov   

**Question:** Based on the preceding tables, which columns are numerical and which columns are categorical?

Numerical columns (quantitative values — either int64 or float64):

gp (Games played)

min (Minutes)

pts (Points)

fgm (Field goals made)

fga (Field goals attempted)

fg (Field goal percentage)

3p_made (3-pointers made)

3pa (3-pointers attempted)

3p (3-point percentage)

ftm (Free throws made)

fta (Free throws attempted)

ft (Free throw percentage)

oreb (Offensive rebounds)

dreb (Defensive rebounds)

reb (Total rebounds)

ast (Assists)

stl (Steals)

blk (Blocks)

tov (Turnovers)

target_5yrs (Target variable indicating whether the career lasted at least 5 years)

Categorical column (non-numerical / object type):

name (Player's name)

### Check for missing values

Now, review the data to determine whether it contains any missing values. Begin by displaying the number of missing values in each column. After that, use isna() to check whether each value in the data is missing. Finally, use sum() to aggregate the number of missing values per column.


In [7]:
# Display the number of missing values in each column
missing_values = data.isna().sum()
missing_values

name           0
gp             0
min            0
pts            0
fgm            0
fga            0
fg             0
3p_made        0
3pa            0
3p             0
ftm            0
fta            0
ft             0
oreb           0
dreb           0
reb            0
ast            0
stl            0
blk            0
tov            0
target_5yrs    0
dtype: int64

**Question:** What do you observe about the missing values in the columns? 

There are no missing values in any of the columns. Every column in the dataset has complete data for all 1,340 player records.

**Question:** Why is it important to check for missing values?

Checking for missing values is important because they can negatively impact the performance and accuracy of machine learning models. Models may not be able to process incomplete data properly, leading to errors, biases, or incorrect predictions. Identifying missing values allows data professionals to handle them appropriately — for example, by imputing, removing, or flagging them — to ensure data quality and model reliability.

## **Step 3: Statistical tests** 



Next, use a statistical technique to check the class balance in the data. To understand how balanced the dataset is in terms of class, display the percentage of values that belong to each class in the target column. In this context, class 1 indicates an NBA career duration of at least five years, while class 0 indicates an NBA career duration of less than five years.

In [8]:
# Display percentage (%) of values for each class in the target column
class_distribution = data['target_5yrs'].value_counts(normalize=True) * 100
class_distribution

1    62.014925
0    37.985075
Name: target_5yrs, dtype: float64

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

In `pandas`, `value_counts(normalize=True)` can be used to calculate the frequency of each distinct value in a specific column of a DataFrame.  
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

After `value_counts(normalize=True)`, multipling by `100` converts the frequencies into percentages (%).
</details>

**Question:** What do you observe about the class balance in the target column?

The dataset is somewhat imbalanced, with 62% of players having careers lasting at least five years (class 1) and 38% having careers shorter than five years (class 0). While this isn’t an extreme imbalance, it's still worth noting for modeling.

**Question:** Why is it important to check class balance?

It is important to check class balance because imbalanced datasets can bias machine learning models toward the majority class, leading to misleading performance metrics. For example, a model might predict the majority class very well but perform poorly on the minority class — which may be the more critical one to predict correctly. Understanding class distribution helps inform decisions about model choice, evaluation metrics (like precision, recall, and F1-score), and whether techniques like resampling or class weighting are needed.

## **Step 4: Results and evaluation** 


Now, perform feature engineering, with the goal of identifying and creating features that will serve as useful predictors for the target variable, `target_5yrs`. 

### Feature selection

The following table contains descriptions of the data in each column:

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

**Question:** Which columns would you select and avoid selecting as features, and why? Keep in mind the goal is to identify features that will serve as useful predictors for the target variable, `target_5yrs`. 

Columns to select:

gp (Games played): Players who have played more games may have more experience, and this could be an indicator of career longevity.

pts (Points): A key performance indicator that could help predict player career longevity — players who score more might be more valuable and have longer careers.

fgm (Field goals made): This is tied to scoring ability, which is essential in evaluating player performance and might contribute to career longevity.

fg (Field goal percentage): Players who are more efficient with their shooting are likely to have better careers.

3p_made (Three-point field goals made) and 3p (Three-point percentage): The ability to make three-pointers is important in modern basketball, and successful players in this area tend to have longer careers.

ftm (Free throws made) and ft (Free throw percentage): Free throw ability is a key skill in basketball, and efficiency here could correlate with career longevity.

oreb (Offensive rebounds), dreb (Defensive rebounds), and reb (Total rebounds): Rebounding is a critical aspect of a player’s value and longevity in the league, so these features are likely to be useful.

ast (Assists): Playmaking ability can be an indicator of a player’s overall value, and those with higher assists may have longer careers.

stl (Steals) and blk (Blocks): Defensive skills are important for a player's value over the long term, so these features might be useful as predictors.

tov (Turnovers): Turnovers may indicate decision-making ability and could be predictive of a player’s longevity if they consistently have low turnovers.

Columns to avoid:

name: The name of the player is a categorical feature and doesn't provide any relevant predictive value for career duration.

target_5yrs: This is the target variable and should not be used as a feature in prediction models because it represents the outcome you're trying to predict.

Summary:
I would select columns that reflect a player's performance (points, assists, shooting efficiency, etc.) and experience (games played, rebounds, etc.). These features are likely to have the most predictive value for whether a player’s career will last at least five years. Columns like "name" are irrelevant for prediction and should be excluded.

Next, select the columns you want to proceed with. Make sure to include the target column, `target_5yrs`. Display the first few rows to confirm they are as expected.

In [9]:
# Select the columns to proceed with and save the DataFrame in new variable `selected_data`.
selected_data = data[['gp', 'min', 'pts', 'fgm', 'fga', 'fg', '3p_made', '3pa', '3p', 
                       'ftm', 'fta', 'ft', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'tov', 'target_5yrs']]

# Display the first few rows.
selected_data.head()

Unnamed: 0,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,ftm,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the materials about feature selection and selecting a subset of a DataFrame.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use two pairs of square brackets, and place the names of the columns you want to select inside the innermost brackets. 

</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

There is a function in `pandas` that can be used to display the first few rows of a DataFrame. Make sure to specify the column names with spelling that matches what's in the data. Use quotes to represent each column name as a string. 
</details>

### Feature transformation

An important aspect of feature transformation is feature encoding. If there are categorical columns that you would want to use as features, those columns should be transformed to be numerical. This technique is also known as feature encoding.

**Question:** Why is feature transformation important to consider? Are there any transformations necessary for the features you want to use?

Feature Transformation Importance:
Feature transformation is essential in machine learning because most algorithms require numerical input to perform calculations and make predictions. Transforming raw data into a form that can be understood by machine learning models can significantly impact the model's performance. Key reasons include:

Consistency with Model Requirements: Many machine learning algorithms (such as regression, decision trees, or neural networks) expect numerical inputs. Categorical variables, like "name" or "team," must be converted into numerical representations using techniques like one-hot encoding or label encoding.

Improved Model Performance: Some models may struggle with non-numeric data because they can't process categorical variables directly. Feature encoding allows the model to understand the data's structure and relationships, leading to better performance.

Enabling Feature Interaction: Transformations like scaling or normalization allow features to have comparable magnitudes, making it easier for models to identify patterns between features. This is especially useful when using algorithms sensitive to the scale of data, such as K-Nearest Neighbors (KNN) or Support Vector Machines (SVM).

Handling Missing Data: Feature transformation techniques can also help deal with missing or inconsistent data (e.g., filling missing values, encoding missing categories), which helps ensure a more robust model.

Are Transformations Necessary for the Features in Our Data?
Upon reviewing the dataset, the following transformations might be necessary:

Categorical Data: The only categorical feature in the data, in our case, is the "name" column, which we do not need for predictions. We will not include this column in the feature set.

Numerical Data: Most of the features in the dataset, such as points, assists, field goals, rebounds, etc., are already numeric. However, we might need scaling or normalization if the features have vastly different ranges (for example, "points" might range in the tens, while "minutes" could be in the hundreds).

Scaling: It's beneficial to scale numerical features (e.g., using Min-Max scaling or Standard scaling) to make the data suitable for models sensitive to feature scales (like logistic regression or SVM).

Missing Values: We’ve already checked for missing values and found none, so we don't need to worry about imputing missing data.

### Feature extraction

Display the first few rows containing containing descriptions of the data for reference. The table is as follows:

<center>

|Column Name|Column Description|
|:---|:-------|
|`name`|Name of NBA player|
|`gp`|Number of games played|
|`min`|Number of minutes played per game|
|`pts`|Average number of points per game|
|`fgm`|Average number of field goals made per game|
|`fga`|Average number of field goal attempts per game|
|`fg`|Average percent of field goals made per game|
|`3p_made`|Average number of three-point field goals made per game|
|`3pa`|Average number of three-point field goal attempts per game|
|`3p`|Average percent of three-point field goals made per game|
|`ftm`|Average number of free throws made per game|
|`fta`|Average number of free throw attempts per game|
|`ft`|Average percent of free throws made per game|
|`oreb`|Average number of offensive rebounds per game|
|`dreb`|Average number of defensive rebounds per game|
|`reb`|Average number of rebounds per game|
|`ast`|Average number of assists per game|
|`stl`|Average number of steals per game|
|`blk`|Average number of blocks per game|
|`tov`|Average number of turnovers per game|
|`target_5yrs`|1 if career duration >= 5 yrs, 0 otherwise|

</center>

In [10]:
# Display the first few rows of the selected_data DataFrame for reference.
selected_data.head()

Unnamed: 0,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,ftm,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


**Question:** Which columns lend themselves to feature extraction?

In this context, feature extraction involves creating new features from the existing ones, potentially combining multiple columns to generate more meaningful features for modeling. Some columns in the dataset already provide valuable performance metrics, but there are others that could be combined or transformed into new features to better predict the target variable (target_5yrs).

Here are some columns that lend themselves to feature extraction:

Field Goal Statistics:

fgm (field goals made) and fga (field goals attempted) could be combined into a Field Goal Percentage column:

Field Goal Percentage (fg%) = fgm / fga (if no zero attempts).

Three-Point Statistics:

3p_made (three-point field goals made) and 3pa (three-point attempts) could be transformed into a 3P% (three-point shooting percentage):

Three-Point Percentage (3P%) = 3p_made / 3pa.

Free Throw Statistics:

ftm (free throws made) and fta (free throw attempts) could be used to create a Free Throw Percentage (FT%):

Free Throw Percentage (FT%) = ftm / fta.

Rebound Statistics:

oreb (offensive rebounds) and dreb (defensive rebounds) could be combined into a Total Rebounds column:

Total Rebounds (Reb) = oreb + dreb.

Efficiency Metrics:

pts (points per game) and min (minutes per game) can be used to calculate a Points per Minute (PPM) metric:

Points per Minute (PPM) = pts / min.

Player’s Role and Contribution:

A Player Efficiency Rating (PER) or other metrics could be derived by combining several of the player's stats (e.g., assists, steals, blocks, turnovers, etc.).

Why is feature extraction important?
Feature extraction is important because it allows us to reduce the complexity of the data, making it more manageable and potentially more predictive. By combining or transforming the features, we may uncover hidden relationships between the features and the target variable, which could improve model accuracy.

Extract two features that you think would help predict `target_5yrs`. Then, create a new variable named 'extracted_data' that contains features from 'selected_data', as well as the features being extracted.

In [11]:
# Extract features and create new variables
extracted_data = selected_data.copy()

# Calculate Field Goal Percentage (fg%)
extracted_data['fg_percentage'] = extracted_data['fgm'] / extracted_data['fga']

# Calculate Points per Minute (PPM)
extracted_data['ppm'] = extracted_data['pts'] / extracted_data['min']

# Display the first few rows to confirm the changes
extracted_data.head()

Unnamed: 0,gp,min,pts,fgm,fga,fg,3p_made,3pa,3p,ftm,...,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs,fg_percentage,ppm
0,36,27.4,7.4,2.6,7.6,34.7,0.5,2.1,25.0,1.6,...,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0,0.342105,0.270073
1,35,26.9,7.2,2.0,6.7,29.6,0.7,2.8,23.5,2.6,...,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0,0.298507,0.267658
2,74,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,...,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0,0.425532,0.339869
3,58,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,...,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1,0.418182,0.491379
4,48,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,...,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1,0.533333,0.391304


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the materials about feature extraction.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use the function `copy()` to make a copy of a DataFrame. To access a specific column from a DataFrame, use a pair of square brackets and place the name of the column as a string inside the brackets.

</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

Use a pair of square brackets to create a new column in a DataFrame. The columns in DataFrames are series objects, which support elementwise operations such as multiplication and division. Be sure the column names referenced in your code match the spelling of what's in the DataFrame.
</details>

Now, to prepare for the Naive Bayes model that you will build in a later lab, clean the extracted data and ensure ensure it is concise. Naive Bayes involves an assumption that features are independent of each other given the class. In order to satisfy that criteria, if certain features are aggregated to yield new features, it may be necessary to remove those original features. Therefore, drop the columns that were used to extract new features.

**Note:** There are other types of models that do not involve independence assumptions, so this would not be required in those instances. In fact, keeping the original features may be beneficial.

In [12]:
# Remove the columns that are no longer needed
extracted_data = extracted_data.drop(['fgm', 'fga', 'min'], axis=1)

# Display the first few rows to confirm the columns have been dropped
extracted_data.head()

Unnamed: 0,gp,pts,fg,3p_made,3pa,3p,ftm,fta,ft,oreb,dreb,reb,ast,stl,blk,tov,target_5yrs,fg_percentage,ppm
0,36,7.4,34.7,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0,0.342105,0.270073
1,35,7.2,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0,0.298507,0.267658
2,74,5.2,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0,0.425532,0.339869
3,58,5.7,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1,0.418182,0.491379
4,48,4.5,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1,0.533333,0.391304


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the materials about feature extraction.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

There are functions in the `pandas` library that remove specific columns from a DataFrame and that display the first few rows of a DataFrame.
</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

Use the `drop()` function and pass in a list of the names of the columns you want to remove. By default, calling this function will result in a new DataFrame that reflects the changes you made. The original DataFrame is not automatically altered. You can reassign `extracted_data` to the result, in order to update it. 

Use the `head()` function to display the first few rows of a DataFrame.
</details>

Next, export the extracted data as a new .csv file. You will use this in a later lab. 

In [13]:
# Export the extracted data to a .csv file
extracted_data.to_csv('extracted_data.csv', index=False)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

There is a function in the `pandas` library that exports a DataFrame as a .csv file. 
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use the `to_csv()` function to export the DataFrame as a .csv file. 
</details>

<details>
<summary><h4><strong>Hint 3</strong></h4></summary>

Call the `to_csv()` function on `extracted_data`, and pass in the name that you want to give to the resulting .csv file. Specify the file name as a string and in the file name. Make sure to include `.csv` as the file extension. Also, pass in the parameter `index` set to `0`, so that when the export occurs, the row indices from the DataFrame are not treated as an additional column in the resulting file. 
</details>

## **Considerations**


**What are some key takeaways that you learned during this lab? Consider the process you followed and what tasks were performed during each step, as well as important priorities when training data.**

Data Exploration and Preprocessing: The lab emphasized the importance of thoroughly exploring the dataset before any modeling. This includes checking for missing values, analyzing column types (categorical vs numerical), and understanding the class balance. Identifying issues in the data early helps in deciding how to handle them.

Feature Selection and Engineering: Feature selection is crucial in identifying the most relevant columns for the target variable. I learned to select and transform data into features that help predict the target variable. Feature extraction, for example, helps combine existing features to create more meaningful predictors, which is important for enhancing the model's predictive power.

Data Cleaning: It is important to clean the data by dropping unnecessary columns, especially after feature extraction, to ensure the data is concise and ready for model training. The Naive Bayes model specifically requires independent features, so removing redundant features that may conflict with that assumption is crucial.

Data Export and Future Use: Once the data is cleaned and prepared, it’s essential to save the final version for later use. Exporting the data into a .csv format allows us to easily share it, and it can be used in later stages of model development.

**What summary would you provide to stakeholders? Consider key attributes to be shared from the data, as well as upcoming project plans.**

In this phase of the project, we prepared the NBA player data to predict whether a player will have a career duration of at least five years based on their performance metrics. Here's a summary of the key findings and steps:

Data Insights: We found that the dataset includes various player performance statistics such as points, assists, field goal percentage, and more. After analyzing the data, we selected the most relevant features for predicting career longevity, focusing on statistics like the number of games played, minutes, points, and field goals made.

Data Quality: The data did not contain missing values, which simplified preprocessing. We ensured the features used were numerical and aligned with the assumptions of models like Naive Bayes.

Feature Engineering: We extracted new features by combining existing ones to create meaningful predictors for the target variable (target_5yrs), representing career longevity.

Data Preparation: After extracting new features, we cleaned the dataset by dropping redundant columns to ensure concise and usable data.

Next Steps: We have now exported the prepared dataset for use in future modeling. The next steps involve using this data to build and evaluate predictive models, starting with Naive Bayes.