# Naive Bayes model 🤖

## Introduction

* You work for a firm that provides insights for management and coaches in the National Basketball Association (NBA), a professional basketball league in North America. The league is interested in retaining players who can last in the high-pressure environment of professional basketball and help the team be successful over time. Build a model that predicts whether a player will have an NBA career lasting five years or more. 

* The data for this activity consists of performance statistics from each player's rookie year. There are 1,341 observations, and each observation in the data represents a different player in the NBA. Your target variable is a Boolean value that indicates whether a given player will last in the league for five years.

## Import Packages 🚢

In [1]:
# operational 
import numpy as np
import pandas as pd

#modeling & evalution
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score, confusion_matrix, ConfusionMatrixDisplay

#visualization
import matplotlib.pyplot as plt

### Load the dataset

In [2]:
data = pd.read_csv('nba_players_data.csv')
data.head()

Unnamed: 0,fg,3p,ft,reb,ast,stl,blk,tov,target_5yrs,total_points,efficiency
0,34.7,25.0,69.9,4.1,1.9,0.4,0.4,1.3,0,266.4,0.270073
1,29.6,23.5,76.5,2.4,3.7,1.1,0.5,1.6,0,252.0,0.267658
2,42.2,24.4,67.0,2.2,1.0,0.5,0.3,1.0,0,384.8,0.339869
3,42.6,22.6,68.9,1.9,0.8,0.6,0.1,1.0,1,330.6,0.491379
4,52.4,0.0,67.4,2.5,0.3,0.3,0.4,0.8,1,216.0,0.391304


## Model preparation 🦾

### Isolate your target and predictor variables

In [3]:
# Define the y (target) variable.
y = data[['target_5yrs']]

# Define the X (predictor) variables.
X = data.drop('target_5yrs',  axis=1)

# Display the first 10 rows of your target data.
print(y.head(10))

# Display the first 10 rows of your predictor variables.
print(X.head(10))

   target_5yrs
0            0
1            0
2            0
3            1
4            1
5            0
6            1
7            1
8            0
9            0
     fg    3p    ft  reb  ast  stl  blk  tov  total_points  efficiency
0  34.7  25.0  69.9  4.1  1.9  0.4  0.4  1.3         266.4    0.270073
1  29.6  23.5  76.5  2.4  3.7  1.1  0.5  1.6         252.0    0.267658
2  42.2  24.4  67.0  2.2  1.0  0.5  0.3  1.0         384.8    0.339869
3  42.6  22.6  68.9  1.9  0.8  0.6  0.1  1.0         330.6    0.491379
4  52.4   0.0  67.4  2.5  0.3  0.3  0.4  0.8         216.0    0.391304
5  42.3  32.5  73.2  0.8  1.8  0.4  0.0  0.7         277.5    0.324561
6  43.5  50.0  81.1  2.0  0.6  0.2  0.1  0.7         409.2    0.605505
7  41.5  30.0  87.5  1.7  0.2  0.2  0.1  0.7         273.6    0.553398
8  39.2  23.3  71.4  0.8  2.3  0.3  0.0  1.1         156.0    0.242424
9  38.3  21.4  67.8  1.1  0.3  0.2  0.0  0.7         155.4    0.435294


### Perform a split operation on your data 

Divide your data into a training set (75% of data) and test set (25% of data).

In [4]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Print the shape of each output 

In [5]:
print('The shape of x train split is ',x_train.shape)
print('The shape of x test split is ', x_test.shape)
print('The shape of y train split is ', y_train.shape)
print('The shape of y test split is ', y_test.shape)

The shape of x train split is  (1005, 10)
The shape of x test split is  (335, 10)
The shape of y train split is  (1005, 1)
The shape of y test split is  (335, 1)


**Question:** How many rows are in each of the outputs?

Each training DataFrame contains 1,005 rows, while each test DataFrame contains 335 rows. Additionally, there are 10 columns in each X DataFrame, with only one column in each y DataFrame.

## Model building 🤖

### Fit your model to your training data and predict on your test data

In [6]:
#implementation of Naive Bayes.
nb = GaussianNB()

# Fit the model on your training data.
model = nb.fit(x_train, y_train)

# Apply your model to predict on your test data.
y_pred = model.predict(x_test)
y_pred

  y = column_or_1d(y, warn=True)


array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,

### Results and evaluation 🧪


### Leverage metrics to evaluate your model's performance

In [7]:
# your accuracy score.
print('The Accuracy score is %.3f' % accuracy_score(y_pred, y_test))

# your precision score.
print('The Precision score is %.3f' % precision_score(y_pred, y_test))

# your recall score.
print('The Recall score is %.3f' % recall_score(y_pred, y_test))

# your f1 score.
print('The F1 score is %.3f' % f1_score(y_pred, y_test))

The Accuracy score is 0.654
The Precision score is 0.548
The Recall score is 0.838
The F1 score is 0.663


**Question:** What is the accuracy score for your model, and what does this tell you about the success of the model's performance?



The accuracy score for this model is 0.6896, or 69.0% accurate.

**Question:** What are the precision and recall scores for your model, and what do they mean? Is one of these scores more accurate than the other?


Precision and recall scores are both useful to evaluate the correct predictive capability of a model because they balance the false positives and false negatives inherent in prediction.

The model shows a precision score of 0.8406, suggesting the model is quite good at predicting true positives—meaning the player will play longer than five years—while balancing false positives. The recall score of 0.5859 shows worse performance in predicting true negatives—where the player will not play for five years or more—while balancing false negatives.These two metrics combined can give a better assessment of model performance than accuracy does alone.

**Question:** What is the F1 score of your model, and what does this score mean?

The F1 score balances the precision and recall performance to give a combined assessment of how well this model delivers predictions. In this case, the F1 score is 0.6905, which suggests reasonable predictive power in this model.

### Gain clarity with the confusion matrix

In [8]:
# Construct and display your confusion matrix.
cm = confusion_matrix(y_pred, y_test, labels = model.classes_)
dspl = ConfusionMatrixDisplay(confusion_matrix=cm, /
                              display_labels=model.classes_)
dspl.plot()

SyntaxError: invalid syntax (1546002208.py, line 3)

**Question:** What do you notice when observing your confusion matrix, and does this correlate to any of your other calculations?


- The top left to bottom right diagonal in the confusion matrix represents the correct predictions, and the ratio of these squares showcases the accuracy.

- The concentration of true positives stands out relative to false positives. This ratio is why the precision score is so high (0.8406).

- True negatives and false negatives are closer in number, which explains the worse recall score.

## Considerations

**How would you summarize your findings to stakeholders?**

- The model created provides some value in predicting an NBA player's chances of playing for five years or more.
- Notably, the model performed better at predicting true positives than it did at predicting true negatives. In other words, it more accurately identified those players who will likely play for more than five years than it did those who likely will not.