Baseline for classification
---

In [1]:
import pandas as pd

# Load data
data_df = pd.read_csv('heart-numerical.csv')

# First five rows
data_df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,disease
0,63,145,233,150,2.3,0,absence
1,67,160,286,108,1.5,3,presence
2,67,120,229,129,2.6,2,presence
3,37,130,250,187,3.5,0,absence
4,41,130,204,172,1.4,0,absence


In [2]:
import numpy as np

# Create X/y arrays
X = data_df.drop('disease', axis=1).values
y = data_df.disease.values

print('X:', X.shape, X.dtype)
print('y:', y.shape, y.dtype)

# Print labels
labels = np.unique(y)
print('Labels:', labels)

X: (303, 6) float64
y: (303,) object
Labels: ['absence' 'presence']


In [3]:
from sklearn.model_selection import train_test_split

# Split data
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.3, random_state=0)

print('Train set:', X_tr.shape, y_tr.shape)
print('Test set:', X_te.shape, y_te.shape)

Train set: (212, 6) (212,)
Test set: (91, 6) (91,)


## The 'most frequent' baseline

In [4]:
# Count the number of entries labeled with 'absence'
n_absence = np.sum(y_tr == 'absence')

print('Total absence:', n_absence)

Total absence: 117


In [5]:
# Probability of 'absence'
p_absence = n_absence / len(y_tr)

print('Probability of absence: {:.2f}'.format(p_absence))

Probability of absence: 0.55


In [6]:
# On the test set
p_absence_te = np.sum(y_te == 'absence') / len(y_te)
print('Probability of absence: {:.2f}'.format(p_absence_te))

Probability of absence: 0.52


## Multiple classes

In [9]:
# Compute distribution using Pandas
pd.Series(y_tr).value_counts() / len(y_tr)

absence     0.551887
presence    0.448113
dtype: float64

In [10]:
# Compute distribution using Pandas
pd.Series(y_tr).value_counts(normalize=True)

absence     0.551887
presence    0.448113
dtype: float64

## Scikit-learn DummyClassifier

In [11]:
from sklearn.dummy import DummyClassifier

# Create the dummy classifier
dummy = DummyClassifier(strategy='most_frequent')

# Fit it
dummy.fit(None, y_tr)

# Compute test accuracy
accuracy = dummy.score(None, y_te)
print('Accuracy: {:.2f}'.format(accuracy))

Accuracy: 0.52


## Confusion matrix

In [12]:
# "Most-frequent" predictions
y_pred_absence = dummy.predict(X_te)
print('Predicted:', y_pred_absence[:5], '..')
print('True labels:', y_te[:5], '..')

Predicted: ['absence' 'absence' 'absence' 'absence' 'absence'] ..
True labels: ['absence' 'absence' 'presence' 'absence' 'presence'] ..


In [13]:
from sklearn.metrics import confusion_matrix

# Confusion matrix
matrix = confusion_matrix(y_true=y_te, y_pred=y_pred_absence)
print(matrix)

[[47  0]
 [44  0]]


In [14]:
# Confusion matrix as a DataFrame
matrix_df = pd.DataFrame(
    matrix, 
    columns=['pred: absence', 'pred: presence'],
    index=['true: absence', 'true: presence']
)
matrix_df

Unnamed: 0,pred: absence,pred: presence
true: absence,47,0
true: presence,44,0


### Precision
Intuitively, the precision answers "How many times are we correct when we predict positive?". The formula is simply $$ precision = tp/(tp+fp) $$

In [16]:
from sklearn.metrics import precision_score

precision_score(y_true=y_te, y_pred=y_pred_absence, pos_label='presence')

  'precision', 'predicted', average, warn_for)


0.0

In our case, the precision is not defined since our "most-frequent" baseline never predicts 'presence' ..

However, we can compute it for a "always predicts presence" baseline



In [17]:
# Precision of the "always predicts presence" baseline
y_pred_presence = np.full_like(y_te, fill_value='presence')
precision_score(y_true=y_te, y_pred=y_pred_presence, pos_label='presence')

0.4835164835164835

### Recall
Intuitively, the recall measures "How many times do we predict positive when it is?". The formula is simply $$ recall = tp/(tp+fn) $$

In [18]:
from sklearn.metrics import recall_score

recall_score(y_true=y_te, y_pred=y_pred_absence, pos_label='presence')

0.0

In [19]:
# Recall of the "always predicts presence" baseline
recall_score(y_true=y_te, y_pred=y_pred_presence, pos_label='presence')

1.0

### F1 Score
The F1 score is a way to combine the precision and recall metrics into a single score. The formula is
$$  f1_score = 2 * (precision * recall) / (precision + recall) $$

In [20]:
from sklearn.metrics import f1_score

f1_score(y_true=y_te, y_pred=y_pred_presence, pos_label='presence')

0.6518518518518518

> Note: Scikit-learn also provides a fbeta_score() function which has a "beta" parameter to set the recall/precision balance in the score. You can read more about it in the sklearn.metrics.fbeta_score page.

## Classification report

In [21]:
from sklearn.metrics import classification_report

report = classification_report(y_true=y_te, y_pred=y_pred_presence)
print(report)

              precision    recall  f1-score   support

     absence       0.00      0.00      0.00        47
    presence       0.48      1.00      0.65        44

   micro avg       0.48      0.48      0.48        91
   macro avg       0.24      0.50      0.33        91
weighted avg       0.23      0.48      0.32        91



  'precision', 'predicted', average, warn_for)


> Note: In this table, support corresponds to the number of points in each class. Micro, macro and weighted averages refer to different ways to combine the results when there are multiple classes.