|   |True | False |   
|---|---|---|
|   |   |   |   
|True | Correct!  | **Type 1 Error**  |  
|False  | **Type 2 Error**  | Correct!  | 

- **True Positives**: A positive class observation (1) is correctly classified as positive by the model.
- **False Positive**: A negative class observation (0) is incorrectly classified as positive.
- **True Negative**: A negative class observation is correctly classified as negative.
- **False Negative**: A positive class observation is incorrectly classified as negative.

Load logistic regression, numpy, and cross validation train/test split functions.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import numpy as np
import pandas as pd

Return to the Wisconsin breast cancer data. Clean it up as we did before.

In [None]:
column_names = ['id',
                'clump_thickness',
                'cell_size_uniformity',
                'cell_shape_uniformity',
                'marginal_adhesion',
                'single_epithelial_size',
                'bare_nuclei',
                'bland_chromatin',
                'normal_nucleoli',
                'mitoses',
                'class']

bcw = pd.read_csv('../assets/datasets/breast-cancer-wisconsin.csv',
                 names=column_names, na_values=['?'])

bcw.head(10)

In [None]:
bcw.dropna(inplace=True)
print(bcw.shape)
bcw.head(8)

Create a percentage score across the predictor columns for simplicity in this lesson.

In [None]:
# Let's select everything from our column_names, minus the "class" and "id" columns
subset_mask = list(set(column_names) - set(['class', 'id']))   # difference set operation
subset_mask

In [None]:
bcw[subset_mask].sum(axis=1)/90 # axis:1 == rows

In [None]:
bcw['metrics_pct'] = bcw[subset_mask].sum(axis=1)/90.
bcw['class'] = bcw['class'].map(lambda x: 0 if x == 2 else 1) # Here we're shifting 2 & 4 to 0 (healthy) and 1 (cancer)

In [None]:
# Notice our class and new metrics_pct
bcw.head(10)

In [None]:
print 'Patients with cancer:', np.sum(bcw[['class']].values)

Split into 66% training set and 33% testing set
>```
>X = metrics_pct (predictor)
>Y = class (non-cancer:0 vs cancer:1)
>```

In [None]:
metrics_pct = np.array(bcw.metrics_pct.values)
metrics_pct = metrics_pct[:, np.newaxis]

# stratify keeps our classes balanced
X_train, X_test, Y_train, Y_test = train_test_split(metrics_pct, bcw[['class']].values, 
                                                    test_size=0.33, stratify=bcw[['class']].values,
                                                    random_state=77)  

Fit the logistic regression on the training data

In [None]:
logreg = LogisticRegression(random_state=77)
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)

Look at the confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

conmat = np.array(confusion_matrix(Y_test, Y_pred, labels=[1,0]))

confusion = pd.DataFrame(conmat, index=['has_cancer', 'is_healthy'],
                         columns=['predicted_cancer','predicted_healthy'])

print(confusion)

Calculate true positives, false positives, true negatives, and false negatives from the confusion matrix

In [None]:
TP = confusion.ix['has_cancer', 'predicted_cancer']   # row index: has_cancer column_index: predicted_cancer
FP = confusion.ix['is_healthy', 'predicted_cancer']
TN = confusion.ix['is_healthy', 'predicted_healthy']
FN = confusion.ix['has_cancer', 'predicted_healthy']

print(zip(['True Positives','False Positives','True Negatives','False Negatives'],
          [TP, FP, TN, FN]))

## Check

- People with cancer:  ??
- People without cancer: ??

Calculate the accuracy with the accuracy_score() function from sklearn

In [None]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(Y_test, Y_pred)
print(acc)

Show that the accuracy is equivalent to: True Positives + True Negatives / Total

In [None]:
print((TP + TN) / float(len(Y_test)))

Create the classification report with the classification_report() function

In [None]:
from sklearn.metrics import classification_report

cls_rep = classification_report(Y_test, Y_pred)
print(cls_rep)

Show that the precision (for 1 vs 0) is equivalent to: True Positives / (True Positives + False Positives)

In [None]:
# 1 vs. 0
print(float(TP) / (TP + FP))

# 0 vs. 1
print(float(TN) / (TN + FN))

Show that the recall (for 1 vs 0) is equivalent to: True Positives / (True Positives + False Negatives)

In [None]:
## How many class predictions did we "recall" correctly?
# 1 vs. 0
print(float(TP) / (TP + FN))

# 0 vs. 1
print(float(TN) / (TN + FP))

Show that the F1-score is equivalent to: 2 * (Precision * Recall) / (Precision + Recall)

![](https://upload.wikimedia.org/math/9/9/1/991d55cc29b4867c88c6c22d438265f9.png)

In [None]:
# 1 vs. 0
pos_precision = float(TP) / (TP + FP)
pos_recall = float(TP) / (TP + FN)
print(2. * (pos_precision * pos_recall) / (pos_precision + pos_recall))

# 0 vs. 1
neg_precision = float(TN) / (TN + FN)
neg_recall = float(TN) / (TN + FP)
print(2. * (neg_precision * neg_recall) / (neg_precision + neg_recall))