# Scikit Learn API Experimentation
## Accuracy

Accuracy is the number of correct predictions divided by the total number of predictions.

There are a few different ways to compute accuracy.

In [1]:
# Load Titanic Data
%cd -q ../projects/titanic
%run LoadTitanicData.py
%cd -q -

# X: features
# y: target variable
print('X Shape: ', X.shape)
print('y Shape: ', y.shape)
print('X columns:\n', X.columns.values)
print('y name:',y.name)

X Shape:  (891, 11)
y Shape:  (891,)
X columns:
 ['PassengerId' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare'
 'Cabin' 'Embarked']
y name: Survived


In [2]:
# Build a model
drop_fields = ['Name', 'Sex', 'Ticket', 'Cabin', 
               'Embarked', 'PassengerId', 'Age']

# Remove all non-numeric fields and PassengerId (1st iteration only)
X = X.drop(drop_fields, axis=1)
X.dtypes

# create train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, stratify=y, random_state=10)

# Create instance of LogisticRegression estimator
from sklearn.linear_model import LogisticRegression
base_model = LogisticRegression()

# Build Model on training data
# information about the fitted model is returned
model_info = base_model.fit(X_train, y_train)

### accuracy_score()

In [3]:
# Make Predictions
predictions = base_model.predict(X_test)

# Compute accuarcy using sklearn
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

# Compute accuracy manually to be sure we understand accuracy_score()
print((y_test == predictions).mean())

0.6305970149253731
0.6305970149253731


We see that accuracy_score() gave us the same value as determining the mean number of times the prediction was correct.

Although it may seem confusing at first to see mean() used to compute the percentage of True values in a boolean collection, it is a commonly used idiom.

To fully understand what is happening, it is often helpful to look at the data types of the objects.  Let's look at the last line of code above in more detail.

In [4]:
print('Predictions Collection Type: ', type(predictions))
print('Predictions Value Type:      ', predictions.dtype)
print('Predictions Values:          ', np.unique(predictions))
print('y_test Collection Type:      ', type(y_test))
print('y_test Value Type:           ', y_test.dtype)
print('y_test Values:               ', y_test.unique())
print('Comparison Collection Type:  ', type(predictions == y_test))
print('Comparison Value Type:       ', (predictions == y_test).dtype)
print('Comparison Values:           ', (predictions == y_test).unique())
print('Accuracy:                     {:.4f}'.format((y_test == predictions).mean()))

Predictions Collection Type:  <class 'numpy.ndarray'>
Predictions Value Type:       int64
Predictions Values:           [0 1]
y_test Collection Type:       <class 'pandas.core.series.Series'>
y_test Value Type:            int64
y_test Values:                [0 1]
Comparison Collection Type:   <class 'pandas.core.series.Series'>
Comparison Value Type:        bool
Comparison Values:            [ True False]
Accuracy:                     0.6306


We see that:
* predictions, returned from predict(), is a numpy array of integers
* y, the response (or target) variable, is a Pandas Series
* comparisons between numpy arrays and Pandas Series are allowable
* this comparison results in a Pandas Series of type bool
* taking the mean of a boolean collection gives the percentage of True values

### Confusion Matrix

For binary classification problems such as this, the following terms are often used:
<pre>
TP = True  Positive  
FP = False Positive  (also called Type 1 Error)
TN = True Negative  
FN = False Negative  (also called Type 2 Error)
</pre>

Where "positive" in this example means "Survived".

TP means we predicted survived and that passenger did survive.  
FP means we predicted survived and the passenger did not survive.  
TN means we predicted not-survived and the passenger did not survive.  
FN means we predicted not-survived and the passenger did survive.

Accuracy is the number of true predictions divided by the total number of predictions:
(TN + TP) / (TN + FP + FN + TP)


For a binary classification problem, sklearn represents this as a [Confusion Matrix](http://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) as follows:

In [5]:
# sklearn confusion matrix
print("","TN", "FP\n","FN TP")

 TN FP
 FN TP


In [6]:
# For instructional purposes, let's derive the confusion matrix ourselves
TN = ((y_test == 0) & (predictions == 0)).sum()
FP = ((y_test == 0) & (predictions == 1)).sum()
FN = ((y_test == 1) & (predictions == 0)).sum()
TP = ((y_test == 1) & (predictions == 1)).sum()

my_confusion_matrix = np.array([[TN, FP],[FN, TP]])
print(my_confusion_matrix)

[[126  39]
 [ 60  43]]


In [7]:
# Use sklearn to compute the confusion_matrix
from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(y_test, predictions)
print(confusion)

[[126  39]
 [ 60  43]]


We see that our hand-coded and sklearn confusion matrix results are the same.

In [8]:
# Compute Accuracy from Confusion Matrix
print('Accuracy: {:.4f}'.format((TN+TP)/(TN + FP + FN + TP)))

Accuracy: 0.6306


We see that we got the same accuarcy as before, about 63.1%.