TPV, FPV, AUC, confusion matrix, NDCG

**Choice of metrics influences how the performance of machine learning algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose.**

# Evaluation Metrics
- Classification Accuracy
- AUC – ROC
- Confusion Matrix
- NDCG

-------------------------------

## Dataset in use:

Metrics are demonstrated for classification machine learning problems.

** Pima Indians Diabetes Data Set: https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes**

This is a binary classification problem where all of the input variables are numeric.

**Attributes:**

1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 
9. Class variable (0 or 1) 

**Description: **

The diagnostic, binary-valued variable investigated is whether the
patient shows signs of diabetes according to World Health Organization
criteria. Pima Indians are a group of Native Americans living near Phoenix, Arizona, USA.

Their ADAP algorithm, an adaptive learning routine that generates and executes digital analogs of perceptron-like devices, makes a real-valued prediction between
0 and 1.  This was transformed into a binary decision using a cutoff of 
0.448.  Using 576 training instances, the sensitivity and specificity
of their algorithm was 76% on the remaining 192 instances.

Now let's have a look at the dataset.

In [1]:
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
dataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Convert dataframe into array

In [2]:
array = dataframe.values
array

array([[   6.   ,  148.   ,   72.   , ...,    0.627,   50.   ,    1.   ],
       [   1.   ,   85.   ,   66.   , ...,    0.351,   31.   ,    0.   ],
       [   8.   ,  183.   ,   64.   , ...,    0.672,   32.   ,    1.   ],
       ..., 
       [   5.   ,  121.   ,   72.   , ...,    0.245,   30.   ,    0.   ],
       [   1.   ,  126.   ,   60.   , ...,    0.349,   47.   ,    1.   ],
       [   1.   ,   93.   ,   70.   , ...,    0.315,   23.   ,    0.   ]])

Separate the attributes by data values and class

In [3]:
X = array[:,0:8] # [preg, plas, pres, skin, test, mass, pedi, age]
Y = array[:,8] # [class]

All metrics evaluate the same algorithms, Logistic Regression for classification. A 10-fold cross-validation test harness is used to demonstrate each metric, because this is the most likely scenario where you will be employing different algorithm evaluation metrics.

Cross_val_score function is used to report the performance in each metric. All scores are reported so that they can be sorted in ascending order (largest score is best).

------------

## 1. Classification Accuracy
= number of correct predictions made as a ratio of all predictions made.

 It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.

In [4]:
# Cross Validation Classification Accuracy

seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Accuracy: %.3f%% (%.3f%%)" %(results.mean()*100, results.std()*100))

Accuracy: 76.951% (4.841%)


-------------------------------

## 2. Area Under ROC Curve (AUC)
= performance metric for binary classification problems

<img class="irc_mi" src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/ROC_curves.svg/1280px-ROC_curves.svg.png" onload="google.aft&amp;&amp;google.aft(this)" width="500" height="500" style="margin-left: 5px;" alt="Image result for Area under roc curve">

The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.

In [5]:
# Cross Validation Classification ROC AUC

seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()
scoring = 'roc_auc'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

AUC: 0.824 (0.041)


The result shows AUC is relatively close to 1 and greater than 0.5, suggesting some skill in the predictions.

------

## 3. Confusion Matrix
= presentation of the accuracy of a model with two or more classes.

<img class="irc_mi" src="https://de.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/60900/versions/11/screenshot.png" onload="google.aft&amp;&amp;google.aft(this)" width="500" height="500" style="margin-left: 1px;" alt="Image result for confusion matrix">

It creates a table that presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

For example, a machine learning algorithm can predict 0 or 1 and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for prediction=0 and actual=0, whereas predictions for 0 that were actually 1 appear in the cell for prediction = 0 and actual=1. And so on.

Let's calculate a confusion matrix for a set of prediction by a model on a test set.

In [6]:
# Cross Validation Classification Confusion Matrix

from sklearn.metrics import confusion_matrix

test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

[[141  21]
 [ 41  51]]


Although the array is printed without headings, you can see that the majority of the predictions fall on the diagonal line of the matrix (which are correct predictions).

--------------

## 4. Normalized Discounted Cumulative Gain (NDCG)
= measures the performance of a recommendation system based on the graded relevance of the recommended entities. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the entities. This metric is commonly used in information retrieval and to evaluate the performance of web search engines.

<img class="irc_mi" src="http://imageclef.org/system/files/ndcg.png" onload="google.aft&amp;&amp;google.aft(this)" width="385" height="94" style="margin-top: 1px;" alt="Image result for ndcg">

In [7]:
# Download a customized library 'measures.py' for NDCG calculation
!rm measures.py
!wget https://raw.githubusercontent.com/telescopeuser/ranking_measures/master/measures.py
# !wget https://github.com/telescopeuser/ranking_measures/blob/master/measures.py

--2018-01-13 23:49:49--  https://raw.githubusercontent.com/telescopeuser/ranking_measures/master/measures.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7814 (7.6K) [text/plain]
Saving to: ‘measures.py’


2018-01-13 23:49:49 (921 KB/s) - ‘measures.py’ saved [7814/7814]



In [8]:
import measures
reference_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
predicted_list = [2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11]
print (measures.find_rankdcg(reference_list, predicted_list))

0.9600280799041041


### More ranking examples:

In [9]:
import measures

def test_measures(reference, hypothesis):
    """
    Runs all rank-ordering evaluation measures on given pair of lists.
    """

    print("\t DCG:\t\t\t{0}".format(measures.find_dcg(hypothesis)))
    print("\t nDCG:\t\t\t{0}".format(measures.find_ndcg(reference, hypothesis)))
    print("\t Precision:\t\t{0}".format(measures.find_precision(reference, hypothesis)))
    print("\t Precision at k:\t{0}".format(measures.find_precision_k(reference, hypothesis, len(reference))))
    print("\t Average precision:\t{0}".format(measures.find_average_precision(reference, hypothesis)))
    print("\t RankDCG:\t\t{0}".format(measures.find_rankdcg(reference, hypothesis), "\n"))

#Defining test cases
L1 = [9, 4, 4, 2, 2, 2, 1, 1, 1, 1]
L2 = [9, 4, 4, 2, 2, 1, 2, 1, 1, 1]
L3 = [4, 4, 2, 9, 2, 2, 1, 1, 1, 1]
L4 = [1, 4, 4, 2, 2, 2, 9, 1, 1, 1]
L5 = [1, 4, 4, 2, 2, 2, 1, 1, 1, 9]
L6 = [1, 1, 1, 1, 2, 2, 2, 4, 4, 9]

print('list L1:', L1)
print('list L2:', L2)
print('list L3:', L3)
print('list L4:', L4)
print('list L5:', L5)
print('list L6:', L6)
print()


#Testing:
print("1. Perfect ordering: (L1, L1)")
test_measures(L1, L1)
print()

print("2. Slightly wose case (low ranks): (L1, L2)")
test_measures(L1, L2)
print()

print("3. Further worsen case (hight ranks): (L1, L3)")
test_measures(L1, L3)
print()

print("4. Placement of a high rank element into the low rank 'subgroup': (L1, L4)")
test_measures(L1, L4)
print()

print("5. Case #4 but with further misplacement of the hight rank element inside the low rank 'subgroup': (L1, L5)")
test_measures(L1, L5)
print()

print("6. The worst case (reverse ordering): (L1, L6)")
test_measures(L1, L6)
print()

list L1: [9, 4, 4, 2, 2, 2, 1, 1, 1, 1]
list L2: [9, 4, 4, 2, 2, 1, 2, 1, 1, 1]
list L3: [4, 4, 2, 9, 2, 2, 1, 1, 1, 1]
list L4: [1, 4, 4, 2, 2, 2, 9, 1, 1, 1]
list L5: [1, 4, 4, 2, 2, 2, 1, 1, 1, 9]
list L6: [1, 1, 1, 1, 2, 2, 2, 4, 4, 9]

1. Perfect ordering: (L1, L1)
	 DCG:			24.68463499685107
	 nDCG:			1.0
	 Precision:		1.0
	 Precision at k:	1.0
	 Average precision:	1.0
	 RankDCG:		1.0

2. Slightly wose case (low ranks): (L1, L2)
	 DCG:			24.651635001444305
	 nDCG:			0.9986631361812328
	 Precision:		0.8
	 Precision at k:	0.8
	 Average precision:	0.8875396825396826
	 RankDCG:		0.9749999999999999

3. Further worsen case (hight ranks): (L1, L3)
	 DCG:			20.377809293434577
	 nDCG:			0.825526052786849
	 Precision:		0.7
	 Precision at k:	0.7
	 Average precision:	0.45464285714285707
	 RankDCG:		0.6500000000000002

4. Placement of a high rank element into the low rank 'subgroup': (L1, L4)
	 DCG:			16.990261445443263
	 nDCG:			0.6882929987666679
	 Precision:		0.8
	 Precision at k:	0.8
	 Ave

---