### How to calculate AUC from scratch (no ML package used)

AUC is referring to Area Under Receiver Operating Curve. It is often used to measure prediction accuracy of classification problems. The good thing about AUC is that it shows the result for all possible cut-offs when it comes to decide the optimal cut-off threshold.

This is a great video explaining AUC
http://www.dataschool.io/roc-curves-and-auc-explained/

This notebook is to calculate AUC without using any dependency package. The result will be tested on a sample dataset and be compared to the result calcuated by Sciki-learn.

In [61]:
import numpy

### Step 1: calculate true positive rate and false positive rate for each user_defined cut-off threshold

- TP rate =  TP/(TP + FN) 
- FP rate = FP/(FP + TN)

In [9]:
def true_pos_rate(actual_score_array, prediction_score_array, threshold):
    
    TP = 0
    FN = 0
    
    if len(actual_score_array) != len(prediction_score_array):
        print("warning: actual array and score array need to have the same length!")
    
    else:
        
        for y, y_hat in zip(actual_score_array, prediction_score_array):
            
            if y == 1 and y_hat > threshold:
                TP += 1
            elif y ==1 and y_hat < threshold:
                FN += 1
            else:
                TP += 0
 
        return TP * 1.000/(TP + FN)

In [10]:
def false_pos_rate(actual_score_array, prediction_score_array, threshold):
    
    FP = 0
    TN = 0
    
    if len(actual_score_array) != len(prediction_score_array):
        print("warning: actual array and score array need to have the same length!")
    
    else:
        
        for y, y_hat in zip(actual_score_array, prediction_score_array):
            
            if y == 0 and y_hat >= threshold:
                FP += 1
            elif y ==0 and y_hat <= threshold:
                TN += 1
            else:
                FP += 0
 
        return FP * 1.000/(FP + TN)

### Step 2: use the calcuated true and false positive rate to calculate the auc
threshold_list is defined by the user, step size can be as small as needed

In [11]:
def auc_cal(actual_score_array,prediction_score_array,threshold_list):
    tpr_list = []
    fpr_list = []
    
    # generate a list of tpr, generate a list of fpr
    for th in threshold_list:
        
        tpr_list.append(true_pos_rate(actual_score_array,prediction_score_array,th))
        fpr_list.append(false_pos_rate(actual_score_array,prediction_score_array,th))
        
    # generate true positive list (substract the previous tpr)
    tpr_list_substract = [tpr_list[i]-tpr_list[i+1] for i in range(len(tpr_list)-1)]
    tpr_list_substract.insert(len(tpr_list)-1,tpr_list[len(tpr_list) -1])
    
    # generate false positive list (subtract the previous fpr)
    fpr_list_substract = [fpr_list[i]-fpr_list[i+1] for i in range(len(fpr_list)-1)]
    fpr_list_substract.insert(len(fpr_list)-1,fpr_list[len(fpr_list) -1])
    
    # take the first element and onwards
    tpr_remove_the_largest = tpr_list[1:]
    tpr_remove_the_largest.insert(len(tpr_remove_the_largest)-1, tpr_list[len(tpr_list) -1])
    
    #calculate area under the curve
    auc = (0.5 *(np.array(tpr_list_substract).dot(np.array(fpr_list_substract))) + np.array(tpr_remove_the_largest).dot(np.array(fpr_list_substract)))
    
    return auc

### Test the result on randomly generated dummy datasets

In [59]:
import numpy as np
from sklearn.metrics import roc_auc_score

# create a dummy prediction set and dummy threshold list
y_true = np.random.randint(2, size=100)
print("y_actual are:")
print(y_true)
y_scores = np.array([random.random() for i in range(100)])
print("y_scores are:")
print(y_scores)
threshold_list = np.linspace(0,1,1000,endpoint=False)

y_actual are:
[1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1
 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1
 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 1]
y_scores are:
[ 0.02170709  0.79349882  0.15634267  0.41640273  0.83401201  0.16093844
  0.57689871  0.90580256  0.17949482  0.2368067   0.05524144  0.26409155
  0.27863757  0.10994987  0.744952    0.57874802  0.13196928  0.51246108
  0.3633696   0.4788283   0.66882581  0.30107823  0.34787405  0.5724247
  0.41520884  0.46827663  0.25271672  0.53814267  0.21144478  0.67944208
  0.80328112  0.43497804  0.18578928  0.55703032  0.426147    0.72973414
  0.02622692  0.86876235  0.60307218  0.15187998  0.65145489  0.86844425
  0.11961548  0.24071914  0.62225687  0.58213272  0.85592275  0.63668841
  0.20810597  0.71954138  0.18401001  0.56739185  0.02934028  0.78666915
  0.11736302  0.62143267  0.48719313  0.81749175  0.56467944  0.32878933
  0.65248722  0.9553407   0.46015237  0

In [60]:
# calculate AUC using the SKLEARN package
print("AUC using Sciki-learn package is:{}".format(roc_auc_score(y_true, y_scores)))

# calculate AUC using the raw functions above
print("AUC using raw functions I built is:{}".format(auc_cal(y_true,y_scores,threshold_list)))

AUC using Sciki-learn package is:0.5163636363636364
AUC using raw functions I built is:0.5165656565656566
