## Compute performance metrics for the given data 5_a.csv

  Note 1: in this data you can see number of positive points >> number of negatives points

  Note 3: you need to derive the class labels from given score

  $y^{pred}= \text{[0 if y_score < 0.5 else 1]}$


 - Compute Confusion Matrix

 - Compute F1 Score

 - Compute AUC Score, you need to compute different thresholds and for each threshold compute tpr,fpr and then use

 numpy.trapz(tpr_array, fpr_array)

 https://stackoverflow.com/q/53603376/4084039

 https://stackoverflow.com/a/39678975/4084039

 Note: it should be numpy.trapz(tpr_array, fpr_array) not numpy.trapz(fpr_array, tpr_array)

- Compute Accuracy Score

In [67]:
import numpy as np
import pandas as pd

df_5_a = pd.read_csv('./Instructions/5_a.csv')
print(df_5_a.shape)
df_5_a.head()


(10100, 2)


Unnamed: 0,y,proba
0,1.0,0.637387
1,1.0,0.635165
2,1.0,0.766586
3,1.0,0.724564
4,1.0,0.889199


In [68]:
# df_5_a_small = pd.read_csv('Instructions/5_a.csv', nrows=10 )
# df_5_a_small.head(10)
# print(df_5_a_small.to_numpy())

In [69]:
df_5_a['y_predicted'] = np.where(df_5_a['proba'] >= 0.5, float(1), float(0))
df_5_a.head()


Unnamed: 0,y,proba,y_predicted
0,1.0,0.637387,1.0
1,1.0,0.635165,1.0
2,1.0,0.766586,1.0
3,1.0,0.724564,1.0
4,1.0,0.889199,1.0


In [70]:
# Checking to see if there's any 'proba' less than or equal to 0.5
# And there is none. So all y_predicted will be classified as 1
# df = df_5_a.loc[df_5_a['proba'] <= 0.5 ]
# df.head()

In [71]:
# print(df_5_a.to_numpy())
actual_y_train_arr = df_5_a.iloc[:, 0].values
print('actual_y_train_arr ', actual_y_train_arr)
predicted_y_arr = df_5_a.iloc[:, 2].values
print('predicted_y_arr ', predicted_y_arr)

actual_y_train_arr  [1. 1. 1. ... 1. 1. 1.]
predicted_y_arr  [1. 1. 1. ... 1. 1. 1.]


In [72]:
import numpy as np

def get_confusion_matrix(true_y_classes_array, predicted_y_classes_array):

  unique_classes = np.unique(true_y_classes_array)
  # For a binary class the above will give me [0 1] numpy array
  # so top-left of confusion matrix will start from 0 i.e. 'True Negative'

  # But the challenge here asks that the top left will be 'True Positive'
  # Hence I need to reverse the above numpy array
  unique_classes = unique_classes[::-1]
  # print('reversed unique', unique_classes) # will convert the above array to [1 0]

  # initialize a matrix with zero values that will be the final confusion matrix
  # For the binary class-label dataset, this confusion matrix will be a 2*2 square matrix
  confusion_matrix = np.zeros((len(unique_classes), len(unique_classes)))

  for i in range(len(unique_classes)):
    for j in range(len(unique_classes)):
      confusion_matrix[i, j] = np.sum((true_y_classes_array == unique_classes[j]) & (predicted_y_classes_array == unique_classes[i]))

  return confusion_matrix

confusion_matrix_5_a = get_confusion_matrix(actual_y_train_arr, predicted_y_arr)
print(confusion_matrix_5_a)

true_negative, false_positive, false_negative, true_positive = int(confusion_matrix_5_a[1][1]), int(confusion_matrix_5_a[0][1]), int(confusion_matrix_5_a[1][0]), int(confusion_matrix_5_a[0][0])

[[10000.   100.]
 [    0.     0.]]


### Explanations and notes on above Confusion matrix function

![img](https://i.imgur.com/1A3Izpg.png)

#### Note `unique_classes[0]` is 1 and `unique_classes[1]` = 0

### For first row of my final confusion_matrix

`confusion_matrix[0,0]` => i.e. i, j = 0, 0 => will have the Total 'True' count (i.e. `np.sum()`) of following conditions

`(true_y_classes_array == unique_classes[0]) & (predicted_y_classes_array == unique_classes[0])`

Similarly for `confusion_matrix[0, 1]` => i.e. i, j = 0, 1 => will have the Total 'True' count (i.e. `np.sum()`)  of following conditions

`(true_y_classes_array == unique_classes[1]) & (predicted_y_classes_array == unique_classes[0])`

---

### Now second row

And for second row of my final confusion_matrix

`confusion_matrix[1,0]`  => i.e. i, j = 1, 0 => will have the Total 'True' count (i.e. `np.sum()`)  of following conditions

`(true_y_classes_array == unique_classes[0]) & (predicted_y_classes_array == unique_classes[1])`

Similarly for `confusion_matrix[1, 1]`  => i.e. i, j = 1, 1  => will have the Total 'True' count (i.e. `np.sum()`) of following conditions


`(true_y_classes_array == unique_classes[1]) & (predicted_y_classes_array == unique_classes[1])`

In [73]:
# To check that the total num of elements of the original dataframe matches
# with the counts captured in the confusion matrix
# sum-all-the-elements-of-the confusion_matrix_5_a
sum_all_elements_of_confusion_matrix = np.concatenate(confusion_matrix_5_a).sum()
print(sum_all_elements_of_confusion_matrix == df_5_a.shape[0] )

True


In [74]:
# Testing my custom confusion_matrix result with scikit-learn
from sklearn.metrics import confusion_matrix
sklearn_confustion_matrix = confusion_matrix(actual_y_train_arr, predicted_y_arr)
print(sklearn_confustion_matrix)

[[    0   100]
 [    0 10000]]


In [75]:
# Verifying individual elements of the Confusion matrix from my custom result with scikit-learn
tn, fp, fn, tp = confusion_matrix(actual_y_train_arr, predicted_y_arr).ravel()
print(tn, fp, fn, tp)
print(true_negative, false_positive, false_negative, true_positive)

0 100 0 10000
0 100 0 10000


### From above we can see the values of the confution Matrix matches between scikit-learn and our custom-implementation

---

## F1 Score

![img](https://i.imgur.com/ZPntYB0.jpg)

![Imgur](https://imgur.com/qy5Fesd.jpg)


In [76]:
# the below function will work only for
# binary confusion matrix
def get_f1_accuracy(binary_conf_matrix):
    true_negative  = binary_conf_matrix[1][1]    
    false_positive = binary_conf_matrix[0][1]
    false_negative = binary_conf_matrix[1][0]
    true_positive = binary_conf_matrix[0][0]

    precision = true_positive / (true_positive + false_positive)
    recall = true_positive/ (true_positive + false_negative)
    
    f1_score = (2 * (precision * recall)) / (precision + recall )
    
    sum_all_elements_of_confusion_matrix = np.concatenate(binary_conf_matrix).sum()
    
    accuracy = (true_positive + true_negative)/sum_all_elements_of_confusion_matrix
    
    return f1_score, sum_all_elements_of_confusion_matrix


print(get_f1_accuracy(confusion_matrix_5_a))
    



(0.9950248756218906, 10100.0)


In [None]:
# Now verifying the above result with that of sk-learn
