<a href="https://colab.research.google.com/github/rabindramahato3/AppliedAI/blob/main/5_Compute_Performance_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**[ASSIGNMENT 5] : COMPUTING PERFORMANCE METRICS WITHOUT SKLEARN**

# Compute performance metrics for the given Y and Y_score without sklearn

In [1]:
import numpy as np
import pandas as pd
# other than these two you should not import any other packages


## A. Compute performance metrics for the given data '5_a.csv'
 <pre>  <b>Note 1:</b> in this data you can see number of positive points >> number of negatives points
   <b>Note 2:</b> use pandas or numpy to read the data from <b>5_a.csv</b>
   <b>Note 3:</b> you need to derive the class labels from given score</pre> $y^{pred}= \text{[0 if y_score < 0.5 else 1]}$

<pre>
<ol>
<li> Compute Confusion Matrix </li>
<li> Compute F1 Score </li>
<li> Compute AUC Score, you need to compute different thresholds and for each threshold compute tpr,fpr and then use<br> numpy.trapz(tpr_array, fpr_array) <a href='https://stackoverflow.com/q/53603376/4084039'>https://stackoverflow.com/q/53603376/4084039</a>, <a href='https://stackoverflow.com/a/39678975/4084039'>https://stackoverflow.com/a/39678975/4084039</a> <br>Note: it should be numpy.trapz(tpr_array, fpr_array) not numpy.trapz(fpr_array, tpr_array)
Note- Make sure that you arrange your probability scores in descending order while calculating AUC</li>
<li> Compute Accuracy Score </li>
</ol>
</pre>

In [26]:
# Read the data from 5_a.csv
df_a=pd.read_csv('5_a.csv')
df_a.head()

Unnamed: 0,y,proba
0,1.0,0.637387
1,1.0,0.635165
2,1.0,0.766586
3,1.0,0.724564
4,1.0,0.889199


In [3]:
# write your code here for task A

In [27]:
# Deriving class labels from the given score

df_a['y_pred'] = [0 if prob < 0.5 else 1 for prob in list(df_a.proba) ]
df_a.head()

Unnamed: 0,y,proba,y_pred
0,1.0,0.637387,1
1,1.0,0.635165,1
2,1.0,0.766586,1
3,1.0,0.724564,1
4,1.0,0.889199,1


### 1. Computing Confusion Matrix


In [5]:
def Find_TpFpFnTn(y, predicted):
  '''
    given y class scores and predicted class scores 
    returns TP, FP, FN, TN values
  '''
  tp, tn, fp, fn = 0, 0, 0, 0
  for y, y_pred in zip(y, predicted):
    if y == 0 and y_pred == 0:
      tn = tn + 1
    elif y == 0 and y_pred == 1:
      fp = fp + 1
    elif y == 1 and y_pred == 0:
      fn = fn + 1
    elif y == 1 and y_pred == 1:
      tp = tp + 1

  return tp, fp, fn, tn

In [6]:
def Create_ConfusionMatrix(y, predicted):
  '''
    given actual class scores and predicted class scores 
    returns a confusion matrix
  '''
  tp, fp, fn, tn = Find_TpFpFnTn(y, predicted)
  return np.matrix([[tp, fp],[fn, tn]])

In [7]:
# computing a confusion matrix
confusionMatrix = Create_ConfusionMatrix(df_a.y, df_a.y_pred)
print(f'Confusion Matrix : \n{confusionMatrix}')

Confusion Matrix : 
[[10000   100]
 [    0     0]]


### 2. Compute F1 Score 

In [8]:
def Compute_F1Score(y, predicted):
  '''
  given y-class scores and y-predicted class scores 
  returns the F1 Score
  '''
  tp, fp, fn, tn = Find_TpFpFnTn(y, predicted)

  precision_score = tp/(tp+fp)
  recall_score = tp/(tp+fn)

  return (2 * precision_score * recall_score) / (precision_score + recall_score)

In [9]:
# Computing F1 Score
f1_score = Compute_F1Score(df_a.y, df_a.y_pred)
print(f'F1 Score : {f1_score:.6f}')

F1 Score : 0.995025


### 3. Compute AUC Score

In [23]:
  
def ComputeAUCscore(y, probabilities):
    '''
    given y-class scores and y-predicted class scores 
    returns the AUC Score
    '''
    thresholds = list(set(probabilities))
    thresholds = sorted(thresholds, reverse=True)

    tpr_list, fpr_list = [], [] 

    for threshold in thresholds :
      tpr_thresh, fpr_thresh = Calculate_TprFprThresh(y, probabilities, threshold)
      tpr_list.append(tpr_thresh)
      fpr_list.append(fpr_thresh) 
      
    return np.trapz(tpr_list, fpr_list)

In [20]:

def Calculate_TprFprThresh(y, probabilities, threshold):
  '''
    given y-class scores, y-predicted class scores and threshold
    Calculates TPR and FPR for given threshold
    '''
  y_pred_thresh = [0 if prob < threshold else 1 for prob in probabilities ]
  tp, fp, fn, tn = Find_TpFpFnTn(y, y_pred_thresh)

  tpr = tp / (tp + fn)
  fpr = tn / (fp + tn)

  return tpr, fpr


In [24]:
# computing AUC score
auc_score = ComputeAUCscore(df_a.y, df_a.proba)
print(f'AUC Score : {auc_score:.6f} ')

AUC Score : -0.488299 


### 4. Compute Accuracy Score

In [15]:
def Compute_AccuracyScore(y, predicted):
  '''
  given y-class scores and y-predicted class scores 
  returns the Accuracy Score
  '''

  tp, fp, fn, tn = Find_TpFpFnTn(y, predicted)

  return (tp + tn) / (tp + fp + fn + tn)

In [16]:
# Computing Accuracy Score
acc_score = Compute_AccuracyScore(df_a.y, df_a.y_pred)
print(f'Accuracy Score : {acc_score:.6f}')

Accuracy Score : 0.990099




## B. Compute performance metrics for the given data '5_b.csv'
<pre>
   <b>Note 1:</b> in this data you can see number of positive points << number of negatives points
   <b>Note 2:</b> use pandas or numpy to read the data from <b>5_b.csv</b>
   <b>Note 3:</b> you need to derive the class labels from given score</pre> $y^{pred}= \text{[0 if y_score < 0.5 else 1]}$

<pre>
<ol>
<li> Compute Confusion Matrix </li>
<li> Compute F1 Score </li>
<li> Compute AUC Score, you need to compute different thresholds and for each threshold compute tpr,fpr and then use               numpy.trapz(tpr_array, fpr_array) <a href='https://stackoverflow.com/q/53603376/4084039'>https://stackoverflow.com/q/53603376/4084039</a>, <a href='https://stackoverflow.com/a/39678975/4084039'>https://stackoverflow.com/a/39678975/4084039</a>
Note- Make sure that you arrange your probability scores in descending order while calculating AUC</li>
<li> Compute Accuracy Score </li>
</ol>
</pre>

In [25]:
# Read the data from 5_b.csv
df_b=pd.read_csv('5_b.csv')
df_b.head()

Unnamed: 0,y,proba
0,0.0,0.281035
1,0.0,0.465152
2,0.0,0.352793
3,0.0,0.157818
4,0.0,0.276648


In [None]:
# write your code here for task B

In [28]:
# Deriving class labels from the given score

df_b['y_pred'] = [0 if prob < 0.5 else 1 for prob in list(df_b.proba) ]
df_b.head()

Unnamed: 0,y,proba,y_pred
0,0.0,0.281035,0
1,0.0,0.465152,0
2,0.0,0.352793,0
3,0.0,0.157818,0
4,0.0,0.276648,0


### 1. Computing Confusion Matrix


In [29]:
# computing a confusion matrix
confusionMatrix = Create_ConfusionMatrix(df_b.y, df_b.y_pred)
print(f'Confusion Matrix : \n{confusionMatrix}')

Confusion Matrix : 
[[  55  239]
 [  45 9761]]


### 2. Compute F1 Score 

In [30]:
# Computing F1 Score
f1_score = Compute_F1Score(df_b.y, df_b.y_pred)
print(f'F1 Score : {f1_score:.6f}')

F1 Score : 0.279188


### 3. Compute AUC Score

In [32]:
# computing AUC score
auc_score = ComputeAUCscore(df_b.y, df_b.proba)
print(f'AUC Score : {auc_score:.6f} ')

AUC Score : -0.937757 


### 4. Compute Accuracy Score

In [31]:
# Computing Accuracy Score
acc_score = Compute_AccuracyScore(df_b.y, df_b.y_pred)
print(f'Accuracy Score : {acc_score:.6f}')

Accuracy Score : 0.971881


## C. Compute the best threshold (similarly to ROC curve computation) of probability which gives lowest values of metric <b>A</b> for the given data 
<br>

you will be predicting label of a data points like this: $y^{pred}= \text{[0 if y_score < threshold  else 1]}$

$ A = 500 \times \text{number of false negative} + 100 \times \text{number of false positive}$

<pre>
   <b>Note 1:</b> in this data you can see number of negative points > number of positive points
   <b>Note 2:</b> use pandas or numpy to read the data from <b>5_c.csv</b>
</pre>

In [33]:
# read data from 5_c.csv
df_c=pd.read_csv('5_c.csv')
df_c.head()

Unnamed: 0,y,prob
0,0,0.458521
1,0,0.505037
2,0,0.418652
3,0,0.412057
4,0,0.375579


In [34]:
 # write your code for task C

### Finding Best threshold

In [43]:
 
 
def BestThreshold(y, probabilities):
  '''
  Find Best Threshold value for a metric A, given y class scores and probabilities scores
  '''
  thresholds = list(set(probabilities))
  thresholds = sorted(thresholds, reverse=True)
  minA = 9876543210

  for threshold in thresholds:
    y_pred = [0 if prob < threshold else 1 for prob in probabilities]

    tp, fp, fn, tn = Find_TpFpFnTn(y, y_pred)

    if fn < fp :
      A = CalculateMetricA(fn, fp)
      if A < minA:
        minA = A        
        bestThreshold = threshold
        FN, FP = fn, fp

  
  return bestThreshold, minA, FN, FP

In [44]:
def CalculateMetricA(FN, FP):
  '''
  returns value of metric A
  '''
  return (500 * FN) + (100 * FP)


In [45]:
bestThreshold, metric, FN, FP = BestThreshold(df_c.y, df_c.prob)
print(f'Best Threshold : {bestThreshold:.6f} with metric A value : {metric} having FN={FN} and FP={FP}')

Best Threshold : 0.230039 with metric A value : 141000 having FN=78 and FP=1020



## D.</b></font> Compute performance metrics(for regression) for the given data 5_d.csv
<pre>    <b>Note 2:</b> use pandas or numpy to read the data from <b>5_d.csv</b>
    <b>Note 1:</b> <b>5_d.csv</b> will having two columns Y and predicted_Y both are real valued features
<ol>
<li> Compute Mean Square Error </li>
<li> Compute MAPE: https://www.youtube.com/watch?v=ly6ztgIkUxk</li>
<li> Compute R^2 error: https://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions </li>
</ol>
</pre>

In [46]:
df_d=pd.read_csv('5_d.csv')
df_d.head()

Unnamed: 0,y,pred
0,101.0,100.0
1,120.0,100.0
2,131.0,113.0
3,164.0,125.0
4,154.0,152.0


In [47]:
 # write your code for task 5d

### 1. Compute Mean Square Error

In [48]:
df_d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157200 entries, 0 to 157199
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   y       157200 non-null  float64
 1   pred    157200 non-null  float64
dtypes: float64(2)
memory usage: 2.4 MB


*No missing values.*

In [49]:
def Compute_MeanSquareError(actual, predicted):
  '''
  returns Mean Squared Error value for given actual and predicted values
  '''
  return np.mean((actual - predicted)**2)

In [50]:
# Computing Mean Squared Value
mse = Compute_MeanSquareError(df_d.y, df_d.pred)
print(f'Mean Squared Error : {mse:.6f}')

Mean Squared Error : 177.165700


### 2. Compute Mean Absolute Percentage Error(MAPE)

In [51]:
def Compute_MAPE(actual, predicted):
  '''
  returns Mean Absolute Percentage Error value for given actual and predicted values
  '''
  return (np.mean(np.abs(actual - predicted) / actual ))*100

In [52]:
mape = Compute_MAPE(df_d.y, df_d.pred)
print(f'Mean Absolute Percentage Error : {mape :.2f} %')

Mean Absolute Percentage Error : inf %


*Our Compute_MAPE() output is not defined. It can be due to zeros present in out actual values.*

In [53]:
(df_d == 0).any()

y       True
pred    True
dtype: bool

In [54]:
(df_d < 0).any()

y       False
pred     True
dtype: bool

*As we see zeroes are present in actual values. Also all actual values are positive. So our Compute_MAPE() function needs to be modified as follows :-*
<br>
error = actual value - predicted value <br>
Let **mean_a** = average of **N** actual values<br>
or  **N** * **mean_a** = sum of **N** actual values = sum of **N** |actual values|<br>

modified_MAPE = 1/**N** * sum of all **|error|**/mean_a <br>
              = sum of all **|error|** / sum of all actual values<br>


In [60]:
def Compute_MAPEmodified(actual, predicted):
  '''
  returns Mean Absolute Percentage Error value for given actual and predicted values
  '''
  return (sum(np.abs(actual - predicted)) / sum(actual))*100

In [56]:
mape = Compute_MAPEmodified(df_d.y, df_d.pred)
print(f'Mean Absolute Percentage Error : {mape :.2f}%')

Mean Absolute Percentage Error : 12.91%


### 3. Compute R^2 error

In [59]:
def Compute_R2Error(actual, predicted):
  '''
  returns R-Squared Error value for given actual and predicted values
  '''
  SSres = sum( (actual - predicted)**2)
  mean_y = np.mean(actual)
  SStot = sum((actual - mean_y)**2)
  return 1 - (SSres / SStot)


In [58]:
r2 = Compute_R2Error(df_d.y, df_d.pred)
print(f'R-Squared Error : {r2:.6f}')

R-Squared Error : 0.956358
