# Metrics Onboarding Notebook 
#### Author : [@Achintya](https://github.com/AchintyaX)

In this Notebook we would be calculating the metrics for which we monitor in ML weekly review along with metrics for evaluation the performance of our model before deployment

## Importing the necessary libraries 

In [1]:
import sqlite3
import json
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, accuracy_score, classification_report, confusion_matrix, precision_score, recall_score

### Loading the data from the tog job
The data that we retrive form a tog job is in a sqlite file, we have standard way of encoding the data inside a `data` table

In [2]:
def load_sqlite(path):
    cnx = sqlite3.connect(path)
    df = pd.read_sql_query("SELECT * FROM data", cnx)
    return df 

In [None]:
df = load_sqlite('982.sqlite')
df

## Extracting Labels and Predictions 
Inside the `df` we have the `tag` and `data` columns. <br>
The `tag` column contains the intent tags labelled by the Ops team(ground truth labels)<br>
the `data` column contains the predictions from our bot.(Predictions) <br>
Both are in Json format so we need to extract the prediction and label by parsing the json. <br>
the code block below performs that function, please feel free to explore the data further 

In [4]:
def predictions_labels(row):
    try:
        prediction = json.loads(row['data'])['filter']['predicted_intent']
    except KeyError:
        prediction = '_empty_transcript_'
    label = json.loads(row['tag'])[0]['type']
    return prediction, label 

In [None]:
labels = []
predictions = []

for _, row in df.iterrows():
    prediction, label = predictions_labels(row)
    predictions.append(prediction)
    labels.append(label)

df['prediction'] = predictions
df['label'] = labels 
df

Checking out the list of labels present inside our tog job

In [6]:
df.label.unique()

array(['_silence_', '_callback_', '_purchased_', '_ood_', '_hindi_',
       '_pending_', '_cancel_', '_uninterested_', '_confirm_',
       '_who_is_this_', 'two_wheeler'], dtype=object)

## Distribution of Intents 
We can divide the intents into 2 groups - 
1. Inscope Intents - Intents which are in the use case of the bot 
2. Out of Scope Intents - Intents which our out of the use case of the bot. These are represented by `_oos_` but there could be other intents which can be aliased to `_oos_`

Inside the Inscope Intents we have another group of intents called `smalltalk intents`. <br>
- Smalltalk Intents : These are the intents which are part of every bot, basically if the user is trying to make a smalltalk with the bot. generally the following intents are in smalltalk. 
    - `_confirm_`
    - `_cancel_`
    - `_repeat_`
    - `_greeting_`
    
<br>
The list can expand depending on the client 

In [7]:
# we don't need _silence_ intent because that is for silent audios

df = df[df['label'] != '_silence_'].reset_index(drop=True)

smalltalk_intents = ['_confirm_', '_cancel_', '_repeat_', '_greeting_']
inscope_intents = [i for i in df.label.unique().tolist() if i not in ['_oos_', '_ood_']]
inscope_without_smalltalk = [i for i in inscope_intents if i not in smalltalk_intents]
oos = ['_oos_']

def scorer(df, intents=None, average="weighted"):
    if intents:
        df = df[df['label'].isin(intents)].reset_index(drop=True)
    score = {}
    score['precision'] = precision_score(df['label'], df['prediction'], average='weighted', zero_division=0)
    score['recall'] = recall_score(df['label'], df['prediction'], average='weighted', zero_division=0)
    return score 

## Metric Calculation - 
As mentioned in the [doc](https://docs.google.com/document/d/1txL6Dq5qQdfvYxdU_3Z-Z6Rfu0WV_A_SgQz4EdVDEFM/edit#heading=h.6mcxiaz7ktcb), these are the following metrics that we monitor in ML weekly review - 
1. Inscope Precision - Precision of all inscope Intents 
2. Inscope Recall - Recall of all inscope Intents 
3. Smalltalk Precicion - Precision of only Smalltalk intents 
4. Smalltalk Recall - Recall of only Smalltalk intents 
5. Inscope precision without smalltalk 
6. Inscope Recall without Smalltalk 
7. Slot Capture Rate 
8. Slot Retry Rate 

The details and procedure for calculating slot related metrics can be found [here](https://github.com/Vernacular-ai/onboarding/blob/master/ml/slot-reporting/slot-evaluation-and-reporting.ipynb)

In [8]:
print("Overall Precision is {} and overall recall is {}".format(scorer(df)['precision'], scorer(df)['recall']))
print("Smalltalk Precision is {} and Smalltalk Recall is {}".format(scorer(df, smalltalk_intents)['precision'], scorer(df, smalltalk_intents)['recall']))
print("Inscope Precision is {} and Inscope Recall is {}".format(scorer(df, inscope_intents)['precision'], scorer(df, inscope_intents)['recall']))
print("Inscope Precision without smalltalk is {} and Inscope Recall without Smalltalk is {}".format(scorer(df, inscope_without_smalltalk)['precision'], scorer(df, inscope_without_smalltalk)['recall']))
print("OOS precision is {} and OOS recall is {}".format(scorer(df, oos)['precision'], scorer(df, oos)['recall']))

Overall Precision is 0.9623137108792846 and overall recall is 0.8114754098360656
Smalltalk Precision is 1.0 and Smalltalk Recall is 0.875
Inscope Precision is 0.9703972082723714 and Inscope Recall is 0.8181818181818182
Inscope Precision without smalltalk is 0.9946581196581197 and Inscope Recall without Smalltalk is 0.8162393162393162
OOS precision is 0.0 and OOS recall is 0.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Evaluating Model Performance 
For Evaluating the performance of our Model we generally use the classification report packaged with sklearn. Since our SLU is a classification model the key metric that we look for evaluating the performance is `F1 score`. <br>
We use the classification report because it lets us analyze the performance of each intent 

In [9]:
print(classification_report(df['label'], df['prediction'], zero_division=0))

                    precision    recall  f1-score   support

        _callback_       1.00      0.82      0.90       110
          _cancel_       0.36      1.00      0.53         4
         _confirm_       0.15      0.75      0.25         4
_empty_transcript_       0.00      0.00      0.00         0
           _hindi_       1.00      0.87      0.93        63
      _interested_       0.00      0.00      0.00         0
             _ood_       0.00      0.00      0.00         2
  _other_language_       0.00      0.00      0.00         0
         _pending_       1.00      0.68      0.81        25
       _purchased_       1.00      0.60      0.75        10
    _uninterested_       1.00      1.00      1.00        15
     _who_is_this_       1.00      1.00      1.00         1
       two_wheeler       0.88      0.70      0.78        10
          why_what       0.00      0.00      0.00         0

          accuracy                           0.81       244
         macro avg       0.53      0.5

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
