# K-Fold testing and Confusion Matrix of a Watson Assistant workspace.

This example notebook is used to show you how to generate a K-fold test in Watson Assistant, and run the cross validation test. 

It demonstrates a technique to programmatically train and evaluate the intent recognition performance for a workspace in <a href="https://www.ibm.com/watson/developercloud/assistant/api/v1/" target="_blank" rel="noopener noreferrer">Watson Assistant</a>.

At a high level, intents are purposes or goals expressed in a user's input, such as answering a question or processing a bill payment. By recognizing the intent expressed in a customer's input, the Assistant service can choose the correct dialog flow for responding to it.

This notebook will demonstrate how the Watson Assistant API can be directly accessed to programmatically train the workspace on intents. This is an alternative to the GUI tool typically used to train a workspace.

By managing the training process programmatically, the intent recognition performance can be reliably tested with a truly blind test set.

This notebook runs on Python 3.5
So, on the top right be sure to run the right Kernel, if not go to menu **Kernel > Change Kernel** then select **Python 3.5** 

Tips:
* Code cells are identifiable by their `In [ ]:` prefix in the margin
* To execute the celsl in the notebook, select the cell and click the run button, or hit Ctrl-Enter.
* Cells which have not been executed before will have empty brackets, while executed cells will have a sequence number within, e.g. `In [13]`
* Cell execution result displays below the cell
* To clear all exection statuses and outputs, use the `Cell/All Output/Clear` menu.

Then execute the cell (Ctrl-Enter or run button)

## Table of contents

1. [Install and import packages](#setup)
2. [Authenticate to the Watson Assistant Service](#authenticate)
3. [Import the data as a pandas DataFrame](#import)
4. [Generate Folds and split the data set for training and testing](#foldandsplit)
5. [Create the workspaces](#workspaces)
6. [Check workspace status](#status)
7. [Run the test](#runtest)
8. [Clean up](#clean)
9. [Understanding the results you got](#understanding)
10. [Building a confusion matrix](#matrix)<br>
[Summary and next steps](#Summary-and-next-steps)

## <a id="setup"></a> Step 1. Install and import packages

Install and import the necessary packages.

In [None]:
!pip install --upgrade watson_developer_cloud
!pip install --upgrade numpy
!pip install --upgrade pandas
!pip install --upgrade scikit-learn
!pip install --upgrade matplotlib

## <a id="authenticate"></a>Step 2. Authenticate to the Watson Assistant

Sign up for the Watson Conversation service and enter your credentials. 

1. Sign up for [Watson Assistant](https://console.bluemix.net/catalog/services/conversation) in IBM Cloud.
1. On your Watson Assistant service page, click **Service credentials**.
1. Find your credentials. 
1. Add your workspace ID, username, and password to the next cell and run the cell.

Tips:
* The Watson Studio and the Watson Assistant must be in the same IBM Cloud region (US South for instance)


The `ctx` object you get from the Conversation service credentials window. For this example you should create a new service for testing, as the test could generate 10 workspaces. 

You have to set the language of your workspaces. This is a 2 digit language identifier. Check the <A HREF="https://console.bluemix.net/docs/services/conversation/lang-support.html#supported-languages">supported languages</A> page for what codes can be used. 

The `number_of_folds` is set to `5` so you can test this with a free conversation service. For production testing, you should set the `number_of_folds` to `10`. 

In [None]:
ctx = {
  "url": "https://gateway.watsonplatform.net/conversation/api",
  "apikey": "<API KEY>"
}

language = 'en'
VERSION = '2018-07-10'

number_of_folds = 5

In [None]:
#import pandas as pd
import numpy as np
from watson_developer_cloud import AssistantV1
from sklearn.model_selection import KFold

conversation = AssistantV1( 
    iam_apikey=ctx.get('apikey'),
    url=ctx.get('url'),
    version=VERSION)

## <a id="import"></a>Step 3. Import the data as a pandas DataFrame

The data consists of sample user questions and the assigned intents. 

**For notebooks running on IBM Data Science Experience:**

To get the data and load it into a pandas DataFrame:

* Select the code cell below, and **delete all its content**
* Open the data panel on the right using the 1001 button icon  (top right)
* Drop your file with the your intents and user examples.
* From the data panel on the right use context menu on the added file choose **Insert to code > Insert Pandas DataFrame** 

Some code should be generated, which creates a `df_data_1` panda DataFrame. If the name is different, change the variable name back to `df_data_1`

**For Python notebook servers**
1. Uncomment and modify the code stub to load data from your server's filesystem. 

In [None]:

import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_93c7a4746f1e4132864e7ef0a2d31c48 = ibm_boto3.client(service_name='s3',
<>
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.head()



Rename the DataFrame to `df`:

In [None]:
# Make sure this uses the variable above. The number will vary in the inserted code.
try:
    examples = df_data_1
except NameError as e:
    print('Error: Setup is incorrect or incomplete.\n')
    print('Follow the instructions to insert the pandas DataFrame above, and edit to')
    print('make the generated df_data_# variable match the variable used here.')
    raise

##  <a id="foldandsplit"></a>Step 4. Generate Folds and split the data set for training and testing.

What happens next is we randomise the questions, and split them evenly into `number_of_folds` buckets. 

In [None]:
bucket = np.arange(len(examples))
folds = []

kf = KFold(n_splits=number_of_folds, shuffle=True)
for train_index, test_index in kf.split(bucket):
    train, test = bucket[train_index], bucket[test_index]
    fold = { 
        'test': test,
        'train': train
    }
    folds.append(fold)

Create intents for fold:
This method will go through CSV lines and create an `intents[]` object for when the workspace is created.


In [None]:
def createIntents(train_list=None):
    results = []
    for i in train_list:
        row = {}
        question = examples.iloc[i]['example']
        intent = examples.iloc[i]['intent']
        
        if not any(intent in x['intent'] for x in results):
            row = { 'intent': intent, 
                    'examples': [ {'text': question } ] } 
        else:
            row = [d for d in results if d.get('intent') == intent][0]
            results[:] = [d for d in results if d.get('intent') != intent]
            e = {'text': question}
            row['examples'].append(e)
            
        results.append(row)
  
    return results

##  <a id="workspaces"></a>Step 5.Create the workspaces. 
Using each 'train' part of the folds, we will create 5 workspaces, and hold the workspace ID for each. We don't do testing straight away as each workspace needs time to train. 

In [None]:
workspaces = []

dialog_nodes = [
    {
     'dialog_node': 'anything_else',
     'description': 'Required to stop the endless loop error.',
     'conditions': 'anything_else',
     'parent': None, 
     'previous_sibling': None,
     'output': {'text': {'values': ['OK'], 'selection_policy' : 'sequential'}}, 
     'context': None,
     'metadata': None,
     'go_to': None
    }
]

i = 0
for fold in folds:
    intents = createIntents(fold['train'])
    response = conversation.create_workspace(name='Fold {}'.format(i),
                                         description='K-Fold Testing workspace.',
                                         language=language,
                                         intents=intents,
                                         dialog_nodes=dialog_nodes,
                                         metadata={}).get_result()
    workspaces.append(response['workspace_id'])
    print('Created workspace fold {}: {}'.format(i,response['workspace_id']))
    i = i + 1

## <a id="status"></a>Step 6. Check workspace status. 
Before you start running the testing, you want to make sure that they are all ready to understand. 

In [None]:
i = 0
for workspace in workspaces:
    response = conversation.get_workspace(workspace_id=workspace, export=False).get_result()
    print('Fold: {}. Workspace: {}.  Status: {}'.format(i,workspace, response['status']))
    i = i + 1

## <a id="runtest"></a>Step 7. Run the test.

**IMPORTANT!** Do not run the next piece until the status is `Available` for all workspaces above. 

Now walk through each workspace and test each fold. Generate a final report, and save test information. 

In [None]:
import json
import time

wsid = 0
col_list = [ 'Question', 'Expected Intent', 'Matched', 'Found@',
            'I1', 'C1', 'I2', 'C2', 'I3', 'C3', 'I4', 'C4', 'I5', 'C5',
            'I6', 'C6', 'I7', 'C7', 'I8', 'C8', 'I9', 'C9', 'I10', 'C10'
           ]

fold_results = []
print('Running Folds')

for fold in folds:
    results = []
    workspace = workspaces[wsid]
    test_set = fold['test']

    print('Fold {}. Questions = {}'.format(wsid, len(test_set)))

    counter = 0
    for t in test_set:
        question = examples.iloc[t]['example']
        expected_intent = examples.iloc[t]['intent']
        
        msg = {'text': question }
        
        if counter % 10 == 0: 
            print('X',end='')
        else:
            print('.',end='')
        counter = counter + 1
        
        try: 
            response = conversation.message(workspace_id=workspace, input=msg, alternate_intents=True).get_result()
        except:
            print('E', end='')
            time.sleep(5)
            try:
                response = conversation.message(workspace_id=workspace, input=msg, alternate_intents=True).get_result()
            except:
                print('E', end='')
                time.sleep(5)
                response = conversation.message(workspace_id=workspace, input=msg, alternate_intents=True).get_result()
                
        intents = response['intents']
        
        found_at = [i for i,_ in enumerate(intents) if _['intent'] == expected_intent]
        if found_at == []: 
            found_at = ''
        else:
            found_at = found_at[0]

        row = { 
                'Question': question,
                'Expected Intent': expected_intent,
                'Matched': any(expected_intent in x['intent'] for x in intents),
                'Found@': found_at,
              }
        
        i = 1
        for intent in intents:
            row['I{}'.format(i)] = intent['intent']
            row['C{}'.format(i)] = intent['confidence']
            i = i + 1
        
        results.append(row)
    print('')
    
    fold_result = pd.DataFrame(results, columns=col_list)
    fold_result.to_csv('fold{}.csv'.format(wsid))
    wsid = wsid + 1
    
print('')
print('Done')

## <a id="clean"></a>Step 8.Clean up. 
When all done and you want to remove the workspaces in your test service. 

In [None]:
for workspace in workspaces:
    response = conversation.delete_workspace(workspace_id=workspace).get_result()
    print('Deleted: {}'.format(workspace))


## <a id="understanding"></a>Step 9.Understanding the results you got. 

At this point you will have four csv files called `foldX.csv` (where `X` = `0` to `number_of_folds`). The next step is to load these, and get the average across all folds. 

Fields in the reports:

---

| Field | Description | 
| :-| :-|
| Question | The question that was used to test with.     |
| Expected Intent | The intent that was expected to be returned. |
| Matched  | This will be `true` if expected intent shows up in the top 10 intents returned. |
| Found@ | This tells you what position it was found at. 0 = Top. | 
| I**x** | Intent found at recall **x** |
| C**x** | Confidence of Intent found at recall **x** |

--- 

First load the reports into memory. For this example, we are only going to bother with the main answer found. For production however, you would also examine the top 5 intents, so as to see why a particular question failed.

In [None]:
reports = [] 
#failures = [] 
drop_fields = [ 'C2', 'C3','C4','C5','C6','C7','C8','C9','C10',
                'I2','I3','I4','I5','I6','I7','I8','I9','I10',
                'Found@', 'Matched'
              ]
            
for x in range(number_of_folds):
    df = pd.read_csv('fold{}.csv'.format(x),header=0,index_col=0)
    df.drop(drop_fields, axis=1, inplace=True)
    df = df.rename(columns={'Expected Intent': 'true_value', 'I1': 'predicted_value', 'C1': 'confidence'})
#    dz = df[df['true_value']!= df['predicted_value']]
    reports.append(df)
#    failures.append(dz)
    
# example
reports[0].head(10)

#display(failures[0],failures[1],failures[2],failures[3],failures[4])


Now get the overall details of the reports. These are just some sample metrics. 

In [None]:
records = []
for index, report in enumerate(reports):
    tp = report[report['true_value'] == report['predicted_value']]
    fp = report[report['true_value'] != report['predicted_value']]

    record = {
        'Report': 'report {}'.format(index+1),
        'Total': len(report),
        'Correct': len(tp),
        'Incorrect': len(fp),
        'Accuracy': len(tp) / len(report),
        'Avg Postive Confidence': tp['confidence'].mean(),
        'Avg Negative Confidence': fp['confidence'].mean()
    }
    records.append(record)

df = pd.DataFrame(records,
        columns=['Report','Total','Correct','Incorrect','Accuracy','Avg Postive Confidence', 'Avg Negative Confidence']
)

record = [{ 'Report': 'Total/Average',
           'Total': df['Total'].sum(),
           'Correct': df['Correct'].sum(),
           'Incorrect': df['Incorrect'].sum(),
           'Accuracy': df['Accuracy'].mean(),
           'Avg Postive Confidence': df['Avg Postive Confidence'].mean(),
           'Avg Negative Confidence': df['Avg Negative Confidence'].mean()
}]

dft = pd.DataFrame(record,
        columns=['Report','Total','Correct','Incorrect','Accuracy','Avg Postive Confidence', 'Avg Negative Confidence']
)

df = df.append(dft)

df

In the report above, the accuracy > 0.7 is a good result. The accuracy of correct answers is high, while for wrong answers is lower (could be a little better).

## <a id="matrix"></a>Step 10. Building a confusion matrix.

In this example, we will take the combined results and create a confusion matrix. This will allow us to see where one intent may be interfering with another. 

We start by combining the reports.

In [None]:
df = pd.DataFrame([],columns=['Question','true_value','predicted_value','confidence'])

for report in reports:
    df = df.append(report,ignore_index=True)


For the confusion matrix, we need to specify the fields that are going to be scanned. For this demo, we do this by getting the unique list from the records. 

In [None]:
class_names = examples['intent'].unique()

print('\n'.join(class_names))

This next piece of code is for creating the confusion matrix, and is taken from <A HREF="http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py">sckit-learn example</A>.

In [None]:
%matplotlib inline
import numpy as np
import itertools
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import matplotlib as mpl

mpl.rcParams['figure.figsize'] = (10,10)

def plot_confusion_matrix(cm, classes=None,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

This last piece of code will display the confusion matrix. 

In [None]:
cm_true = df['true_value'].tolist()
cm_predicted = df['predicted_value'].tolist()
cnf_matrix = confusion_matrix(cm_true, cm_predicted,labels=class_names)

np.set_printoptions(precision=2)

plt.figure(figsize=(15,15))
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

#plt.figure(figsize=(15,15))
#plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True, title='Normalized Confusion matrix')
#plt.show()

## Summary and next steps
You've learned how to use the Watson Assistant API to train and evaluate the service. Try adding your own user questions and intents data and see how Watson does!

Learn more:
- <a href="https://www.ibm.com/watson/developercloud/assistant/api/v1/" target="_blank" rel="noopener noreferrer">Watson Assistant API reference</a>
- <a href="https://github.com/watson-developer-cloud/python-sdk" target="_blank" rel="noopener noreferrer">Watson Assistant Python SDK</a>

### Authors
Laurent Vincent.