# Demo: Model Training and Inference with NLU Classifications

In this notebook, we'll be looking at the most recent feature released in IBM Watson Natural Language Understanding (NLU): **The option to train custom single-label vs. multi-label models**. In addition, we'll go through a couple of *best practices for getting the most out of the model predictions according to individual use cases*.

For a broad overview of training a custom classification model using NLU, see this [demo](https://github.com/watson-developer-cloud/doc-tutorial-downloads/blob/master/natural-language-understanding/custom_classifications_example.ipynb).

Most of our setup is taken from that demo notebook, and we'll have duplicates of each cell to differentiate single-label and multi-label models. This means that we'll create two NLU instances - but this is only for demo purposes since we can only have one custom model at a time in the NLU free tier - you can choose to run only the cells related to single-label or multi-label models.

#### Requirements:
- `ibm_watson`
- `sklearn`
- `numpy`

## 0. Install requirements

If you don't have the libraries required, uncomment the following cell to install them:

In [88]:
# !pip install ibm_watson
# !pip install sklearn
# !pip install numpy


## 1. Setup the NLU Service


See the following for authenticating to Watson services: https://cloud.ibm.com/docs/watson?topic=watson-iam. It will suffice to use the auto-generated service credentials when you instantiated the NLU service.


In [15]:
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

#### Single-label setup

In [31]:
# Add your NLU credentials here
single_api_key = "[INSERT YOUR API KEY HERE]"
single_url = "[INSERT YOUR NLU URL HERE]"

In [32]:
single_auth = IAMAuthenticator(single_api_key)
single_nlu = NaturalLanguageUnderstandingV1(version='2022-08-10', authenticator=single_auth)
single_nlu.set_service_url(single_url)

print("Successfully connected with the NLU service for our single-label model")

Successfully connected with the NLU service for our single-label model


#### Multi-label setup

In [48]:
# Add your NLU credentials here
multi_api_key = "[INSERT YOUR API KEY HERE]"
multi_url = "[INSERT YOUR NLU URL HERE]"

In [49]:
multi_auth = IAMAuthenticator(multi_api_key)
multi_nlu = NaturalLanguageUnderstandingV1(version='2022-08-10', authenticator=multi_auth)
multi_nlu.set_service_url(multi_url)

print("Successfully connected with the NLU service for our multi-label model")

Successfully connected with the NLU service for our multi-label model


## 2. Creating Training Data

Before creating our training data, let's define `single-label` vs. `multi-label`. 
- **Single-label**: Each example in the dataset can have *only one label*:

```
{
        "text": "How hot is it today?",
        "labels": ["temperature"]
}
```

- **Multi-label**: Each example in the dataset can have *more than one label*:

```
{
        "text": "How hot is it today?",
        "labels": ["temperature", "question", "assistance"]
}
```

If all of the data examples in our dataset are single-label, then we'd need to train a single-label model. Conversely, if we expect multiple labels per each example, then we'd train a multi-label model. Notice how the data format for input doesn't change regardless of whether the dataset is multi-label or single-label.

For our demo, we'll use a single-label dataset with three labels in total, one of which has less data examples than the other two, to denote a real case of an *imbalanced dataset*. This generally occurs when we only have a small set of data for a type of examples that doesn't occur often, such as extreme weather conditions.

For our toy dataset, we'll have:
- 8 examples for the label `temperature`
- 8 examples for the label `conditions`
- 5 examples for the label `emergencies`


**NOTE: A minimum of 5 examples per class is required to train a model!**

In [20]:
training_data = [
    {
        "text": "How hot is it today?",
        "labels": ["temperature"]
    },
    {
        "text": "Is it hot outside?",
        "labels": ["temperature"]
    },
    {
        "text": "Will it be uncomfortably hot?",
        "labels": ["temperature"]
    },
    {
        "text": "Will it be sweltering?",
        "labels": ["temperature"]
    },
    {
        "text": "How cold is it today?",
        "labels": ["temperature"]
    },
    {
        "text": "What's the real-feel?",
        "labels": ["temperature"]
    },
    {
        "text": "Is it freezing?",
        "labels": ["temperature"]
    },
    {
        "text": "Is it warm outside?",
        "labels": ["temperature"]
    },
    {
        "text": "Will we get snow?",
        "labels": ["conditions"]
    },
    {
        "text": "Are we expecting sunny conditions?",
        "labels": ["conditions"]
    },
    {
        "text": "Is it overcast?",
        "labels": ["conditions"]
    },
    {
        "text": "Will it be cloudy?",
        "labels": ["conditions"]
    },
    {
        "text": "Will there be hail tomorrow?",
        "labels": ["conditions"]
    },
    {
        "text": "Will there be a blizzard tonight?",
        "labels": ["conditions"]
    },
    {
        "text": "Is it snowing right now?",
        "labels": ["conditions"]
    },
    {
        "text": "Is it going to rain?",
        "labels": ["conditions"]
    },
    {
        "text": "Has there been a crash?",
        "labels": ["emergencies"]
    },
    {
        "text": "Is there a wildfire?",
        "labels": ["emergencies"]
    },
    {
        "text": "Are the roads blocked?",
        "labels": ["emergencies"]
    },
    {
        "text": "Is someone missing?",
        "labels": ["emergencies"]
    },
    {
        "text": "I need help!",
        "labels": ["emergencies"]
    }
]

In [21]:
# Save Training data in a file
import json

training_data_filename = 'training_data.json'

with open(training_data_filename, 'w', encoding='utf-8') as f:
    json.dump(training_data, f, indent=4)

print('Data successfully saved locally in ' + training_data_filename)

Data successfully saved locally in training_data.json



## 3. How to Train a NLU Classifications Model: Single-label vs. Multi-label

To train a NLU Classifications model using the data created above, utilize the `create_classifications_model` method. To specify whether the model is single-label or multi-label, you can pass a dictionary with a `model_type` to the `training_parameters` argument.

- To create a single-label model:
```
nlu.create_classifications_model(...,
                                training_parameters={"model_type": "single_label"},
                                ...)
```

- To create a multi-label model:
```
nlu.create_classifications_model(...,
                                training_parameters={"model_type": "multi_label"},
                                ...)
```


**NOTE: This cell will start a training job for the model and return the model information, but the training will continue even if the cell has finished running. To check the status of the model, run the cell below `Checking the status of the models` and look at the `status` key in the model information.**


To view all functionality, you can also look over the NLU API documentation: https://cloud.ibm.com/apidocs/natural-language-understanding?code=python.


### Single-label training

In [79]:
with open(training_data_filename, 'r') as file:
    single_label_model = single_nlu.create_classifications_model(language='en', 
                                                          training_data=file, 
                                                          training_parameters={"model_type": "single_label"}, 
                                                          training_data_content_type='application/json', 
                                                          name='MySingleLabelClassificationsModel', model_version='1.0.1').get_result()

    print("Created a NLU Single Label Classifications model:")
    print(json.dumps(single_label_model, indent=4))

Created a NLU Single Label Classifications model:
{
    "name": "MySingleLabelClassificationsModel",
    "user_metadata": null,
    "language": "en",
    "description": null,
    "model_version": "1.0.1",
    "version": "1.0.1",
    "workspace_id": null,
    "version_description": null,
    "status": "starting",
    "notices": [],
    "model_id": "16d9969c-232f-437e-96de-5436b867b366",
    "features": [
        "classifications"
    ],
    "created": "2022-08-23T17:18:15Z",
    "last_trained": "2022-08-23T17:18:15Z",
    "last_deployed": null
}


### Multi-label training

In [80]:
with open(training_data_filename, 'r') as file:
    multi_label_model = multi_nlu.create_classifications_model(language='en', 
                                                          training_data=file, 
                                                          training_parameters={"model_type": "multi_label"}, 
                                                          training_data_content_type='application/json', 
                                                          name='MyMultiLabelClassificationsModel', model_version='1.0.1').get_result()
    print("Created a NLU Multi Label Classifications model:")
    print(json.dumps(multi_label_model, indent=4))

Created a NLU Multi Label Classifications model:
{
    "name": "MyMultiLabelClassificationsModel",
    "user_metadata": null,
    "language": "en",
    "description": null,
    "model_version": "1.0.1",
    "version": "1.0.1",
    "workspace_id": null,
    "version_description": null,
    "status": "starting",
    "notices": [],
    "model_id": "d014a8a4-c566-49bf-8a64-635bdb42e4a7",
    "features": [
        "classifications"
    ],
    "created": "2022-08-23T17:18:16Z",
    "last_trained": "2022-08-23T17:18:16Z",
    "last_deployed": null
}


### Checking the status of the models

When the model is training, the value of `status` will be `training`. When the model is done training and ready to use, the value of `status` will be `available`.

In [50]:
single_model_id = single_label_model['model_id']
single_model_to_view = single_nlu.get_classifications_model(model_id=single_model_id).get_result()
multi_model_id = multi_label_model['model_id']
multi_model_to_view = multi_nlu.get_classifications_model(model_id=multi_model_id).get_result()

print("Information about the created Single-label NLU Classifications model:")
print(json.dumps(single_model_to_view, indent=4))
print("Information about the created Multi-label NLU Classifications model:")
print(json.dumps(multi_model_to_view, indent=4))

Information about the created Single-label NLU Classifications model:
{
    "name": "MySingleLabelClassificationsModel",
    "user_metadata": null,
    "language": "en",
    "description": null,
    "model_version": "1.0.1",
    "version": "1.0.1",
    "workspace_id": null,
    "version_description": null,
    "status": "available",
    "notices": [],
    "model_id": "16d9969c-232f-437e-96de-5436b867b366",
    "features": [
        "classifications"
    ],
    "created": "2022-08-23T17:18:15Z",
    "last_trained": "2022-08-23T17:18:15Z",
    "last_deployed": "2022-08-23T17:25:11Z"
}
Information about the created Multi-label NLU Classifications model:
{
    "name": "MyMultiLabelClassificationsModel",
    "user_metadata": null,
    "language": "en",
    "description": null,
    "model_version": "1.0.1",
    "version": "1.0.1",
    "workspace_id": null,
    "version_description": null,
    "status": "available",
    "notices": [],
    "model_id": "d014a8a4-c566-49bf-8a64-635bdb42e4a7",
  

## 4. How to Use a Trained NLU Classifications Model for Analysis

Once the NLU Classifications model is fully trained, the `status` located in the cell above will turn to `available` indicating the model can be used for analysis (training above will take a few minutes to complete). Once ready, utilize the `analyze` method by passing in text, HTML, or public webpage urls.


In [52]:
from ibm_watson.natural_language_understanding_v1 import Features, ClassificationsOptions

text = "is there lightning today?" #"What is the expected high for today?"

### Single-label predictions

In [35]:
single_pred = single_nlu.analyze(text=text, features=Features(classifications=ClassificationsOptions(model=single_model_id))).get_result()

print("Analysis response from trained Single Label NLU Classifications model:")
print(json.dumps(single_pred, indent=4))

Analysis response from trained Single Label NLU Classifications model:
{
    "usage": {
        "text_units": 1,
        "text_characters": 25,
        "features": 1
    },
    "language": "en",
    "classifications": [
        {
            "confidence": 0.441537,
            "class_name": "conditions"
        },
        {
            "confidence": 0.418804,
            "class_name": "temperature"
        },
        {
            "confidence": 0.139659,
            "class_name": "emergencies"
        }
    ]
}


As we can see, the confidence scores are normalized, meaning that they will add up to 1.

### Multi-label predictions

In [80]:
multi_pred = multi_nlu.analyze(text=text, features=Features(classifications=ClassificationsOptions(model=multi_model_id))).get_result()

print("Analysis response from trained Multi Label NLU Classifications model:")
print(json.dumps(multi_pred, indent=4))

Analysis response from trained Multi Label NLU Classifications model:
{
    "usage": {
        "text_units": 1,
        "text_characters": 25,
        "features": 1
    },
    "language": "en",
    "classifications": [
        {
            "confidence": 0.391887,
            "class_name": "conditions"
        },
        {
            "confidence": 0.357356,
            "class_name": "temperature"
        },
        {
            "confidence": 0.169718,
            "class_name": "emergencies"
        }
    ]
}


In the multi-label case, we can see that each score represents how aligned the text is with each label independently of each other, and therefore the scores don't add up to 1. 

## 5. How to select a good prediction threshold

We managed to get predictions from our models - great! But they only give us a confidence score for each potential label. *The final prediction is made by the user.*

In the single label-case, we can just choose the class with the highest confidence score/probability, but what happens in the multi-label case?

### Multi-label final predictions

Looking at the multi-label model predictions, we could take the labels that have a confidence score above a given threshold, say `0.33` (so we'd take `conditions` and `temperature`, and discard the other labels.

In [81]:
threshold = 0.33
multi_final_pred = [pred for pred in multi_pred["classifications"] if pred["confidence"] > threshold]
multi_final_pred

[{'confidence': 0.391887, 'class_name': 'conditions'},
 {'confidence': 0.357356, 'class_name': 'temperature'}]

But what if we end up having too many unwanted labels per example (false positives) - and in this case, what if the right answer was **just** `conditions` instead of `conditions` and `temperature`?


In this case, we can *increase* the confidence threshold to a number that yields more accurate results.

In [82]:
threshold = 0.39
multi_final_pred = [pred for pred in multi_pred["classifications"] if pred["confidence"] > threshold]
multi_final_pred

[{'confidence': 0.391887, 'class_name': 'conditions'}]

### Choosing the right threshold

So how do we choose the right prediction threshold? By exploring multiple potential thresholds within a range and choosing the one that **produces the highest score on the *test set* for a metric of our choosing.**

For example, let's choose the `micro f1-score` as a metric for our case, and tackle the multi-label case.

In [1]:
from sklearn.metrics import f1_score



In `sklearn`, there are different types of `f1-scores` we can chose from (more on that in their [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)). We can choose between `weighted`, `micro` and `macro` f1 scores. For our case, we'll choose the `micro` f1 score because it calculates the metric globally across all the potential labels/classes without favoring the performance on any class in particular.

#### Test set
We'll create some dummy test data and dummy test predictions - note that the relationship between text and labels may not necessarily make too much sense, this is for demonstration purposes. Imagine we are trying to classify whether our input text should be labeled as `temperature`, `conditions` or `emergencies`.

In [138]:
test_data = [
    {
        "text": "Is it dry or is it drizzling today?",
        "labels": ["temperature", "conditions"]
    },
    {
        "text": "Should I wear pants or shorts?",
        "labels": ["temperature"]
    },
    {
        "text": "Is it hailing outside?",
        "labels": ["conditions"]
    },
    {
        "text": "Is it humid and cold and raining?",
        "labels": ["conditions", "temperature"]
    },
    {
        "text": "Did you hear about the accident?",
        "labels": ["emergencies"]
    },
    {
        "text": "Are there any survivors from the storm?",
        "labels": ["emergencies", "conditions"]
    }
]

test_labels = [data["labels"] for data in test_data]

We'll generate predictions for the test set...

In [85]:
test_preds = [multi_nlu.analyze(text=text["text"], features=Features(classifications=ClassificationsOptions(model=multi_model_id))).get_result() for text in test_data]
test_preds[0]

{'usage': {'text_units': 1, 'text_characters': 35, 'features': 1},
 'language': 'en',
 'classifications': [{'confidence': 0.543259, 'class_name': 'temperature'},
  {'confidence': 0.352541, 'class_name': 'conditions'},
  {'confidence': 0.029271, 'class_name': 'emergencies'}]}

... And we'll define a method that calculates our final predictions given model confidence scores and a threshold...

In [116]:
def compute_final_predictions(model_preds, threshold):
    """Given a set of probabilities/confidence scores output by our model, return the final predicted labels
    that have a confidence score above a given threshold.
    """  
    # Extract the class name and confidence score from the prediction object 
    model_preds = [pred["classifications"] for pred in model_preds]
    
    # Only keep the predictions above a threshold
    model_preds = [[pred_obj for pred_obj in pred_obj_list if pred_obj["confidence"] > threshold] for pred_obj_list in model_preds]
    
    # Extract the class names
    final_preds = [[pred_obj["class_name"] for pred_obj in pred_obj_list] for pred_obj_list in model_preds]
    return final_preds

... so that we can get to our final predictions!

In [164]:
threshold = 0.33
final_preds = compute_final_predictions(test_preds, threshold)
final_preds

[['temperature', 'conditions'],
 [],
 ['temperature'],
 ['temperature', 'conditions'],
 ['emergencies'],
 ['emergencies']]

A key detail is that `sklearn` expects multi-label predictions (and labels) to be in a matrix of `0`'s and `1`'s, so we'll do a final transformation of the outputs using `MultiLabelBinarizer`

In [152]:
from sklearn.preprocessing import MultiLabelBinarizer

label_names = [["temperature", "conditions", "emergencies"]]
MLB = MultiLabelBinarizer().fit(label_names)

In [165]:
# Transform both our true labels and the model predictions
y_true = MLB.transform(test_labels)
y_pred = MLB.transform(final_preds)
y_pred

array([[1, 0, 1],
       [0, 0, 0],
       [0, 0, 1],
       [1, 0, 1],
       [0, 1, 0],
       [0, 1, 0]])

For a threshold of `0.33`, our `micro f1-score` is...

In [166]:
f1_score(y_true, y_pred, average="micro")

0.75

Great - now let's try to see if we can find a better threshold that gives us a higher `f1-score`!

In [158]:
from functools import partial
import numpy as np

def compute_f1_with_threshold(threshold, test_labels, test_preds):
    """Compute the final predictions given model confidence scores, then use them to calculate the f1 score
    """
    final_preds = compute_final_predictions(test_preds, threshold)
    y_pred = MLB.transform(final_preds)
    f1_with_threshold = f1_score(test_labels, y_pred, average="micro")
    return f1_with_threshold

def compute_optimal_threshold(metric_func, test_labels, test_preds):    
    """Compute an optimal threshold that maximizes a given metric function (such as the f1-score), 
        given a set of test labels and model confidence scores
    """
    eval_f1_func = partial(metric_func, test_labels=test_labels, test_preds=test_preds)

    print("Evaluating 1001 threshold values between 0 and 1...")
    vals = map(eval_f1_func, np.linspace(0, 1, 1001))
    results = list(zip(np.linspace(0, 1, 1001), vals))
    opt_thresh = sorted(results, key=lambda x: x[1], reverse=True)
    print(f'Found an optimal threshold that maximized our metric!: {opt_thresh[0][0]}')
    return opt_thresh[0][0]

In [159]:
opt_thresh = compute_optimal_threshold(compute_f1_with_threshold, y_true, test_preds)

Evaluating 1001 threshold values between 0 and 1...
Found an optimal threshold that maximized our metric!: 0.225


It looks like the threshold that maximizes the `f1-score` for our test set and model probabilities is `0.225`. Let's give that one a try!

In [160]:
final_preds_w_opt_thresh = compute_final_predictions(test_preds, opt_thresh)
final_preds_w_opt_thresh = MLB.transform(final_preds_w_opt_thresh)
final_preds_w_opt_thresh

array([[1, 0, 1],
       [1, 0, 1],
       [0, 0, 1],
       [1, 0, 1],
       [0, 1, 0],
       [1, 1, 0]])

In [162]:
f1_score(y_true, final_preds_w_opt_thresh, average="micro")

0.8421052631578948

Nice! As we can see, our `f1-score` is way better when our threshold is `0.225` than if we had chosen our initial `0.33` threshold. A similar logic can be applied to the single-label case for binary classification (choosing 1 label between 2 classes), which is left as an exercise to the reader. Happy training!