# Analyze questions using NLU with a custom model

This notebook demonstrates Python code for using IBM Watson Natural Language Understanding (NLU) to extract entities from text using a custom language model):

- Step 0: Add project token to notebook
- Step 1: Load questions from a file saved as an asset in the project
- Step 2: Analyze questions using NLU
- Step 3: Count entities
- Step 4: New twist: *Normalize results*
- Step 5: Save results in Watson Studio project

## Step 0: Add project token

To be able to easily save NLU results in .csv files as assets in our Watson Studio project, we need a _project token_.

Follow the steps in this topic: [Adding a project token](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data)

## Step 1: Load questions

Use the Data panel to insert the questions from the project asset file "so_questions_watson-studio_2019-August.csv" into the notebook code as a pandas DataFrame.

See also: [Load data](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/load-and-access-data.html?audience=wdp&context=data#conns)

In [19]:
questions = list( df_data_1["Questions"] )

In [22]:
print( questions[0][:80] + "..." + "\n\n" + questions[1][:80] + "..." + "\n\n" + questions[2][:80] + "..." )

I am doing PCA on CIFAR 10 image on IBM WATSON Studio Free version so I uploaded...

I m taking the IBM Certification on Coursera and the instructions are to create ...

I am following a tutorial that requires me to select the visual recognition modu...


## Step 2: Analyze questions using NLU with a custom model

### Natural Language Understanding API key and URL
1. From the **Services** menu in Watson Studio, right-click "Watson Services" and then open the link in a new browser tab
2. In the new Watson services tab, from the **Action** menu beside the Natural Language Understanding instance, select "Manage in IBM Cloud"
3. In the service details page that opens, click **Service credentials**, then expand credentials to view them, and then copy the apikey and URL

### Custom language model ID
1. On the <b>Versions</b> page in your Knowledge Studio workspace, expand the <b>Deployed Models</b> list
2. Copy the <b>Model ID</b>

In [23]:
apikey = "" # <-- PASTE YOUR APIKEY HERE
url    = "" # <-- PASTE YOUR SERVICE URL HERE
custom_model_id = "" # <-- PASTE THE MODEL ID FROM KNOWLEDGE STUDIO HERE

In [None]:
!pip install --upgrade "ibm-watson>=3.1.2"

In [25]:
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_watson.natural_language_understanding_v1 import Features, ConceptsOptions, EmotionOptions, EntitiesOptions, KeywordsOptions, SemanticRolesOptions, SentimentOptions, CategoriesOptions, SyntaxOptions, SyntaxOptionsTokens
nlu = NaturalLanguageUnderstandingV1( version='2019-07-12', iam_apikey=apikey, url=url )

Analyze part (the first 250 characters) of the first question just to remember what the results are like...

In [31]:
print( "\n" + questions[0][:250] + "...\n" )
nlu.analyze( text=questions[0][:250], features=Features( entities=EntitiesOptions( model=custom_model_id ) ) ).get_result()


I am doing PCA on CIFAR 10 image on IBM WATSON Studio Free version so I uploaded the python file for downloading the CIFAR10 on the studio pic below. But when I trying to import cache the following error is showing. pic below- After spending some tim...



{'usage': {'text_units': 1, 'text_characters': 250, 'features': 1},
 'language': 'en',
 'entities': [{'type': 'action',
   'text': 'import',
   'disambiguation': {'subtype': ['NONE']},
   'count': 1,
   'confidence': 0.999524},
  {'type': 'tech',
   'text': 'python',
   'disambiguation': {'subtype': ['NONE']},
   'count': 1,
   'confidence': 0.986331},
  {'type': 'action',
   'text': 'downloading',
   'disambiguation': {'subtype': ['NONE']},
   'count': 1,
   'confidence': 0.939268},
  {'type': 'tech',
   'text': 'Studio',
   'disambiguation': {'subtype': ['NONE']},
   'count': 1,
   'confidence': 0.716415}]}

In [41]:
def analyze_questions( questions ):
    results = []    
    for question_txt in questions:        
        result = nlu.analyze( text=question_txt, features=Features( entities=EntitiesOptions( model=custom_model_id ) ) ).get_result()
        result_entities = { "action" : [], "docs" : [], "obj" : [], "persona" : [], "tech" : [] }
        if( "entities" in result ):
            for entity in result["entities"]:
                result_entities[ entity["type"] ].append( entity["text"] )
            results.append( { "text"    : question_txt,
                              "action"  : result_entities["action"],
                              "docs"    : result_entities["docs"],
                              "obj"     : result_entities["obj"],
                              "persona" : result_entities["persona"],
                              "tech"    : result_entities["tech"] } )
    return results

In [42]:
nlu_results = analyze_questions( questions )

In [43]:
import json
print( json.dumps( nlu_results, indent=3 ) )

[
   {
      "text": "I am doing PCA on CIFAR 10 image on IBM WATSON Studio Free version so I uploaded the python file for downloading the CIFAR10 on the studio pic below. But when I trying to import cache the following error is showing. pic below- After spending some time on google I find a solution but I can t understand it. link the solution is as follows - Click the Add Data icon Shows the Add Data icon and then browse the script file or drag it into your notebook sidebar. Click in an empty code cell in your notebook and then click the Insert to code link below the file. Take the returned string and write to a file in the file system that comes with the runtime session. To import the classes to access the methods in a script in your notebook use the following command For Python from python file name import class name I can t understand this line and write to a file in the file system that comes with the runtime session. Where can I find the file that comes with runtime session? Whe

## Step 3: Count Entities

To perform analysis and create nice charts, we need to tally results.

In [44]:
from collections import OrderedDict

def count_tags( results_list ):
    tags = {}
    for tag_arr in results_list:
        for tag in tag_arr:
            tag = tag.lower()
            if( "" != tag ):
                if( tag not in tags ):
                    tags[tag] = 0
                tags[tag] += 1
    common_tags = dict( [ (k,v) for k,v in tags.items() ] )
    ordered_common_tags = OrderedDict( sorted( common_tags.items(), key=lambda x:x[1], reverse=True ) )
    return ordered_common_tags
    
def count_words( results_list, entity_type ):
    entities = {}
    for entry in results_list:
        words = entry[entity_type]
        for word in words:
            word = word.lower()
            if( "" != word ):
                if( word not in entities ):
                    entities[word] = 0
                entities[word] += 1
    common_entities = dict( [ (k,v) for k,v in entities.items() ] )
    ordered_common_entities = OrderedDict( sorted( common_entities.items(), key=lambda x:x[1], reverse=True ) )
    return ordered_common_entities

In [45]:
action_counts  = count_words( nlu_results, "action" )
obj_counts     = count_words( nlu_results, "obj" )
tech_counts    = count_words( nlu_results, "tech" )

In [46]:
action_counts

OrderedDict([('create', 8),
             ('import', 3),
             ('add', 3),
             ('access', 2),
             ('downloading', 1),
             ('select', 1),
             ('selecting', 1),
             ('connection', 1),
             ('credential', 1)])

In [47]:
obj_counts

OrderedDict([('project', 13),
             ('notebook', 5),
             ('methods', 1),
             ('account', 1),
             ('free trial', 1),
             ('dataframe', 1)])

In [48]:
tech_counts

OrderedDict([('python', 4),
             ('studio', 2),
             ('jupyter', 2),
             ('object storage', 1),
             ('ibm cloud', 1),
             ('pandas', 1)])

## Step 4: Normalize results

There are some noisy results above.  For example, in `action_counts`, "select" and "selecting" are counted as separate entities.  But for our analysis purposes, those both refer to the same action.  Instead of being counted separately, they should be counted together.

#### Dictionary files

To train the custom language model, we created dictionary files that looked like this:

[Action words](https://raw.githubusercontent.com/spackows/CASCON-2017_Analyzing_chat/master/custom-language-model/dictionaries/action.csv)

```
lemma,poscode,surface
select,2,select,selecting
create,2,create,creating
train,2,train,training
load,2,load,loading,upload,uploading
sign up,2,sign up,sign-up,signup,register,registering
import,2,import,importing,imported
...
```

Given that we went to the trouble of creating those dictionaries, let's use them to *normalize* results.  For example, count "select" and "selecting" as two instances of the same action.

#### Method: _lookup_

The way we'll use those dictionary files to normalize results is this:

From the dictionaries, create an important words look-up structure that we can use to map any `surface` form back to the `lemma` form.

For example, using the action words dictionary, "loading", "upload", and "uploading" should all map back to: "load".

#### Other methods

We could use *stemming* or *lemmatization* libraries.. But why?  We already have this dictionaries of words we care about, so let's just use those!

In [52]:
import urllib.request
import re

def readSource( url ):
    content = urllib.request.urlopen( url )
    lines_arr = []
    for line in content:
        lines_arr.append( line.decode("utf-8") )
    return lines_arr

def addLookups( lines_arr, lookup_dict ):
    for i in range( 1, len( lines_arr ) ):
        line = lines_arr[i]
        line = re.sub( "\s+$", "", line )
        arr = line.split( "," )
        lemma = arr[0].lower()
        for j in range( 3, len( arr ) ):
            variant = arr[j].lower()
            if variant not in lookup_dict:
                lookup_dict[ variant ] = lemma
    return lookup_dict

def readCustomDictionaries( url_arr ):
    lookup_dict = {}
    for url in url_arr:
        lines_arr = readSource( url )
        lookup_dict = addLookups( lines_arr, lookup_dict )
    return lookup_dict

In [53]:
action_dict_url = "https://raw.githubusercontent.com/spackows/CASCON-2017_Analyzing_chat/master/custom-language-model/dictionaries/action.csv"
obj_dict_url    = "https://raw.githubusercontent.com/spackows/CASCON-2017_Analyzing_chat/master/custom-language-model/dictionaries/obj.csv"
tech_dict_url   = "https://raw.githubusercontent.com/spackows/CASCON-2017_Analyzing_chat/master/custom-language-model/dictionaries/tech.csv"

In [65]:
lookup_struct = readCustomDictionaries( [ action_dict_url, obj_dict_url, tech_dict_url ] )
lookup_struct

{'selecting': 'select',
 'creating': 'create',
 'training': 'train',
 'loading': 'load',
 'upload': 'load',
 'uploading': 'load',
 'sign-up': 'sign up',
 'signup': 'sign up',
 'register': 'sign up',
 'registering': 'sign up',
 'importing': 'import',
 'imported': 'import',
 'adding': 'add',
 'recovering': 'recover',
 'changing': 'change',
 'building': 'build',
 'login': 'log in',
 'logging in': 'log in',
 'sign in': 'log in',
 'signing in': 'log in',
 'sign-in': 'log in',
 'signin': 'log in',
 'connecting': 'connect',
 'connection': 'connect',
 'connections': 'connect',
 'deploying': 'deploy',
 'setting up': 'set up',
 'setup': 'set up',
 'set-up': 'set up',
 'editing': 'edit',
 'exceeds': 'exceed',
 'exceeded': 'exceed',
 'exceeding': 'exceed',
 'exporting': 'export',
 'analyzing': 'analyze',
 'downloading': 'download',
 'accessing': 'access',
 'acess': 'access',
 'saving': 'save',
 'initiating': 'initiate',
 'preparing': 'prepare',
 'requesting': 'request',
 'writing': 'write',
 'rena

In [66]:
def normalize( word, lookup_struct ):
    if word in lookup_struct:
        return lookup_struct[word]
    else:
        return word

In [67]:
def count_words2( results_list, entity_type, lookup_struct ):
    entities = {}
    for entry in results_list:
        words = entry[entity_type]
        for word in words:
            word = word.lower()
            word = normalize( word, lookup_struct );
            if( "" != word ):
                if( word not in entities ):
                    entities[word] = 0
                entities[word] += 1
    common_entities = dict( [ (k,v) for k,v in entities.items() ] )
    ordered_common_entities = OrderedDict( sorted( common_entities.items(), key=lambda x:x[1], reverse=True ) )
    return ordered_common_entities

In [68]:
action_counts2 = count_words2( nlu_results, "action", lookup_struct )

In [69]:
action_counts

OrderedDict([('create', 8),
             ('import', 3),
             ('add', 3),
             ('access', 2),
             ('downloading', 1),
             ('select', 1),
             ('selecting', 1),
             ('connection', 1),
             ('credential', 1)])

In [70]:
action_counts2

OrderedDict([('create', 8),
             ('import', 3),
             ('add', 3),
             ('access', 2),
             ('select', 2),
             ('download', 1),
             ('connect', 1),
             ('credential', 1)])

## Step 5: Save results in a .csv file

In [63]:
def createDataFrameCSV( counts ):
    data = { "Entities": list( counts.keys() ),  "Counts": list( counts.values() ) }
    return pd.DataFrame( data=data ).to_csv()

In [64]:
project.save_data( "so_entities-action_watson-studio_2019-August.csv", createDataFrameCSV( action_counts2 ), overwrite=True )
project.save_data( "so_entities-obj_watson-studio_2019-August.csv",    createDataFrameCSV( obj_counts     ), overwrite=True )
project.save_data( "so_entities-tech_watson-studio_2019-August.csv",   createDataFrameCSV( tech_counts    ), overwrite=True )

{'file_name': 'so_entities-tech_watson-studio_2019-August.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'nluworkshopproj-donotdelete-pr-kxjtz2yxb1sovi',
 'asset_id': '8e1c13b5-fc71-4225-b8f8-8d0904278d46'}