# Natural Langauge Classifier with ICD-10 - Medical classification
## A Think2018 Lab - Python Application in Watson Studio / DSX - JuPyter / IPython Notebook

### LAB OVERVIEW:  
This application was built to demonstrate IBM's Watson Natural Language Classifier (NLC).  It uses the Watson Python SDK for IBM Watson to create the classifier, list classifiers, and classify the input text.  We also make use of the freely available ICD-10 API which, given an ICD-10 code, returns a name and description.  ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO).  This lab and data set is for educational purposes only. 

https://www.ibm.com/watson/services/natural-language-classifier/ 
https://www.ibm.com/watson/developercloud/natural-language-classifier/api/v1


### LAB GOALS
- Overview of ICD-10 classification system
- Background on IBM Natural Language Classifier (NLC) and how to leverage it in classificaiton use cases
- Familiarity with DSX/Watson Studio as mechanism for a test bench for NLC

### PRIOR WORKS:  
With thanks, this lab leverags other code and works.  WATSON SDK: Client library to use the IBM Watson services in Python and available in pip as watson-developer-cloud  https://github.com/watson-developer-cloud/python-sdk from https://github.com/watson-developer-cloud ; the lab also leverages code and methods framed by  https://developer.ibm.com/code/author/stevemar/ Steve Martinelli is a Development Manager focused on delivering Cognitive Journeys that empower developers worldwide. 

### TRAINING DATA:  
The data set we will be using, `ICD-10-GT-AA.csv`, which contains a subset of ICD-10 entries. ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems. In short, it is a medical classification list by the World Health Organization (WHO) that contains codes for: diseases, signs and symptoms, abnormal findings, complaints, social circumstances, and external causes of injury or diseases. Hospitals and insurance companies alike could save time and money by levearging Watson to properly tag the most accurate ICD-10 codes.

ICD 10 is quite a big set - can take 60 minutes + time to train.  Using a smaller data set - like this one https://github.com/rustyoldrake/Harry_Potter_Sorting_Hat_Simple - e.g. 200 rows and 4 classes - can take about 10 minutes. (if you want to train and play in a shorter period, use a smaller set)

ICD-10 IS NOT PART OF ANY IBM OFFERING - it is a classification of the WHO and is used for informational and educational purposes only. 


### RELATED WORKS

Similar notebook here:  https://github.com/mamoonraja/call-center-think18/blob/master/notebooks/Step3-natural-language-classifier.ipynb 
Notebook 3 – Natural Language Classifier (NLC) - IBM Watson Natural Language Classifier uses machine learning algorithms to return the top matching predefined classes for short text input.  YOU Create and train a classifier to connect predefined classes to example texts so that the service can apply those classes to new inputs.

https://github.com/rustyoldrake/ICD-10-NLC-Python-LAB This application was built to demonstrate IBM's Watson Natural Language Classifier (NLC). The data set we will be using, ICD-10-GT-AA.csv, contains a subset of ICD-10 entries. ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems. 

Blog - proof of concept to explore using the IBM Watson NL Classifier and codes as Ground Truth - to help nurse/doctor narrow down to most likely codes  https://dreamtolearn.com/ryan/r_journey_to_watson/17


In [None]:
!pip install watson_developer_cloud
import watson_developer_cloud as watson

# CONNECT TO EXISTING SERVICE - Natural Language Classifier (NLC)

IBM Watson Natural Language Classifier uses machine learning algorithms to return the top matching predefined classes for short text input.

### For this lab - we have created a pre-trained model that classifies a SUBSET of the ICD
"the ICD-10 Clinical Modification (ICD-10-CM) used in the US has some 93,000 codes compared to the ~16,000 within the international version" - because of the size of the set - initially - the training set was broken up into 5 chunks.

https://github.com/rustyoldrake/IBM_Watson_NLC_ICD10_Health_Codes


In [None]:
## These are TEMPORARY 'burner' creds for NLC - this one for ICD-10

credentials_nlc = {
    "classifier_id": "f7e6f0x306-nlc-940",
    "url": "https://gateway.watsonplatform.net/natural-language-classifier/api",
    "username": "99698e63-e402-4bd2-96ab-86f494737b78",
    "password": "DhqKfygHynwW" 
}

In [None]:
natural_language_classifier = watson.NaturalLanguageClassifierV1(
    username = credentials_nlc['username'],
    password = credentials_nlc['password'])

# Natural Language Classifier (NLC) - 


In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
import os
import requests
from watson_developer_cloud import NaturalLanguageClassifierV1
import pixiedust
import pandas as pd


DATA_SET = 'data/ICD-10-GT-AA.csv'

def create_dataframe(result):
    result = {}
    result['class_type'] = []
    result['confidence'] = []
    for d in json.loads(classifier_output)['classes']:
        result['class_type'].append(d['class_name'])
        result['confidence'].append(d['confidence'])
    return pd.DataFrame(data = result)

    
def classify_text(input_text):
    # send the text to the classifier, get back an ICD code
    classifier_output = natural_language_classifier.classify(credentials_nlc['classifier_id'], input_text)
    # get the ICD name based on ICD code
    icd_code, icd_output = get_ICD_code_info(classifier_output)
    # format results
    classifier_output = json.dumps(classifier_output, indent=4)
    icd_output = json.dumps(icd_output, indent=4)
    return icd_output, classifier_output

def get_ICD_code_info(result):
    base_url = "http://www.icd10api.com/?"
    code = result["top_class"]
    query_string = "s=" + code + "&desc=short&r=json"
    resp = requests.get(base_url + query_string)
    return code, resp.json()

def map_types(output):
    result = {}
    for d in output:
        result[d['Name']] = d['Description']
    return result

## Test 1 - Shoulder Pain

In [None]:
[icd_output, classifier_output] = classify_text('I injured my left shoulder')

hash_map = map_types(json.loads(icd_output)['Search'])
    
df = create_dataframe(classifier_output)
display(df)
print('Top result is: ', hash_map[json.loads(classifier_output)['top_class']])

### Shoulder pain

https://www.nuemd.com/icd-10/codes/shoulder%20pain  (third party site) 

#### M75 Family codes include:
Adhesive capsulitis of shoulder
Rotator cuff tear or rupture, not specified as traumatic
Bicipital tendinitis
Calcific tendinitis of shoulder
Impingement syndrome of shoulder 


## Test 2 - Nose Injury

In [None]:
[icd_output, classifier_output] = classify_text('I broke my nose')

hash_map = map_types(json.loads(icd_output)['Search'])
    
df = create_dataframe(classifier_output)
display(df)
print('Top result is: ', hash_map[json.loads(classifier_output)['top_class']])

## Test 3 - Switch Classifiers - Harry Potter Fun

In [None]:
credentials_nlc = {
    "classifier_id": "340008x87-nlc-967",
    "url": "https://gateway.watsonplatform.net/natural-language-classifier/api",
    "username": "d52a6d7e-e8cd-4dbf-816b-6a7ac1f2dd71",
    "password": "OZD3ulJMbZaN"     
}

In [None]:
natural_language_classifier = NaturalLanguageClassifierV1(
    username = credentials_nlc['username'],
    password = credentials_nlc['password'])

## Bad and Evil

In [None]:
classify_text('I am bad and mean')

## Brave and Noble and Good

In [None]:
print(classify_text('I am noble and good and brave'))

## Stephen Hawking

In [None]:
print(classify_text('Stephen William Hawking CH CBE FRS FRSA (8 January 1942 – 14 March 2018) was an English theoretical physicist, cosmologist, author, and Director of Research at the Centre for Theoretical Cosmology within the University of Cambridge.[15][16] His scientific works included a collaboration with Roger Penrose on gravitational singularity theorems in the framework of general relativity and the theoretical prediction that black holes emit radiation, often called Hawking radiation. Hawking was the first to set out a theory of cosmology explained by a union of the general theory of relativity and quantum mechanics. He was a vigorous supporter of the many-worlds interpretation of quantum mechanics'))

In [None]:
# A few other blogs and links on Harry Potter 
# https://dreamtolearn.com/ryan/data_analytics_viz/97
# https://dreamtolearn.com/ryan/data_analytics_viz/98
# https://github.com/rustyoldrake/Harry_Potter_Sorting_Hat_Simple


## CREATE NEW SERVICE - Natural Language Classifier (NLC)  (( NOT RUN ))

IBM Watson Natural Language Classifier uses machine learning algorithms to return the top matching predefined classes for short text input.

YOU Create and train a classifier to connect predefined classes to example texts so that the service can apply those classes to new inputs.

https://www.ibm.com/watson/services/natural-language-classifier/

https://www.ibm.com/watson/developercloud/natural-language-classifier/api/v1
Unlike most Watson Services - NLC does NOT offer a 'Lite/Free' version.

In [None]:
def create_classifier():
    # fetch all classifiers associated with the NLC instance
    result = natural_language_classifier.list_classifiers()
    # for the purposes of this demo, we handle only one classifier
    # return the first one found
    if len(result['classifiers']) > 0:
        return result['classifiers'][0]
    else:
        # if none found, create a new classifier, change this value
        with open(DATA_SET, 'rb') as training_data:
            metadata = '{"name": "ICD_classifier", "language": "en"}'
            classifier = natural_language_classifier.create_classifier(
                metadata=metadata,
                training_data=training_data
            )
        return classifier



### For more information 

Python Notebook leveraging Cloud Object Storage (Call Center use Case)
https://github.com/mamoonraja/call-center-think18/blob/master/notebooks/Step3-natural-language-classifier.ipynb


