# Entity Extraction and Document Classification

## 1. Setup

To prepare your environment, you need to install some packages and enter credentials for the Watson services.

## 1.1 Install the necessary packages

You need the latest versions of these packages:                                                                                                                                     
Watson Developer Cloud: a client library for Watson services.                                                                                                                        
NLTK: leading platform for building Python programs to work with human language data.                                                                                                     

### Install the Watson Developer Cloud package: 

In [1]:
!pip install watson-developer-cloud==1.5

Collecting watson-developer-cloud
  Downloading https://files.pythonhosted.org/packages/41/35/9c98ba1056163641c97f1416e882679c2da941abb95d37311b90980c5293/watson-developer-cloud-1.3.5.tar.gz (192kB)
[K    100% |████████████████████████████████| 194kB 2.9MB/s ta 0:00:01
[?25hRequirement not upgraded as not directly required: requests<3.0,>=2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: python_dateutil>=2.5.3 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Collecting autobahn>=0.10.9 (from watson-developer-cloud)
  Downloading https://files.pythonhosted.org/packages/71/0c/af6e79f0cc23668454f4a73ec10b96bdb5b7f172509179da4fc92bce4142/autobahn-18.5.2-py2.py3-none-any.whl (299kB)
[K    100% |████████████████████████████████| 307kB 2.5MB/s ta 0:00:01
[?25hCollecting Twisted>=13.2.0 (from watson-developer-cloud)
  Downloading https://files.pythonhosted

### Install NLTK:

In [2]:
!pip install --upgrade nltk

Collecting nltk
  Downloading https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 725kB/s eta 0:00:01
[?25hRequirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from nltk)
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/d1/ab/40/3bceea46922767e42986aef7606a600538ca80de6062dc266c
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.4
    Uninstalling nltk-3.2.4:
      Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.3


### Install IBM Cloud Object Storage Client: 

In [3]:
!pip install ibm-cos-sdk

Requirement not upgraded as not directly required: ibm-cos-sdk in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: ibm-cos-sdk-s3transfer==2.*,>=2.0.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk)
Requirement not upgraded as not directly required: ibm-cos-sdk-core==2.*,>=2.0.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk)
Requirement not upgraded as not directly required: jmespath<1.0.0,>=0.7.1 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Requirement not upgraded as not directly required: docutils>=0.10 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Requirement not upgraded as not directly required: python-dateutil<3.0.0,>=2.1 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Re

### Now restart the kernel by choosing Kernel > Restart. 

## 1.2 Import packages and libraries
Import the packages and libraries that you'll use:

In [4]:
import json
import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
  import Features, EntitiesOptions, KeywordsOptions
    
import ibm_boto3
from botocore.client import Config

import re
import nltk
import datetime
from nltk import word_tokenize,sent_tokenize,ne_chunk

import numpy as np

import unicodedata



## 2. Configuration
Add configurable items of the notebook below

### 2.1 Add your service credentials from IBM Cloud for the Watson services
You must create a Watson Natural Language Understanding service on IBM Cloud. Create a service for Natural Language Understanding (NLU). Insert the username and password values for your NLU in the following cell. Do not change the values of the version fields.
Run the cell.

In [5]:
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2018-03-23',
    username="",
    password="")

### 2.2 Add your service credentials for Object Storage
You must create Object Storage service on IBM Cloud. To access data in a file in Object Storage, you need the Object Storage authentication credentials. Insert the Object Storage authentication credentials as credentials_1 in the following cell after removing the current contents in the cell.

In [6]:

# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    'IBM_API_KEY_ID': '',
    'IAM_SERVICE_ID': '',
    'ENDPOINT': 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.eu-gb.bluemix.net/oidc/token',
    'BUCKET': '',
    'FILE': 'form-doc-1.txt'
}


### 2.3 Global Variables
Add global variables.

In [7]:
sampleText='form-doc-1.txt'
ConfigFileName_Entity='config_entity_extract.txt'
ConfigFileName_Classify= 'config_legaldocs.txt'

### 2.4 Configure and download required NLTK packages
Download the 'punkt' and 'averaged_perceptron_tagger' NLTK packages for POS tagging usage.

In [8]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/dsxuser/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## 3. Persistence and Storage

### 3.1 Configure Object Storage Client

In [9]:
cos = ibm_boto3.client('s3',
                    ibm_api_key_id=credentials_1['IBM_API_KEY_ID'],
                    ibm_service_instance_id=credentials_1['IAM_SERVICE_ID'],
                    ibm_auth_endpoint=credentials_1['IBM_AUTH_ENDPOINT'],
                    config=Config(signature_version='oauth'),
                    endpoint_url=credentials_1['ENDPOINT'])

def get_file(filename):
    '''Retrieve file from Cloud Object Storage'''
    fileobject = cos.get_object(Bucket=credentials_1['BUCKET'], Key=filename)['Body']
    return fileobject

def load_string(fileobject):
    '''Load the file contents into a Python string'''
    text = fileobject.read()
    return text

def put_file(filename, filecontents):
    '''Write file to Cloud Object Storage'''
    resp = cos.put_object(Bucket=credentials_1['BUCKET'], Key=filename, Body=filecontents)
    return resp

## 4. Input Data
Read the data file for entity extraction from Object Store                                                                                                                               
Read the configuration file for augumented entity-value pairs from Object Store.

In [10]:
text_file= load_string(get_file(sampleText))
if isinstance(text_file, bytes):
    text_file = text_file.decode('utf-8') 
print(text_file)

﻿PURCHASE AGREEMENT

THIS IS A LEGALLY BINDING CONTRACT BETWEEN
PURCHASER AND SELLER.
IF YOU DO NOT UNDERSTAND IT, SEEK LEGAL ADVICE.

1. PARTIES TO CONTRACT - PROPERTY. Purchaser and Seller acknowledge that Broker is ABC
is not the limited agent of both parties to this transaction as outlined in Section III of the Agency
Agreement Addendum as authorized by Purchaser and Seller.

, XYZ hereinafter referred to as

Purchaser, offers and agrees to purchase from , UVW
hereinafter referred to as Seller, upon the tenns and conditions set forth, the property legally described as:

also known as

2. EARNEST MONEY DEPOSIT. Earnest Money in the amount of ($ )
DOLLARS Cash Check ,
unless otherwise noted herein, shall be deposited into the trust account of the listing selling

broker on the next legal banking day after acceptance of this offer.

Other earnest money provisions:

3. PURCHASE PRICE. The total purchase price is to be ($ )
DOLLARS

After earnest money herein

In [11]:
config_entity = load_string(get_file(ConfigFileName_Entity)).decode('utf-8')
print(config_entity)

{
	"configuration":{
		"class":{

				"stages" : [
					{
					"name": "Intro",
					"steps": [
					{
						"type":"text",
						"tag":"Landlord",
						"regex": "Chunk: {<NNP> <NNP> (<IN> <VBP>)?}"
					},

					{
						"type":"text",
						"tag":"Tenant",
						"regex": "Chunk: {<NNP> <NNP> (<IN> <VBP>)?}"
					},
					{

						"type":"date",
						"tag":"Date",
						"regex1":"\\d+/\\d+/\\d+"
					}

					]

					},

					{
					"name": "Term",
					"steps":[
					{
						"term_type": "Fixed",
						"type":"date",
						"tag":"beginning on",
						"regex":"\\d+/\\d+/\\d+"
					},

					{
						"term_type": "Fixed",
						"type":"date",
						"tag":"ending on",
						"regex":"\\d+/\\d+/\\d+"
					},

					{
						"term_type": "Month",
						"type":"date",
						"tag":"beginning on",
						"regex":"\\d+/\\d+/\\d+"
					}
					]


					},

					{
					"name": "Rent",
					"steps":[
					{
						"type":"amount",
						"tag"

In [12]:
config_class = load_string(get_file(ConfigFileName_Classify)).decode('utf-8')
print(config_class)

{
	"configuration":{
		"classification":{
			"stages":[
				{
					"doctype":"Rental",
					"entities":[
						{
							"tag":"Lease Term",
							"text":"lease term"
						},
						
						{
							"tag":"Rent",
							"text":"Rent"
						},
						{
							"tag":"Security Deposit",
							"text":"Security Deposit"
						}
					
					]
				},
				{
					"doctype":"Purchase",
					"entities":[
						{
							"tag":"PARTIES TO CONTRACT - PROPERTY",
							"text":"PARTIES TO CONTRACT - PROPERTY"
						},
						
						{
							"tag":"EARNEST MONEY DEPOSIT",
							"text":"EARNEST MONEY DEPOSIT"
						},
						{
							"tag":"PURCHASE PRICE",
							"text":"PURCHASE PRICE"
						}
					
					]
				}
				
			]
		
		}
	
	}
}


## 5. Entity Extraction
Extract required entities present in the document and augment the response to NLU's results

### 5.1 Entites Extracted by Watson NLU

In [13]:
def analyze_using_NLU(analysistext):
    """ Call Watson Natural Language Understanding service to obtain analysis results.
    """
    response = natural_language_understanding.analyze( 
        text=analysistext,
        features=Features(keywords=KeywordsOptions()))
    response = [r['text'] for r in response['keywords']]
    return response

### 5.2 Extract Entity-Value 
Custom entity extraction utlity fucntions for augumenting the results of Watson NLU API call

In [14]:
def POS_tagging(text):
    """ Generate Part of speech tagging of the text.
    """
    sent = re.sub(r'\n',' ',text)
    words = nltk.word_tokenize(sent)
    POSofText = nltk.tag.pos_tag(words)
    return POSofText


entval= dict()
def text_extract(reg, tag,text):
    """ Use Chunking to extract text from sentence
    """
    entities = list()
    chunkParser= nltk.RegexpParser(reg)
    chunked= chunkParser.parse(POS_tagging(text))
    #print(chunked)
    for subtree in chunked.subtrees():
        if subtree.label() == 'Chunk':
            #print(subtree.leaves())
            entities.append(subtree.leaves())
    #print(entities)
    for i in range(len(entities)):
        for j in range(len(entities[i])):
            #print(entities[i][j][0].lower())
            if tag.strip().lower() in entities[i][j][0].lower():
                #print(entities[i])
                entval.update({tag: find_NNP(entities[i],tag)})
    return entval


def find_NNP(ent, tag):
    """ Find NNP POS tags
    """
    e= ent
    for i in range(len(e)):
        if (tag not in e[i]) and (e[i][1] == 'NNP'):
            return e[i][0]



def checkValid(date):
    #f= datetime.datetime.strftime(date)
    try:
        datetime.datetime.strptime(date.strip(),"%d/%m/%Y")
        return 1
    except ValueError as err:
        print(err)
        return 0
    
def date_extract(reg, tag, text, stage_name):
    #print(reg)
    d= dict()
    dates=re.findall(tag.lower()+' '+reg,text.lower())
    print(dates)
    temp= dates[0].strip(tag.lower())
    ret= checkValid(temp)
    if ret == 1:
        d.update({tag.lower():temp})
    print(d)

def amt_extract(reg,tag,text):
    a= dict()
    amt= re.findall(reg,text)
    print(amt)
    
entities_req= list()
def entities_required(text,step, types):
    """ Extracting entities required from configuration file
    """
    configjson= json.loads(config_entity)
    for i in range(len(step)):
        if step[i]['type'] == types:
            entities_req.append(str(step[i]['tag']))
            #entities_req.append([c['tag'] for c in configjson['configuration']['class'][i]['steps'][j]])
    return entities_req

# entlist= list()
def extract_entities(config,text):
    """ Extracts entity-value pairs
    """
    configjson= json.loads(config)
    #print(configjson)
    #print(configjson['configuration']['class'][0]['steps'][0]['entity'][0]['tag'])
    classes=configjson['configuration']['class']
    #for i in range(len(classes)):
    stages= classes['stages']
    for j in range(len(stages)):
        if stages[j]['name']=='Intro':
            steps= stages[j]['steps']
            for k in range(len(steps)):
                if steps[k]['type'] == 'text':
                        #temp=entities_required(text,steps,steps[k]['type'])
                            #print(temp)
                    ent = text_extract(steps[k]['regex'],steps[k]['tag'],text)
                #elif steps[k]['type'] == 'date':
                    #dates= date_extract(steps[k]['regex1'],steps[k]['tag'],text, stages[j]['name'])
        elif stages[j]['name']=='Parties to Contract':
            steps= stages[j]['steps']
            for k in range(len(steps)):
                if steps[k]['type'] == 'text':
                        #temp=entities_required(text,steps,steps[k]['type'])
                    ent = text_extract(steps[k]['regex'],steps[k]['tag'],text)
                    #print(ent)
    
    return ent


      

In [15]:

extract_entities(config_entity, text_file)

{'Broker': 'ABC', 'Purchaser': 'XYZ', 'Seller': 'UVW'}

## 6. Document Classification
Classify documents based on entities extracted from the previous step

In [16]:
def entities_required_classification(text,config):
    """ Extracting entities from configuration file
    """
    entities_req= list()
    configjson= json.loads(config)
    for stages in configjson['configuration']['classification']['stages']:
        class_req= stages['doctype']
        entities_req.append([[c['text'],class_req] for c in stages['entities']])
    return entities_req
#entities_required_classification(text2,config1)

In [17]:
def classify_text(text, entities,config):
    """ Classify type of document from list of entities(NLU + Configuration file)
    """
    e= dict()
    entities_req= entities_required_classification(text,config)
    for i in range(len(entities_req)):
        temp= list()
        for j in range(len(entities_req[i])):
            entities_req[i][j][0]= entities_req[i][j][0].strip()
            entities_req[i][j][0]= entities_req[i][j][0].lower()
            temp.append(entities_req[i][j][0])
            res= analyze_using_NLU(text)
            #temp= temp + res
            #print text
            #text= text.decode('utf-8')
        if all(str(x) in text.lower() for x in temp) and any(str(y) in text.lower() for y in res):
            return entities_req[i][j][1]

In [18]:
def doc_classify(text,config,config1):
    """ Classify type of Document
    """
    entities= analyze_using_NLU(text)
    temp= extract_entities(config,text)
    for k,v in temp.items():
        entities.append(k)
    #print(entities)
    entities= [e.lower() for e in entities]
    entities= [e.strip() for e in entities]
    entities= set(entities)
    ret=classify_text(text,entities,config1)
    return ret

In [19]:
doc_classify(text_file,config_entity,config_class)

'Purchase'