# Text Engineering

We are going to use some techniques to find rare terms to increase coverage and effectivness.

The first technique is the same from the "Exploring and Processing" notebook: Reduce text and then analyze.

In [1]:
import pandas as pd
from sklearn.metrics import confusion_matrix
import numpy as np
from collections import OrderedDict
from IPython.display import Markdown as md
import re
import ast

In [2]:
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.

#These are dummy values for example:

credentials = {
    'IAM_SERVICE_ID': 'iam-ServiceId-1340asfdasdfavno',
    'IBM_API_KEY_ID': 'apikeyapikeyapikeyapikeyapikeyapikeyapikey',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.ng.bluemix.net/oidc/token',
    'BUCKET': 'watsonassistantaugmentedanalytics-donotdelete-pr-asdfafasdfasfasd'
}

In [3]:
# DOWNLOAD THE FILE
from ibm_botocore.client import Config
import ibm_boto3

def download_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.download_file(Bucket=credentials['BUCKET'],Key=key,Filename=local_file_name)
    except Exception as e:
        print(Exception, e)
    else:
        print('File Downloaded')

In [4]:
download_file_cos(credentials,'annotation.xlsx','All.xlsx')
annotation_file = 'annotation.xlsx'

File Downloaded


In [5]:
column_list = [
    '',
    'log_id',
    'conversation_id',
    'timestamp',
    'customer_id',
    'utterance_text',
    'response_text',
    'top_intent',
    'top_confidence',
    'intent_2',
    'intent_2_confidence',
    'confidence_gap',
    'intent_3',
    'intent_3_confidence',
    'entities',
    'is_escalated',
    'is_convered',
    'not_convered_cause',
    'dialog_flow',
    'dialog_stack',
    'dialog_request_counter',
    'dialog_turn_counter',
    'correctness',
    'helpfulness',
    'root_cause',
    'correct_intent',
    'new_intent',
    'add_train',
    'missed_entity',
    'new_entity',
    'new_entity_value',
    'new_dialog_logic',
    'wrong_dialog_node',
    'no_dialog_node_triggered'
]

annotated_data = pd.read_excel(annotation_file, sheet_name='Not_Covered', names=column_list)

### Convert to Lowercase

In [6]:
annotated_data['utterance_text'] = annotated_data['utterance_text'].apply(lambda x: " ".join(x.lower()for x in x.split()))

### Correct Spelling

In [7]:
#Install textblob library
!pip install textblob



In [8]:
#import libraries and use 'correct' function
from textblob import TextBlob
annotated_data['utterance_text'].apply(lambda x: str(TextBlob(x).correct()))

0          how do i start the process for child support?
1                                       contact numbers?
2                                       contact numbers?
3                                        contact numbers
4                                       contact numbers?
5                                         how can i pay?
6      what if the mother is not sure who the father is?
7                            i can't hear over the phone
8                                         where are you?
9      can a parent take custody of the child instead...
10     can any other agency handle child support enfo...
11     i have some child support issues, but i am dea...
12                           can i pay with credit card?
13                                           i need a in
14                                   i can't login to si
15                                do you accept plastic?
16                                are you a real person?
17                             

### Remove Punctuation

In [9]:
annotated_data['utterance_text'] = annotated_data['utterance_text'].str.replace('[^\w\s]',"")
annotated_data['utterance_text'].head(3)                                                               

0    how do i start the process for child support
1                                 contact numbers
2                                 contact numbers
Name: utterance_text, dtype: object

### Remove Stop Words

In [10]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
stop = stopwords.words('english')
annotated_data['utterance_text'] = annotated_data['utterance_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

In [12]:
annotated_data['utterance_text'].head(5)

0    start process child support
1                contact numbers
2                contact numbers
3                contact numbers
4                contact numbers
Name: utterance_text, dtype: object

In [13]:
#When you don't have a library
!pip install textblob



In [14]:
#Import library
from textblob import Word

import nltk
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /home/dsxuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/dsxuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Lemmatizing

In [15]:
annotated_data_lem = annotated_data['utterance_text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

In [16]:
#Install textblob library
!pip install textblob



### Generate N-Gram

We don't go very far with N-Grams in this notebook, however, future versions may include a predictive altogrithm.

In [17]:
from textblob import TextBlob

In [18]:
df_annotated_data_lem = pd.DataFrame(data=annotated_data_lem)
words = ' '.join(df_annotated_data_lem['utterance_text'].tolist())

In [19]:
words

'start process child support contact number contact number contact number contact number pay mother sure father cant hear phone  parent take custody child instead making child support payment agency handle child support enforcement case child support issue deaf go communicating attorney general office pay credit card need cin cant login csi accept plastic real person dont take amex pay online want contest payment login child support pay amex dont want pay need help need help need help hiba help need person need help many licensing agency involved 3456 find office located payment accept talk live assistant garrett complete acknowledgment paternity form live assistant wont able hear phone 1234 ken paxton live assistant live person garrett located want talk person live assistant zahra pay mail accept plastic accept bank note 1234 talk live assistant garrett cant login csi office help havent received child support pay cash located hour document needed office attorney general child support 

In [20]:
TextBlob(words).ngrams(3)

[WordList(['start', 'process', 'child']),
 WordList(['process', 'child', 'support']),
 WordList(['child', 'support', 'contact']),
 WordList(['support', 'contact', 'number']),
 WordList(['contact', 'number', 'contact']),
 WordList(['number', 'contact', 'number']),
 WordList(['contact', 'number', 'contact']),
 WordList(['number', 'contact', 'number']),
 WordList(['contact', 'number', 'contact']),
 WordList(['number', 'contact', 'number']),
 WordList(['contact', 'number', 'pay']),
 WordList(['number', 'pay', 'mother']),
 WordList(['pay', 'mother', 'sure']),
 WordList(['mother', 'sure', 'father']),
 WordList(['sure', 'father', 'cant']),
 WordList(['father', 'cant', 'hear']),
 WordList(['cant', 'hear', 'phone']),
 WordList(['hear', 'phone', 'parent']),
 WordList(['phone', 'parent', 'take']),
 WordList(['parent', 'take', 'custody']),
 WordList(['take', 'custody', 'child']),
 WordList(['custody', 'child', 'instead']),
 WordList(['child', 'instead', 'making']),
 WordList(['instead', 'making', 

In [1]:
#DEBUG
#df_annotated_data_lem

In [22]:
#importing the function
from sklearn.feature_extraction.text import CountVectorizer

# create the transform
vectorizer = CountVectorizer(ngram_range=(3,3))
# tokenizing
vectorizer.fit(df_annotated_data_lem['utterance_text'])
# encode document
vector = vectorizer.transform(df_annotated_data_lem['utterance_text'])
# summarize & generating output
print(vectorizer.vocabulary_)
print(vector.toarray())


{'start process child': 73, 'process child support': 68, 'mother sure father': 56, 'cant hear phone': 10, 'parent take custody': 64, 'take custody child': 78, 'custody child instead': 23, 'child instead making': 12, 'instead making child': 46, 'making child support': 54, 'child support payment': 17, 'agency handle child': 4, 'handle child support': 39, 'child support enforcement': 14, 'support enforcement case': 75, 'child support issue': 16, 'support issue deaf': 76, 'issue deaf go': 47, 'deaf go communicating': 24, 'go communicating attorney': 38, 'communicating attorney general': 20, 'attorney general office': 7, 'pay credit card': 66, 'cant login csi': 11, 'dont take amex': 30, 'want contest payment': 86, 'dont want pay': 32, 'many licensing agency': 55, 'licensing agency involved': 49, 'talk live assistant': 81, 'complete acknowledgment paternity': 21, 'acknowledgment paternity form': 2, 'wont able hear': 94, 'able hear phone': 0, 'want talk person': 92, 'accept bank note': 1, 'ha

### TF-IDF

Term frequency : Ratio of the count of a word to the length of the sentence.

Inverse Document Frequency: log of the ratio of the total number of rows where a word is present.

tldr;

IDF : rareness of a term


In [23]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
#instantiate CountVectorizer()
cv=CountVectorizer()
 
# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(df_annotated_data_lem['utterance_text'])

In [24]:
word_count_vector.shape

(193, 130)

In [25]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

#### The lower the IDF value of a word, the less unique it is to any particular document.

In [27]:
# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["tf_idf_weights"])
 
# sort ascending
df_idf.sort_values(by=['tf_idf_weights'])

Unnamed: 0,tf_idf_weights
pay,2.900562
child,2.935654
support,2.935654
live,3.434645
need,3.495269
office,3.495269
want,3.559808
contact,3.559808
accept,3.628801
general,3.782952


In [28]:
# count matrix
count_vector=cv.transform(df_annotated_data_lem['utterance_text'])
 
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(count_vector)

#### How important a word is based off user input.

In [29]:
feature_names = cv.get_feature_names()
 
#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False).head(5)

Unnamed: 0,tfidf
start,0.643564
process,0.596755
support,0.338902
child,0.338902
1234,0.0


Much of the more advanced NLP is done in watson assistant for you. e.g. Determining what are the nouns in the phrase and confidence it can answer the question. After running this notebook, we can run 