# Introduction

This tutorial will introduce you to one of the Natural Langrage Processing approaches called Speech Analytics Model by VERINT. In the class we've learned several useful libraries and algorithms that can help us to do specific tasks. You may also find some of them in this tutorial but the goal is to understand the whole model structure and different role each part plays and how does it contributes to the process of solving business issue. We will apply this model to solve a call driver analysis problem in this tutorial.

**Please do not share this notebook or the data with others.**

## Natural Langrage Processing and Speech Analytics Model
According to the Wikipedia, the definition of the NLP is "A field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages." https://en.wikipedia.org/wiki/Natural_language_processing. 

Processing natrual language is hard. The same word can express different meanings in different contents. (Think about the subtle irony). Apart from sentiment analysis, translation is another common application of NLP practice. Recently, Google announced a breakthough in Neural Machine Translation which reduces the translation errors by an average of 60% compared to Google's phrase-based production system. You can find more details about NMT here: https://arxiv.org/pdf/1609.08144v2.pdf

In practise, when we want to analysis speech data, the Speech Analytics Model would be a handy framwork to follow. It includes three levels: Keyword Spotting, Content Categorization and Root Cause Analytics. (please refer to the graph below) Of course in real business problem we need to taylor the model to make it more suitable for the case.

![title](speech.png)

## Background Introduction
Nowadays almost all the companies have their call centers. The customer service team over the phone are supposed to solve the enquiries and problems from the customers. Company A receives increasing incoming call as well as lots of complains about increasing waiting time. (shows below)

The company wants to know why people are calling so that they can optimize their business practice to reduce the call volumn as well as the calling time. In this tutorial we are going to do a small call driver analysis.

![title](background.png)


# Tutorial content

In this tutorial, we will follow the basic speech analytics model's structure, optimize it and apply the apporach on the call driver analysis case.

We will be working on a small interaction sample data (a small fraction of monthly data) from a telecom company. The general apporach could be apply to different business cases with corresponding adjustments.

We will cover the folowing topics in this tutorial:
* Installing the libraries
* Loading and understand the data
* Keyword spotting (Named-entity recognition, topic modeling)
* Content categorization (SVM, RNN-LSTMs)

# Installing the libraries

Before getting started, you'll need to install the various libraries that we will use. You can install lda, nltk, scikit-learn, Keras and TensorFlow using pip:

    $ pip install lda
$ sudo pip install -U nltk
    $ pip install -U scikit-learn
$ sudo pip install keras

if you have anaconda installed, please visit the link https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#pip-installation and
https://www.tensorflow.org/versions/r0.10/get_started/os_setup.html#anaconda-installation to
choose the right install method of keras and TensorFlow for your system.
Here I use

    $ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.11.0rc1-py2-none-any.whl
$ sudo pip install --upgrade $TF_BINARY_URL
    
sudo pip will not install the packages into local anaconda site-packages so that it cannot be found.

    $ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.10.0-py2-none-any.whl
$ sudo pip install --ignore-installed --upgrade $TF_BINARY_URL


    LDA: https://pypi.python.org/pypi/lda
    NLTK: http://www.nltk.org/install.html
    Scikit-learn: http://scikit-learn.org/stable/install.html
    Keras: https://keras.io/#installation

In [12]:
import re
import csv
import lda
import time
import nltk
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras.preprocessing.text as pre

from collections import Counter
from sklearn import preprocessing
from nltk.corpus import stopwords
from keras.models import Sequential
from keras.preprocessing import sequence
from nltk.tokenize import wordpunct_tokenize
from keras.utils.np_utils import to_categorical
# from keras.layers.recurrent import GRU as GRU
from sklearn.feature_extraction.text import CountVectorizer
from keras.layers import Activation, Embedding, Dense, LSTM, Dropout


# Loading and understand the data

Now that we've installed and loaded the libraries, let's loat the data example to see how's the data looks like. The data is stored in csv format so we can load it into data frame for further process

In [2]:
df = pd.read_csv('sampleData.csv')
print df.head()

   iInteractionId    iContactId  \
0    6.280000e+18  6.280000e+18   
1    6.280000e+18  6.280000e+18   
2    6.280000e+18  6.280000e+18   
3    6.280000e+18  6.280000e+18   
4    6.280000e+18  6.280000e+18   

                                             Content  
0    i used to get mine i'm speaking with a sound...  
1    hello the payments that sort of them speakin...  
2    so i just need them you here overgrown about...  
3    so i usually david since right how are you s...  
4    the money once it's not come on the phone th...  


The table has three columns, iInteractionID, iContactId and Content. The first two are mainly used as primary key that helps to find the specific interaction. The conversation content that is converted from the voice record stores in content column.
Let's find one conversation record to look in to the details.

In [3]:
content = list(df.ix[:,2])
print content[10]


  i used to get mine else we can look hello yes a c yes yeah okay oh really oh i thought i know you know saddam open unfortunately what's your name sorry misspoke into most of and you're looking at getting building connected so rights no i don't know yeah what was your mother sorry what was your hollywood trace right are you calling from phone in the same rights okay i understand let me look i'll go see society juices for a total of march which really hard why is that correct all what's the street interest cold now do you have a number on this i said what you want to sorry yes really slow yeah yeah they typically don't update please number is which doesn't surprise me limited cookbooks okay no it doesn't work on this issue is so i need to see if i can check what the what i'm interested it's probably so he said he saw it's the main road reachable i used to check what's your address and so i'm okay one sec i'll just see what options it has you know what he is you know that's good so well



We can tell from the text above that it does not make much sense. There are many oral words as well as mis-interpreted words. These chaos make it very diffecult to identify the main topic of the conversation. Although we can still make mostly correct judgement on the topic after reading the test above but there're tens of thousands of transactions per month and the company want to analysis one year's data. 

The call drive analysis is actually a classification on natural language test without any predifined classes. So our first step is to find thouse potential classes. Before that, we need to pre-process the data and remove those noise as possible.
The noise could be mis-interpreted words, stopwords, oral words or in this case: words that has no business meaning.


In [4]:
def _getWordList(content):
    docs = []
    rep = {r"phone number": "phonenumber",
           r"line number": "linenumber",
           r"call back": "callback",
           r"ring back": "callback",
           r"cell phone": "phonenumber",
           r"mobil number": "phonenumber",
           r"mobile number": "phonenumber",
           r"ticket number": "ticketnumber",
           r"account number": "accountnumber",
           r"number account": "accountnumber",
           r"email address": "email",
           r"late payment": "latepayment",
           r"make payment": "makepayment",
           r"put payment": "makepayment",
           r"made payment": "madepayment",
           r"bill payment": "billpayment",
           r"land line": "landline",
           r" talk ": " speak ",
           r" thank ": " thanks ",
           r"calling": "call",
           r"paying" : "pay",
           r"trying" : "try",
           r"anything" : "something"
           }
    rep = dict((re.escape(k), v) for k, v in rep.iteritems())
    pattern = re.compile("|".join(rep.keys()))
    for i in range(len(content)):
        line = pattern.sub(lambda m: rep[re.escape(m.group(0))], content[i])   
        line = re.sub(r'[^\w\s]','',line)            
        docs.append(line)
    stop_words = set(stopwords.words('english'))
    stop_words.update(['~','.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}', 'anyway', 'hold', 'make', 'zero', 'first', 'hi', 'much', 'still', 'sort', 'alright', 'take', 'name', 'sure', 'would', 'look', 'got', 'six', 'back', 'put', 'give', 'nine', 'want', 'let', 'got', 'oh', 'yeah', 'yeh', 'yep', 'ah', 'like', 'know', 'right', 'think', 'okay', 'get', 'see', 'okay','well','three','two'])
    stop_words.update(['else','sorry','said','going','thanks','hundr', 'id','seventeen','cant', 'pop', 'sorri','im', 'dont', 'ill', 'went', 'long','help', 'richard','youll', 'set','full', 'old', 'thank', 'call', 'shall', 'put', 'speak','bit','one', 'need','say','five','go','might','four','tell','told','us','around','mom','either','use','me','five','bye','mind','keep'])        
    stop_words.update(['also', 'ive', 'youve', "yes", 'thats', 'im', 'dont', 'already', 'cant', 'us', 'ill', 'youre', 'could',
                       'go', 'actually', 'anyway', 'hold', 'make', 'zero', 'first', 'hi', 'much', 
                       'still', 'sort', 'alright', 'take', 'name', 'sure', 'would', 'look', 
                       'got', 'six', 'back', 'put', 'give', 'nine', 'want', 'let', 'got', 'oh', 
                       'yeah', 'yeh', 'yep', 'ah', 'like', 'know', 'right', 'think', 'okay', 
                       'get', 'see', 'okay','well','three','two', 'put', 'speak', 'four', 'five', 'litle',
                       'ok', 'gonna', 'whats', 'nine', 'eight', 'one', 'theres', 'please', 'good',
                       'able', 'forty', 'youll', 'hundred' ])               
    # Prepare docs an vocabulary removing stop words that we defined  
    ndocs = []
    word_list = []
    for doc in docs:
        tokens = wordpunct_tokenize(doc)
        wl = [i for i in tokens if i.lower() not in stop_words]
        ndocs.append(" ".join(wl))
        word_list += wl
    return list(set(word_list)), ndocs, stop_words
        

Now we have a customized lexicon, and preprocessed document list as well as the defined stop_words. This is a unigram methond. From the regular expression above you can find out that we join two words together. Why don't do a bigram analysis instead? You can try yourself and explore the reason.

## NER
The simplest key word spotting is Named-entity recognition. There are many ways to implement, I find that the nltk.pos_tag and nltk.FreqDist can be a good choice. We've learned pos in the class, the FreqDist will return a distribution with frequency.



In [5]:
def ner_process(document, stopword):
    text = nltk.word_tokenize(document)
    tagged = nltk.pos_tag(text)
    word_tag_fd = nltk.FreqDist(tagged)
    entity_names = []
    for (wt, _) in word_tag_fd.most_common():
        word = wt[0].replace(" ", "").strip()
        if (wt[1] == 'NN' or 'NNS' or 'VBN') and (len(word) > 1):
            if stopword.count(word) == 0:
                entity_names.append(word)
    counts = Counter(entity_names)
    return counts



In [6]:
vocab, docs, stop_words = _getWordList(content)
document = ' '.join(str(e) for e in docs)
counts = ner_process(document, list(stop_words))
print sorted(counts.iteritems(), key=lambda (k, v): (-v, k))[:5]


[('modem', 17), ('mean', 12), ('phonenumber', 12), ('twelve', 12), ('ask', 11)]


The individual keyword with frequency is not enough for us to interpret the topics. It only provides the infomation of the most frequent words that people are mentioned. So it's natural for us to find a way to generate the most frequent topic of the interactions.


## LDA



In natural language processing, Latent Dirichlet Allocation (LDA) is a generative probabilistic model of a corpus that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

The generative process is as follows:
Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for a corpus **D** consisting of **M** documents each of length **N**i:


LDA assumes the following generative process for each document w in a corpus D:
1. Choose N ∼ Poisson(ξ).
2. Choose θ ∼ Dir(α) where Dir(α) is the Dirichlet distribution for parameter α.
3. For each of the N words wn:
    (a) Choose a topic zn ∼ Multinomial(θ).
    (b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.



Find more here: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Now we can perform a topic modeling on the docs, we are using lda (Latent Dirichlet Allocation ) package here.
This process can be regarded as unsupervised classification.


In [7]:
# Topic modeling function.
def _topicModels(vocab, docs, n_topics, g_num_top_words):
    cv = CountVectorizer(vocabulary = vocab)
    X = cv.fit_transform(docs).toarray();
    X.shape
    model = lda.LDA(n_topics=n_topics, n_iter=200, random_state=1)
    model.fit(X)
    topic_word = model.topic_word_
    
    num_top_words = g_num_top_words
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-num_top_words:-1]
        print "Topic {}: {}".format(i, ' '.join(topic_words))
    return X, model    

def _plotTopics(x, vocab, model, n_topics, g_num_top_words):        
    #
    # Plotting and visualizing
    num_top_words = g_num_top_words
    num_topics = n_topics
    fontsize_base = 100 / np.max(x)
    topic_word = model.topic_word_
    
    for t, topic_dist in enumerate(topic_word):
        plt.subplot(1, num_topics, t + 1)
        plt.ylim(0, num_top_words + 0.5)
        plt.xticks([])
        plt.yticks([])
        plt.title('Topic #{}'.format(t))
        top_words_idx = np.argsort(x[:,t])[::-1] 
        #print top_words_idx
        top_words_idx = top_words_idx[:num_top_words]
        #print top_words_idx
        #top_words = vocab[top_words_idx]
        top_words = np.array(vocab)[np.argsort(topic_dist)][:-num_top_words:-1] 
        top_words_shares = x[top_words_idx, t]
        for i, (word, share) in enumerate(zip(top_words, top_words_shares)):
            fontsize = (fontsize_base*share) if (fontsize_base*share) > 12 else 12
            fontsize = 18 if fontsize > 18 else fontsize
            plt.text(0.3, num_top_words-i-0.5, word, fontsize=fontsize)

    plt.show()

In [8]:
n_topics = 10
n_words = 21
# Build the model
topic, model = _topicModels(vocab, docs, n_topics, n_words)

# _plotTopics(topic, vocab, model, n_topics, n_words)
# print "finished"

Topic 0: something check try mean time phone broadband thing problem card yet maybe didnt moment great may different find connected read
Topic 1: modem connection check working internet connected broadband cable getting box problem next using lets connect technician line correct point tried
Topic 2: number account something new send address phone seven system phonenumber wanted supposed double since correct speaking minute accounts ahead problem
Topic 3: really moment phone working mobile interest work different local second landline lot understand home itll details possible used reason phonenumber
Topic 4: cool come looking today people things probably stuff little pretty speaking time contract double twenty place takes couple kind basically
Topic 5: broadband connection landline address line connected order property done course thing work accountnumber house people modem installation customer speed questions
Topic 6: something try mean come find fine thing time doesnt point even mone



Now we have a clearer picture of what's going on here. For example, the topic 1 may be conversations about broadband internet modem connection.  Topic 8 maybe something about the payment. Taking business domain knowledge into account, we can draw a issue tree for this problem. 

My attempts lead me to the structure like this:


![title](issueTree.png)

You can certainly have different version of issue tree based on your understanding. Have a try.

Based on the issue tree we developed, now we have the tags that we need for the supervised classification. We can perform structured classification method. Here I choose to devide the problem into small parts and solve them seperately. 


## Dynamic Learning

We now have the tags, but without any training data. How do we get the training set for further classifications? 
First I examed the result from unsupervised classification by manually go through the content of the conversation and to check if the tag is correct. Then I use the sample (about 50 interactions for each class) to predict on larger dataset using SVM. Manually check the correctness of the tag and put those with high confidence into the training set.
And then we keep training on more unseen dataset and continually add samples with high confidence into training set. The process is called active(dynamic) learning.

You can try to generate your own training set. Here I provide mine training sample for the broadband class.


In [9]:
BBtraining = pd.read_csv('sampleWithID.csv')
print BBtraining.head(20)

         iInteractionId           iContactId  \
0   6280185542163380000  6280185542163380000   
1   6280199883101110000  6280199883101110000   
2   6280207102876130000  6280207102876130000   
3   6280208284040360000  6280208284040360000   
4   6280213798738520000  6280213798738520000   
5   6280216255459810000  6280216255459810000   
6   6281409350169860000  6281409350169860000   
7   6281422432600400000  6281422432600400000   
8   6281432392669410000  6281432392669410000   
9   6281432981054750000  6281432981054750000   
10  6283515662236180000  6283515662236180000   
11  6283518934986590000  6283518934986590000   
12  6283521370247720000  6283521370247720000   
13  6283526129071490000  6283526129071490000   
14  6280197061242590000  6280197061242590000   
15  6281413456133420000  6281413456133420000   
16  6281416020254070000  6281416020254070000   
17  6281430657477450000  6281430657477450000   
18  6281431937377700000  6281431937377700000   
19  6288744213730370854  628874356089534

I used SVM for the classification and get the result as follows:

    Fold 1 Out of Sample Accuracy = 0.8406755
    Fold 2 Out of Sample Accuracy = 0.8759399

![title](SVM.png)

Since we've already learned SVM in the class, I'll leave this part to yourself to try it out.

# RNN-LSTMs (alternative attempt)

Basic neural network assume all inputs and outputs are independent. Recurrent neural network's basic idea is to make use of sequential information.

![title](RNN.png)

    X: A token (word) as a vector
    O: Output label
    S : Memory, computed from the past memory and current word

However it still have issues like vanishing gradients when working on natural language problems since the meaning is heavily rely on the context.
    
LSTMs (Long Short Term Memories) have a different formula which add "residual information" to the next state instead of just transforming each state. The key idea of LSTMs is a "constant stream" flows through the entire chain. 

![title](LSTM.png)

The basic LSTMs also has input gate, forget/update gate and output gate. These gates together decide how the latest input will affect the model and to drop what information and what should be the output.

If you want to expolore more about RNN and LSTM, CS224D-Deep Learning for Natural Language Processing offered by Stanford NLP group would be a nice resource. You can find the lecture notes here:
http://cs224d.stanford.edu/syllabus.html

Here I will provide a rough version of LSTMs code application. This may take long time! (around 15 minutes for each epoch)
Try to write your own code and improve the accuracy.


In [14]:
def base_filter():
    f = string.punctuation
    f = f.replace("'", '')
    f += '\t\n'
    return f
    
    
# extract data and format them as the input to the model
def load_data(filepath):
    data = []
    classes = []
    rep = {r"phone number": "phonenumber",
           r"line number": "linenumber",
           r"call back": "callback",
           r"ring back": "callback",
           r"cell phone": "phonenumber",
           r"mobil number": "phonenumber",
           r"mobile number": "phonenumber",
           r"ticket number": "ticketnumber",
           r"account number": "accountnumber",
           r"number account": "accountnumber",
           r"email address": "email",
           r"late payment": "latepayment",
           r"make payment": "makepayment",
           r"put payment": "makepayment",
           r"made payment": "madepayment",
           r"bill payment": "billpayment",
           r"land line": "landline",
           r" talk ": " speak ",
           r" thank ": " thanks ",
           r"calling": "call",
           r"paying" : "pay",
           r"trying" : "try",
           r"anything" : "something"
           }
    rep = dict((re.escape(k), v) for k, v in rep.iteritems())
    pattern = re.compile("|".join(rep.keys()))
    stop_words = set(stopwords.words('english'))
    stop_words.update(['also', 'ive', 'youve', "yes", 'thats', 'im', 'dont', 'already', 'cant', 'us', 'ill', 'youre', 'could',
                       'go', 'actually', 'anyway', 'hold', 'make',  'first', 'hi', 'much', 
                       'still', 'sort', 'alright', 'take', 'name', 'sure', 'would', 'look', 
                       'got', 'back', 'put', 'give', 'want', 'let', 'got', 'oh', 
                       'yeah', 'yeh', 'yep', 'ah', 'like', 'know', 'right', 'think', 'okay', 
                       'get', 'see', 'okay','well','three','two', 'put', 'speak', 'four', 'five', 'litle',
                       'ok', 'gonna', 'whats', 'nine', 'eight', 'one', 'theres', 'please', 'good',
                       'able', 'forty', 'youll', 'hundred'])    
    token = pre.Tokenizer(nb_words= None, filters=base_filter(), lower=True, split=" ")
    with open (filepath, 'rb') as f:
        reader = csv.reader(f)
        next(reader) #ignore first row
        for row in reader:
            line = pattern.sub(lambda m: rep[re.escape(m.group(0))], row[2])
            line = re.sub(r'[^\w\s]','',line)
            tokens = wordpunct_tokenize(line)
            wl = [i for i in tokens if i.lower() not in stop_words]
            wl = ' '.join(wl)
            token.fit_on_texts(wl)
            list_of_lists = token.texts_to_sequences(wl)
            flattened = [val for sublist in list_of_lists for val in sublist]
            classes.append(row[3])
            data.append(flattened)
    array = np.asarray(data)
    array = sequence.pad_sequences(array)
    labeler = preprocessing.LabelEncoder()
    labels = set(classes)
    labeler.fit(list(labels))
    nb_classes = len(set(labels))
    label = labeler.transform(classes)
    label = to_categorical(label, nb_classes)
    return array, label, nb_classes

def run_model(Xtrain, xtrain, Xtest, xtest, nb_classes):
    results = []
    max_features = 1500
    embedding_dim = 32
    batch_size = 32
    epochs = 1 # to save time I set it to 1
    #modes = ['cpu', 'mem', 'gpu']
    modes = ['cpu']
    for mode in modes:
        print('Testing mode: consume_less="{}"'.format(mode))
    
        model = Sequential()  
        model.add(Embedding(max_features, embedding_dim, dropout=0.2))
        model.add(LSTM(embedding_dim, dropout_W=0.2, dropout_U=0.2, return_sequences=True))
        model.add(LSTM(16, return_sequences=True))
        model.add(LSTM(8))
        model.add(Dense(16, activation = 'tanh'))
        model.add(Dense(nb_classes, activation='sigmoid'))
        model.compile(loss='categorical_crossentropy',
                      optimizer='adam',
                      metrics=['accuracy'])
        start_time = time.time()
        history = model.fit(Xtrain, xtrain,
                            batch_size=batch_size,
                            nb_epoch=epochs,
                            validation_data=(Xtest, xtest))
        average_time_per_epoch = (time.time() - start_time) / epochs
        results.append((history, average_time_per_epoch))
    return results, modes


In [15]:
base_filter()
Xtrain, xtrain, nb_classes = load_data("BBXtrain.csv")
Xtest, xtest, nb_classes = load_data("SampleWithID.csv")
results, modes = run_model(Xtrain, xtrain, Xtest, xtest, nb_classes)

Testing mode: consume_less="cpu"
Train on 208 samples, validate on 408 samples
Epoch 1/1


    Is the result good enough? 
    Why is that? 
    Could do provide a better solution?




In [None]:
# try it out here

# Conclusion

As long as we get the model (either SVM or LSTM) with high performance, we can then apply it on all interaction data set. Thus it's much easier for the company to understand why customer are calling and which part of their business need to be optimized for higher efficiency.
This is one way to perform the data driven business strategy developing. The algorithm that we chose could be replaced in different application domain, but the logic remains the same. We also need to know how to chose between different algorithms to better fit the business case.

# Reference

[1] Blei, David M. "Latent Dirichlet Allocation - Definitions." Machine Learning. N.p., Jan. 2003. Web. 30 Oct. 2016.
    https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
    
[2] By Unrolling We Simply Mean That We Write out the Network for the Complete Sequence. For Example, If the Sequence We Care about Is a Sentence Of 5 Words, the Network Would Be Unrolled into a 5-layer Neural Network, One Layer for Each Word. The Formulas That Govern the Computation Happening in A RNN Are as Follows:. "Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs." WildML. N.p., 17 Sept. 2015. Web. 30 Oct. 2016.
    http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
    
[3] Olah, Christopher. "Understanding LSTM Networks." -- Colah's Blog. August 27, 2015. Accessed October 30, 2016. 
        http://colah.github.io/posts/2015-08-Understanding-LSTMs/
    