# Auto Complete Server - Conversation Corpus Analysis & Prototype

The goal of this project is to build a relevant, robust, fast auto completion server that will be used to provide auto completion suggestions to customer service agents. The auto completion model will be trained on a provided set of customer/agent conversations.

This notebook will be used to explore the data and start prototyping the future components of our auto-complete server.

In [236]:
import json
import pandas as pd
import numpy as np
import nltk
import datrie
import string
import re
import operator

## Load Data

The conversation json file is loaded into a dataframe where each row represents a message part of a given issue.

In [3]:
sample_file = '../data/sample_conversations.json'
 
with open(sample_file, encoding='utf8') as f:
    sample_json = json.loads(f.read())
    
df = pd.io.json.json_normalize(
    data=sample_json,
    record_path=['Issues', 'Messages'],
    meta=[
        ['Issues', 'CompanyGroupId'],
        ['Issues', 'IssueId']
    ]
)
 
df = df[['Issues.IssueId', 'Issues.CompanyGroupId', 'IsFromCustomer', 'Text']]
df.rename(columns={'Issues.IssueId': 'IssueId', 'Issues.CompanyGroupId': 'CompanyGroupId'},
          inplace=True)

Since our interest is in providing autocompletion suggestions to the customer service agent, we need to focus on their own messages, which are tagged with `IsFromCustomer = False`

In [72]:
from_agent = df.IsFromCustomer == False
df[from_agent].sample(5)

Unnamed: 0,IssueId,CompanyGroupId,IsFromCustomer,Text
20383,13500001,1,False,OK let me update you account information.
352,310001,40001,False,could you please provide me with your new addr...
7429,5050001,20001,False,Great. Have a safe flight
5627,3860001,40001,False,Thank you for that information. I need to make...
11672,8270001,50001,False,Have you tried rebooting the system


It will be useful later to split the messages text into sentences. This will enable us to easily measure sentence frequencies.

In [159]:
# Note: the creation of the sentences dataframe might not be scalable for a corpus of a billion messages
# with this implementation. A clever usage of pandas stack() and joins might enable us to do it in a more
# efficient fashion.
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
columns = ['IssueId', 'CompanyGroupId', 'IsFromCustomer', 'SentenceId', 'Sentence']
df_sentences = pd.DataFrame(columns=columns)
for message in df.itertuples():
    sentences = tokenizer.tokenize(message.Text)
    for i, sentence in enumerate(sentences):
        new_row = [message.IssueId, message.CompanyGroupId, message.IsFromCustomer, i, sentence]
        df_sentences = df_sentences.append(dict(zip(columns, new_row)), ignore_index=True)

In [195]:
from_agent = df_sentences.IsFromCustomer == False
df_sentences[from_agent].Sentence.value_counts().head(10)

Is there anything else I can help you with today?    285
Great.                                               268
Okay.                                                204
Thank you.                                           197
Have a great day.                                    165
You're welcome.                                      158
Is there anything else I can help you with?          139
Have a great day                                     138
Is there anything else I can assist you with?        133
Have a great day!                                     85
Name: Sentence, dtype: int64

## How many sentences?

In [307]:
df_sentences[from_agent].shape`

(16508, 5)

## n-gram Frequencies

### Unigrams

In [309]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokens = []
for message in df.itertuples():
    sentences = tokenizer.tokenize(message.Text)
    for sentence in sentences:
        tokens += nltk.word_tokenize(sentence)

In [31]:
def ngram_freq(n):
    ngrams = nltk.ngrams(tokens, n)
    fdist = nltk.FreqDist(ngrams)
    return fdist.most_common(5) 

In [32]:
ngram_freq(1)

[(('.',), 12362),
 (('I',), 10388),
 (('you',), 8612),
 (('the',), 5465),
 (('to',), 4748)]

### Vocabulary size?

In [322]:
unigrams = nltk.ngrams(tokens, 1)
fdist = nltk.FreqDist(unigrams)
v = fdist.B()
print("{:,}".format(v))

6,416


In [324]:
print("Bigrams space size = {:,}".format(v**2))
print("Trigrams space size = {:,}".format(v**3))

Bigrams space size = 41,165,056
Trigrams space size = 264,114,999,296


### Bigrams

In [33]:
ngram_freq(2)

[(('.', 'I'), 2629),
 (('I', 'can'), 1737),
 (('you', 'with'), 1355),
 (('Thank', 'you'), 1275),
 (('anything', 'else'), 1076)]

### Trigrams

In [34]:
ngram_freq(3)

[(('there', 'anything', 'else'), 1041),
 (('else', 'I', 'can'), 950),
 (('anything', 'else', 'I'), 946),
 (('Is', 'there', 'anything'), 915),
 (('help', 'you', 'with'), 797)]

Those results should give a good idea of what the auto completion is expected to do. For example, if a representative types "Is there", the auto completion should suggest "Is there anything else I can help you with?"

### n-grams

In [35]:
n = 9
ngram_freq(n)

[(('Is', 'there', 'anything', 'else', 'I', 'can', 'help', 'you', 'with'), 500),
 (('there', 'anything', 'else', 'I', 'can', 'help', 'you', 'with', 'today'),
  326),
 (('.', 'Is', 'there', 'anything', 'else', 'I', 'can', 'help', 'you'), 312),
 (('anything', 'else', 'I', 'can', 'help', 'you', 'with', 'today', '?'), 305),
 (('Is', 'there', 'anything', 'else', 'I', 'can', 'assist', 'you', 'with'),
  246)]

We can observe that the frequencies logically decrease with the length of the n-gram. However, it might be more relevant to auto complete with a word with a full sentence than with the most frequent associated bigram. When mixing n-grams of different lengths, we need to find some way to adjust their scores - they cannot be compared on the raw frequency.

## Build Trie Data Structure

In [199]:
sentences_freq = df_sentences[from_agent].Sentence.value_counts()
sentences_freq.head(10)

Is there anything else I can help you with today?    285
Great.                                               268
Okay.                                                204
Thank you.                                           197
Have a great day.                                    165
You're welcome.                                      158
Is there anything else I can help you with?          139
Have a great day                                     138
Is there anything else I can assist you with?        133
Have a great day!                                     85
Name: Sentence, dtype: int64

In [200]:
sentences_freq.shape

(9845,)

Let's build the trie! The values of each sentence will correspond to frequency observed in the corpus.

In [221]:
trie = datrie.Trie(string.printable)
for sentence, freq in sentences_freq.iteritems():
    trie[sentence] = freq
trie.save('sentences_freq.trie')

In [253]:
def generate_completions(prefix):
    completions = trie.items(prefix)
    sorted_d = sorted(
        dict(completions).items(),
        key=operator.itemgetter(1),
        reverse=True
    )
    n_results = min(len(sorted_d), 5)
    return sorted_d[:n_results]

In [261]:
generate_completions("H")

[]

## Quick and Dirty Performance Test

In [278]:
import platform
print(
    platform.uname(),
    platform.processor(),
    platform.python_version(),
    platform.python_implementation()
)

uname_result(system='Darwin', node='macgregor.local', release='17.7.0', version='Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64', machine='x86_64', processor='i386') i386 3.6.4 CPython


In [286]:
%%timeit -n 100 -r 7
generate_completions("H")

100 loops, best of 7: 1.97 ms per loop


In [285]:
%%timeit -n 100 -r 7
generate_completions("W")

100 loops, best of 7: 1.98 ms per loop


In [287]:
%%timeit -n 100 -r 7
generate_completions("How")

100 loops, best of 7: 111 µs per loop
