## Verification

The goal of this notebook is to verify our most prominent clusters. We are verfying these clusters by creating clusters using non-negative matrix factorization and then ensuring that the most prominent clusters (according to LDA) are also present in the NMF clusters. 

By "most prominent clusters" I mean clusters with at least 2 words having a probability of over .05 for that cluster. For clarification, please reference the theme_analysis folder.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

I am reading in the raw data and convert it into a list of strings with 'zz' instead of '.' and 'yyy' instead of '->'. This is a workaround because sklearn will interpret '.' and '->' as separating words, while they are not.

In [2]:
def data_to_path(pre_df, qty):
    #INPUT: pre_df, dataframe
    #INPUT: quantity of rows to read in, 438982 for total dataframe.
    #OUTPUT: paths, list os list of words

    paths = []
    for i in range(0, qty):
         paths.append(pre_df['Path'][i].replace(' ',\
        '.').replace('->', ' ').split())
    return paths

In [3]:
pre_df = pd.read_csv('../../data/Top_Traversals_demo-1daybehavior_20140401.csv')
docs = data_to_path(pre_df, 438982)

In [4]:
print docs

In [5]:
def doc_combine(words_list):
    #INPUT: list of  list of words (output of data_to_path() function)
    #OUTPUT: list of list of word transitions

    result_list = []
    for  doc in words_list:
        #Check to see if document contains more than 1 word
        zip_list = zip(doc[:-1], doc[1:])
        single_string = ''
        for val in zip_list:
            single_string += val[0].replace('.', 'zzz')\
            + 'xxx' + val[1].replace('.', 'zzz') + ' '
        result_list.append(single_string)
    return result_list

In [6]:
docs = doc_combine(docs)

Verify that docs contains data in the correct format

In [7]:
len(docs)

438982

## Convert our documents into a tfidf matrix.

In [8]:
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(docs)

## Instantiate Model

In [9]:
nmf = NMF(n_components=30, random_state=42, alpha=.1, l1_ratio=0).fit(tfidf)

In [18]:
def get_top_words(model, feature_names, n_top_words):
    top_words = {}
    for topic_idx, topic in enumerate(model.components_):
        top_words["Topic #{}:".format(topic_idx)] = []
        top_words["Topic #{}:".format(topic_idx)].append([feature_names[i].replace('zzz',\
                        '.').replace('xxx', '->')
                        for i in topic.argsort()[:-n_top_words - 1:-1]])
    return  top_words

In [19]:
top_words = get_top_words(nmf, tfidf_vectorizer.get_feature_names(), 5)

## Output results

In [25]:
for key, value in top_words.iteritems():
    print key
    for l in value:
        for event in l:
            print event
    print '\n\n\n'

Topic #3:
agent.view.payment.history->agent.view.payment.history
agent.view.payment.history->agent.exit
agent.pay.by.phone.success->agent.view.payment.history
agent.view.statements->agent.view.payment.history
agent.view.payment.history->agent.pay.by.phone.success




Topic #23:
redeem.rewards
journey.entry->reward
reward->web.entry
webstc.view
redeem.rewards->web.exit




Topic #16:
ivr.disp.completed.call->ivr.exit
ivr.entry->ivr.proactive.balance
ivr.proactive.balance->ivr.disp.completed.call
ivr.exit->ivr.entry
ivr.exit->journey.exit




Topic #25:
webevent.view.transactions.and.details.success->webevent.view.payment.activity
webevent.view.payment.activity->webevent.view.transactions.and.details.success
webevent.view.account.summary.success->webevent.view.payment.activity
webevent.view.payment.activity->web.exit
webevent.view.payment.activity->webevent.login




Topic #9:
agent.view.statements->agent.view.statements
agent.financial.adjustments->agent.view.statements
tsys.financial.a

## Prominent Clusters in LDA and Their Corresponding Topics in NMF

Horrendous IVR: topic 8 <br>
Mobile Disengagement: topic 11 <br>
Mobile Users: Topic 4<br>
Web Logins and Deleting Payments: Topic 15 <br>
The Shallow Web: Topic 17 <br>
Statments Statements Statements: Topic 0 <br>

## Results
All of the prominent clusters in LDA have a corresponding NMF cluster. This is good news. While the clusters vary slightly, the overall themes are still present and therefore we can declare these clusters as legitimate.