# PROJECT INTRODUCTION

This intent detection problem is on some text extracted from Enron emails (It would be interesting to examine how closely related it is to their fraud back in the early 2000s, as I had the chance to study it in my accounting fraud class). The dataset is primarily divided into sentences that are classified as "yes" or "no", depending on whether they communicate some form of intent. This form of intent could fall under the categories of"request", "propose" or even "commit". My task is to build a mathematical/statistical model that helps in determining whether a sentence has such intent or not, and for model 1.0, I'll take two approaches to see what results I can obtain. 

1. Since the classification is already done, we break the dataset up into the "yes" and "no" categories, and attempt to cluster words that are common to both the classes. Our interest, of course, remains focused on the words that we obtain in the "yes" category, but we will examine the ones in "no" as well. We study the frequency and more importantly, the difference in frequency of the words, extend this analysis to the phrases in the sentences to see if we can obtain any insights.


2. After doing so, we will attempt to use some prebuilt models to see if they are effective at detecting commonalities between the all the different "yes" sentences, and check them against our test dataset. Following this, we will attempt to use a more mathematically intense deep learning models (that are common to such datasets) to see if there are any changes in accuracy levels.  

In [1]:
from collections import Counter
import operator
import pandas as pd
import string

import plotly.graph_objects as go
import chart_studio.plotly as py
from plotly.subplots import make_subplots
import plotly.express as px

%matplotlib inline

# PREPING THE DATASET

In [2]:
def prepare_data(file_name):
    data =  open(file_name, "r") 
    
    punctuation = str.maketrans("","", string.punctuation)
    corpus_map = {"YES":[], "NO":[]}
    intent_map = {"YES":[], "NO":[]}
    items = data.read().split("\n")
    items.remove("")
    
    #reading file line by line, adding data to a dictionary
    for line in items:  
        intent = line.split("\t")[0]
        sentence = line.split("\t")[1]
        corpus_map[intent.upper()].append(sentence.translate(punctuation))
        intent_map[intent.upper()].append(sentence.split(" "))
    print("NUMBER OF POSITIVE INTENTS:",len(corpus_map['YES']), "NUMBER OF NO INTENTS:" ,len(corpus_map['NO']))
    
    #convert to lower case, remove punctuation
    no = [x.lower().translate(punctuation) for sublist in intent_map['NO'] for x in sublist]
    yes = [x.lower().translate(punctuation) for sublist in intent_map['YES'] for x in sublist]
    
    #obtain frequency of words
    no_freq = sorted(Counter(no).items(), key=operator.itemgetter(1), reverse = True)
    yes_freq = sorted(Counter(yes).items(), key=operator.itemgetter(1), reverse = True)

    #obtain dataframe for frequency
    no_df = pd.DataFrame(no_freq, columns =["word", "frequency"])
    yes_df = pd.DataFrame(yes_freq, columns = ["word", "frequency"])

    #create y_train and return
    x = corpus_map['YES'] + corpus_map['NO']
    y = [1]*len(corpus_map['YES']) + [0]*len(corpus_map['NO'])
    return x,y, yes_df, no_df
  

In [3]:
x_train, y_train, yes_df, no_df = prepare_data("/home/rvp/Dropbox/intent_detection/enron_train.txt")
x_test, y_test, yes_test, no_test = prepare_data("/home/rvp/Dropbox/intent_detection/enron_test.txt")

NUMBER OF POSITIVE INTENTS: 1719 NUMBER OF NO INTENTS: 1938
NUMBER OF POSITIVE INTENTS: 309 NUMBER OF NO INTENTS: 683


In [4]:
"No of unique words in each category"
no_df.shape, yes_df.shape

((5463, 2), (3631, 2))

# CURSORY GLANCE OF DATASET

Our dataset is small and a cursory glance at it reveals some key insights about the data (note that this is only possible because the data is relatively clean and labelled, as well as small. This is not a scalable way of approach when the dataset is huge, but a glance of a few samples is always highly encouraged).

We notice patterns of phrases like "please ____" and sentences ending in questions. 

# ANALYSING COMMON WORDS AND UNIQUE WORDS

In [5]:
common_words = pd.merge(yes_df.head(500), no_df.head(500), on='word', how='outer').fillna(1)
common_words.columns = ['Word', 'yes_intent', 'no_intent']
common_words['yes_ratio'] = common_words['yes_intent']/common_words['no_intent']
common_words['no_ratio'] = common_words['no_intent']/common_words['yes_intent']
common_words[common_words['yes_ratio']>3]

Unnamed: 0,Word,yes_intent,no_intent,yes_ratio,no_ratio
4,please,546.0,139.0,3.928058,0.254579
22,call,199.0,39.0,5.102564,0.195980
23,discuss,193.0,23.0,8.391304,0.119171
36,let,114.0,32.0,3.562500,0.280702
42,could,104.0,34.0,3.058824,0.326923
58,meet,73.0,9.0,8.111111,0.123288
69,lets,54.0,17.0,3.176471,0.314815
77,join,48.0,12.0,4.000000,0.250000
78,talk,46.0,15.0,3.066667,0.326087
86,schedule,42.0,12.0,3.500000,0.285714


In [6]:
common_words[common_words['no_ratio']>3]

Unnamed: 0,Word,yes_intent,no_intent,yes_ratio,no_ratio
147,was,24.0,79.0,0.303797,3.291667
150,day,23.0,78.0,0.294872,3.391304
186,receive,16.0,56.0,0.285714,3.500000
223,no,14.0,50.0,0.280000,3.571429
224,he,14.0,50.0,0.280000,3.571429
251,they,13.0,67.0,0.194030,5.153846
260,their,13.0,40.0,0.325000,3.076923
295,business,11.0,34.0,0.323529,3.090909
310,she,10.0,35.0,0.285714,3.500000
321,off,10.0,36.0,0.277778,3.600000


# PLOTTING THE INSIGHTS

In [18]:
"""Add seaborn plots for the common words"""

import plotly.io as pio
pio.renderers.default = "browser"
def generate_figure(data): 
    
    fig = make_subplots(specs=[[{"secondary_y": True}]])


    fig.add_trace(go.Bar(x=data['Word'],
                     y =data['yes_intent'],  name = 'Intent'),
                     secondary_y = False)
    
    fig.add_trace(go.Bar(x=data['Word'],
                     y =data['no_intent'],  name = 'No Intent'),
                     secondary_y = False)
    
    fig.add_trace(go.Scatter(x=data['Word'],
                     y =data['yes_ratio'],  name = 'Intent Ratio'),
                     secondary_y = True)
    
    fig.add_trace(go.Scatter(x=data['Word'],
                     y =data['no_ratio'],  name = 'No Intent Ratio'),
                     secondary_y = True)
    
    
    fig.update_yaxes(title_text="<b>WORD COUNT</b>", secondary_y=False)
    fig.update_yaxes(title_text="<b>WORD RELATIVE RATIO</b>", secondary_y=True)

    return fig

generate_figure(common_words).show()


# VECTORIZING THE DATASET

While the insights we have gathered so far reflect a trend in the words used in emails having "intent", we must formalize the process of identify these emails. To do so, we will use statistical models to see if they are effective in identifying the sentiment we are looking for, and in case we are disappointed by the results, we shall seek the even more mathetically complex neural networks to detect trends for us. 

The first step in this process would be to vectorize the dataset, as words are meaningless to these statistical models and must be converted to numbers. We take the complete corpus and vectorize it using both the count and the tfidf methods, because an initial examination of our data has revealed that some words are more common in the "yes" set, while a few others are more common in the "no" set. The nature of the tfidf method will automatically give more weight to these important variables that occur less frequently, as opposed to words like "the" and "you" which are scattered everywhere.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [9]:
"""We shall attempt to implement a unigram model, and then a dtidf model to see what trends they
reveal about the data"""

vectorizer = CountVectorizer(ngram_range=(3,3), stop_words = ['to', 'you', 'and', 'the', 'a'])
tfidf_vectorizer = TfidfVectorizer(norm = 'l2', ngram_range=(3,3), stop_words = ['to', 'you', 'and', 'the', 'a'])

xtrain_count = vectorizer.fit_transform(x_train)
xtrain_tfidf = tfidf_vectorizer.fit_transform(x_train)

xtest_count = vectorizer.transform(x_test)
xtest_tfidf = tfidf_vectorizer.transform(x_test)

In [10]:
xtrain_count

<3657x37939 sparse matrix of type '<class 'numpy.int64'>'
	with 41883 stored elements in Compressed Sparse Row format>

# TRAINING THE MODEL(S)

In [11]:
models = {"Naive Bayes": [("naive_bayes", MultinomialNB())], "SVC": [("SVC", LinearSVC())], 
         "Random Forest": [("RF",RandomForestClassifier(n_estimators=1000, max_depth = 500))]}
predicted = dict()
accuracy = dict(dict())
test_predicted = dict()
test_accuracy = dict()

parameters = {'Naive Bayes': {}, 
              'SVC': {"SVC__tol": [1e-04, 1e-05], "SVC__loss": ['hinge', 'squared_hinge']}, 
              'Random Forest': {"RF__n_estimators": [200,500], "RF__max_depth": [500, 1000]} }

for model in models: 
    
    print("The %s model is being trained."%model)
    
    pipe = Pipeline(models[model])
    params = parameters[model]
    clf = GridSearchCV(pipe, param_grid = params, n_jobs = -1, cv = 5).fit(xtrain_count, y_train)
    clf_tfidf = GridSearchCV(pipe, param_grid = params, cv = 5).fit(xtrain_tfidf, y_train)
    print("Best parameters for %s on GridSearch in train data: "%(model), clf.best_params_)
    
    #train predictions, will make loop later
    predicted[model, 'count'] = clf.predict(xtrain_count)
    predicted[model, 'tfidf'] = clf_tfidf.predict(xtrain_tfidf)
    accuracy[model, 'count'] = accuracy_score(y_train, predicted[model, 'count'])
    accuracy[model, 'tfidf'] = accuracy_score(y_train, predicted[model, 'tfidf'])
    
    #test predictions, will make loop later
    test_predicted[model, 'count'] = clf.predict(xtest_count)
    test_predicted[model, 'tfidf'] = clf_tfidf.predict(xtest_tfidf)
    test_accuracy[model, 'count'] = accuracy_score(y_test, test_predicted[model, 'count'])
    test_accuracy[model, 'tfidf'] = accuracy_score(y_test, test_predicted[model, 'tfidf'])
    print("Train Accuracy for %s is: %1.11f for count and %1.11f for tfidf"%(model, 
                        accuracy[model, 'count'], accuracy[model, 'tfidf']))
    print("Test Accuracy for %s is: %1.11f for count and %1.11f for tfidf"%(model, 
                        test_accuracy[model, 'count'], test_accuracy[model, 'tfidf']))
    
    
print("Train Accuracy", accuracy) 
print("Test Accuracy", test_accuracy)

The Naive Bayes model is being trained.
Best parameters for Naive Bayes on GridSearch in train data:  {}
Train Accuracy for Naive Bayes is: 0.99808586273 for count and 0.99835931091 for tfidf
Test Accuracy for Naive Bayes is: 0.74193548387 for count and 0.74697580645 for tfidf
The SVC model is being trained.
Best parameters for SVC on GridSearch in train data:  {'SVC__loss': 'hinge', 'SVC__tol': 0.0001}
Train Accuracy for SVC is: 0.99863275909 for count and 0.99863275909 for tfidf
Test Accuracy for SVC is: 0.76612903226 for count and 0.75302419355 for tfidf
The Random Forest model is being trained.
Best parameters for Random Forest on GridSearch in train data:  {'RF__max_depth': 1000, 'RF__n_estimators': 500}
Train Accuracy for Random Forest is: 0.99671862182 for count and 0.77932731747 for tfidf
Test Accuracy for Random Forest is: 0.74193548387 for count and 0.74495967742 for tfidf
Train Accuracy {('Naive Bayes', 'count'): 0.9980858627290129, ('Naive Bayes', 'tfidf'): 0.99835931091058

# ENSEMBLING THE RESULTS

In [15]:
"""We will now take the results obtained from each of the models and merge them to see if the accuracy is higher. 
We have three methods that we have used currently, and our final result will take the opinion of two or more."""

# Results for the Count Vector

merged_results = list(zip(test_predicted['Naive Bayes', 'count'], 
                          test_predicted['Random Forest', 'count'], test_predicted['SVC', 'count']))
added_results = [sum(x) for x in merged_results]
final_results = [1 if x > 2 else 0 for x in added_results]
print("Accuracy when combining count results:",accuracy_score(final_results, y_test))

Accuracy when combining count results: 0.7399193548387096


In [16]:
# Results for the tfidf Vector

merged_results = list(zip(test_predicted['Naive Bayes', 'tfidf'], 
                          test_predicted['Random Forest', 'tfidf'], test_predicted['SVC', 'tfidf']))
added_results = [sum(x) for x in merged_results]
final_results = [1 if x > 2 else 0 for x in added_results]
print("Accuracy when combining the tfidf results:",accuracy_score(final_results, y_test))

Accuracy when combining the tfidf results: 0.7429435483870968


In [17]:
# Results for the count and tfidf vectors combined

merged_results = list(zip(test_predicted['Naive Bayes', 'count'], test_predicted['Random Forest', 'count'], 
                          test_predicted['SVC', 'count'], test_predicted['Naive Bayes', 'tfidf'], 
                          test_predicted['Random Forest', 'tfidf'], test_predicted['SVC', 'tfidf']))
added_results = [sum(x) for x in merged_results]
final_results = [1 if x > 5 else 0 for x in added_results]
print("Accuracy when combining the count and tfidf results:",accuracy_score(final_results, y_test))

Accuracy when combining the count and tfidf results: 0.7389112903225806


# INITIAL INFERENCES

So after preprocessing, examining, and running the data against some benchmark classification models, we notice that while accuracy in training data is nearly 100 percent, the same cannot be said about test data, which is close to the 75-80 percent range. Some enhancements in the results come after parameter tuning using grid search in sklearn, cleaning data for punctuation and stop words, using bi-grams and tri-grams while vectorizing and ensembling the results at the end. However, the model is still overfit, and we need to explore other methods that might increase the efficiency of identifying intent. One such solution is using a neural network, that might be able to identify some patterns in the largely sparse matrix that other machine learning models have been unable to so far. 

# DEEP LEARNING

# THINGS TO EXPLORE

1. Data Resampling
2. Error Analysis (seeing where the mistakes are and if accuracy can be improved)
3. Other methods for encoding
4. Hyperparameter tuning using gridsearch
5. Closer analysis of the data to check for normalization
6. The random state!