# PROJECT INTRODUCTION

This intent detection problem is on some text extracted from Enron emails (It would be interesting to examine how closely related it is to their fraud back in the early 2000s, as I had the chance to study it in my accounting fraud class). The dataset is primarily divided into sentences that are classified as "yes" or "no", depending on whether they communicate some form of intent. This form of intent could fall under the categories of"request", "propose" or even "commit". My task is to build a mathematical/statistical model that helps in determining whether a sentence has such intent or not, and for model 1.0, I'll take two approaches to see what results I can obtain. 

1. Since the classification is already done, we break the dataset up into the "yes" and "no" categories, and attempt to cluster words that are common to both the classes. Our interest, of course, remains focused on the words that we obtain in the "yes" category, but we will examine the ones in "no" as well. We study the frequency and more importantly, the difference in frequency of the words, extend this analysis to the phrases in the sentences to see if we can obtain any insights.


2. After doing so, we will attempt to use some prebuilt models to see if they are effective at detecting commonalities between the all the different "yes" sentences, and check them against our test dataset. Following this, we will attempt to use a more mathematically intense deep learning models (that are common to such datasets) to see if there are any changes in accuracy levels.  

In [1]:
from collections import Counter
import operator
import pandas as pd
import string

# PREPING THE DATASET

In [2]:
train =  open("/home/rvp/Dropbox/intent_detection/enron_train.txt", "r") 
intent_map = {"YES":[], "NO":[]}
corpus_map = {"YES":[], "NO":[]}
items = train.read().split("\n")
items.remove("")
for line in items:
    intent = line.split("\t")[0]
    sentence = line.split("\t")[1]
    corpus_map[intent.upper()].append(sentence)
    intent_map[intent.upper()].append(sentence.split(" ")) 

In [3]:
"47 percent of our data is yes, while 53 percent is no"
len(intent_map["YES"]), len(intent_map['NO'])

(1719, 1938)

# CURSORY GLANCE OF DATASET

Our dataset is small and a cursory glance at it reveals some key insights about the data (note that this is only possible because the data is relatively clean and labelled, as well as small. This is not a scalable way of approach when the dataset is huge, but a glance of a few samples is always highly encouraged).

We notice patterns of phrases like "please ____" and sentences ending in questions. 

In [79]:
"flatten list. We remove punctuation marks from the words, but knowing that "

punctuation = str.maketrans("","", string.punctuation)
no = [x.lower().translate(punctuation) for sublist in intent_map['NO'] for x in sublist]
yes = [x.lower().translate(punctuation) for sublist in intent_map['YES'] for x in sublist]

"obtain frequency of words"
no_freq = sorted(Counter(no).items(), key=operator.itemgetter(1), reverse = True)
yes_freq = sorted(Counter(yes).items(), key=operator.itemgetter(1), reverse = True)

"obtain dataframe for frequency"
no_df = pd.DataFrame(no_freq, columns =["word", "frequency"])
yes_df = pd.DataFrame(yes_freq, columns = ["word", "frequency"])

#no_df.set_index('word', inplace = True)
#yes_df.set_index('word', inplace = True)

x_train = corpus_map['YES'] + corpus_map['NO']
y_train = [1]*len(corpus_map['YES']) + [0]*len(corpus_map['NO'])
len(x_train), len(y_train)

(3657, 3657)

In [60]:
"No of unique words in each category"
no_df.shape, yes_df.shape

((5463, 2), (3631, 2))

In [61]:
unique_to_no = set(no) - set(yes)
unique_to_yes = set(yes) - set(no)

# ANALYSING COMMON WORDS AND UNIQUE WORDS

In [81]:
common_words = pd.merge(yes_df.head(500), no_df.head(500), on='word', how='outer').fillna(1)
common_words.columns = ['Word', 'yes_intent', 'no_intent']
common_words['yes_ratio'] = common_words['yes_intent']/common_words['no_intent']
common_words['no_ratio'] = common_words['no_intent']/common_words['yes_intent']
common_words[common_words['yes_ratio']>3]

Unnamed: 0,Word,yes_intent,no_intent,yes_ratio,no_ratio
4,please,546.0,139.0,3.928058,0.254579
22,call,199.0,39.0,5.102564,0.195980
23,discuss,193.0,23.0,8.391304,0.119171
36,let,114.0,32.0,3.562500,0.280702
42,could,104.0,34.0,3.058824,0.326923
58,meet,73.0,9.0,8.111111,0.123288
69,lets,54.0,17.0,3.176471,0.314815
77,join,48.0,12.0,4.000000,0.250000
78,talk,46.0,15.0,3.066667,0.326087
86,schedule,42.0,12.0,3.500000,0.285714


In [82]:
common_words[common_words['no_ratio']>3]

Unnamed: 0,Word,yes_intent,no_intent,yes_ratio,no_ratio
147,was,24.0,79.0,0.303797,3.291667
150,day,23.0,78.0,0.294872,3.391304
186,receive,16.0,56.0,0.285714,3.500000
223,no,14.0,50.0,0.280000,3.571429
224,he,14.0,50.0,0.280000,3.571429
251,they,13.0,67.0,0.194030,5.153846
260,their,13.0,40.0,0.325000,3.076923
295,business,11.0,34.0,0.323529,3.090909
310,she,10.0,35.0,0.285714,3.500000
321,off,10.0,36.0,0.277778,3.600000


# PLOTTING THE INSIGHTS

In [94]:
"""Add seaborn plots for the common words"""

'Add seaborn plots for the common words'

# VECTORIZING THE DATASET

While the insights we have gathered so far reflect a trend in the words used in emails having "intent", we must formalize the process of identify these emails. To do so, we will use statistical models to see if they are effective in identifying the sentiment we are looking for, and in case we are disappointed by the results, we shall seek the even more mathetically complex neural networks to detect trends for us. 

The first step in this process would be to vectorize the dataset, as words are meaningless to these statistical models and must be converted to numbers. We take the complete corpus and vectorize it using both the count and the tfidf methods, because an initial examination of our data has revealed that some words are more common in the "yes" set, while a few others are more common in the "no" set. The nature of the tfidf method will automatically give more weight to these important variables that occur less frequently, as opposed to words like "the" and "you" which are scattered everywhere.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [77]:
"""We shall attempt to implement a unigram model, and then a dtidf model to see what trends they
reveal about the data"""

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(x_train)

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform(x_train)

(3657, 6617)

In [96]:
#testing the analyzer
analyze = vectorizer.build_analyzer()
analyze('Please add Tricia Spence to this email') == (["please", "add", "tricia", "spence", "to", "this", "email"])

True