## Enron Email Classification using Machine Learning

you can find data cleaning notebook of enron email dataset at:

[https://www.kaggle.com/ankur561999/data-cleaning-enron-email-dataset](https://www.kaggle.com/ankur561999/data-cleaning-enron-email-dataset)

In [68]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os

### Import necessary libraries

In [69]:
import matplotlib.pyplot as plt
import re
import string
import time
pd.set_option('display.max_rows', 50)

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GIGABYTE\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Load Data

In [70]:
df = pd.read_csv("C:/Users/GIGABYTE/Documents/Env/test/cleaned_data.csv")

# view first 5 rows of the dataframe
df.head()
print(len(df))

489236


### Data Pre-processing

#### Remove Folders
Remove folders that do not contain enough e-mails because such folders would not be significant for training our classifier. Also, we can infer that some folders with very little e-mails in them were created but unused.

In [71]:
def remove_folders(emails, n):
    # returns the number of folders containing more than 'n' number of emails
    email_count = dict(df['X-Folder'].value_counts())
    small_folders = [key for key, val in email_count.items() if val<=n]
    emails = df.loc[~df['X-Folder'].isin(small_folders)]
    return emails

In [72]:
n = 600
df = remove_folders(df, n)
print("Total folders: ", len(df['X-Folder'].unique()))
print("df.shape: ", df.shape)
print(df)

Total folders:  21
df.shape:  (442527, 3)
                             subject    X-Folder  \
0                                Re:  'sent mail   
1                           Re: test  'sent mail   
2                          Re: Hello  'sent mail   
3                          Re: Hello  'sent mail   
4       Re: PRC review - phone calls  'sent mail   
...                              ...         ...   
489231      Trade with John Lavorato  sent items   
489232                    Gas Hedges  sent items   
489233              RE: CONFIDENTIAL  sent items   
489234     Calgary Analyst/Associate  sent items   
489235              RE: ali's essays  sent items   

                                                     body  
0       Traveling to have a business meeting takes the...  
1                          test successful.  way to go!!!  
2                     Let's shoot for Tuesday at 11:45.    
3       Greg,\n\n How about either next Tuesday or Thu...  
4                        any morn

**Combine subject and body columns**

In [73]:
df['text'] = df['subject'] + " " + df['body']

In [74]:
# drop the columns 'subject' and 'body'
df.drop(['subject','body'], axis=1, inplace=True)

In [75]:
df.head()

Unnamed: 0,X-Folder,text
0,'sent mail,Re: Traveling to have a business meeting takes...
1,'sent mail,Re: test test successful. way to go!!!
2,'sent mail,Re: Hello Let's shoot for Tuesday at 11:45.
3,'sent mail,"Re: Hello Greg,\n\n How about either next Tues..."
4,'sent mail,Re: PRC review - phone calls any morning betwe...


Now, do the following to preprocess text:
- lowercasing all words
- Remove extra new lines
- Remove extra tabs, punctuations, commas
- Remove extra white spaces
- Remove stopwords

In [76]:
def preprocess(x):
    # lowercasing all the words
    x = x.lower()
    
    # remove extra new lines
    x = re.sub(r'\n+', ' ', x)
    
    # removing (replacing with empty spaces actually) all the punctuations
    x = re.sub("["+string.punctuation+"]", " ", x)
    
    # remove extra white spaces
    x = re.sub(r'\s+', ' ', x)
    
    return x

In [77]:
start = time.time()
df.loc[:,'text'] = df.loc[:, 'text'].map(preprocess)

# remove stopwords
df.loc[:, 'text'] = df.loc[:, 'text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
end = time.time()
print("Execution time (sec): ",(end - start))



Execution time (sec):  159.38078689575195


In [78]:
df

Unnamed: 0,X-Folder,text
0,'sent mail,traveling business meeting takes fun trip espe...
1,'sent mail,test test successful way go
2,'sent mail,hello let shoot tuesday 11 45
3,'sent mail,hello greg either next tuesday thursday phillip
4,'sent mail,prc review phone calls morning 10 11 30
...,...,...
489231,sent items,trade john lavorato trade oil spec hedge ng jo...
489232,sent items,gas hedges position alberta term book send pos...
489233,sent items,confidential 2 original message doucet dawn se...
489234,sent items,calgary analyst associate analyst rank stephan...


- Randomly select any 20 folders which we would like to categorize.
- Only 20 folders have been selected because of very high training time and computational cost

In [79]:
start = time.time()
folders_dict = dict(df['X-Folder'].value_counts().sort_values()[0:10])
# folders_dict = dict(df['X-Folder'].value_counts().sort_values())
data = df[df['X-Folder'].isin(folders_dict.keys())]
end = time.time()
print("Execution time (sec): ",(end - start))

Execution time (sec):  0.02993154525756836


In [80]:
folders_dict.keys()

dict_keys(['tufco', 'esvl', 'calendar', 'management', 'deal discrepancies', 'bill williams iii', 'california', 'tw-commercial group', 'logistics', 'schedule crawler'])

In [81]:
data

Unnamed: 0,X-Folder,text
6956,california,caiso notice summer 2001 generation rfb market...
6957,california,ca iso cal px information related 2000 market ...
6958,california,caiso notification update inter sc trades adju...
6959,california,update mif meeting presentations iso website u...
6960,california,mif presentations presentations market issues ...
...,...,...
488688,calendar,duke westcoast transaction dial number 216 090...
488689,calendar,duke westcoast transaction sent behalf peter k...
488690,calendar,updated edcc ecc pricing discussion would like...
488691,calendar,aes project tolling interest mtg derek dennist...


In [82]:
# check number of rows in the 'data' dataframe
print("Number of instances: ", data.shape[0])
data.to_csv('preprocessed.csv', index=False)

Number of instances:  9378


In [83]:
data = pd.read_csv("preprocessed.csv")

**Encode class labels**

In [84]:
data

Unnamed: 0,X-Folder,text
0,california,caiso notice summer 2001 generation rfb market...
1,california,ca iso cal px information related 2000 market ...
2,california,caiso notification update inter sc trades adju...
3,california,update mif meeting presentations iso website u...
4,california,mif presentations presentations market issues ...
...,...,...
9373,calendar,duke westcoast transaction dial number 216 090...
9374,calendar,duke westcoast transaction sent behalf peter k...
9375,calendar,updated edcc ecc pricing discussion would like...
9376,calendar,aes project tolling interest mtg derek dennist...


In [85]:
data.iloc[3][0]

'california'

In [86]:
data['X-Folder'].value_counts()

X-Folder
schedule crawler       1396
logistics              1170
tw-commercial group    1150
california             1014
bill williams iii      1004
deal discrepancies      878
management              799
calendar                700
esvl                    663
tufco                   604
Name: count, dtype: int64

In [87]:
def label_encoder(data):
    class_le = LabelEncoder()
    # apply label encoder on the 'X-Folder' column
    y = class_le.fit_transform(data['X-Folder'])
    return y


def label_encoder1(data):
    class_le = LabelEncoder()
    # apply label encoder on the 'X-Folder' column
    y_encoded = class_le.fit_transform(data['X-Folder'])
    
    # create a dictionary mapping original labels to encoded labels
    label_mapping = {label: encoded_label for label, encoded_label in zip(data['X-Folder'], y_encoded)}
    d={}
    for i in label_mapping.items():
        d[i[1]]=i[0]
    
    return d

In [88]:
y = label_encoder(data)
input_data = data['text']

In [89]:
y1= label_encoder1(data)
print(y1)

{2: 'california', 1: 'calendar', 5: 'logistics', 8: 'tufco', 6: 'management', 4: 'esvl', 9: 'tw-commercial group', 3: 'deal discrepancies', 0: 'bill williams iii', 7: 'schedule crawler'}


In [117]:
#Chuyển y1 thành dataframe
df_y1 = pd.DataFrame.from_dict(y1, orient='index', columns=['Value'])
# df_y1_sort = df_y1.sort_values()
print(df_y1)
df_y1.to_csv('label.csv', index=False)

                 Value
2           california
1             calendar
5            logistics
8                tufco
6           management
4                 esvl
9  tw-commercial group
3   deal discrepancies
0    bill williams iii
7     schedule crawler


In [90]:
input_data

0       caiso notice summer 2001 generation rfb market...
1       ca iso cal px information related 2000 market ...
2       caiso notification update inter sc trades adju...
3       update mif meeting presentations iso website u...
4       mif presentations presentations market issues ...
                              ...                        
9373    duke westcoast transaction dial number 216 090...
9374    duke westcoast transaction sent behalf peter k...
9375    updated edcc ecc pricing discussion would like...
9376    aes project tolling interest mtg derek dennist...
9377    transfer enron direct contracts ed marking inc...
Name: text, Length: 9378, dtype: object

In [91]:
type(input_data)

pandas.core.series.Series

## 1. Bag-of-Words

In [92]:
start = time.time()
vectorizer = CountVectorizer(min_df=5, max_features=5000)


X = vectorizer.fit_transform(input_data)
import pickle
# Lưu vectorizer vào file
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)
end = time.time()
print("Execution time (sec): ",(end - start))

Execution time (sec):  0.6467018127441406


In [93]:
start = time.time()
X = X.toarray()
print("X.shape: ",X.shape)
end = time.time()
print("Execution time (sec): ",(end - start))

X.shape:  (9378, 5000)
Execution time (sec):  0.0703892707824707


In [94]:
import pickle
# Lưu vectorizer vào file
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)


In [95]:
# caiso notice summer 2001 generation rfb market participants california iso initiating request bids effort obtain 3 000 mw new generation resources allow iso operate iso control area meet applicable reliability criteria peak demand conditions summer period 2001 iso seeks acquire generation resources rfb one year agreements also consider bids require iso commitment summer periods 2002 2003 responses proposing one year arrangements prove insufficient meet iso requirements rfb attached email posted iso web site http www1 caiso com clientserv stakeholders inquiries regarding rfb directed writing electronically brian theaker noted first page rfb fuller director client relations summer generation rfb doc summer generation rfb doc

for i in X[0]:
    print(i,end=" ")

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [None]:
# create dataframe to store results
# f1_data = {
#     'Algorithm': ['Gaussian NB', 'Multinomial NB','Decision Tree','SVM','AdaBoost','ANN'],
#     'BoW': ''
# }
# f1_df = pd.DataFrame(f1_data)

# jaccard_data = {
#     'Algorithm': ['Gaussian NB', 'Multinomial NB', 'Decision Tree','SVM','AdaBoost','ANN'],
#     'BoW': ''
# }
# jacc_df = pd.DataFrame(jaccard_data)

# acc_data = {
#     'Algorithm': ['Gaussian NB', 'Multinomial NB','Decision Tree','SVM','AdaBoost','ANN'],
#     'BoW': ''
# }
# acc_df = pd.DataFrame(acc_data)
# acc_df

### Training and Evaluation

In [96]:
# models = [GaussianNB(), MultinomialNB(), DecisionTreeClassifier(), LinearSVC(), 
#           AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=5),
#          MLPClassifier(hidden_layer_sizes=(10,))]

# names = ["Gaussian NB", "Multinomial NB", "Decision Tree", "SVM", "AdaBoost", "ANN"]

# models = [ LinearSVC()]

# names = ["SVM"]

# jacc_scores = []
# acc_scores = []
# f1_scores = []
# exec_times = []

# for model, name in zip(models, names):
#     print(name)
#     start = time.time()
#     scoring = {
#         'acc': 'accuracy',
#         'f1_mac': 'f1_macro',
#         'jacc_mac': 'jaccard_macro'
#     }
    
#     scores = cross_validate(model, X, y, cv=10, n_jobs=4, scoring=scoring)
#     training_time = (time.time() - start)
#     print("accuracy: ", scores['test_acc'].mean())
#     print("f1_score: ", scores['test_f1_mac'].mean())
#     print("Jaccard_index: ", scores['test_jacc_mac'].mean())
#     print("time (sec): ", training_time)
#     print("\n")

    
    # jacc_scores.append(scores['test_jacc_mac'].mean())
    # acc_scores.append(scores['test_acc'].mean())
    # f1_scores.append(scores['test_f1_mac'].mean())
    # exec_times.append(training_time)
    
# acc_df['BoW'] = acc_scores
# jacc_df['BoW'] = jacc_scores
# f1_df['BoW'] = f1_scores
# acc_df['time'] = exec_times
# acc_df

# print(name)

start = time.time()
scoring = {
    'acc': 'accuracy',
    'f1_mac': 'f1_macro',
    'jacc_mac': 'jaccard_macro'
}

model =  LinearSVC()
# print(model)
name ="SVM"
print("start")
scores = cross_validate(model, X, y, cv=10, n_jobs=4, scoring=scoring)
model.fit( X, y)
print(model)

training_time = (time.time() - start)
print("accuracy: ", scores['test_acc'].mean())
print("f1_score: ", scores['test_f1_mac'].mean())
print("Jaccard_index: ", scores['test_jacc_mac'].mean())
print("time (sec): ", training_time)
print("\n")

start




LinearSVC()
accuracy:  0.8784361467551707
f1_score:  0.8747947764622414
Jaccard_index:  0.7886337857022863
time (sec):  7.126962900161743






In [97]:
print(model.coef_)
print( model.classes_)
print(y)

[[-1.73110663e-01 -1.77862206e-01 -1.39754544e-02 ... -2.14101093e-01
   7.52005721e-03 -4.16945116e-02]
 [ 8.84928069e-02 -1.18742997e-01  7.30902095e-02 ... -2.46104349e-01
  -1.11205151e-01 -7.75818882e-02]
 [ 2.32864937e-02  5.36048638e-03  2.60208521e-18 ... -1.62545623e-02
  -5.66045605e-02 -2.91526386e-03]
 ...
 [-4.93503022e-02 -7.34093988e-02 -6.77626358e-21 ... -1.65374275e-03
   0.00000000e+00  0.00000000e+00]
 [ 9.51686225e-03  5.28276788e-02 -4.22984812e-03 ... -3.96012700e-03
   0.00000000e+00  0.00000000e+00]
 [-1.19841977e-01  1.87043734e-01  2.95832606e-02 ...  8.46106682e-02
  -7.62252119e-02  5.54327291e-02]]
[0 1 2 3 4 5 6 7 8 9]
[2 2 2 ... 1 1 1]


In [98]:
print(y1)
# y_pred = model.predict([X[0]])
# y1[y_pred[0]]


{2: 'california', 1: 'calendar', 5: 'logistics', 8: 'tufco', 6: 'management', 4: 'esvl', 9: 'tw-commercial group', 3: 'deal discrepancies', 0: 'bill williams iii', 7: 'schedule crawler'}


In [99]:
d= {2: 'california', 1: 'calendar', 5: 'logistics', 8: 'tufco', 6: 'management', 4: 'esvl', 9: 'tw-commercial group', 3: 'deal discrepancies', 0: 'bill williams iii', 7: 'schedule crawler'}
y_pred = model.predict([X[0]])
d[y_pred[0]]

'california'

In [100]:
y_pred = model.predict([X[0]])
y_pred[0]
y1[y_pred[0]]

'california'

In [101]:
X[0]

array([0, 1, 0, ..., 0, 0, 0], dtype=int64)

In [102]:
from sklearn.svm import LinearSVC
import joblib
joblib.dump(model, 'linear_svc_model.pkl')


['linear_svc_model.pkl']

# test

In [103]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import matplotlib.pyplot as plt
import re
import string
import time
pd.set_option('display.max_rows', 50)

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GIGABYTE\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [104]:
X_test_word = """ Phillip,
 Below is the issues & to do list as we go forward with documenting the 
requirements for consolidated physical/financial positions and transport 
trade capture. What we need to focus on is the first bullet in Allan's list; 
the need for a single set of requirements. Although the meeting with Keith, 
on Wednesday,  was informative the solution of creating a infinitely dynamic 
consolidated position screen, will be extremely difficult and time 
consuming.  Throughout the meeting on Wednesday, Keith alluded to the 
inability to get consensus amongst the traders on the presentation of the 
consolidated position, so the solution was to make it so that a trader can 
arrange the position screen to their liking (much like Excel). What needs to 
happen on Monday from 3 - 5 is a effort to design a desired layout for the 
consolidated position screen, this is critical. This does not exclude 
building a capability to create a more flexible position presentation for the 
future, but in order to create a plan that can be measured we need firm 
requirements. Also, to reiterate that the goals of this project is a project 
plan on consolidate physical/financial positions and transport trade capture. 
The other issues that have been raised will be capture as projects on to 
themselves, and will need to be prioritised as efforts outside of this 
project.

I have been involved in most of the meetings and the discussions have been 
good. I believe there has been good communication between the teams, but now 
we need to have focus on the objectives we set out to solve."""
X_test_word


" Phillip,\n Below is the issues & to do list as we go forward with documenting the \nrequirements for consolidated physical/financial positions and transport \ntrade capture. What we need to focus on is the first bullet in Allan's list; \nthe need for a single set of requirements. Although the meeting with Keith, \non Wednesday,  was informative the solution of creating a infinitely dynamic \nconsolidated position screen, will be extremely difficult and time \nconsuming.  Throughout the meeting on Wednesday, Keith alluded to the \ninability to get consensus amongst the traders on the presentation of the \nconsolidated position, so the solution was to make it so that a trader can \narrange the position screen to their liking (much like Excel). What needs to \nhappen on Monday from 3 - 5 is a effort to design a desired layout for the \nconsolidated position screen, this is critical. This does not exclude \nbuilding a capability to create a more flexible position presentation for the \nf

In [105]:
import pickle
def preprocess(x):
    # lowercasing all the words
    x = x.lower()
    
    # remove extra new lines
    x = re.sub(r'\n+', ' ', x)
    
    # removing (replacing with empty spaces actually) all the punctuations
    x = re.sub("["+string.punctuation+"]", " ", x)
    
    # remove extra white spaces
    x = re.sub(r'\s+', ' ', x)
    
    return x
# Load vectorizer từ file
with open('vectorizer.pkl', 'rb') as f:
    vectorizer_xtest = pickle.load(f)

# vectorizer_xtest.fixed_vocabulary_ = True
print(vectorizer.vocabulary  )
print(vectorizer_xtest.vocabulary_  ) #dict

# print(vectorizer_xtest._validate_vocabulary)
# print(vectorizer_xtest.fixed_vocabulary_ )

start = time.time()
X_test=preprocess(X_test_word)
print(X_test) 
# vectorizer_xtest = CountVectorizer(min_df=5, max_features=5000)
# X_test = vectorizer.fit_transform([X_test])
X_test = pd.Series([X_test])
print(X_test)

# Encode the Document
# X_test = vectorizer.transform(X_test)
X_test = vectorizer_xtest.transform(X_test)


X_test = X_test.toarray()
print(X_test)
print("X.shape: ",X_test.shape)
end = time.time()
print("Execution time (sec): ",(end - start))


None
 phillip below is the issues to do list as we go forward with documenting the requirements for consolidated physical financial positions and transport trade capture what we need to focus on is the first bullet in allan s list the need for a single set of requirements although the meeting with keith on wednesday was informative the solution of creating a infinitely dynamic consolidated position screen will be extremely difficult and time consuming throughout the meeting on wednesday keith alluded to the inability to get consensus amongst the traders on the presentation of the consolidated position so the solution was to make it so that a trader can arrange the position screen to their liking much like excel what needs to happen on monday from 3 5 is a effort to design a desired layout for the consolidated position screen this is critical this does not exclude building a capability to create a more flexible position presentation for the future but in order to create a plan that can 

In [106]:
# Đọc mô hình từ file
# with open('linear_svc_model.pkl', 'rb') as f:
#     loaded_model = pickle.load(f)
import joblib

loaded_model = joblib.load('linear_svc_model.pkl')

# Sử dụng mô hình để dự đoán
predictions = loaded_model.predict([X_test[0]]   )

d= {2: 'california', 1: 'calendar', 5: 'logistics', 8: 'tufco', 6: 'management', 4: 'esvl', 9: 'tw-commercial group', 3: 'deal discrepancies', 0: 'bill williams iii', 7: 'schedule crawler'}
print(d[predictions[0]])

calendar


In [28]:
# save the results
acc_df.to_csv("accuracy.csv", index=False)
f1_df.to_csv("f1_score.csv", index=False)
jacc_df.to_csv("jacc_score.csv", index=False)

## 2. Bag-of-Words Bigram

In [36]:
start = time.time()
vectorizer = CountVectorizer(min_df=5, max_features=5000, ngram_range=(2,2))
X = vectorizer.fit_transform(input_data)

X = X.toarray()
print("X.shape: ",X.shape)

end = time.time()
print("Execution time (sec): ",(end - start))

X.shape:  (13586, 5000)
Execution time (sec):  7.333747625350952


### Training and Evaluation

In [37]:
models = [GaussianNB(), MultinomialNB(), DecisionTreeClassifier(), LinearSVC(), 
          AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=5),
         MLPClassifier(hidden_layer_sizes=(10,))]

names = ["Gaussian NB", "Multinomial NB", "Decision Tree", "SVM", "AdaBoost", "ANN"]

jacc_scores = []
acc_scores = []
f1_scores = []
exec_times = []

for model, name in zip(models, names):
    print(name)
    start = time.time()
    scoring = {
        'acc': 'accuracy',
        'f1_mac': 'f1_macro',
        'jacc_mac': 'jaccard_macro'
    }
    scores = cross_validate(model, X, y, cv=10, n_jobs=4, scoring=scoring)
    training_time = (time.time() - start)
    print("accuracy: ", scores['test_acc'].mean())
    print("f1_score: ", scores['test_f1_mac'].mean())
    print("Jaccard_index: ", scores['test_jacc_mac'].mean())
    print("time (sec): ", training_time)
    print("\n")
    
    jacc_scores.append(scores['test_jacc_mac'].mean())
    acc_scores.append(scores['test_acc'].mean())
    f1_scores.append(scores['test_f1_mac'].mean())
    exec_times.append(training_time)
    
acc_df['BoWBi'] = acc_scores
jacc_df['BoWBi'] = jacc_scores
f1_df['BoWBi'] = f1_scores
acc_df['BoWBi_time'] = exec_times
acc_df

Gaussian NB
accuracy:  0.5833930454364673
f1_score:  0.5621651556732388
Jaccard_index:  0.4068105548950894
time (sec):  11.270399570465088


Multinomial NB
accuracy:  0.6374178145803735
f1_score:  0.6170933752131809
Jaccard_index:  0.4707424107547659
time (sec):  42.18202495574951


Decision Tree
accuracy:  0.5911941987145101
f1_score:  0.5797069341612804
Jaccard_index:  0.4317389725588492
time (sec):  217.36713671684265


SVM
accuracy:  0.6324125098481621
f1_score:  0.619196736013025
Jaccard_index:  0.47206075684293436
time (sec):  21.832029104232788


AdaBoost
accuracy:  0.5783132360383674
f1_score:  0.5652402591763923
Jaccard_index:  0.41793531112035415
time (sec):  410.09173607826233


ANN
accuracy:  0.6169565575484877
f1_score:  0.604338923367805
Jaccard_index:  0.4571185691855167
time (sec):  1108.7890048027039




Unnamed: 0,Algorithm,BoW,time,BoWBi,BoWBi_time
0,Gaussian NB,0.585233,11.612147,0.583393,11.2704
1,Multinomial NB,0.737743,42.235202,0.637418,42.182025
2,Decision Tree,0.662005,84.62758,0.591194,217.367137
3,SVM,0.737451,31.174071,0.632413,21.832029
4,AdaBoost,0.667892,424.459035,0.578313,410.091736
5,ANN,0.73539,879.533137,0.616957,1108.789005


In [38]:
# save the results
acc_df.to_csv("accuracy.csv", index=False)
f1_df.to_csv("f1_score.csv", index=False)
jacc_df.to_csv("jacc_score.csv", index=False)

## 3. Tf-Idf (Term Frequency - Inverse Document Frequency)

In [20]:
start = time.time()
vectorizer = TfidfVectorizer(min_df=5, max_features=5000)
X = vectorizer.fit_transform(input_data)

X = X.toarray()
print("X.shape: ",X.shape)

end = time.time()
print("Execution time (sec): ",(end - start))

X.shape:  (13586, 5000)
Execution time (sec):  2.365476369857788


### Training and Evaluation

In [25]:
models = [GaussianNB(), MultinomialNB(), DecisionTreeClassifier(), LinearSVC(), 
          AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=5),
         MLPClassifier(hidden_layer_sizes=(10,))]

names = ["Gaussian NB", "Multinomial NB", "Decision Tree", "SVM", "AdaBoost", "ANN"]

jacc_scores = []
acc_scores = []
f1_scores = []
exec_times = []

for model, name in zip(models, names):
    print(name)
    start = time.time()
    scoring = {
        'acc': 'accuracy',
        'f1_mac': 'f1_macro',
        'jacc_mac': 'jaccard_macro'
    }
    scores = cross_validate(model, X, y, cv=10, n_jobs=4, scoring=scoring)
    training_time = (time.time() - start)
    print("accuracy: ", scores['test_acc'].mean())
    print("f1_score: ", scores['test_f1_mac'].mean())
    print("Jaccard_index: ", scores['test_jacc_mac'].mean())
    print("time (sec): ", training_time)
    print("\n")
    
    jacc_scores.append(scores['test_jacc_mac'].mean())
    acc_scores.append(scores['test_acc'].mean())
    f1_scores.append(scores['test_f1_mac'].mean())
    exec_times.append(training_time)
    
acc_df['TfIdf'] = acc_scores
jacc_df['TfIdf'] = jacc_scores
f1_df['TfIdf'] = f1_scores
acc_df['TfIdf_time'] = exec_times
acc_df

Gaussian NB
accuracy:  0.6093018127120674
f1_score:  0.5877402363957523
Jaccard_index:  0.44084640698807825
time (sec):  10.88046383857727


Multinomial NB
accuracy:  0.7368567808999297
f1_score:  0.6967070564788325
Jaccard_index:  0.5701299709091912
time (sec):  6.833428621292114


Decision Tree
accuracy:  0.649639451602311
f1_score:  0.6336756930392328
Jaccard_index:  0.48894297941690895
time (sec):  95.03275084495544


SVM
accuracy:  0.7947884663526091
f1_score:  0.7771822256420796
Jaccard_index:  0.6613918628186176
time (sec):  7.603848695755005


AdaBoost
accuracy:  0.6595013226610141
f1_score:  0.6386553507155176
Jaccard_index:  0.4976178552852552
time (sec):  411.92880725860596


ANN
accuracy:  0.7534232591104306
f1_score:  0.7344935635448607
Jaccard_index:  0.6074949481016497
time (sec):  996.1522567272186




Unnamed: 0,Algorithm,BoW,time,BoWBi,BoWBi_time,TfIdf,TfIdf_time
0,Gaussian NB,0.585233,11.612147,0.583393,11.2704,0.609302,10.880464
1,Multinomial NB,0.737743,42.235202,0.637418,42.182025,0.736857,6.833429
2,Decision Tree,0.662005,84.62758,0.591194,217.367137,0.649639,95.032751
3,SVM,0.737451,31.174071,0.632413,21.832029,0.794788,7.603849
4,AdaBoost,0.667892,424.459035,0.578313,410.091736,0.659501,411.928807
5,ANN,0.73539,879.533137,0.616957,1108.789005,0.753423,996.152257


In [26]:
# save the results
acc_df.to_csv("accuracy.csv", index=False)
f1_df.to_csv("f1_score.csv", index=False)
jacc_df.to_csv("jacc_score.csv", index=False)

In [27]:
jacc_df

Unnamed: 0,Algorithm,BoW,BoWBi,TfIdf
0,Gaussian NB,0.413084,0.406811,0.440846
1,Multinomial NB,0.577062,0.470742,0.57013
2,Decision Tree,0.497672,0.431739,0.488943
3,SVM,0.58716,0.472061,0.661392
4,AdaBoost,0.505863,0.417935,0.497618
5,ANN,0.584506,0.457119,0.607495
