![Enron](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Logo_de_Enron.svg/300px-Logo_de_Enron.svg.png)

# Mail from the boss

One of the most common problems companies face is CEO impersonation attacks. This is a form of fraud where a scammer sends an email purporting to be from the CEO of a company or another senior executive and commonly requesting the finance team to make a payment to a third party.

In this challenge we're going to use machine learning to automatically detect if an email from a person is legitimate. For that purpose we're going to use the corpus of emails from Enron. You can learn more about the [Enron](https://en.wikipedia.org/wiki/Enron_scandal) scandal in the Wikipedia.

Now imagine that you're a worker from Enron in the year 2000 and you receive an email from the boss [Ken Lay](https://en.wikipedia.org/wiki/Kenneth_Lay) asking you to pay a huge invoice. Fortunately you have access to the email servers files and you can use machine learning to find out if the email is genuine.

Use the following Colab notebook to create a feature vector of 3000 features for each email in the Enron dataset. Populate each feature with the number of times each of the 3000 most frequent English words appears in the email.

Afterwards train a classifier to detect Ken's emails and provide a confusion matrix of the performance.

For your convenience we have already labeled Ken's emails in the dataset and extracted the body of the emails in a separate column.

In [0]:
import pandas as pd

In [0]:
emails = pd.read_csv('https://storage.googleapis.com/bewica-challenge/emails_from_boss_small.csv', engine='python', error_bad_lines=False)

## Below is the full dataset from Enron. We suggest to use only to check performance at the end of the exercise.
# emails = pd.read_csv('https://storage.googleapis.com/bewica-challenge/emails_from_boss.csv', engine='python', error_bad_lines=False)

In [3]:
emails[emails.from_boss].body.tail(10)

59339    Okay, I've already had questions about stateme...
59340    FYI - Thought you might be interested in Gary ...
59905    Today, Enron hosted a conference call to give ...
59926    Today we announced another positive developmen...
60063    As you have heard, Dynegy is terminating the m...
60131    Many of you have asked whether you should come...
61143    I want to remind you about our All-Employee Me...
61150    Today we announced the appointment of Jeff McM...
61342    I know that this is a difficult time for all o...
61343    Today we released additional information about...
Name: body, dtype: object

In [0]:
dictionary = pd.read_csv("https://storage.googleapis.com/bewica-challenge/most_common_3000_words.txt", header=None, names=['word'])

In [5]:
dictionary.head(10)

Unnamed: 0,word
0,a
1,abandon
2,ability
3,able
4,abortion
5,about
6,above
7,abroad
8,absence
9,absolute


In [6]:
emails.columns

Index(['Unnamed: 0', 'file', 'message', 'user', 'type', 'email_from',
       'email_to', 'xfrom', 'xto', 'date', 'subject', 'body', 'from_boss'],
      dtype='object')

In [7]:
# keep only subject and body since we're training a classifier based on text/vocabulary
# maybe other featres such as 'message', 'date' (specially time of day) or recipient from the email could also be useful
emails = emails.drop(columns=['Unnamed: 0', 'file', 'message', 'user', 'type', 'email_from', 'email_to', 'xfrom', 'xto', 'date'])

# removing subject too since would require a bit more work to take into account in the analysis
# it might only add stuff as in "boss tends to hit the reply button", might not be that relevant with a 3k vocabulary
# we're dropping this feature too for now and only keeping the body
# we fill NaNs with empty string (assuming its an email with no body, might also show boss behaviour)
emails = emails.drop(columns=['subject']).fillna('')
emails.head()

Unnamed: 0,body,from_boss
0,Here is our forecast,False
1,Traveling to have a business meeting takes the...,False
2,test successful. way to go!!!,False
3,"Randy, Can you send me a schedule of the salar...",False
4,Let's shoot for Tuesday at 11:45.,False


In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn import metrics

train, test = train_test_split(emails, test_size=0.2, random_state=3)

In [9]:
train.head()

Unnamed: 0,body,from_boss
37648,,False
26623,I ran across this article in the Yomiuri Newsp...,False
44908,"Hi Larry,Thanks for completing the diamond inf...",False
56754,Attached is a copy of our California Separatio...,False
46065,Today I spoke with Jean Calhoun of the Arizona...,False


In [10]:
print('Train data:\n', train.groupby('from_boss').count())
print('\nTest data:\n', test.groupby('from_boss').count())

Train data:
             body
from_boss       
False      52221
True         153

Test data:
             body
from_boss       
False      13054
True          40


In [11]:
pipeline = Pipeline([('vect', TfidfVectorizer(vocabulary=dictionary['word'])),
                       ('clf', LinearSVC(C=1000)),])
pipeline.fit(train['body'], train['from_boss'])

Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
  ...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [12]:
predictions = pipeline.predict(test['body'])
print(metrics.classification_report(test['from_boss'], predictions))

cm = metrics.confusion_matrix(test['from_boss'], predictions)
print(cm)

             precision    recall  f1-score   support

      False       1.00      1.00      1.00     13054
       True       0.95      0.97      0.96        40

avg / total       1.00      1.00      1.00     13094

[[13052     2]
 [    1    39]]


# Commentary:
Results look pretty good on the small dataset with a pretty quick model.
## Model comments
- __Dropping everything but body text__ still gives good results
- __No effort in tuning__/trying different classifier parameters (kept defaults) or other classifiers, this is an actual model should be done
- Even though there's a __heavy class imbalance__ (around 330 times more non-boss emails), classifier works great without doing anything special about it, but this should probably be the __first improvement/thing to double-check to the model__
- __Next improvements__ we could check adding subject (more text data), recipient (typical recipients from boss emails), date (the boss might never write an email before 10am, or never on a Saturday) or other features of the email such as has attachments or server/domain that served the email as features to the model
- Other improvements could be checks for typical spam/scam flags such as money-related signs/talk (£, $), but this would be for a model that checks for 'dangerous boss emails', not just 'is this email from my boss'
- Assumed NaN in body text meant empty message (therefore filled with empty string)

## Train/test split comments
__Split between train and test data is pretty good__ with 'random_state=3' given the imbalanced nature of the data, so not much effort is done here either on the small dataset.

__Train sample split__

| from_boss | occurrences |
|-------|-------|
| False | 52221 |
| True  | 153   |

*ratio non_boss/boss emails = 341*

__Test sample split__

| from_boss | occurrences |
|-------|-------|
| False | 13054 |
| True  | 40   |

*ratio non_boss/boss emails = 326*

## Train results comments
Only 3 misclassifications (1 false negative and 2 false positives) over 13094 test samples.

__Confusion matrix on test sample__

| | Predicted False | Predicted True |
|-------|-------|-------|
| Actual False | 13051 | 2 |
| Actual True  | 1   | 39 |

In [13]:
# @title Now with the full dataset
emails_full = pd.read_csv('https://storage.googleapis.com/bewica-challenge/emails_from_boss.csv', engine='python', error_bad_lines=False)
emails_full = emails_full.drop(columns=['Unnamed: 0', 'file', 'message', 'user', 'type', 'email_from', 'email_to', 'xfrom', 'xto', 'date', 'subject']).fillna('')
emails_full.head()

Unnamed: 0,body,from_boss
0,Here is our forecast,False
1,Traveling to have a business meeting takes the...,False
2,test successful. way to go!!!,False
3,"Randy, Can you send me a schedule of the salar...",False
4,Let's shoot for Tuesday at 11:45.,False


In [14]:
train_full, test_full = train_test_split(emails_full, test_size=0.2, random_state=3)
print('Train data:\n', train_full.groupby('from_boss').count())
print('\nTest data:\n', test_full.groupby('from_boss').count())

Train data:
              body
from_boss        
False      673421
True         1156

Test data:
              body
from_boss        
False      168350
True          295


In [15]:
pipeline_full = Pipeline([('vect', TfidfVectorizer(vocabulary=dictionary['word'])),
                       ('clf', LinearSVC(C=1000)),])
pipeline_full.fit(train_full['body'], train_full['from_boss'])

predictions = pipeline_full.predict(test_full['body'])

print(metrics.classification_report(test_full['from_boss'], predictions))

cm = metrics.confusion_matrix(test_full['from_boss'], predictions)
print(cm)

             precision    recall  f1-score   support

      False       1.00      1.00      1.00    168350
       True       0.79      0.99      0.88       295

avg / total       1.00      1.00      1.00    168645

[[168274     76]
 [     2    293]]


# Full dataset Commentary:
Results are not as good on the full dataset with a pretty quick model, more tuning needed.
## Model comments
Same model as used in the small dataset

## Train/test split comments
__Split between train and test data is still pretty good__ with 'random_state=3', this dataset is even more imbalanced that the small one (around 580 times more non-boss emails).

__Train sample split__

| from_boss | occurrences |
|-------|-------|
| False | 673421 |
| True  | 1156   |

*ratio non_boss/boss emails = 582*

__Test sample split__

| from_boss | occurrences |
|-------|-------|
| False | 168350 |
| True  | 295   |

*ratio non_boss/boss emails = 570*

## Train results comments
Only 78 misclassifications (2 false negative and 76 false positives) over 13094 test samples.

__Confusion matrix on test sample__

| | Predicted False | Predicted True |
|-------|-------|-------|
| Actual False | 168272 | 76 |
| Actual True  | 2   | 293 |

## Final comment
When we misclassify we get __quite a few false positives__ on both small and full dataset.
Given the __nature of the problem__ (testing wether an email is actually from our boss) this __might be dangerous__, since you claim that an email is from your boss when it really isn't. this model would need some __more work to be usable__.
We could also maybe use a classifier that yields probability instead of straight classification, and then set a safer/more conservative threshold for 'is this email from my boss'.