## Identifying Fraud at Enron Using Emails and Financial Data

**This project aims to explore the email dataset of Enron Corp - globally known for a huge corporate fraud, which led to the bankruptcy of the company. Our attempt would be to find patterns and classify emails to detect fraudulent emails.**

#### 1. Introduction

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. 
In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives. 
For this project, predictive models were built using scikit-learn, numpy, and pandas modules in Python. The target of the predictions were persons-of-interest (POI’s) who were "individuals who were indicted, reached a settlement, or plea deal with the government, or testified in exchange for prosecution immunity". 
Financial compensation data and aggregate email statistics from the Enron Corpus were used as features for prediction.

**The goal of this project is to build a prediction model to identify persons-of-interest (POI’s.)**

In [1]:
#Importing necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pickle

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("wcukierski/enron-email-dataset")

print("Path to dataset files:", path)

from os import walk

f = []
for (dirpath, dirnames, filenames) in walk(path):
    f.extend(filenames)
    break

print(f)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/sara/.cache/kagglehub/datasets/wcukierski/enron-email-dataset/versions/2
['emails.csv']


In [3]:
import pickle

enron_data = pickle.load(open("final_project_dataset.pkl", "rb"))
enron_data

{'METTS MARK': {'salary': 365788,
  'to_messages': 807,
  'deferral_payments': 'NaN',
  'total_payments': 1061827,
  'loan_advances': 'NaN',
  'bonus': 600000,
  'email_address': 'mark.metts@enron.com',
  'restricted_stock_deferred': 'NaN',
  'deferred_income': 'NaN',
  'total_stock_value': 585062,
  'expenses': 94299,
  'from_poi_to_this_person': 38,
  'exercised_stock_options': 'NaN',
  'from_messages': 29,
  'other': 1740,
  'from_this_person_to_poi': 1,
  'poi': False,
  'long_term_incentive': 'NaN',
  'shared_receipt_with_poi': 702,
  'restricted_stock': 585062,
  'director_fees': 'NaN'},
 'BAXTER JOHN C': {'salary': 267102,
  'to_messages': 'NaN',
  'deferral_payments': 1295738,
  'total_payments': 5634343,
  'loan_advances': 'NaN',
  'bonus': 1200000,
  'email_address': 'NaN',
  'restricted_stock_deferred': 'NaN',
  'deferred_income': -1386055,
  'total_stock_value': 10623258,
  'expenses': 11200,
  'from_poi_to_this_person': 'NaN',
  'exercised_stock_options': 6680544,
  'from_

In [4]:
print("Number of people in the Enron dataset:",len(enron_data))
print(type(enron_data))

Number of people in the Enron dataset: 146
<class 'dict'>


In [5]:
print(enron_data.keys())
print(enron_data.values())

dict_keys(['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HAEDICKE MARK E', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'NOLES JAMES L', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY REX', 'LEMA

In [6]:
enron_df = pd.DataFrame.from_records(list(enron_data.values()))
enron_df.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
0,365788.0,807.0,,1061827.0,,600000.0,mark.metts@enron.com,,,585062,...,38.0,,29.0,1740.0,1.0,False,,702.0,585062,
1,267102.0,,1295738.0,5634343.0,,1200000.0,,,-1386055.0,10623258,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714,
2,170941.0,,,211725.0,,350000.0,steven.elliott@enron.com,,-400729.0,6678735,...,,4890344.0,,12961.0,,False,,,1788391,
3,,764.0,,,,,bill.cordes@enron.com,,,1038185,...,10.0,651850.0,12.0,,0.0,False,,58.0,386335,
4,243293.0,1045.0,,288682.0,,1500000.0,kevin.hannon@enron.com,,-3117011.0,6391065,...,32.0,5538001.0,32.0,11350.0,21.0,True,1617011.0,1035.0,853064,


In [7]:
employees = pd.Series(list(enron_data.keys()))
employees.head()

0          METTS MARK
1       BAXTER JOHN C
2      ELLIOTT STEVEN
3    CORDES WILLIAM R
4      HANNON KEVIN P
dtype: object

In [8]:
enron_df.set_index(employees, inplace=True)
enron_df.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
METTS MARK,365788.0,807.0,,1061827.0,,600000.0,mark.metts@enron.com,,,585062,...,38.0,,29.0,1740.0,1.0,False,,702.0,585062,
BAXTER JOHN C,267102.0,,1295738.0,5634343.0,,1200000.0,,,-1386055.0,10623258,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714,
ELLIOTT STEVEN,170941.0,,,211725.0,,350000.0,steven.elliott@enron.com,,-400729.0,6678735,...,,4890344.0,,12961.0,,False,,,1788391,
CORDES WILLIAM R,,764.0,,,,,bill.cordes@enron.com,,,1038185,...,10.0,651850.0,12.0,,0.0,False,,58.0,386335,
HANNON KEVIN P,243293.0,1045.0,,288682.0,,1500000.0,kevin.hannon@enron.com,,-3117011.0,6391065,...,32.0,5538001.0,32.0,11350.0,21.0,True,1617011.0,1035.0,853064,


In [9]:
enron_df.describe()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,...,146.0,146.0,146.0,146.0,146.0,146,146.0,146.0,146.0,146.0
unique,95.0,87.0,40.0,126.0,5.0,42.0,112.0,19.0,45.0,125.0,...,58.0,102.0,65.0,93.0,42.0,2,53.0,84.0,98.0,18.0
top,,,,,,,,,,,...,,,,,,False,,,,
freq,51.0,60.0,107.0,21.0,142.0,64.0,35.0,128.0,97.0,20.0,...,60.0,44.0,60.0,53.0,60.0,128,80.0,60.0,36.0,129.0


In [10]:
enron_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, METTS MARK to GLISAN JR BEN F
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   salary                     146 non-null    object
 1   to_messages                146 non-null    object
 2   deferral_payments          146 non-null    object
 3   total_payments             146 non-null    object
 4   loan_advances              146 non-null    object
 5   bonus                      146 non-null    object
 6   email_address              146 non-null    object
 7   restricted_stock_deferred  146 non-null    object
 8   deferred_income            146 non-null    object
 9   total_stock_value          146 non-null    object
 10  expenses                   146 non-null    object
 11  from_poi_to_this_person    146 non-null    object
 12  exercised_stock_options    146 non-null    object
 13  from_messages              146 non-null    object

In [11]:
enron_df.shape

(146, 21)

In [12]:
poi_count = enron_df.groupby('poi').size()
print(poi_count)
print('Total number of non-POIs in the dataset: ',poi_count.iloc[0])
print('Total number of POIs in the dataset: ',poi_count.iloc[1])

poi
False    128
True      18
dtype: int64
Total number of non-POIs in the dataset:  128
Total number of POIs in the dataset:  18


In [13]:
enron_df.dtypes

salary                       object
to_messages                  object
deferral_payments            object
total_payments               object
loan_advances                object
bonus                        object
email_address                object
restricted_stock_deferred    object
deferred_income              object
total_stock_value            object
expenses                     object
from_poi_to_this_person      object
exercised_stock_options      object
from_messages                object
other                        object
from_this_person_to_poi      object
poi                            bool
long_term_incentive          object
shared_receipt_with_poi      object
restricted_stock             object
director_fees                object
dtype: object

In [14]:
enron_df_original = enron_df.copy()
enron_df_original.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
METTS MARK,365788.0,807.0,,1061827.0,,600000.0,mark.metts@enron.com,,,585062,...,38.0,,29.0,1740.0,1.0,False,,702.0,585062,
BAXTER JOHN C,267102.0,,1295738.0,5634343.0,,1200000.0,,,-1386055.0,10623258,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714,
ELLIOTT STEVEN,170941.0,,,211725.0,,350000.0,steven.elliott@enron.com,,-400729.0,6678735,...,,4890344.0,,12961.0,,False,,,1788391,
CORDES WILLIAM R,,764.0,,,,,bill.cordes@enron.com,,,1038185,...,10.0,651850.0,12.0,,0.0,False,,58.0,386335,
HANNON KEVIN P,243293.0,1045.0,,288682.0,,1500000.0,kevin.hannon@enron.com,,-3117011.0,6391065,...,32.0,5538001.0,32.0,11350.0,21.0,True,1617011.0,1035.0,853064,


In [15]:
enron_df = enron_df.apply(lambda x : pd.to_numeric(x, errors = 'coerce')).copy().fillna(0)
enron_df.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
METTS MARK,365788.0,807.0,0.0,1061827.0,0.0,600000.0,0.0,0.0,0.0,585062.0,...,38.0,0.0,29.0,1740.0,1.0,False,0.0,702.0,585062.0,0.0
BAXTER JOHN C,267102.0,0.0,1295738.0,5634343.0,0.0,1200000.0,0.0,0.0,-1386055.0,10623258.0,...,0.0,6680544.0,0.0,2660303.0,0.0,False,1586055.0,0.0,3942714.0,0.0
ELLIOTT STEVEN,170941.0,0.0,0.0,211725.0,0.0,350000.0,0.0,0.0,-400729.0,6678735.0,...,0.0,4890344.0,0.0,12961.0,0.0,False,0.0,0.0,1788391.0,0.0
CORDES WILLIAM R,0.0,764.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1038185.0,...,10.0,651850.0,12.0,0.0,0.0,False,0.0,58.0,386335.0,0.0
HANNON KEVIN P,243293.0,1045.0,0.0,288682.0,0.0,1500000.0,0.0,0.0,-3117011.0,6391065.0,...,32.0,5538001.0,32.0,11350.0,21.0,True,1617011.0,1035.0,853064.0,0.0


In [16]:
enron_df.drop(['email_address'],axis=1,inplace=True)
enron_df.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,restricted_stock_deferred,deferred_income,total_stock_value,expenses,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
METTS MARK,365788.0,807.0,0.0,1061827.0,0.0,600000.0,0.0,0.0,585062.0,94299.0,38.0,0.0,29.0,1740.0,1.0,False,0.0,702.0,585062.0,0.0
BAXTER JOHN C,267102.0,0.0,1295738.0,5634343.0,0.0,1200000.0,0.0,-1386055.0,10623258.0,11200.0,0.0,6680544.0,0.0,2660303.0,0.0,False,1586055.0,0.0,3942714.0,0.0
ELLIOTT STEVEN,170941.0,0.0,0.0,211725.0,0.0,350000.0,0.0,-400729.0,6678735.0,78552.0,0.0,4890344.0,0.0,12961.0,0.0,False,0.0,0.0,1788391.0,0.0
CORDES WILLIAM R,0.0,764.0,0.0,0.0,0.0,0.0,0.0,0.0,1038185.0,0.0,10.0,651850.0,12.0,0.0,0.0,False,0.0,58.0,386335.0,0.0
HANNON KEVIN P,243293.0,1045.0,0.0,288682.0,0.0,1500000.0,0.0,-3117011.0,6391065.0,34039.0,32.0,5538001.0,32.0,11350.0,21.0,True,1617011.0,1035.0,853064.0,0.0
