Starter code for exploring the Enron dataset (emails + finances); loads up the dataset (pickled dict of dicts).

The dataset has the form: enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

{features_dict} is a dictionary of features associated with that person.

You should explore features_dict as part of the mini-project, but here's an example to get you started:

`enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000`

In [54]:
import pickle
import pandas as pd

In [None]:
enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "rb"))

In [12]:
# Note the object is a dictionary

enron_data.keys()

dict_keys(['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HAEDICKE MARK E', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'NOLES JAMES L', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY REX', 'LEMA

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.

In [24]:
enron_data['METTS MARK']

{'bonus': 600000,
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'mark.metts@enron.com',
 'exercised_stock_options': 'NaN',
 'expenses': 94299,
 'from_messages': 29,
 'from_poi_to_this_person': 38,
 'from_this_person_to_poi': 1,
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 1740,
 'poi': False,
 'restricted_stock': 585062,
 'restricted_stock_deferred': 'NaN',
 'salary': 365788,
 'shared_receipt_with_poi': 702,
 'to_messages': 807,
 'total_payments': 1061827,
 'total_stock_value': 585062}

How many data points (people) are in the dataset?

In [36]:
len(enron_data.keys())

146

For each person, how many features are available?

In [37]:
len(enron_data['METTS MARK'])

21

It is more practical to work with a Pandas Data Frame instead of a dictionary.

In [59]:
enron_df = pd.DataFrame.from_dict(enron_data, orient='index')

# orient : {‘columns’, ‘index’}, default ‘columns’
# The “orientation” of the data. If the keys of the passed dict should be the columns of the 
# resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.

In [60]:
enron_df.head()

Unnamed: 0,salary,to_messages,deferral_payments,total_payments,loan_advances,bonus,email_address,restricted_stock_deferred,deferred_income,total_stock_value,...,from_poi_to_this_person,exercised_stock_options,from_messages,other,from_this_person_to_poi,poi,long_term_incentive,shared_receipt_with_poi,restricted_stock,director_fees
ALLEN PHILLIP K,201955.0,2902.0,2869717.0,4484442,,4175000.0,phillip.allen@enron.com,-126027.0,-3081055.0,1729541,...,47.0,1729541.0,2195.0,152.0,65.0,False,304805.0,1407.0,126027.0,
BADUM JAMES P,,,178980.0,182466,,,,,,257817,...,,257817.0,,,,False,,,,
BANNANTINE JAMES M,477.0,566.0,,916197,,,james.bannantine@enron.com,-560222.0,-5104.0,5243487,...,39.0,4046157.0,29.0,864523.0,0.0,False,,465.0,1757552.0,
BAXTER JOHN C,267102.0,,1295738.0,5634343,,1200000.0,,,-1386055.0,10623258,...,,6680544.0,,2660303.0,,False,1586055.0,,3942714.0,
BAY FRANKLIN R,239671.0,,260455.0,827696,,400000.0,frank.bay@enron.com,-82782.0,-201641.0,63014,...,,,,69.0,,False,,,145796.0,


Again: For each person, how many features are available?

In [64]:
enron_df.columns

Index(['salary', 'to_messages', 'deferral_payments', 'total_payments',
       'loan_advances', 'bonus', 'email_address', 'restricted_stock_deferred',
       'deferred_income', 'total_stock_value', 'expenses',
       'from_poi_to_this_person', 'exercised_stock_options', 'from_messages',
       'other', 'from_this_person_to_poi', 'poi', 'long_term_incentive',
       'shared_receipt_with_poi', 'restricted_stock', 'director_fees'],
      dtype='object')

In [65]:
len(enron_df.columns)

21

In [66]:
enron_df.shape

(146, 21)

The `poi` feature records whether the person is a person of interest, according to our definition. How many POIs are there in the E+F dataset? In other words, count the number of entries in where `data['poi']==1`.

In [68]:
sum(enron_df['poi'] == 1)

18

What is the total value of the stock belonging to James Prentice? Note that if Prentice had a middle name, we could not directly look for the index `PRENTICE JAMES`. We are going to suppose that is the case.

In [87]:
enron_df['total_stock_value'].head()

ALLEN PHILLIP K        1729541
BADUM JAMES P           257817
BANNANTINE JAMES M     5243487
BAXTER JOHN C         10623258
BAY FRANKLIN R           63014
Name: total_stock_value, dtype: object

In [109]:
enron_df.loc['ALLEN PHILLIP K']

salary                                        201955
to_messages                                     2902
deferral_payments                            2869717
total_payments                               4484442
loan_advances                                    NaN
bonus                                        4175000
email_address                phillip.allen@enron.com
restricted_stock_deferred                    -126027
deferred_income                             -3081055
total_stock_value                            1729541
expenses                                       13868
from_poi_to_this_person                           47
exercised_stock_options                      1729541
from_messages                                   2195
other                                            152
from_this_person_to_poi                           65
poi                                            False
long_term_incentive                           304805
shared_receipt_with_poi                       

In [89]:
enron_df.loc['ALLEN PHILLIP K', 'total_stock_value']

1729541

In [91]:
enron_df.index

Index(['ALLEN PHILLIP K', 'BADUM JAMES P', 'BANNANTINE JAMES M',
       'BAXTER JOHN C', 'BAY FRANKLIN R', 'BAZELIDES PHILIP J', 'BECK SALLY W',
       'BELDEN TIMOTHY N', 'BELFER ROBERT', 'BERBERIAN DAVID',
       ...
       'WASAFF GEORGE', 'WESTFAHL RICHARD K', 'WHALEY DAVID A',
       'WHALLEY LAWRENCE G', 'WHITE JR THOMAS E', 'WINOKUR JR. HERBERT S',
       'WODRASKA JOHN', 'WROBEL BRUCE', 'YEAGER F SCOTT', 'YEAP SOON'],
      dtype='object', length=146)

if Prentice had a middle name, we could not directly look for the index `PRENTICE JAMES`. We are going to suppose that is the case.

In [93]:
# Here we get a list of index where the string 'PRENTICE' APPEARS.

ind_prentice = [i for i, s in enumerate(enron_df.index) if 'PRENTICE' in s]

In [105]:
enron_df.iloc[ind_prentice,:]['total_stock_value']

PRENTICE JAMES    1095040
Name: total_stock_value, dtype: object

How many email messages do we have from Wesley Colwell to persons of interest?

In [108]:
enron_data['COLWELL WESLEY']['from_this_person_to_poi']

11