## P5 Machine Learning MiniProject Explorer Enron Data


"The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. We’ve combined the email and finance data into a single dataset, which you’ll explore in this mini-project."

Getting started:
```
Clone this git repository: https://github.com/udacity/ud120-projects
Open the starter code: datasets_questions/explore_enron_data.py
```

In [2]:
#!/usr/bin/python

""" 
    Starter code for exploring the Enron dataset (emails + finances);
    loads up the dataset (pickled dict of dicts).

    The dataset has the form:
    enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

    {features_dict} is a dictionary of features associated with that person.
    You should explore features_dict as part of the mini-project,
    but here's an example to get you started:

    enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000
    
"""
import pickle

enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "r"))

In [3]:
type(enron_data)

dict

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.
The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly. Use datasets_questions/explore_enron_data.py to load the dataset.

How many data points (people) are in the dataset?
146

In [4]:
len(enron_data)

146

For each person, how many features are available?

In [24]:
#len(enron_data['ALLEN PHILLIP K'])
len(enron_data.values()[0])

21

The “poi” feature records whether the person is a person of interest, according to our definition. How many POIs are there in the E+F dataset?

In [45]:
import pandas as pd

enron = pd.DataFrame(enron_data)
enron = enron.transpose()
sum(enron['poi'])

18

We compiled a list of all POI names (in ../final_project/poi_names.txt) and associated email addresses (in ../final_project/poi_email_addresses.py).

How many POI’s were there total? (Use the names file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails.)

`35 poi according to the poi_names.txt`

As you can see, we have many of the POIs in our E+F dataset, but not all of them. Why is that a potential problem?

We will return to this later to explain how a POI could end up not being in the Enron E+F dataset, so you fully understand the issue before moving on.

#### Query the Dataset 1

There are a few things you could say here, but our main thought is about having enough data to really learn the patterns.  In general, more data is always better--only having 18 data points doesn't give you that many examples to learn from.

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]
or, sometimes 
enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

What is the total value of the stock belonging to James Prentice?

In [74]:
enron['total_stock_value'].loc['PRENTICE JAMES']

1095040

####  Query the Dataset 2
Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

How many email messages do we have from Wesley Colwell to persons of interest?
11

In [76]:
#enron.columns
enron['from_this_person_to_poi'].loc['COLWELL WESLEY']


11

####  Query the Dataset 3
Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

or

enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

What’s the value of stock options exercised by Jeffrey K Skilling?

19250000

In [77]:
enron['exercised_stock_options'].loc['SKILLING JEFFREY K']

19250000

### Research the Enron Fraud
In the coming lessons, we’ll talk about how the best features are often motivated by our human understanding of the problem at hand. In this case, that means knowing a little about the story of the Enron fraud.

If you have an hour and a half to spare, “Enron: The Smartest Guys in the Room” is a documentary that gives an amazing overview of the story. Alternatively, there are plenty of archival newspaper stories that chronicle the rise and fall of Enron.

Which of these schemes was Enron not involved in?

* selling assets to shell companies at the end of each month, and buying them back at the beginning of the next month to hide accounting losses
* causing electrical grid failures in California
* illegally obtained a government report that enabled them to corner the market on frozen * concentrated orange juice futures
* conspiring to give a Saudi prince expedited American citizenship
* a plan in collaboration with Blockbuster movies to stream movies over the internet


Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of “total_payments” feature)?

How much money did that person get?

In [79]:
eso = enron["exercised_stock_options"]

In [105]:
enron.loc['SKILLING JEFFREY K']

bonus                                        5600000
deferral_payments                                NaN
deferred_income                                  NaN
director_fees                                    NaN
email_address                jeff.skilling@enron.com
exercised_stock_options                     19250000
expenses                                       29336
from_messages                                    108
from_poi_to_this_person                           88
from_this_person_to_poi                           30
loan_advances                                    NaN
long_term_incentive                          1920000
other                                          22122
poi                                             True
restricted_stock                             6843672
restricted_stock_deferred                        NaN
salary                                       1111258
shared_receipt_with_poi                         2042
to_messages                                   

In [103]:
enron.loc['FASTOW ANDREW S']

bonus                                        1300000
deferral_payments                                NaN
deferred_income                             -1386055
director_fees                                    NaN
email_address                andrew.fastow@enron.com
exercised_stock_options                          NaN
expenses                                       55921
from_messages                                    NaN
from_poi_to_this_person                          NaN
from_this_person_to_poi                          NaN
loan_advances                                    NaN
long_term_incentive                          1736055
other                                         277464
poi                                             True
restricted_stock                             1794412
restricted_stock_deferred                        NaN
salary                                        440698
shared_receipt_with_poi                          NaN
to_messages                                   

In [167]:
enron.loc['LAY KENNETH L']

bonus                                      7000000
deferral_payments                           202911
deferred_income                            -300000
director_fees                                  NaN
email_address                kenneth.lay@enron.com
exercised_stock_options                   34348384
expenses                                     99832
from_messages                                   36
from_poi_to_this_person                        123
from_this_person_to_poi                         16
loan_advances                             81525000
long_term_incentive                        3600000
other                                     10359729
poi                                           True
restricted_stock                          14761694
restricted_stock_deferred                      NaN
salary                                     1072321
shared_receipt_with_poi                       2411
to_messages                                   4273
total_payments                 

In [97]:
for i in eso.index:
    print i;

ALLEN PHILLIP K
BADUM JAMES P
BANNANTINE JAMES M
BAXTER JOHN C
BAY FRANKLIN R
BAZELIDES PHILIP J
BECK SALLY W
BELDEN TIMOTHY N
BELFER ROBERT
BERBERIAN DAVID
BERGSIEKER RICHARD P
BHATNAGAR SANJAY
BIBI PHILIPPE A
BLACHMAN JEREMY M
BLAKE JR. NORMAN P
BOWEN JR RAYMOND M
BROWN MICHAEL
BUCHANAN HAROLD G
BUTTS ROBERT H
BUY RICHARD B
CALGER CHRISTOPHER F
CARTER REBECCA C
CAUSEY RICHARD A
CHAN RONNIE
CHRISTODOULOU DIOMEDES
CLINE KENNETH W
COLWELL WESLEY
CORDES WILLIAM R
COX DAVID
CUMBERLAND MICHAEL S
DEFFNER JOSEPH M
DELAINEY DAVID W
DERRICK JR. JAMES V
DETMERING TIMOTHY J
DIETRICH JANET R
DIMICHELE RICHARD G
DODSON KEITH
DONAHUE JR JEFFREY M
DUNCAN JOHN H
DURAN WILLIAM D
ECHOLS JOHN B
ELLIOTT STEVEN
FALLON JAMES B
FASTOW ANDREW S
FITZGERALD JAY L
FOWLER PEGGY
FOY JOE
FREVERT MARK A
FUGH JOHN L
GAHN ROBERT S
GARLAND C KEVIN
GATHMANN WILLIAM D
GIBBS DANA R
GILLIS JOHN
GLISAN JR BEN F
GOLD JOSEPH
GRAMM WENDY L
GRAY RODNEY
HAEDICKE MARK E
HANNON KEVIN P
HAUG DAVID L
HAYES ROBERT E
HAYSLETT RODERIC

In [166]:
enron.sort_values(by = 'total_payments',ascending =False,na_position='last')['total_payments']

SHERRICK JEFFREY B               NaN
MCDONALD REBECCA                 NaN
MORAN MICHAEL P                  NaN
MCCARTY DANNY J                  NaN
CORDES WILLIAM R                 NaN
PIRO JIM                         NaN
CLINE KENNETH W                  NaN
CHRISTODOULOU DIOMEDES           NaN
CHAN RONNIE                      NaN
LOWRY CHARLES P                  NaN
POWERS WILLIAM                   NaN
SCRIMSHAW MATTHEW                NaN
LOCKHART EUGENE E                NaN
FOWLER PEGGY                     NaN
LEWIS RICHARD                    NaN
HAYSLETT RODERICK J              NaN
GATHMANN WILLIAM D               NaN
GILLIS JOHN                      NaN
WHALEY DAVID A                   NaN
HUGHES JAMES A                   NaN
WROBEL BRUCE                     NaN
TOTAL                      309886585
LAY KENNETH L              103559793
FREVERT MARK A              17252530
BHATNAGAR SANJAY            15456290
LAVORATO JOHN J             10425757
SKILLING JEFFREY K           8682716
M

How many folks in this dataset have a quantified salary? What about a known email address?

In [164]:
(enron['email_address']!='NaN').sum()

111

In [165]:
(enron['salary']!= 'NaN').sum()

95

A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).

We’ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.

In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

#### Missing POIs 1 (optional)
As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That’s because the dataset was created using the financial data you can find in final_project/enron61702insiderpay.pdf, which is missing some POI’s (those absences propagated through to the final dataset). On the other hand, for many of these “missing” POI’s, we do have emails.

While it would be straightforward to add these POI’s and their email information to the E+F dataset, and just put “NaN” for their financial information, this could introduce a subtle problem. You will walk through that here.

How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

In [171]:
(enron['total_payments']=='NaN').sum() / 146.0

0.14383561643835616

Adding in the new POI’s in this example, none of whom we have financial information for, has introduced a subtle problem, that our lack of financial information about them can be picked up by an algorithm as a clue that they’re POIs. Another way to think about this is that there’s now a difference in how we generated the data for our two classes--non-POIs all come from the financial spreadsheet, while many POIs get added in by hand afterwards. That difference can trick us into thinking we have better performance than we do--suppose you use your POI detector to decide whether a new, unseen person is a POI, and that person isn’t on the spreadsheet. Then all their financial data would contain “NaN” but the person is very likely not a POI (there are many more non-POIs than POIs in the world, and even at Enron)--you’d be likely to accidentally identify them as a POI, though!

This goes to say that, when generating or augmenting a dataset, you should be exceptionally careful if your data are coming from different sources for different classes. It can easily lead to the type of bias or mistake that we showed here. There are ways to deal with this, for example, you wouldn’t have to worry about this problem if you used only email data--in that case, discrepancies in the financial data wouldn’t matter because financial features aren’t being used. There are also more sophisticated ways of estimating how much of an effect these biases can have on your final answer; those are beyond the scope of this course.

For now, the takeaway message is to be very careful about introducing features that come from different sources depending on the class! It’s a classic way to accidentally introduce biases and mistakes.