Starter code for exploring the Enron dataset (emails + finances);
loads up the dataset (pickled dict of dicts).

The dataset has the form:
enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

{features_dict} is a dictionary of features associated with that person.
You should explore features_dict as part of the mini-project,
but here's an example to get you started:

enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000

In [5]:
import pickle

In [6]:
enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "r"))

The aggregated Enron email + financial dataset is stored in a dictionary, where each key in the dictionary is a person’s name and the value is a dictionary containing all the features of that person.
The email + finance (E+F) data dictionary is stored as a pickle file, which is a handy way to store and load python objects directly. Use datasets_questions/explore_enron_data.py to load the dataset.

## How many data points (people) are in the dataset?

In [11]:
print(len(enron_data.keys()))

146


## For each person, how many features are available?

In [19]:
len(enron_data['ALLEN PHILLIP K'].keys())

21

The “poi” feature records whether the person is a person of interest, according to our definition. 
## How many POIs are there in the E+F dataset?

In other words, count the number of entries in the dictionary where
data[person_name]["poi"]==1

In [78]:
poi = sum([enron_data[person]['poi'] == 1 for person in enron_data.keys()])
print('Person of Interest in E+F dataset'.format(poi))

Person of Interest in E+F dataset


We compiled a list of all POI names (in ../final_project/poi_names.txt) and associated email addresses (in ../final_project/poi_email_addresses.py).

## How many POI’s were there total? 

(Use the names file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails.)

In [79]:
filepath = '../final_project/poi_names.txt'
with open(filepath) as file:
    data = file.read()
    data_list = data.split('\n')
    persons = data_list[2:-1]
    poi = len(persons)
    print('Person of Interest: {}'.format(poi))
        
    
    

Person of Interest: 35


## Problems with Incomplete data.

As you can see, we have many of the POIs in our E+F dataset, but not all of them. 

## Why is that a potential problem?

We will return to this later to explain how a POI could end up not being in the Enron E+F dataset, so you fully understand the issue before moving on.

Thanks for completing that!. There are a few things you could say here, but our main thought is about having enough data to really learn the patterns.  In general, more data is always better--only having 18 data points doesn't give you that many examples to learn from.

## Query the dataset 1

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]
or, sometimes 
enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

What is the total value of the stock belonging to James Prentice?

##### Notes:
Lastname, Firstname and Middle Initial all in CAPS.

In [95]:
james_prentice_total_stock_value = enron_data['PRENTICE JAMES']['total_stock_value']
print('Stock value of James Prentice: {}'.format(james_prentice_total_stock_value))


Stock value of James Prentice: 1095040


## Query the dataset 2

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

### How many email messages do we have from Wesley Colwell to persons of interest?

In [101]:
enron_data['COLWELL WESLEY']['from_this_person_to_poi']

11

## Query the dataset 3

Like any dict of dicts, individual people/features can be accessed like so:

enron_data["LASTNAME FIRSTNAME"]["feature_name"]

or

enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"]["feature_name"]

### What’s the value of stock options exercised by Jeffrey K Skilling?

In [105]:
enron_data['SKILLING JEFFREY K']['exercised_stock_options']

19250000

## Research the Enron Fraud

n the coming lessons, we’ll talk about how the best features are often motivated by our human understanding of the problem at hand. In this case, that means knowing a little about the story of the Enron fraud.

If you have an hour and a half to spare, “Enron: The Smartest Guys in the Room” is a documentary that gives an amazing overview of the story. Alternatively, there are plenty of archival newspaper stories that chronicle the rise and fall of Enron.

Which of these schemes was Enron not involved in?

```
a. selling assets to shell companies at the end of each month, and buying them back at the beginning of the next month to hide accounting losses
b. causing electrical grid failures in California
c. illegally obtained a government report that enabled them to corner the market on frozen concentrated orange juice futures
d. conspiring to give a Saudi prince expedited American citizenship
e. a plan in collaboration with Blockbuster movies to stream movies over the internet
```

Answer: c and d

## Who was the CEO of Enron during most of the time that fraud was being perpetrated?

Answer: Jeffrey Skilling

Enron Chairman

## Who was of the Enron board of directors?

Answer: Kenneth Lay

## Who was CFO (chief financial officer) of Enron during most of the time that fraud was going on?

Answer: Andrew Fastow

Unfulfilled Features

## For nearly every person in the dataset, not every feature has a value. How is it denoted when a feature doesn’t have a well-defined value?

Answer: NaN

Dealing with Unfulfilled features

## How many folks in this dataset have a quantified salary? What about a known email address?



In [119]:
quantified_salaries = sum([person['salary'] != 'NaN' for person_name, person in enron_data.items()])
known_email_address = sum([person['email_address'] != 'NaN' for person_name, person in enron_data.items()])
print("Number of quantified salary: {}".format(quantified_salaries))
print("Number of known_email_address: {}".format(known_email_address))

Number of quantified salary: 95
Number of known_email_address: 111


## Dict-to-array Conversion

A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).

We’ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.

In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

Mission POIs 1 (optional)

As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That’s because the dataset was created using the financial data you can find in final_project/enron61702insiderpay.pdf, which is missing some POI’s (those absences propagated through to the final dataset). On the other hand, for many of these “missing” POI’s, we do have emails.

While it would be straightforward to add these POI’s and their email information to the E+F dataset, and just put “NaN” for their financial information, this could introduce a subtle problem. You will walk through that here.

## How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

In [154]:
number_of_people_with_nan_total_payments = sum([person['total_payments'] == 'NaN' for person_names, person in enron_data.items()])
print("number_of_people_with_nan_total_payments: {}".format(number_of_people_with_nan_total_payments))

people = len(enron_data.items())
print('total people: {}'.format(people))
print('Percentage of people: {}'.format(number_of_people_with_nan_total_payments/people))

number_of_people_with_nan_total_payments: 21
total people: 146
Percentage of people: 0


##How many  POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this?