# Connect Intensive - Machine Learning Nanodegree
# Lesson 2: Datasets and Questions- Enron Dataset Mini-Project

## Objectives
  - Work through the Datasets and Questions lesson from [ud-120](https://www.udacity.com/course/intro-to-machine-learning--ud120) within the Jupyter Notebook environment
  - Introduce [the `pickle` module](https://docs.python.org/2/library/pickle.html) from the Python Standard Library for preserving data structures in Python
  - Practice loading data from different directories
  - Review stacking and unstacking in `pandas` to [reshape a `DataFrame`](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)
  - Explore and understand data sets, particularly those with many missing values
  
## Prerequisites
  - You should be familiar with `pandas` (lesson 1 in the [UConnect-Intensive repo](https://github.com/ryanjferrin/UConnect-Intensive))

## Background

The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. The Udacity team has combined the email and finance data into a single dataset, which you’ll explore in this mini-project.
 
 
In the online course [Katie Malone](http://blog.udacity.com/2016/04/women-in-machine-learning-katie-malone.html) discusses the Enron scandal, introduces the Enron email corpus and defines and identifies a person-of-interest (POI). 

**POI**: Indicted, settled without admitting guilt, testified in exchange for immunity.

**Goal**: Use machine learning to answer the question:  

**"Can we identify patterns in the emails of people who were POIs?"**




## Getting Started
The Datasets and Questions lesson in the Data Modeling section is based on Enron finance and email data. These datasets are located in the [**ud120-projects** repo](https://github.com/udacity/ud120-projects) on GitHub. Please clone the **ud120-projects** repo to your local machine. 

## Pickle
 [The `pickle` module](https://docs.python.org/2/library/pickle.html) is a fast, efficient way to preserve (or pickle) data structures (*e.g.* dictionaries, lists, tuples, sets...), without having to structure or organize your output file. `Pickle` will know exactly how to serialize (or write) those structures to file. You can un-pickle the data structures using `pickle.load()`. 


More details regarding `pickle` can be found here-[this reference on Serializing Python Objects](http://www.diveintopython3.net/serializing.html) from Dive Into Python 3.

**Run** the cell below to import the `pickle` module. (**shift + enter** or **shift + return**)

In [1]:
try:
    import pickle
    print("Successfully imported pickle!")
except ImportError:
    print("Could not import pickle")

Successfully imported pickle!


## Update working directory 

The [Magic Function](http://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained) `%cd "..."` changes the working directory in the Jupyter Notebook.

**Update** the Magic Function `%cd "..."` in the cell below to indicate the correct path of the **ud120-projects** directory on your local machine.

Then **run** the cell below to load the Enron data!

In [2]:
# Ensure that you write the full path, up to and including "ud120-projects"
%cd "C:/Users/Ryan/Desktop/ud120-projects" 

try:
    enron_data = pickle.load(open("final_project/final_project_dataset.pkl", "r")) # May require "rb" instead of "r"
    print("Enron dataset loaded succesfully!")
except IOError:
    print("Enron dataset not loaded")

C:\Users\Ryan\Desktop\ud120-projects
Enron dataset loaded succesfully!


## Transforming data from Dictionary to DataFrame
The variable `enron_data` is a dictionary object. Dictionaries are not displayed as nicely as `pandas` `DataFrame` objects within the Jupyter Notebook environment. Let's convert `enron_data` to a `DataFrame` object. 

Use either [the method `pandas.DataFrame.from_dict()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html) or simply use [the constructor `pandas.DataFrame()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) either works. 

**Run** the cell below to:
  - import `pandas` and `display`.
  - set some display options.
  - create a `DataFrame` object for the Enron data.
  - display the Enron data.

In [3]:
try:
    import pandas as pd
    print("Successfully imported pandas! (Version {})".format(pd.__version__))
    pd.options.display.max_rows = 10
except ImportError:
    print("Could not import pandas!")

try:
    from IPython.display import display
    print("Successfully imported display from IPython.display!")
except ImportError:
    print("Could not import display from IPython.display")
    
enron_df = pd.DataFrame.from_dict(enron_data)
display(enron_df)

Successfully imported pandas! (Version 0.19.2)
Successfully imported display from IPython.display!


Unnamed: 0,ALLEN PHILLIP K,BADUM JAMES P,BANNANTINE JAMES M,BAXTER JOHN C,BAY FRANKLIN R,BAZELIDES PHILIP J,BECK SALLY W,BELDEN TIMOTHY N,BELFER ROBERT,BERBERIAN DAVID,...,WASAFF GEORGE,WESTFAHL RICHARD K,WHALEY DAVID A,WHALLEY LAWRENCE G,WHITE JR THOMAS E,WINOKUR JR. HERBERT S,WODRASKA JOHN,WROBEL BRUCE,YEAGER F SCOTT,YEAP SOON
bonus,4175000,,,1200000,400000,,700000,5249999,,,...,325000,,,3000000,450000,,,,,
deferral_payments,2869717,178980,,1295738,260455,684694,,2144013,-102500,,...,831299,,,,,,,,,
deferred_income,-3081055,,-5104,-1386055,-201641,,,-2334434,,,...,-583325,-10800,,,,-25000,,,,
director_fees,,,,,,,,,3285,,...,,,,,,108579,,,,
email_address,phillip.allen@enron.com,,james.bannantine@enron.com,,frank.bay@enron.com,,sally.beck@enron.com,tim.belden@enron.com,,david.berberian@enron.com,...,george.wasaff@enron.com,dick.westfahl@enron.com,,greg.whalley@enron.com,thomas.white@enron.com,,john.wodraska@enron.com,,scott.yeager@enron.com,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
salary,201955,,477,267102,239671,80818,231330,213999,,216582,...,259996,63744,,510364,317543,,,,158403,
shared_receipt_with_poi,1407,,465,,,,2639,5521,,,...,337,,,3920,,,,,,
to_messages,2902,,566,,,,7315,7991,,,...,400,,,6019,,,,,,
total_payments,4484442,182466,916197,5634343,827696,860136,969068,5501630,102500,228474,...,1034395,762135,,4677574,1934359,84992,189583,,360300,55097


## Stacking, unstacking, and rearranging

The data is not formatted in the proper manner. The variables (features) should be in the columns and the unique instances should be in the rows. 

The following functions can be used to resolve this issue. 

- [`pandas.DataFrame.stack()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.stack.html) 
- [`pandas.DataFrame.unstack()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html)

First, we need to `stack` the current column indices, moving them to the innermost level of the row index.

**Run** the cell below to see the results of calling `enron_df.stack()`


In [4]:
enron_df.stack()

bonus              ALLEN PHILLIP K           4175000
                   BADUM JAMES P                 NaN
                   BANNANTINE JAMES M            NaN
                   BAXTER JOHN C             1200000
                   BAY FRANKLIN R             400000
                                              ...   
total_stock_value  WINOKUR JR. HERBERT S         NaN
                   WODRASKA JOHN                 NaN
                   WROBEL BRUCE               139130
                   YEAGER F SCOTT           11884758
                   YEAP SOON                  192758
dtype: object

We see that the result of `enron_df.stack()` is a `Series` object, where the innermost (rightmost) level of the index is the person's name in the Enron data set, while the outermost (leftmost) level of the index is the feature. If we call `unstack()` on the resulting `series` without specifying a level, we'll just revert to the original `DataFrame`.

**Run** the cell below to see the result of calling `enron_df.stack().unstack()`


In [5]:
enron_df.stack().unstack()

Unnamed: 0,ALLEN PHILLIP K,BADUM JAMES P,BANNANTINE JAMES M,BAXTER JOHN C,BAY FRANKLIN R,BAZELIDES PHILIP J,BECK SALLY W,BELDEN TIMOTHY N,BELFER ROBERT,BERBERIAN DAVID,...,WASAFF GEORGE,WESTFAHL RICHARD K,WHALEY DAVID A,WHALLEY LAWRENCE G,WHITE JR THOMAS E,WINOKUR JR. HERBERT S,WODRASKA JOHN,WROBEL BRUCE,YEAGER F SCOTT,YEAP SOON
bonus,4175000,,,1200000,400000,,700000,5249999,,,...,325000,,,3000000,450000,,,,,
deferral_payments,2869717,178980,,1295738,260455,684694,,2144013,-102500,,...,831299,,,,,,,,,
deferred_income,-3081055,,-5104,-1386055,-201641,,,-2334434,,,...,-583325,-10800,,,,-25000,,,,
director_fees,,,,,,,,,3285,,...,,,,,,108579,,,,
email_address,phillip.allen@enron.com,,james.bannantine@enron.com,,frank.bay@enron.com,,sally.beck@enron.com,tim.belden@enron.com,,david.berberian@enron.com,...,george.wasaff@enron.com,dick.westfahl@enron.com,,greg.whalley@enron.com,thomas.white@enron.com,,john.wodraska@enron.com,,scott.yeager@enron.com,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
salary,201955,,477,267102,239671,80818,231330,213999,,216582,...,259996,63744,,510364,317543,,,,158403,
shared_receipt_with_poi,1407,,465,,,,2639,5521,,,...,337,,,3920,,,,,,
to_messages,2902,,566,,,,7315,7991,,,...,400,,,6019,,,,,,
total_payments,4484442,182466,916197,5634343,827696,860136,969068,5501630,102500,228474,...,1034395,762135,,4677574,1934359,84992,189583,,360300,55097


We need to `unstack` the *outermost* level of the index, but by default, the function will `unstack` the *innermost* level of the index.

**Run** the cell below *once* to correctly `stack` and `unstack` the Enron `DataFrame` to move the instances (names) to rows and features (variables) to columns. 

In [6]:
enron_df = enron_df.stack().unstack(0)
display(enron_df)

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
ALLEN PHILLIP K,4175000,2869717,-3081055,,phillip.allen@enron.com,1729541,13868,2195,47,65,...,304805,152,False,126027,-126027,201955,1407,2902,4484442,1729541
BADUM JAMES P,,178980,,,,257817,3486,,,,...,,,False,,,,,,182466,257817
BANNANTINE JAMES M,,,-5104,,james.bannantine@enron.com,4046157,56301,29,39,0,...,,864523,False,1757552,-560222,477,465,566,916197,5243487
BAXTER JOHN C,1200000,1295738,-1386055,,,6680544,11200,,,,...,1586055,2660303,False,3942714,,267102,,,5634343,10623258
BAY FRANKLIN R,400000,260455,-201641,,frank.bay@enron.com,,129142,,,,...,,69,False,145796,-82782,239671,,,827696,63014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WINOKUR JR. HERBERT S,,,-25000,108579,,,1413,,,,...,,,False,,,,,,84992,
WODRASKA JOHN,,,,,john.wodraska@enron.com,,,,,,...,,189583,False,,,,,,189583,
WROBEL BRUCE,,,,,,139130,,,,,...,,,False,,,,,,,139130
YEAGER F SCOTT,,,,,scott.yeager@enron.com,8308552,53947,,,,...,,147950,True,3576206,,158403,,,360300,11884758


# Exercises
Use `pandas` to answer questions regarding the Enron dataset. If you're uncertain how to do something, feel free to ask questions, look through [the `pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/api.html), or refer to the code examples above!

You can check your solutions to each of these exercises by entering your answer in the corresponding Quiz in the "Datasets and Questions" lesson. I put the corresponding quizzes in parenthesis after each exercise, so you know where to go to check your answers.

Types of Data Quizzes: Questions 1-6 answer options

- numerical- numerical values (numbers)
- categorical- limited number of discrete values (category)
- time series- temporal value (date, timestamp) 
- text- words
- other

## Question 1
What type of data is salary info? (Quiz: Types of Data 1)

- numerical


## Question 2
What type of data is job title? (Quiz: Types of Data 2)

- categorical 


## Question 3
What type of data is timestamps on emails? (Quiz: Types of Data 3)

- time series 


## Question 4
What type of data is contents of emails? (Quiz: Types of Data 4)

- text


## Question 5
What type of data is number of emails sent by a given person? (Quiz: Types of Data 5)

- numerical


## Question 6
What type of data is to/from fields of emails? (Quiz: Types of Data 6)

- text 

## Question 7
How many data points (people) are in the data set? (Quiz: Size of the Enron Dataset)

In [7]:
len(enron_df)

146

## Question 8
For each person, how many features are available? (Quiz: Features in the Enron Dataset)

In [8]:
len(enron_df.columns)

21

## Question 9
How many Persons of Interest (POIs) are in the dataset? (Quiz: Finding POIs in the Enron Data)

In [9]:
len(enron_df[enron_df['poi']])

18

## Question 10
We compiled a list of all POI names (in `final_project/poi_names.txt`) and associated email addresses (in `final_project/poi_email_addresses.py`).

How many POI’s were there total? Use the **names** file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails. (Quiz: How Many POIs Exist?)

**Hint:** Open up the `poi_names.txt` file to see the file format:
  - the first line is a link to a USA Today article
  - the second line is blank
  - subsequent lines have the format: `(•) Lastname, Firstname`
      - the dot `•` is either "y" (for yes) or "n" (for no), describing if the emails for that POI are available

In [10]:
poi_namefile = "final_project/poi_names.txt"

poi_have_emails = []
poi_last_names = []
poi_first_names = []

with open(poi_namefile, 'r') as f:
    # read the USA Today link
    f.readline()
    
    # read the blank line
    f.readline()
    
    # for each remaining line, append information to the lists
    for line in f:
        name = line.split(" ")
        # name is a list of strings: ["(•)" , "Lastname," , "Firstname\n"]
        poi_have_emails.append(name[0][1] == "y")
        poi_last_names.append(name[1][:-1])
        poi_first_names.append(name[2][:-1])

        
len(poi_first_names)     

35

## Question 11
What might be a problem with having some POIs missing from our dataset? (Quiz: Problems with Incomplete Data)

One problem is the introduction of bias into the model. There are many potential answers to this question. 

## Question 12
What is the total value of the stock belonging to James Prentice? (Query The Dataset 1)

In [11]:
enron_df[enron_df.index.str.contains("Prentice",case=False)]['total_stock_value']

PRENTICE JAMES    1095040
Name: total_stock_value, dtype: object

## Question 13
How many email messages do we have from Wesley Colwell to persons of interest? (Query The Dataset 2)

In [12]:
enron_df[enron_df.index.str.contains("Colwell",case=False)]['from_this_person_to_poi']

COLWELL WESLEY    11
Name: from_this_person_to_poi, dtype: object

## Question 14
What's the value of stock options exercised by Jeffrey K Skilling? (Query The Dataset 3)

In [13]:
enron_df[enron_df.index.str.contains("Skilling",case=False)]['exercised_stock_options']

SKILLING JEFFREY K    19250000
Name: exercised_stock_options, dtype: object

**Questions 15-18 can be answered by some basic online research** 

## Question 15
Which of these schemes was Enron **not** involved in? (Quiz: Research the Enron Fraud)

- selling assets to shell companies at the end of each month, and buying them back at the beginning of the next month to hide accounting losses
- causing electrical grid failures in California
- **illegally obtained a government report that enabled them to corner the market on frozen concentrated orange juice futures**
- **conspiring to give a Saudi prince expedited American citizenship**
- a plan in collaboration with Blockbuster movies to stream movies over the internet

## Question 16
Who was the CEO of Enron during most of the time that fraud was being perpetrated? (Quiz: Enron CEO)

- Jeffrey Skilling 

## Question 17
Who was chairman of the Enron board of directors? (Quiz: Enron Chairman)

- Kenneth Lay

## Question 18
Who was CFO (chief financial officer) of Enron during most of the time that fraud was going on? (Quiz: Enron CFO)

- Andrew Fastow

## Question 19
Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of “total_payments” feature)?(Quiz: Follow The Money)

How much money did that person get?

In [14]:
enron_df[enron_df.index.str.contains("Skilling", case=False) | \
         enron_df.index.str.contains("Lay", case=False) | \
         enron_df.index.str.contains("Fastow", case=False)]['total_payments']

FASTOW ANDREW S         2424083
LAY KENNETH L         103559793
SKILLING JEFFREY K      8682716
Name: total_payments, dtype: object

## Question 20
For nearly every person in the dataset, not every feature has a value. How is it denoted when a feature doesn’t have a well-defined value? (Quiz: Unfilled Features)

In [15]:
display(enron_df)

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
ALLEN PHILLIP K,4175000,2869717,-3081055,,phillip.allen@enron.com,1729541,13868,2195,47,65,...,304805,152,False,126027,-126027,201955,1407,2902,4484442,1729541
BADUM JAMES P,,178980,,,,257817,3486,,,,...,,,False,,,,,,182466,257817
BANNANTINE JAMES M,,,-5104,,james.bannantine@enron.com,4046157,56301,29,39,0,...,,864523,False,1757552,-560222,477,465,566,916197,5243487
BAXTER JOHN C,1200000,1295738,-1386055,,,6680544,11200,,,,...,1586055,2660303,False,3942714,,267102,,,5634343,10623258
BAY FRANKLIN R,400000,260455,-201641,,frank.bay@enron.com,,129142,,,,...,,69,False,145796,-82782,239671,,,827696,63014
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WINOKUR JR. HERBERT S,,,-25000,108579,,,1413,,,,...,,,False,,,,,,84992,
WODRASKA JOHN,,,,,john.wodraska@enron.com,,,,,,...,,189583,False,,,,,,189583,
WROBEL BRUCE,,,,,,139130,,,,,...,,,False,,,,,,,139130
YEAGER F SCOTT,,,,,scott.yeager@enron.com,8308552,53947,,,,...,,147950,True,3576206,,158403,,,360300,11884758


## Question 21
How many folks in this dataset have a quantified salary? What about a known email address? (Quiz: Dealing with Unfilled Features)

In [16]:
print("There are {} people with known salaries.".format(len(enron_df[enron_df['salary'] != 'NaN'])))
print("There are {} people with known e-mail addresses.".format(len(enron_df[enron_df['email_address'] != 'NaN'])))

There are 95 people with known salaries.
There are 111 people with known e-mail addresses.


## More Magic Functions

In the Jupyter Notebook `lesson 1`, we introduced our first [Magic Function](http://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained):

`%matplotlib inline`

That allowed us to generate plots using `matplotlib.pyplot` that appeared directly within our Jupyter Notebook! Earlier in this notebook, we also introduced the Magic Function:

`%cd "..."`

Replacing the ellipses in the above Magic Function with the path of the **ud120-projects** repo allowed us to change the working directory for this Jupyter Notebook session. Let's learn one last Magic Function:

`%load filename.py`

This allows us to quickly load python code from the file `filename.py` into the current cell.

**Run** the cell below to execute the Magic Function `%load tools/feature_format.py` from the **ud120-projects** repo. What happens?

In [17]:
# %load tools/feature_format.py
#!/usr/bin/python

""" 
    A general tool for converting data from the
    dictionary format to an (n x k) python list that's 
    ready for training an sklearn algorithm

    n--no. of key-value pairs in dictonary
    k--no. of features being extracted

    dictionary keys are names of persons in dataset
    dictionary values are dictionaries, where each
        key-value pair in the dict is the name
        of a feature, and its value for that person

    In addition to converting a dictionary to a numpy 
    array, you may want to separate the labels from the
    features--this is what targetFeatureSplit is for

    so, if you want to have the poi label as the target,
    and the features you want to use are the person's
    salary and bonus, here's what you would do:

    feature_list = ["poi", "salary", "bonus"] 
    data_array = featureFormat( data_dictionary, feature_list )
    label, features = targetFeatureSplit(data_array)

    the line above (targetFeatureSplit) assumes that the
    label is the _first_ item in feature_list--very important
    that poi is listed first!
"""


import numpy as np

def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False, sort_keys = False):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.
    """


    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle
        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
    else:
        keys = dictionary.keys()

    for key in keys:
        tmp_list = []
        for feature in features:
            try:
                dictionary[key][feature]
            except KeyError:
                print "error: key ", feature, " not present"
                return
            value = dictionary[key][feature]
            if value=="NaN" and remove_NaN:
                value = 0
            tmp_list.append( float(value) )

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == 'poi':
            test_list = tmp_list[1:]
        else:
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
                    break
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append( np.array(tmp_list) )

    return np.array(return_list)


def targetFeatureSplit( data ):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """

    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features






You should see that the Magic Function `%load tools/feature_format.py` copies the contents of `feature_format.py` into the same cell, below the magic command. Additionally, the Magic Function `%load tools/feature_format.py` is now commented out. Note that you've **copied** the contents of `feature_format.py` into the cell with the Magic Function, but you need to run the cell a second time to actually **execute** the code.

If you want to try another `%load` Magic Function, **run** the cell below to load the contents of `poi_email_addresses.py` to this Notebook.

In [18]:
# %load final_project/poi_email_addresses.py
def poiEmails():
    email_list = ["kenneth_lay@enron.net",    
            "kenneth_lay@enron.com",
            "klay.enron@enron.com",
            "kenneth.lay@enron.com", 
            "klay@enron.com",
            "layk@enron.com",
            "chairman.ken@enron.com",
            "jeffreyskilling@yahoo.com",
            "jeff_skilling@enron.com",
            "jskilling@enron.com",
            "effrey.skilling@enron.com",
            "skilling@enron.com",
            "jeffrey.k.skilling@enron.com",
            "jeff.skilling@enron.com",
            "kevin_a_howard.enronxgate.enron@enron.net",
            "kevin.howard@enron.com",
            "kevin.howard@enron.net",
            "kevin.howard@gcm.com",
            "michael.krautz@enron.com"
            "scott.yeager@enron.com",
            "syeager@fyi-net.com",
            "scott_yeager@enron.net",
            "syeager@flash.net",
            "joe'.'hirko@enron.com", 
            "joe.hirko@enron.com", 
            "rex.shelby@enron.com", 
            "rex.shelby@enron.nt", 
            "rex_shelby@enron.net",
            "jbrown@enron.com",
            "james.brown@enron.com", 
            "rick.causey@enron.com", 
            "richard.causey@enron.com", 
            "rcausey@enron.com",
            "calger@enron.com",
            "chris.calger@enron.com", 
            "christopher.calger@enron.com", 
            "ccalger@enron.com",
            "tim_despain.enronxgate.enron@enron.net", 
            "tim.despain@enron.com",
            "kevin_hannon@enron.com", 
            "kevin'.'hannon@enron.com", 
            "kevin_hannon@enron.net", 
            "kevin.hannon@enron.com",
            "mkoenig@enron.com", 
            "mark.koenig@enron.com",
            "m..forney@enron.com",
            "ken'.'rice@enron.com", 
            "ken.rice@enron.com",
            "ken_rice@enron.com", 
            "ken_rice@enron.net",
            "paula.rieker@enron.com",
            "prieker@enron.com", 
            "andrew.fastow@enron.com", 
            "lfastow@pdq.net", 
            "andrew.s.fastow@enron.com", 
            "lfastow@pop.pdq.net", 
            "andy.fastow@enron.com",
            "david.w.delainey@enron.com", 
            "delainey.dave@enron.com", 
            "'delainey@enron.com", 
            "david.delainey@enron.com", 
            "'david.delainey'@enron.com", 
            "dave.delainey@enron.com", 
            "delainey'.'david@enron.com",
            "ben.glisan@enron.com", 
            "bglisan@enron.com", 
            "ben_f_glisan@enron.com", 
            "ben'.'glisan@enron.com",
            "jeff.richter@enron.com", 
            "jrichter@nwlink.com",
            "lawrencelawyer@aol.com", 
            "lawyer'.'larry@enron.com", 
            "larry_lawyer@enron.com", 
            "llawyer@enron.com", 
            "larry.lawyer@enron.com", 
            "lawrence.lawyer@enron.com",
            "tbelden@enron.com", 
            "tim.belden@enron.com", 
            "tim_belden@pgn.com", 
            "tbelden@ect.enron.com",
            "michael.kopper@enron.com",
            "dave.duncan@enron.com", 
            "dave.duncan@cipco.org", 
            "duncan.dave@enron.com",
            "ray.bowen@enron.com", 
            "raymond.bowen@enron.com", 
            "'bowen@enron.com",
            "wes.colwell@enron.com",
            "dan.boyle@enron.com",
            "cloehr@enron.com", 
            "chris.loehr@enron.com"
        ]
    return email_list


A couple last helpful hints about Magic Functions:
  - You can learn more about Magic Functions at any time with the Magic Function `%magic`
  - You can get the list of available Magic Functions by the Magic Function `%lsmagic`
  - You can learn about any Magic Function by typing a question mark after it, *e.g.* `%load?`
  
**Try it below!**

In [19]:
%magic

# Optional Exercises -- Missing POIs
As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That’s because the dataset was created using the financial data you can find in `../final_project/enron61702insiderpay.pdf`, which is missing some POI’s (those absences propagated through to the final dataset). On the other hand, for many of these “missing” POI’s, we do have emails.

While it would be straightforward to add these POI’s and their email information to the E+F dataset, and just put “NaN” for their financial information, this could introduce a subtle problem. You will walk through that here.

Again, you can check your solutions to each of these exercises by entering your answer in the corresponding Quiz in the "Datasets and Questions" lesson. I put the corresponding quizzes in parenthesis after each exercise, so you know where to go to check your answers.

## Question 1
How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this? (Quiz: Missing POIs 1 (Optional))

In [20]:
print("There are {} people with \'NaN\' for their total payments, or {:.2f}% of the dataset".format(\
        len(enron_df[enron_df['total_payments']=='NaN']),\
        100.0*len(enron_df[enron_df['total_payments']=='NaN'])/len(enron_df)))

There are 21 people with 'NaN' for their total payments, or 14.38% of the dataset


## Question 2
How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this? (Quiz: Missing POIs 2 (Optional))

In [21]:
poi_df = enron_df[enron_df['poi']==True]

print("There are {} POIs with \'NaN\' for their total payments, or {:.2f}% of the dataset".format(\
        len(poi_df[poi_df['total_payments']=='NaN']),\
        100.0*len(poi_df[poi_df['total_payments']=='NaN'])/len(poi_df)))

There are 0 POIs with 'NaN' for their total payments, or 0.00% of the dataset


## Question 3

If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a “NaN” value with POIs or non-POIs? (Quiz: Missing POIs 3 (Optional))

- non-POIs



## Question 4

If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change.

What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments? (Quiz: Missing POIs 4 (Optional))

In [22]:
print("With 10 more POIs in the dataset, there would be {} total people.".format(len(enron_df)+10))
print("If those 10 POIs had \'NaN\' for \'total_payments\', then {} total people would have \'NaN\' for this field.".\
      format(len(enron_df[enron_df['total_payments']=='NaN'])+10))

With 10 more POIs in the dataset, there would be 156 total people.
If those 10 POIs had 'NaN' for 'total_payments', then 31 total people would have 'NaN' for this field.


## Question 5

What is the new number of POI’s in the dataset? What is the new number of POI’s with NaN for total_payments? (Quiz: Missing POIs 5 (Optional))

In [23]:
print("With 10 more POIs in the dataset, there would be {} total POIs.".format(len(poi_df)+10))
print("If those 10 POIs had \'NaN\' for \'total_payments\', then {} total POIs would have \'NaN\' for this field.".\
      format(len(poi_df[poi_df['total_payments']=='NaN'])+10))

With 10 more POIs in the dataset, there would be 28 total POIs.
If those 10 POIs had 'NaN' for 'total_payments', then 10 total POIs would have 'NaN' for this field.


## Question 6

Once the new data points are added, do you think a supervised classification algorithm might interpret “NaN” for total_payments as a clue that someone is a POI? (Quiz: Missing POIs 6 (Optional))

- Yes