# Connect Intensive - Machine Learning Nanodegree
# Lesson 02: Working with the Enron Data Set

## Objectives
  - Work through the Datasets and Questions lesson from [ud-120](https://www.udacity.com/course/intro-to-machine-learning--ud120) within the Jupyter Notebook environment
  - Introduce [the `pickle` module](https://docs.python.org/2/library/pickle.html) from the Python Standard Library for preserving data structures in Python
  - Practice loading data from different directories
  - Review stacking and unstacking in `pandas` to [reshape a `DataFrame`](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)
  

## Background Information
Before you start working through this Jupyter Notebook, you should probably watch the following videos from ud-120 (they're all conveniently compiled in your Classroom under the Data Modeling section).
  - Introduction
  - What is a POI
  - Accuracy vs. Training Set Size
  - Downloading Enron Data
  - Types of Data (Quizzes 1-6)
  - Enron Dataset Mini-Project Video
  
[Katie Malone](http://blog.udacity.com/2016/04/women-in-machine-learning-katie-malone.html) provides background on the Enron scandal, introduces the email corpus, defines a person-of-interest (POI), and poses an interesting ML question to motivate the project ("Can we identify patterns in the emails of people who were POIs?"). The videos provide excellent insight into the workflow for a Machine Learnist or Data Scientist.

This Jupyter Notebook is intended to provide a friendly guide through the "Datasets and Questions" lesson... but if you're feeling pretty confident about your Python skills, consider going off-script! Try to work through the lesson on your own -- you may encounter some snags, and you can always refer back to this Notebook if you need a little push forward.

## Getting Started
The Datasets and Questions lesson in the Data Modeling section draws from Enron finance and email data. These datasets are found in the [**ud120-projects** repo](https://github.com/udacity/ud120-projects) on GitHub. Please fork and clone the **ud120-projects** repo to your local machine if you haven't done so already (if you need a refresher on how to fork and clone from the command line, [check out this link from GitHub](https://help.github.com/articles/fork-a-repo/)). *Be sure to keep track of where you save the local clone of this repo on your machine!* We'll need the location of that directory in just a bit.

## In a `pickle`
Suppose you're working with Python and you assemble nice data structures (*e.g.* dictionaries, lists, tuples, sets...) that you'll want to re-use at in Python a later time. [The `pickle` module](https://docs.python.org/2/library/pickle.html) is a fast, efficient way to preserve (or pickle) those data structures without you needing to worry about how to structure or organize your output file. One nice thing about using `pickle` is that the data structures you store can be arbitrarily complex: you can have nested data structures (*e.g.* lists of tuples as the values in a dictionary) and `pickle` will know exactly how to serialize (or write) those structures to file! The next time you're in Python, you can un-pickle the data structures using `pickle.load()` and pick up right where you left off!

For a better explanation of `pickle` than I could hope to put together, please check out [this reference on Serializing Python Objects](http://www.diveintopython3.net/serializing.html) from Dive Into Python 3.

**Run** the cell below to import the `pickle` module. (Don't forget, **shift + enter** or **shift + return** runs the active cell in a Jupyter Notebook)

In [None]:
try:
    import pickle
    print("Successfully imported pickle!")
except ImportError:
    print("Could not import pickle")

## The `path` to success
Do you remember where you cloned the **ud120-projects** repo to your local machine? We need that information now! The **ud120-projects** directory contains a lot of folders and files, one of which is the Enron data within a Python dictionary, preserved using the `pickle` module. However, we need to tell this Jupyter Notebook where it can find the **ud120-projects** repo on our local machine. In the cell below, I have a string variable (`ud_120_path`) that contains the path of the **ud120-projects** directory on my machine. Chances are, you didn't save the directory to the same place on your machine (although you may have, in which case, hello fellow nick!)

**Update** the variable "`ud_120_path`" in the cell below to reflect the correct path of the **ud120-projects** directory on your local machine.

Then **run** the cell below to load the Enron data!

In [None]:
# Be sure to write the full path, up to and including "ud120-projects"
# (but don't end the string with a "/")
ud_120_path = "/Users/thomas/Udacity/Connect/ud120-projects" # change this!

try:
    enron_data = pickle.load(open(ud_120_path + "/final_project/final_project_dataset.pkl", "r"))
    print("Enron data loaded succesfully!")
except IOError:
    print("No such file or directory! (Is there a problem with the path?)")

## From Dictionary to DataFrame
At this point, the variable `enron_data` is a dictionary object. Dictionaries are not displayed as nicely as `pandas` `DataFrame` objects within the Jupyter Notebook environment. So let's convert `enron_data` to a `DataFrame` object! In the Jupyter Notebook lesson-01, we saw how to construct a `DataFrame` from a .csv file... we simply used the method `pd.DataFrame.read_csv()`. Fortunately, it's just as easy to create a `DataFrame` object from a dictionary object: we could use [the method `pandas.DataFrame.from_dict()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html) or simply use [the constructor `pandas.DataFrame()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) -- either one works!

**Run** the cell below to:
  - import `pandas` and `display`.
  - set some display options.
  - create a `DataFrame` object for the Enron data.
  - display the Enron data.

In [None]:
try:
    import pandas as pd
    print("Successfully imported pandas! (Version {})".format(pd.__version__))
    pd.options.display.max_rows = 10
except ImportError:
    print("Could not import pandas!")

try:
    from IPython.display import display
    print("Successfully imported display from IPython.display!")
except ImportError:
    print("Could not import display from IPython.display")
    
enron_df = pd.DataFrame.from_dict(enron_data)
display(enron_df)

## Stacking, unstacking, and rearranging

Oh no, it looks like we created our DataFrame the wrong way! Each row of the `DataFrame` should correspond to a unique instance or input, while each column of the `DataFrame` should correspond to a unique feature or variable. The functions [`pandas.DataFrame.stack()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.stack.html) and [`pandas.DataFrame.unstack()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html) will come to the rescue! First, we need to `stack` the current column indices, moving them to the innermost level of the row index.

**Run** the cell below to see the results of calling `enron_df.stack()`

In [None]:
enron_df.stack()

We see that the result of `enron_df.stack()` is a `Series` object, where the innermost (rightmost) level of the index is the person's name in the Enron data set, while the outermost (leftmost) level of the index is the feature. If we call `unstack()` on the resulting `Series` without specifying a level, we'll just revert to the original `DataFrame`.

**Run** the cell below to see the result of calling `enron_df.stack().unstack()`

In [None]:
enron_df.stack().unstack()

The trick is, we need to `unstack` the *outermost* level of the index, but by default, the function will `unstack` the *innermost* level of the index.

**Run** the cell below *once* to correctly `stack` and `unstack` the Enron `DataFrame` to move the instances (names) to rows and features (variables) to columns. *Be careful!* If you run this cell an even number of times, you will lose your changes to the `DataFrame`... can you see why?

In [None]:
enron_df = enron_df.stack().unstack(0)
display(enron_df)

Great! Now that our `DataFrame` has the features and instances in the correct orientation, we can start to explore the data. But before we dive into the exercises, I'll leave you with one last reference for [reshaping `Series` and `DataFrame` objects in `pandas`](http://pandas.pydata.org/pandas-docs/stable/reshaping.html).

# Exercises
Now it's your turn to play with `pandas` to answer questions using the Enron data set. If you're not sure how to do something, feel free to ask questions, look through [the `pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/api.html), or refer to the code examples above!

You can check your solutions to each of these exercises by entering your answer in the corresponding Quiz in the "Datasets and Questions" lesson. I put the corresponding quizzes in parenthesis after each exercise, so you know where to go to check your answers.

## Question 1
How many data points (people) are in the data set? (Quiz: Size of the Enron Dataset)

In [None]:
len(enron_df)

## Question 2
For each person, how many features are available? (Quiz: Features in the Enron Dataset)

In [None]:
len(enron_df.columns)

## Question 3
How many Persons of Interest (POIs) are in the dataset? (Quiz: Finding POIs in the Enron Data)

In [None]:
len(enron_df[enron_df['poi']])

## Question 4
We compiled a list of all POI names (in `../final_project/poi_names.txt`) and associated email addresses (in `../final_project/poi_email_addresses.py`).

How many POI’s were there total? Use the **names** file, not the email addresses, since many folks have more than one address and a few didn’t work for Enron, so we don’t have their emails. (Quiz: How Many POIs Exist?)

**Hint:** Open up the `poi_names.txt` file to see the file format:
  - the first line is a link to a USA Today article
  - the second line is blank
  - subsequent lines have the format: `(•) Lastname, Firstname`
      - the dot `•` is either "y" (for yes) or "n" (for no), describing if the emails for that POI are available

In [None]:
poi_namefile = ud_120_path + "/final_project/poi_names.txt"

poi_have_emails = []
poi_last_names = []
poi_first_names = []

with open(poi_namefile, 'r') as f:
    # read the USA Today link
    usa_today_url = f.readline()
    
    # read the blank line
    f.readline()
    
    # for each remaining line, append information to the lists
    for line in f:
        name = line.split(" ")
        # name is a list of strings: ["(•)" , "Lastname," , "Firstname\n"]
        poi_have_emails.append(name[0][1] == "y")
        poi_last_names.append(name[1][:-1])
        poi_first_names.append(name[2][:-1])
        
len(poi_first_names)

## Question 5
What might be a problem with having some POIs missing from our dataset? (Quiz: Problems with Incomplete Data)

This is more of a "free response" thought question -- we don't really expect you to answer this using code.

## Question 6
What is the total value of the stock belonging to James Prentice? (Query The Dataset 1)

In [None]:
enron_df[enron_df.index.str.contains("Prentice",case=False)]['total_stock_value']

## Question 7
How many email messages do we have from Wesley Colwell to persons of interest? (Query The Dataset 2)

In [None]:
enron_df[enron_df.index.str.contains("Colwell",case=False)]['from_this_person_to_poi']

## Question 8
What's the value of stock options exercised by Jeffrey K Skilling? (Query The Dataset 3)

In [None]:
enron_df[enron_df.index.str.contains("Skilling",case=False)]['exercised_stock_options']

**Questions 9-12 can be answered by some Google sleuthing** 

## Question 9
Which of these schemes was Enron **not** involved in? (Quiz: Research the Enron Fraud)

- selling assets to shell companies at the end of each month, and buying them back at the beginning of the next month to hide accounting losses
- causing electrical grid failures in California
- illegally obtained a government report that enabled them to corner the market on frozen concentrated orange juice futures
- conspiring to give a Saudi prince expedited American citizenship
- a plan in collaboration with Blockbuster movies to stream movies over the internet

## Question 10
Who was the CEO of Enron during most of the time that fraud was being perpetrated? (Quiz: Enron CEO)

## Question 11
Who was chairman of the Enron board of directors? (Quiz: Enron Chairman)

## Question 12
Who was CFO (chief financial officer) of Enron during most of the time that fraud was going on? (Quiz: Enron CFO)

## Question 13
Of the CEO, Chairman of the Board, and the CFO, who took home the most money? How much money was it? (Quiz: Follow the Money)

*Hint:* Which of the three individuals has the largest value in the `'total_payments'` feature?

In [None]:
enron_df[(enron_df.index.str.contains("Skilling", case=False)) | \
         (enron_df.index.str.contains("Lay", case=False)) | \
         (enron_df.index.str.contains("Fastow", case=False))]['total_payments']

## Question 14
For nearly every person in the dataset, not every feature has a value. How is it denoted when a feature doesn’t have a well-defined value? (Quiz: Unfilled Features)

In [None]:
display(enron_df)

## Question 15
How many folks in this dataset have a quantified salary? What about a known email address? (Quiz: Dealing with Unfilled Features)

In [None]:
print("There are {} people with known salaries.".format(len(enron_df[enron_df['salary'] != 'NaN'])))
print("There are {} people with known e-mail addresses.".format(len(enron_df[enron_df['email_address'] != 'NaN'])))

## More Magic Functions!

In the Jupyter Notebook `lesson-01.ipynb`, we introduced our first [Magic Function](http://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained):

`%matplotlib inline`

That allowed us to generate plots using `matplotlib.pyplot` that appeared directly within our Jupyter Notebook! Let's learn another Magic Function:

`%load filename.py`

This allows us to quickly load python code from the file `filename.py` into the current cell. We can try it to load the contents of `"../tools/feature_format.py"` into this Jupyter Notebook. However, suppose you have a Python `string` object with the path and desired filename... let's call it `target_file`. Unfortunately, `%load target_file` does not work with a `string` object... so here's what I came up with.

**Run** the cell below to print the magic command we need to load `feature_format.py` from the ud120-projects repo:

In [None]:
print("%load " + ud_120_path + "/tools/feature_format.py")

Now **copy + paste** the output from running the above cell into the empty cell below, and then **run** the cell below. What happens?

In [None]:
# %load /Users/nick/Documents/Udacity/ud120-projects/tools/feature_format.py
#!/usr/bin/python

""" 
    A general tool for converting data from the
    dictionary format to an (n x k) python list that's 
    ready for training an sklearn algorithm

    n--no. of key-value pairs in dictonary
    k--no. of features being extracted

    dictionary keys are names of persons in dataset
    dictionary values are dictionaries, where each
        key-value pair in the dict is the name
        of a feature, and its value for that person

    In addition to converting a dictionary to a numpy 
    array, you may want to separate the labels from the
    features--this is what targetFeatureSplit is for

    so, if you want to have the poi label as the target,
    and the features you want to use are the person's
    salary and bonus, here's what you would do:

    feature_list = ["poi", "salary", "bonus"] 
    data_array = featureFormat( data_dictionary, feature_list )
    label, features = targetFeatureSplit(data_array)

    the line above (targetFeatureSplit) assumes that the
    label is the _first_ item in feature_list--very important
    that poi is listed first!
"""


import numpy as np

def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True,\
                  remove_any_zeroes=False, sort_keys = False):
    """ convert dictionary to numpy array of features
        remove_NaN = True will convert "NaN" string to 0.0
        remove_all_zeroes = True will omit any data points for which
            all the features you seek are 0.0
        remove_any_zeroes = True will omit any data points for which
            any of the features you seek are 0.0
        sort_keys = True sorts keys by alphabetical order. Setting the value as
            a string opens the corresponding pickle file with a preset key
            order (this is used for Python 3 compatibility, and sort_keys
            should be left as False for the course mini-projects).
        NOTE: first feature is assumed to be 'poi' and is not checked for
            removal for zero or missing values.
    """


    return_list = []

    # Key order - first branch is for Python 3 compatibility on mini-projects,
    # second branch is for compatibility on final project.
    if isinstance(sort_keys, str):
        import pickle
        keys = pickle.load(open(sort_keys, "rb"))
    elif sort_keys:
        keys = sorted(dictionary.keys())
    else:
        keys = dictionary.keys()

    for key in keys:
        tmp_list = []
        for feature in features:
            try:
                dictionary[key][feature]
            except KeyError:
                print "error: key ", feature, " not present"
                return
            value = dictionary[key][feature]
            if value=="NaN" and remove_NaN:
                value = 0
            tmp_list.append( float(value) )

        # Logic for deciding whether or not to add the data point.
        append = True
        # exclude 'poi' class as criteria.
        if features[0] == 'poi':
            test_list = tmp_list[1:]
        else:
            test_list = tmp_list
        ### if all features are zero and you want to remove
        ### data points that are all zero, do that here
        if remove_all_zeroes:
            append = False
            for item in test_list:
                if item != 0 and item != "NaN":
                    append = True
                    break
        ### if any features for a given data point are zero
        ### and you want to remove data points with any zeroes,
        ### handle that here
        if remove_any_zeroes:
            if 0 in test_list or "NaN" in test_list:
                append = False
        ### Append the data point if flagged for addition.
        if append:
            return_list.append( np.array(tmp_list) )

    return np.array(return_list)


def targetFeatureSplit( data ):
    """ 
        given a numpy array like the one returned from
        featureFormat, separate out the first feature
        and put it into its own list (this should be the 
        quantity you want to predict)

        return targets and features as separate lists

        (sklearn can generally handle both lists and numpy arrays as 
        input formats when training/predicting)
    """

    target = []
    features = []
    for item in data:
        target.append( item[0] )
        features.append( item[1:] )

    return target, features






You should see that the Magic Function `%load filename.py` copies the contents of `filename.py` into the same cell, below the magic command. Additionally, the Magic Function `%load filename.py` is now commented out! Note that you've **copied** the contents of `filename.py` into the cell with the Magic Function `%load filename.py`, but you need to run the cell a second time to actually **execute** the code.

If you want to try another `%load` Magic Function, **run** the cell below to generate another `%load` for the file `poi_email_addresses.py`, then **copy + paste** the output from that cell into another cell. **Run** the resulting magic function to load the contents of `poi_email_addresses.py` to this Notebook! 

In [None]:
print "%load " + ud_120_path + "/final_project/poi_email_addresses.py"

A couple last helpful hints about Magic Functions:
  - You can learn more about Magic Functions at any time with the Magic Function `%magic`
  - You can get the list of available Magic Functions by the Magic Function `%lsmagic`
  - You can learn about any Magic Function by typing a question mark after it, *e.g.* `%load?`
  
**Try it below!**

In [None]:
%magic

# Optional Exercises -- Missing POIs
As you saw a little while ago, not every POI has an entry in the dataset (e.g. Michael Krautz). That’s because the dataset was created using the financial data you can find in `../final_project/enron61702insiderpay.pdf`, which is missing some POI’s (those absences propagated through to the final dataset). On the other hand, for many of these “missing” POI’s, we do have emails.

While it would be straightforward to add these POI’s and their email information to the E+F dataset, and just put “NaN” for their financial information, this could introduce a subtle problem. You will walk through that here.

Again, you can check your solutions to each of these exercises by entering your answer in the corresponding Quiz in the "Datasets and Questions" lesson. I put the corresponding quizzes in parenthesis after each exercise, so you know where to go to check your answers.

## Question 1
How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this? (Quiz: Missing POIs 1 (Optional))

In [None]:
print("There are {} people with \'NaN\' for their total payments, or {:.2f}% of the dataset".format(\
        len(enron_df[enron_df['total_payments']=='NaN']),\
        100.0*len(enron_df[enron_df['total_payments']=='NaN'])/len(enron_df)))

## Question 2
How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this? (Quiz: Missing POIs 2 (Optional))

In [None]:
poi_df = enron_df[enron_df['poi']==True]

print("There are {} POIs with \'NaN\' for their total payments, or {:.2f}% of the dataset".format(\
        len(poi_df[poi_df['total_payments']=='NaN']),\
        100.0*len(poi_df[poi_df['total_payments']=='NaN'])/len(poi_df)))

## Question 3

If a machine learning algorithm were to use total_payments as a feature, would you expect it to associate a “NaN” value with POIs or non-POIs? (Quiz: Missing POIs 3 (Optional))

(Think about your answers from Questions 1 and 2 to answer this)

## Question 4

If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change.

What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments? (Quiz: Missing POIs 4 (Optional))

In [None]:
print("With 10 more POIs in the dataset, there would be {} total people.".format(len(enron_df)+10))
print("If those 10 POIs had \'NaN\' for \'total_payments\', then {} total people would have \'NaN\' for this field.".\
      format(len(enron_df[enron_df['total_payments']=='NaN'])+10))

## Question 5

What is the new number of POI’s in the dataset? What is the new number of POI’s with NaN for total_payments? (Quiz: Missing POIs 5 (Optional))

In [None]:
print("With 10 more POIs in the dataset, there would be {} total POIs.".format(len(poi_df)+10))
print("If those 10 POIs had \'NaN\' for \'total_payments\', then {} total POIs would have \'NaN\' for this field.".\
      format(len(poi_df[poi_df['total_payments']=='NaN'])+10))

## Question 6

Once the new data points are added, do you think a supervised classification algorithm might interpret “NaN” for total_payments as a clue that someone is a POI? (Quiz: Missing POIs 6 (Optional))

(Think about your answers from Questions 1-5 to answer this)