# Lab 4: Functions and Visualizations

Welcome to lab 4! This week, we'll learn about functions and the table method `apply` from [Section 7.1](https://ucsd-dsc10.gitbooks.io/textbook/content/chapters/07/1/applying-a-function-to-a-column.html).  We'll also learn about visualization from [Chapter 6](https://ucsd-dsc10.gitbooks.io/textbook/content/chapters/06/visualization.html).

First, set up the tests and imports by running the cell below.

In [1]:
import numpy as np
from datascience import *

# These lines set up graphing capabilities.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

from client.api.notebook import Notebook
ok = Notebook('lab04.ok')
_ = ok.auth(inline=True)

Assignment: Functions and Visualizations
OK, version v1.14.15

Successfully logged in as ykk@ucsb.edu


## 1. Functions and CEO Incomes

Let's start with a real data analysis task.  We'll look at the 2015 compensation of CEOs at the 100 largest companies in California.  The data were compiled for a Los Angeles Times analysis [here](http://spreadsheets.latimes.com/california-ceo-compensation/), and ultimately came from [filings](https://www.sec.gov/answers/proxyhtf.htm) mandated by the SEC from all publicly-traded companies.  Two companies have two CEOs, so there are 102 CEOs in the dataset.

We've copied the data in raw form from the LA Times page into a file called `raw_compensation.csv`.  (The page notes that all dollar amounts are in millions of dollars.)

In [None]:
raw_compensation = Table.read_table('raw_compensation.csv')
raw_compensation

**Question 1.** We want to compute the average of the CEOs' pay. Try running the cell below.

In [None]:
np.average(raw_compensation.column("Total Pay"))

You should see an error. **Before proceeding, make sure to comment out the line so there's no error!** Let's examine why this error occured by looking at the values in the "Total Pay" column. Use the `type` function and set `total_pay_type` to the type of the first item from the "Total Pay" column.

In [None]:
total_pay_type = ...
total_pay_type

In [None]:
_ = ok.grade('q1_1')

**Question 2.** You should have found that the values in "Total Pay" column are strings (text). It doesn't make sense to take the average of the text values, so we need to convert them to numbers if we want to do this. Extract the first value in the "Total Pay" column.  It's Mark Hurd's pay in 2015, in *millions* of dollars.  Call it `mark_hurd_pay_string`.

In [None]:
mark_hurd_pay_string = ...
mark_hurd_pay_string

In [None]:
_ = ok.grade('q1_2')

**Question 3.** Convert `mark_hurd_pay_string` to a number of *dollars*.  The string method `strip` will be useful for removing the dollar sign; it removes a specified character from the start or end of a string.  For example, the value of `"100%".strip("%")` is the string `"100"`.  You'll also need the function `float`, which converts a string that looks like a number to an actual number.  Last, remember that the answer should be in dollars, not millions of dollars.

In [None]:
mark_hurd_pay = ...
mark_hurd_pay

In [None]:
_ = ok.grade('q1_3')

To compute the average pay, we need to do this for every CEO.  But that looks like it would involve copying this code 102 times.

This is where functions come in.  Later in this lab, we'll define a new function, giving a name to the expression that converts "total pay" strings to numeric values.  Then we'll see the payoff: we can call that function on every pay string in the dataset at once.

## 2. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign.)

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Let's call our function `to_percentage`.
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  `to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.

    def to_percentage(proportion)

We put a colon after the signature to tell Python it's over.

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function.  We can write anything we could write anywhere else.  First let's give a name to the number we multiply a proportion by to get a percentage.

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        
In Python, everything inside the function body has to be indented (i.e., there's a tab at the beginning of the line).

##### `return`
The special instruction `return` in a function's body tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
        
If you don't include the `return` statement, you will not be able to use the result of this function.

**Question 1.** Define `to_percentage` function in the cell below.

*Hint:* If you are not sure what to do, read the instructions in the preceding cell carefully.

In [None]:
def ...
    """ ... """
    ... = ...
    return ...

Now, call this function that you defined above to convert the proportion `0.2` to a percentage.  Name the result of this function (the resulting percentage) `twenty_percent`. 

_Quick Aside_: Note that you should not name your variable based on the _value_ it holds: variable values can change, so its name should reflect the meaning or usage of the variable. In this case, since we are writing just a quick test, and we are not using this variable anywhere else, we'll use this unconventional name.

In [None]:
twenty_percent = ...
twenty_percent

In [None]:
_ = ok.grade('q2_1')

Like the built-in functions, you can use named values (also known as _variables_) as arguments to your function.

**Question 2.** Use `to_percentage` again to convert the proportion named `a_proportion` (defined below) to a percentage called `a_percentage`.

*Note:* You don't need to define `to_percentage` again!  Just like other named things, functions stick around after you define them.

In [None]:
a_proportion = 2**(.5) / 2
a_percentage = ...
a_percentage

In [None]:
_ = ok.grade('q2_2')

Here's something important about functions: the names assigned within a function body are only accessible within the function body. Once the function has returned, those names are gone.  So even though you defined `factor = 100` inside `to_percentage` above and then called `to_percentage`, you cannot refer to `factor` anywhere except inside the body of `to_percentage`:

*Note:* Keep in mind that this cell intentionally causes an error if you want to use Run All!

In [None]:
# You should see an error when you run this.  (If you don't, you might
# have defined factor somewhere above.)
factor

**Before proceeding, comment the line in the above cell to remove the error!**

Let's write another function to ensure that you understand what is going on.

**Question 3.** Define a function called ``square``. It should take a single float as its argument. (You can call that argument whatever you want.) It should return the square of the argument.

In [None]:
def ...
    return ...

In [None]:
_ = ok.grade('q2_3')

As we've seen with the built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

**Question 4.** Define a function called `concatenate_two_strings`.  It should take two strings as its arguments.  (You can call the arguments whatever you want.)  It should return a string that is the same as joining the two strings end-to-end. For example, `concatenate_two_string("snow", "ball")` should return snowball.

*Hint:* Remember that `+` work on strings too! `"snow" + "ball"` will have the result snowball.

In [None]:
def ...(..., ...):
    ...
    
# An example call to your function.  (It's often helpful to run
# an example call from time to time while you're writing a function,
# to see how it currently works.)
concatenate_two_strings("snow", "ball")

In [None]:
_ = ok.grade('q2_4')

##### Calls on calls on calls
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is a like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the sprinkles.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the total number of characters in three strings. One way to do that is to concatenate all of them and count the size of the resulting string.

**Question 4.** Write a function called `len_three_strings`.  It should take three strings as its arguments. (You can call them whatever you like.) Inside the definition of the function, you should make two function calls to `concatenate_two_strings`. Finally `len_three_strings` should return a number. The number should be the total number of characters in the three strings.

*Hint:* The function `len` takes a string as its argument and returns the number of characters in it.

In [None]:
def ...
    ...

In [None]:
_ = ok.grade('q2_5')

Let's now combine what we learned about functions with tables.

The `movies_by_year` dataset in the textbook has information about movie sales in recent years. Suppose you'd like to get the year with the 5th-highest total gross movie sales. You might do this:

In [None]:
movies_by_year = Table.read_table("movies_by_year.csv")
rank = 5
fifth_from_top_movie_year = movies_by_year.sort("Total Gross", descending=True).column("Year").item(rank-1)
fifth_from_top_movie_year

After writing this, you realize you also wanted to print out the 2nd and 3rd-highest years.  Instead of copying your code, you decide to put it in a function.  Since the rank varies, you make that an argument to your function.

**Question 5.** Write a function called `get_kth_top_movie_year`.  It should take a single argument, the rank of the year (like 2, 3, or 5 in the above examples).  It return the year with the kth-highest total gross movie sales.

In [None]:
def ...
    # Our solution used 2 lines.
    ...
    ...

# Example calls to your function:
get_kth_top_movie_year(3)

In [None]:
_ = ok.grade('q2_6')

## 3. `apply`ing functions

Defining a function is a lot like giving a name to a value with `=`.  In fact, a function is a value just like the number 1 or the text "the"!

For example, we can make a new name for the built-in function `max` if we want:

In [None]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around:

In [None]:
max(2, 6)

Try just writing `max` or `our_name_for_max` (or the name of any other function) in a cell, and run that cell.  Python will print out a (very brief) description of the function.

In [None]:
max

Why is this useful?  Since functions are just values, it's possible to pass them as arguments to other functions.  Here's a simple but not-so-practical example: we can make an array of functions.

In [None]:
make_array(max, np.average, are.equal_to)

**Question 1.** Make an array containing any 3 other functions you've seen.  Call it `some_functions`.

In [None]:
some_functions = ...
some_functions

In [None]:
_ = ok.grade('q3_1')

Working with functions as values can lead to some funny-looking code.  For example, see if you can figure out why this works:

In [None]:
make_array(max, np.average, are.equal_to).item(0)(4, -2, 7)

Next we will see some more practical examples of working with functions as values. But first, let's resolve some unfinished business. Let's define a function that converts a pay string that looks like "$100.00" in millions of dollars to a float in dollars.

**Question 2.** Define a function called `convert_pay_string_to_number`. It should take in one single argument, which is a string that looks like "$100.00" in millions of dollars, and return a float which is the value of the pay in dollars.

*Hint:* Remember what you did in Question 3 of Part 1?

In [None]:
def ...
    """Converts a pay string like '$100' (in millions) to a number of dollars."""
    ...

In [None]:
_ = ok.grade('q1_4')

Let's try using our function `convert_pay_string_to_number`.

In [None]:
convert_pay_string_to_number('$42')

In [None]:
convert_pay_string_to_number(mark_hurd_pay_string)

In [None]:
# We can also compute Safra Catz's pay in the same way:
convert_pay_string_to_number(raw_compensation.where("Name", are.containing("Safra")).column("Total Pay").item(0))

Next, as promised, we will learn how to use this function on every element in a column of a table at once. Let's take a look at the table method `apply`.

`apply` calls a function many times, once on *each* element in a column of a table.  It produces an array of the results.  Here we use `apply` to convert every CEO's pay to a number, using the function you defined:

In [None]:
raw_compensation.apply(convert_pay_string_to_number, "Total Pay")

Here's an illustration of what that did:

<img src="apply.png"/>

Note that we didn't write something like `convert_pay_string_to_number()` or `convert_pay_string_to_number("Total Pay")`.  The job of `apply` is to call the function we give it, so instead of calling `convert_pay_string_to_number` ourselves, we just write its name as an argument to `apply`.

**Question 3.** Using `apply`, make a table that's a copy of `raw_compensation` with one more column called "Total Pay (\$)".  It should be the result of applying `convert_pay_string_to_number` to the "Total Pay" column, as we did above.  Call the new table `compensation`.

In [None]:
compensation = raw_compensation.with_column(
    "Total Pay ($)",
    ...)
compensation

In [None]:
_ = ok.grade('q3_2')

Now that we have the pay in numbers, we can compute things about them.

**Question 4.** Compute the average total pay of the CEOs in the dataset.

In [None]:
average_total_pay = ...
average_total_pay

In [None]:
_ = ok.grade('q3_3')

**Question 5.** Companies pay executives in a variety of ways: directly in cash; by granting stock or other "equity" in the company; or with ancillary benefits (like private jets).  Compute the proportion of each CEO's pay that was cash.  (Your answer should be an array of numbers, one for each CEO in the dataset.)


*Note:* You will see a warning pop up in a pink box.  DON'T BE ALARMED!!!!!!!  We'll get to that in a sec.

In [None]:
cash_proportion = ...
cash_proportion

In [None]:
_ = ok.grade('q3_4')

So what was that warning about?  It says something about "`true_divide`."  As a hint, let's look at the last rows of our `compensation` table.

In [None]:
compensation.take[-5:]

Notice anything strange? 

**Question 6.** Why did we get the "`true_divide`" warning from above?  Assign either 1, 2, 3, or 4 to the name `apply_q5` below. 
1. The proportion would be 0.  Python decides that only nonzero numbers are worth calculating.
2. The `Ratio of CEO pay to average industry worker pay` rounds to 0, so Python knows something is strange.
3. The calculation is dividing by 0, so Python doesn't know what to do.
4. The proportion is a `nan` (not-a-number), of course Python can't divide a `nan`.

In [None]:
apply_q5 = ...

In [None]:
_ = ok.grade('q3_5')

A lot of real data is messy.  It might contain zeros, empty values, or `nan`'s (not-a-number) that we need to watch out for when performing calculations! 

If you're interested in going deeper, try comparing certain outputs of NumPy's version of division, `np.divide`, versus the default Python division operator, `/`.

In [None]:
print(np.divide(... , ...))
print(... / ...)

**Before proceeding, if there are any errors, remember to fix the code in the cell above so that there are no errors!**

Alright, back to our CEOs.  Check out the "% Change" column in `compensation`.  It shows the percentage increase in the CEO's pay from the previous year.  For CEOs with no previous year on record, it instead says "(No previous year)".  The values in this column are *strings*, not numbers, so like the "Total Pay" column, it's not usable without a bit of extra work.

Given your current pay and the percentage increase from the previous year, you can compute your previous year's pay.  For example, if your pay is 100 dollars this year, and that's an increase of 50 percent from the previous year, then your previous year's pay was $\frac{100}{1 + \frac{50}{100}}$ dollars, or around 66.66 dollars.

**Question 7.** Create a new table called `with_previous_compensation`.  It should be a copy of `compensation`, but with the "(No previous year)" CEOs filtered out, and with an extra column called "2014 Total Pay ($)".  That column should have each CEO's pay in 2014.

*Hint:* This question takes several steps, but each one is still something you've seen before.  Take it one step at a time, using as many lines as you need.  You can print out your results after each step to make sure you're on the right track.

*Hint 2:* You'll need to define a function.  You can do that just above your other code.

In [None]:
# For reference, our solution involved more than just this one line of code
...

with_previous_compensation = ...
with_previous_compensation

In [None]:
_ = ok.grade('q3_6')

**Question 8.** What was the average pay of these CEOs in 2014?

In [None]:
average_pay_2014 = ...
average_pay_2014

In [None]:
_ = ok.grade('q3_7')

**Question 9.** According to the [data](https://www1.salary.com/Safra-A-Catz-Salary-Bonus-Stock-Options-for-Oracle-Corp.html) in the fiscal year 2018, Safra Catz made in total $108,282,333. Earlier, we computed how much Safra Catz made in 2015. What is the growth rate of Safra Catz's total pay from 2015 to 2018?

*Hint:* The formula for growth rate can be found [here](https://www.inferentialthinking.com/chapters/03/2/1/Growth.html).

In [None]:
safra_catz_pay_2015 = ...
safra_catz_pay_2018 = ...
safra_catz_pay_growth_rate = ...
safra_catz_pay_growth_rate

In [None]:
_ = ok.grade('q3_9')

This is an increase over 100%!

## 4. Histograms
Earlier, we computed the average pay among the CEOs in our 102-CEO dataset.  The average doesn't tell us everything about the amounts CEOs are paid, though.  Maybe just a few CEOs make the bulk of the money, even among these 102.

We can use a *histogram* to display more information about a set of numbers.  The table method `hist` takes a single argument, the name of a column of numbers.  It produces a histogram of the numbers in that column. Read more about histograms [here](https://www.inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html).

**Question 1.** Make a histogram of the pay of the CEOs in `compensation`.

In [None]:
...

Before proceeding, we highly recommend you to read carefully [the chapter on histogram](https://www.inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#Differences-Between-Bar-Charts-and-Histograms). After reading, you should understand that a histogram is very different from a bar chart. And the **area** of each bar is proportional to the number of entries in the **bin**. If you are unclear about the terminology in the previous sentence, read again!

**Question 2.** Looking at the histogram, how many CEOs made more than \$30 million?  (Answer the question by filling in your answer manually.  You'll have to do a bit of arithmetic; feel free to use Python as a calculator.)

In [None]:
num_ceos_more_than_30_million = ...

**Question 3.** Answer the same question with code.  *Hint:* Use the table method `where` and the property `num_rows`.

In [None]:
num_ceos_more_than_30_million_2 = ...
num_ceos_more_than_30_million_2

In [None]:
_ = ok.grade('q4_3')

Run the next cell if you want to see how far off you were.

In [None]:
percent_diff = abs(num_ceos_more_than_30_million - num_ceos_more_than_30_million_2) / num_ceos_more_than_30_million_2
print("Your guess was only", percent_diff * 100, "% off!")

Great job! :D You're finished with lab 4! Be sure to...

* **run all the tests** (the next cell has a shortcut for that),
* **run the last cell to submit your work**,


In [2]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
 > Suite 1 > Case 1

>>> str(mark_hurd_pay_string)
NameError: name 'mark_hurd_pay_string' is not defined

# Error: expected
#     '$53.25 '
# but got
#     Traceback (most recent call last):
#       ...
#     NameError: name 'mark_hurd_pay_string' is not defined

Run only this test case with "python3 ok -q q1_2 --suite 1 --case 1"
---------------------------------------------------------------------
Test summary
    Passed: 0
    Failed: 1
[k..........] 0.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
 > Suite 1 > Case 1

>>> import math
>>> math.isclose(average_total_pay, 11445294.117647059, rel_tol = 1)
NameError: name 'average_total_pay' is not defined

# Error: expected
#     True
# but got
#     Traceback (most 

In [3]:
_ = ok.submit()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab04.ipynb'.
Submit... 100% complete
Submission successful for user: ykk@ucsb.edu
URL: https://okpy.org/ucsb/int5/fa19/lab04/submissions/gp8JRG

