In [None]:
import otter
grader = otter.Notebook()

# Lab 4: Defining Functions

Welcome to lab 4! This week, we'll learn about functions and the table method `apply` from [Applying a Function to a Column](https://www.inferentialthinking.com/chapters/08/1/applying-a-function-to-a-column.html).  

First, set up the tests and imports by running the cell below.

In [None]:
import numpy as np
from datascience import *

# These lines set up graphing capabilities.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Functions and CEO Incomes

Let's start with a real data analysis task.  We'll look at the 2015 compensation of CEOs at the 100 largest companies in California.  The data were compiled for a Los Angeles Times analysis [here](http://spreadsheets.latimes.com/california-ceo-compensation/), and ultimately came from [filings](https://www.sec.gov/answers/proxyhtf.htm) mandated by the SEC from all publicly-traded companies.  Two companies have two CEOs, so there are 102 CEOs in the dataset.

We've copied the data in raw form from the LA Times page into a file called `raw_compensation.csv`.  (The page notes that all dollar amounts are in millions of dollars.)

In [None]:
raw_compensation = Table.read_table('raw_compensation.csv')
raw_compensation

**Question 1.1.** We want to compute the average of the CEOs' pay. Try running the cell below.

In [None]:
np.average(raw_compensation.column("Total Pay"))

You should see a TypeError. **Before proceeding, make sure to comment out the line so there's no error!** Let's examine why this error occured by looking at the values in the "Total Pay" column. Use the `type` function and set `total_pay_type` to the type of _the first item_ from the "Total Pay" column.

In [None]:
total_pay_type = ...
total_pay_type

In [None]:
grader.check("q1_1")

**Question 1.2.** You should have found that the values in `Total Pay` column are strings (text). It doesn't make sense to take the average of the text values, so we need to convert them to numbers if we want to do this. Extract the first value in the "Total Pay" column.  It's Mark Hurd's pay in 2015, in *millions* of dollars.  Call it `mark_hurd_pay_string`.

In [None]:
mark_hurd_pay_string = ...
mark_hurd_pay_string

In [None]:
grader.check('q1_2')

**Question 1.3.** Convert `mark_hurd_pay_string` to a number of *dollars*.  The string method `strip` will be useful for removing the dollar sign; it removes a specified character from the start or end of a string.  For example, the value of `"100%".strip("%")` is the string `"100"` (because `strip` will remove the `%` that's at the end).  

You'll also need the function `float`, which converts a string that looks like a number to an actual number.  Last, remember that the answer should be in *dollars*, **not** millions of dollars.

In [None]:
mark_hurd_pay = ...
mark_hurd_pay

In [None]:
grader.check('q1_3')

To compute the average pay, we need to do this for every CEO.  But that looks like it would involve copying this code 102 times.

Any time you want to copy/paste something a bunch of times, don't do it!

This is where functions come in.  Later in this lab, we'll define a new function, giving a name to the expression that converts "total pay" strings to numeric values.  Then we'll see the payoff: we can call that function on every pay string in the dataset at once.

For now, let's learn how to create, or as commonly said, *define*, new functions.

## 2. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign.)

A **function definition** has a few parts:
* a keyword `def`
* a name of the function
* a function "signature"
* highly recommended documentation of the function
* a body of the function
* a `return` statement.

Let's look closely at each part.

#### `def`
It always starts with `def` (short for **def**ine):

    def

#### Name
Next comes the name of the function. When you are creating functions, you get to name them. Remember that meaningful function names make your code easier to read and understand.  Let's call our function `to_percentage`.
    
    def to_percentage

#### Signature / declaration
Next comes something called the *signature* of the function.  It tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  

> *Definition alert*: 
> * **Parameter** is a variable defined in the function signature/declaration.
> * **Argument** is the *actual value* of this variable that get passed to the function during the *function call*. 

`to_percentage` should take one argument, and we'll call that argument `proportion` since it should represent a proportion that we want to convert.

    def to_percentage(proportion)
    
*Note:* it is up to us what to call the parameters; just like with the other variables that we create, we get to name the parameters in the function signature.
    
If we had more parameters, we would separate them with commas. Note that the parameters are always included within the opening `(` and closing `)` set of parentheses.

We put a colon after the signature to tell Python that we are finished with the signature.

    def to_percentage(proportion):

#### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""

The triple-quoted string allows you to write a multi-line comment. For example, we could have also written it as follows (helpful when you have more a complicated signature):

    def to_percentage(proportion):
        """
        Converts a proportion to a percentage.
        Input: proportion to convert
        Return: percentage
        """


    
#### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function and every line **must be indented with a tab**.  Any lines that are *not* indented and left-aligned with the `def` statement are considered to be *outside the function*. 

Some notes about the body of the function:
- We can write code that we would write anywhere else.  
- We use the _parameters_ defined in the function signature. We can do this because we assume that when we call the function, values (which we call *arguments*) are already assigned to those parameters.
- We generally avoid referencing variables defined *outside* the function. If you would like to reference variables outside of the function, pass them through as arguments!


Now, let's give a name to the number we multiply a proportion by to get a percentage:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

#### `return`
The special instruction `return` is part of the function's body and tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 multiplied by the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor
        
`return` only makes sense in the context of a function, and **can never be used outside of a function**. `return` is always the last line of the function because Python stops executing the body of a function once it hits a `return` statement.

*Side note:* we can have a complicated function that can have multiple `return` values, but in this class, we will constrain ourselves to the single `return` at the end of the function body.

**Important:**  `return` inside a function tells Python what value the function evaluates to. However, there are other functions, like `print`, that have no `return` value. For example, `print` simply prints the provided arguments to the console. 

`return` and `print` are **very** different. Do not use them interchangeably!

*Note:* If a function does not have an explicit `return` statement, then Python automatically returns `None` to indicate that there was no value to return.

-----

That's it! We just finished defining a new function called `to_percentage`. This name now refers to the sets of steps that we included in the function body and just like other named things, functions stick around after you define them.

Now, in order to use/execute `to_percentage`, we need to **_call_ this function** and provide some specific _values_ as its **input arguments**, which will be substituted for the **parameters** that we defined in the function signature.

For example, to convert 0.99 using our function, we would call it as follows: 
`to_percentage(.99)`. If we want to save the result of this function call, we would need to assign its return value to a variable, which you'll get to do in the example below.

#### Define `to_percentage`

**Question 2.1.** Define `to_percentage` function in the cell below.

*Hint:* If you are not sure what to do, read the instructions in the preceding cell carefully.

In [None]:
def ...
    """ ... """
    ... = ...
    return ...

Now, call this function that you defined above to convert the proportion `0.2` to a percentage.  Name the result of this function (the resulting percentage) `twenty_percent`. 

*Quick Aside:* Note that you should not name your variable based on the _value_ it holds: variable values can change, so its name should reflect the meaning or usage of the variable. In this case, since we are writing just a quick test, and we are not using this variable anywhere else, we'll use this unconventional name.

In [None]:
twenty_percent = ...
twenty_percent

In [None]:
grader.check('q2_1')

Like the built-in functions, you can use named values (also known as *variables*) as arguments to your function.

**Question 2.2.** Use `to_percentage` again to convert the proportion named `a_proportion` (defined below) to a percentage called `a_percentage`.

*Note:* You don't need to define `to_percentage` again!  Just like other named things, functions stick around after you define them.

In [None]:
a_proportion = 2**(.5) / 2
a_percentage = ...
a_percentage

In [None]:
grader.check('q2_2')

Here's something important about functions: **the names assigned within a function body are only accessible within the function body**. Once the function has returned, those names are gone.  So even though you defined `factor = 100` inside `to_percentage` above and then called `to_percentage`, you cannot refer to `factor` anywhere except inside the body of `to_percentage`:

*Note:* Keep in mind that this cell intentionally causes an error if you want to use Run All!

In [None]:
# You should see an error when you run this.  (If you don't, 
# you might have defined factor somewhere above.)
factor

**Before proceeding, comment the line in the above cell to remove the error!**

Let's write another function to ensure that you understand what is going on.

#### Define `square`

**Question 2.3.** Define a function called ``square``. It should take a single float as its argument. (You can call that argument whatever you want.) It should return the square of the argument.

*Hint:* In the previous labs, you have seen a couple of different ways to square a number: some included an exponentiation operator, others used a built-in function.

In [None]:
def ...
    return ...

In [None]:
grader.check('q2_3')

As we've seen with the built-in functions, functions can also take strings (or arrays, or tables) as arguments, and they can return those things, too.

#### Define a function for strings

**Question 2.4.** Define a function called `concatenate_two_strings`.  It should take two strings as its arguments.  (You can call the arguments whatever you want.)  It should return a string that is the same as joining the two strings end-to-end. For example, `concatenate_two_string("snow", "ball")` should return `snowball`.

*Hint:* Remember that `+` works on strings too! 

In [None]:
def ...(..., ...):
    ...
    
# An example call to your function.  (It's often helpful to run
# an example call from time to time while you're writing a function,
# to see how it currently works.)
concatenate_two_strings("snow", "ball")

In [None]:
grader.check('q2_4')

#### Calls on calls on calls
Just as you write a series of lines to build up a complex computation, it's useful to define a series of small functions that build on each other.  Since you can write any code inside a function's body, you can call other functions you've written.

If a function is like a recipe, defining a function in terms of other functions is like having a recipe for cake telling you to follow another recipe to make the frosting, and another to make the sprinkles.  This makes the cake recipe shorter and clearer, and it avoids having a bunch of duplicated frosting recipes.  It's a foundation of productive programming.

For example, suppose you want to count the total number of characters in three strings. One way to do that is to concatenate all of them and count the size of the resulting string.

**Question 2.4.** Write a function called `len_three_strings`.  It should take three strings as its arguments. (You can call them whatever you like.) Inside the definition of the function, you should make _two function calls_ to `concatenate_two_strings`. Finally `len_three_strings` should return a number, which is the total number of characters in the three strings.

*Hint:* The function `len` takes a string as its argument and returns the number of characters in it.

In [None]:
def ...
    ...

In [None]:
grader.check('q2_5')

### Usin functions with tables

Let's now combine what we learned about functions with tables.

Functions can also encapsulate code that *displays output* instead of computing a value. For example, if you call `print` inside a function, and then call that function, something will get printed.

The `movies_by_year` dataset in the textbook has information about movie sales in recent years.  Suppose you'd like to display the year with the 5th-highest total gross movie sales, printed in a human-readable way.  You might do this:

In [None]:
movies_by_year = Table.read_table("movies_by_year.csv")
rank = 5
fifth_from_top_movie_year = movies_by_year.sort("Total Gross", descending=True).column("Year").item(rank-1)
print("Year number", rank, "for total gross movie sales was:", fifth_from_top_movie_year)

After writing this, you realize you also wanted to print out the 2nd and 3rd-highest years.  Instead of copying your code, you decide to put it in a function.  Since the rank varies, you make that an argument to your function.

**Question 2.5.** Write a function called `print_kth_top_movie_year`.  It should take a single argument, the rank of the year (like 2, 3, or 5 in the above examples).  It prints out the year with the kth-highest total gross movie sales.

Note that you can write any number of lines of code in the function body. Don't be restricted by our template!

In [None]:
def print_kth_top_movie_year(k):
    movies_by_year = Table.read_table("movies_by_year.csv")
    ...
    print(...)

# Example calls to your function:
print_kth_top_movie_year(2)
print_kth_top_movie_year(3)

In [None]:
grader.check('q2_6')

### `print` is not the same as `return`
The `print_kth_top_movie_year(k)` function prints the total gross movie sales for the year that was provided! However, since we did not return any value in this function, we can not use it after we call it. Let's look at an example of another function that prints a value but does not return it.

In [None]:
def print_number_five():
    print(5)

In [None]:
print_number_five()

However, if we try to use the output of `print_number_five()`, we see that the value `5` is printed but we get a TypeError when we try to add the number 2 to it!

In [None]:
print_number_five_output = print_number_five()
print_number_five_output + 2

**Before proceeding, make sure to comment out the line above so there's no error**

It may seem that `print_number_five()` is returning a value, 5. In reality, it just displays the number 5 to you without giving you the actual value! If your function prints out a value without returning it and you try to use that value, you will run into errors, so be careful!

If you are curious to see what's stored inside the `print_number_five_output` you can print its value, just like you would print the contents of any other variable. If you run into errors and see this value in your output, you might want to double-check if you are using `print` when you needed to use `return` instead.

In [None]:
### Reveal what's stored inside the print_number_five_output variable
...

## 3. `apply`ing functions

### 3.1 Function names as arguments

Defining a function is a lot like giving a name to a value with `=`.  In fact, a function is a value (an object, to be exact) just like the number 1 or the text "the"!

For example, we can make a new name for the built-in function `max` if we want:

In [None]:
our_name_for_max = max
our_name_for_max(2, 6)

The old name for `max` is still around:

In [None]:
max(2, 6)

Try just writing `max` or `our_name_for_max` (or the name of any other function) in a cell, and run that cell.  Python will print out a (very brief) description of the function.

In [None]:
max

Now try writing `?max` or `?our_name_for_max` (or the name of any other function) in a cell, and run that cell.  An information box should show up at the bottom of your screen with a longer description of the function.

In [None]:
?our_name_for_max

Let's look at what happens when we set `max`to a non-function value. You'll notice that a `TypeError` will occur when you try calling `max`. Things like **integers and strings are not callable**. Look out for any functions that might have been renamed when you encounter this type of error. Such errors might also occur when you think that you are calling a function when, in fact, the name refers to a different value.

In [None]:
max = 6
max(2, 6)

**Before proceeding, make sure to comment out the line so there's no error**

In [None]:
# This cell resets max to the built-in function. 
# Just run this cell, don't change its contents
import builtins
max = builtins.max

Why is this useful?  Since functions are just values, _it's possible to pass them as arguments to other functions_.  Here's a simple but not-so-practical example: we can make an array of functions.

In [None]:
make_array(max, np.average, are.equal_to)

**Question 3.1.** Make an array containing any 3 other functions you've seen.  Call it `some_functions`.

In [None]:
some_functions = ...
some_functions

In [None]:
grader.check('q3_1')

Working with functions as values can lead to some funny-looking code.  For example, see if you can figure out why this works:

In [None]:
make_array(max, np.average, are.equal_to).item(0)(4, -2, 7)

Next we will see some more practical examples of working with functions as values. But first, let's resolve some unfinished business. Let's define a function that converts a pay string that looks like "$100.00" in millions of dollars to a float in dollars.

**Question 3.2.** Define a function called `convert_pay_string_to_number`. It should take in one single argument, which is a string that looks like "$100.00" in millions of dollars, and return a float which is the value of the pay in dollars.

*Hint:* Remember what you did in Question 3 of Part 1?

In [None]:
def ...
    """Converts a pay string like '$100' (in millions) to a number of dollars."""
    ...

In [None]:
grader.check('q1_4')

Let's try using our function `convert_pay_string_to_number`.

In [None]:
convert_pay_string_to_number('$42')

In [None]:
convert_pay_string_to_number(mark_hurd_pay_string)

In [None]:
# We can also compute Safra Catz's pay in the same way:
convert_pay_string_to_number(raw_compensation.where("Name", are.containing("Safra")).column("Total Pay").item(0))

### 3.2 `apply`ing functions to table columns

Next, as promised, we will learn how to use this function on every element in a column of a table at once. Let's take a look at the table method `apply`.

> `apply` calls a function many times, once on *each* element in a column of a table.  It produces/returns an array of the results.  

In the example below, the table that we are working with is `raw_compensation` and we use `apply` on the column `"Total Pay"` to convert every CEO's pay to a number, using the `convert_pay_string_to_number` function you defined:

In [None]:
raw_compensation.apply(convert_pay_string_to_number, "Total Pay")

Here's an illustration of what that did:

<img src="apply.png"/>

Note that we didn't write something like `convert_pay_string_to_number()` or `convert_pay_string_to_number("Total Pay")`.  The job of `apply` is to call the function we give it, so instead of calling `convert_pay_string_to_number` ourselves, we **just write its name as an argument** to `apply`.

**Question 3.3.** Using `apply`, make a table that's a copy of `raw_compensation` with one more column called "Total Pay (\$)".  It should be the result of applying `convert_pay_string_to_number` to the "Total Pay" column, as we did above.  Call the new table `compensation`.

*Note:* If you are still confused as to why you need to use "the result of applying `convert_pay_string_to_number` to the "Total Pay" column", ask yourself:
1. how can I make a new table using `with_column`? what does `with_column` expect as input?
1. what does `apply` return?
1. how do I put them together?

Don't hesitate to ask a mentor for clarification or an additional explanation.

In [None]:
compensation = raw_compensation.with_column(
    "Total Pay ($)",
    ...)
compensation

In [None]:
grader.check('q3_2')

Now that we have the pay in numbers, we can compute things about them.

**Question 3.4.** Compute the average total pay of the CEOs in the dataset.

In [None]:
average_total_pay = ...
average_total_pay

In [None]:
grader.check('q3_3')

**Question 3.5.** Companies pay executives in a variety of ways: directly in cash; by granting stock or other "equity" in the company; or with ancillary benefits (like private jets).  Compute the proportion of each CEO's pay that was cash.  (Your answer should be an array of numbers, one for each CEO in the dataset.)

*Hint:* Take a look at the other columns in the `compensation` table and see if any of them would be useful.

First, Apply the convert_pay_string_to_number function to convert the Cash Pay column from string to numbers


In [None]:
total_cash = ...

Second, compute the proportion using the correct columns.

*Note:* You will see a warning pop up in a pink box.  DON'T BE ALARMED!!!!!!!  We'll get to that in a sec.

In [None]:
cash_proportion = ...
cash_proportion

In [None]:
grader.check('q3_4')

So what was that warning about?  It says something about "`true_divide`."  As a hint, let's look at the last rows of our `compensation` table.

In [None]:
compensation.take[-5:]

Notice anything strange? 

The warning is Python's cryptic way of telling you that you're dividing a number by zero. As you can see, the last item in column Total Pay ($) is 0.

A lot of real data is messy.  It might contain zeros, empty values, or `nan`'s (not-a-number) that we need to watch out for when performing calculations! 

If you're interested in going deeper, try comparing certain outputs of NumPy's version of division, `np.divide`, versus the default Python division operator, `/`.

In [None]:
# this line won't give you error
print(np.divide(5 , 0)) 
# this line will raise zeroDivisionError
print(5 / 0) 

**Before proceeding, if there are any errors, remember to fix the code in the cell above so that there are no errors!**

Alright, back to our CEOs.  Check out the `% Change` column in `compensation`.  It shows the percentage increase in the CEO's pay from the previous year.  Given their current pay and the percentage increase from the previous year, you can compute the previous year's pay.  For example, if your pay is 100 dollars this year, and that's an increase of 50 percent from the previous year, then your previous year's pay was $\frac{100}{1 + \frac{50}{100}}$ dollars, or around 66.66 dollars.

For CEOs with no previous year on record, it instead says "(No previous year)".  The values in this column are *strings*, not numbers, so like the "Total Pay" column, it's not usable without a bit of extra work.

**Question 3.6.** Create a new table called `with_previous_compensation`.  It should be a copy of `compensation`, but with the "(No previous year)" CEOs filtered out, and with an extra column called `2014 Total Pay ($)`.  That column should have each CEO's pay in 2014.

*Hint:* This question takes several steps, but each one is still something you've seen before.  Take it one step at a time, using as many lines as you need.  You can print out your results after each step to make sure you're on the right track.

*Hint 2:* You'll need to define the function percent_string_to_num.  You can do that just above your other code as shown in the template.

*Hint 3:* We've provided a structure that you can use to get to the answer. However, if it's confusing, feel free to delete the current structure and approach the problem your own way!

In [None]:
# Definition to turn percent to number
def percent_string_to_num(percent_string):
    """Converts a percentage string to a number.
    For example, the output should convert the percentage string 25% to float number 0.25.
    """
    return ...

# Compensation table where there is a previous year
having_previous_year = ...

# Get the percent changes as numbers instead of strings
# We're still working off the table having_previous_year
percent_changes = ...

# Calculate the previous year's pay
# We're still working off the table having_previous_year
previous_pay = ...

# Put the previous pay column into the having_previous_year table
with_previous_compensation = ...

with_previous_compensation

In [None]:
grader.check('q3_6')

**Question 3.7.** What was the average pay of these CEOs in 2014?

In [None]:
average_pay_2014 = ...
average_pay_2014

In [None]:
grader.check('q3_7')

**Question 3.8.** According to the [data](https://www1.salary.com/Safra-A-Catz-Salary-Bonus-Stock-Options-for-Oracle-Corp.html) in the fiscal year 2018, Safra Catz made in total $108,282,333. Earlier, we computed how much Safra Catz made in 2015. What is the overall growth rate of Safra Catz's total pay from 2015 to 2018?

*Hint:* The formula for growth rate can be found [here](https://www.inferentialthinking.com/chapters/03/2/1/Growth.html).

*Hint:* Safra Catz's name is `Safra A. Catz*` in the table.

In [None]:
safra_catz_pay_2015 = ...
safra_catz_pay_2018 = ...
safra_catz_pay_growth_rate = ...
safra_catz_pay_growth_rate

In [None]:
grader.check('q3_9')

This is an increase over 100%!

**Why is `apply` useful?**

For operations like arithmetic, or the functions in the NumPy library, you don't need to use `apply`, because they automatically work on each element of an array.  But there are many things that don't.  The string manipulation we did in today's lab is one example.  Since you can write any code you want in a function, `apply` gives you total control over how you operate on data.

Great job! :D You're finished with lab 4! 

* Make sure you **save the notebook** first, 
* Then go up to the `Kernel` menu and select `Restart & Clear Output` (make sure the notebook is saved first, because otherwise, you will lose all your work!). 
* Now, go to `Cell -> Run All`. Carefully look through your notebook and verify that all computations execute correctly. You should see **no errors**; if there are any errors, make sure to correct them before you submit the notebook.
* Then, go to `File -> Download as -> Notebook` and download the notebook to your own computer. ([Please verify](https://ucsb-ds.github.io/ds1-f20/troubleshooting/#i-downloaded-the-notebook-file-but-it-saves-as-the-ipynbjson-extension-so-whenever-i-upload-it-to-gradescope-it-fails) that it got saved as an .ipynb file.)
* Upload the notebook to [Gradescope](https://www.gradescope.com/).

