<center><h1>Functions</h1></center>

#1. Creating a function from scratch

There are two general methods for creating a function. The first is creating a function from scratch, which I'll go over below.

##Something missing from Python

We know that Python has a `sum` function. But it is missing the equivalent function for taking the product of a list of numbers. I defined that function for you in chapter 5. Let's go over it again in detail.

I'll start with the simplest function definition possible for taking the product of a list of floats.

In [None]:
def float_mult(number_list):
  #hidden from you: number_list = whatever you gave me
  result = 1.0
  for number in number_list:  #fancier version of for i in range(n):
    result *= number
  return result

#2. Function signature line

Let's break the code above into pieces.

* **def**. This stands for *define*. When Python sees it, it expects a certain form to follow.

* **float_mult**. This is a name you get to make up. It is the name of your function. It should be a unique name and not clash with other function names nor any variables you have defined.

* **()**. The left and right parens are required. They contain the function parameters. Even if the function has no parameters, you still need the left and right parens.

* **number_list**. This is the parameter to your function. You get to name it. If you have more than one parameter, separate the names by commas.

* **:**. The pesky colon ends the first line. This first line has the jargony name *signature* line.

#3. Function body

Like a loop and an if-statment, everything that follows the signature line that is indented is considered part of the function body.

I hope you recognize the first 3 lines in the function body as multiplying a list of numbers together in a variable called `result`. I could do the same thing without the function, right? Check it out.

In [None]:
result = 1.
for number in [.2, .5, .01]:
  result *= number


In [None]:
result

0.001

##What about that return line?

After the loop, you can see I have **`return result`**. What's up with that? We will have to dig a little deeper into the function concept to understand it.

First, check this out

In [None]:
z = float_mult([.2, .5, .01])  #this is a function call
z

0.001

##What steps happened above?

The first line has the jargony name *function call*. You are "calling" on the function to do its thing. The function won't do anything on its own. It just waits patiently for you to call it.

When Python sees a function call, the first thing it does is assign values to the function parameters. The jargony name is that it *binds* the parameters. I think the best way to think of this is as a set of hidden assignment statements. Python knows what the paramters are so, hidden from you, it does:
<pre>
number_list = argument value
</pre>
In our case, the argument value is a raw list of floats. So Python does this:
<pre>
number_list = [.2, .5, .01]
</pre>
If there were more than one parameter, Python would look for more than one argument so it could assign them all. I think this is one of the most confusing things about functions. These hidden assignments are made before anything else happens.

Now Python goes ahead and executes the loop in the body. It will end up with `.001` in the variable `result`.

Now the function is ready to give a result back to you. You called on it to do some work, it did that work, and now wants to report what it did. The `return` statement does that reporting. If you forget to include it, the function will kind of throw up its hands and give you a `None` value. So a missing return statement won't cause an error, but it will return something to you that is almost certainly not useful, the special Python value `None`.

Let's look at a more complicated example that always seems to be confusing at first.

In [None]:
x = [.1, .9, 2.3, 4.]
z = float_mult(x)
z

0.8280000000000001

So same deal. First Python does the hidden assignment.


<pre>
number_list = x
</pre>

This goes back to week 1. To carry out this assignment, we have to do an MR (memory retrieve) operation on x. When we do that, we end up with:

<pre>
number_list = [.1, .9, 2.3, 4.]  #after doing MR on x
</pre>

Then it operates as normal. Executes the loop and returns the result.

##What can you infer about Python hiding things?

I hope you can infer that you do not need to know the function parameter names to use the function. You do have to supply a value for each parameter. But you do not need to know the parameter names. Python takes care of that for you.

##Let's practice

I would like you to help me create a function that does a guarded divide of 2 floats. If the denominator is 0, the function just returns 0. Otherwise it returns the division.

In [None]:
def guarded_divide(num, denom):
  #num and denom will be assigned values here - hidden

  result = 0  #in case denom is 0
  if denom != 0:
    result = num/denom

  return result

Test it out.

In [None]:
x = 3
y = 4

#now call your function to divide x by y
guarded_divide(x,y)

0.75

In [None]:
x = 3
y = 0

#now call your function to divide x by y
guarded_divide(x,y)

0

BTW, what would happen without the guard?

In [None]:
x/y

ZeroDivisionError: ignored

#4. Creating a function from existing code

Here is the situation. You have written some code in your notebook to carry out some step in your analysis. Now you find you need to repeat the code for a new situation. So you copy the code and paste it in a new cell.

If you have to do this more than once, you might think to yourself, why don't I package the code into a function. Then I can just call the function for each new situation.

This "functionizing" code is perhaps the most frequent case. Especially when doing data science where you are writing code to explore a problem, not develop an app to sell in the app store. You are writing code on the fly to meet an immediate goal.

Let me see if I can motivate this with some code taken from a previous chapter.

In [None]:
our_seed = 1234  #if we all use this we should get same random data

import numpy as np  #powerful library for manipulating data
rsgen = np.random.RandomState(our_seed)  #we are only going to use numpy's random number generator for now

shuffled_table = gothic_sentences.sample(frac=1, random_state=rsgen).reset_index(drop=True)

len(shuffled_table)

19579*.7  #split point

training_table = shuffled_table[:13705].reset_index(drop=True)  #.7
testing_table = shuffled_table[13705:].reset_index(drop=True)   #.3

##Step 1: copy and paste into function body

I'll just create a function and paste the code above into the body. Then make sure I indent it.



In [None]:
def hold_out_v1():
  our_seed = 1234  #if we all use this we should get same random data

  import numpy as np  #powerful library for manipulating data
  rsgen = np.random.RandomState(our_seed)  #we are only going to use numpy's random number generator for now

  shuffled_table = gothic_sentences.sample(frac=1, random_state=rsgen).reset_index(drop=True)

  len(shuffled_table)

  19579*.7  #split point

  training_table = shuffled_table[:13705].reset_index(drop=True)  #.7
  testing_table = shuffled_table[13705:].reset_index(drop=True)   #.3

##Step 2: identify parameters

This is definitely the hardest step. You have to think about what you think might change in different situations. Let's take one example. The original table is `gothic_sentences` in code above, right? But if we leave that as is, this code only works for the gothic authors problem. What if we are working with a different data set? This code will not work.

I propose that we make the original table a parameter! What do you think? Then we can call the function with different datasets and have it do its thing. So here is version 2.

In [None]:
def hold_out_v2(original_table):
  our_seed = 1234  #if we all use this we should get same random data

  import numpy as np  #powerful library for manipulating data
  rsgen = np.random.RandomState(our_seed)  #we are only going to use numpy's random number generator for now

  shuffled_table = original_table.sample(frac=1, random_state=rsgen).reset_index(drop=True)

  len(shuffled_table)

  19579*.7  #split point

  training_table = shuffled_table[:13705].reset_index(drop=True)  #.7
  testing_table = shuffled_table[13705:].reset_index(drop=True)   #.3

You can see that I first decided on a parameter name, and then went through and replaced `gothic_sentences` with `original_table`. I wish colab gave me the ability to do this with one command, i.e., find all occurences of X and replace each with Y just within this function. But does not seem possible. Because of this, I often miss an occurence and have to debug.

What else do you see that is a candidate for making a parameter? The cut percentage looks like a possible thing we might change from dataset to dataset. Or even with the same dataset if we want to explore the effects of different splits. So let's do that. It's a little bit more invovled but we can do it.

In [None]:
def hold_out_v3(original_table, training_percentage):
  our_seed = 1234  #if we all use this we should get same random data

  import numpy as np  #powerful library for manipulating data
  rsgen = np.random.RandomState(our_seed)  #we are only going to use numpy's random number generator for now

  shuffled_table = original_table.sample(frac=1, random_state=rsgen).reset_index(drop=True)

  cut_point = int(training_percentage * len(shuffled_table))

  training_table = shuffled_table[:cut_point].reset_index(drop=True)
  testing_table = shuffled_table[cut_point:].reset_index(drop=True)

Any others? Well, `our_seed` has the flavor of a parameter to me. It feels like something we might want to change. Let's do it.

In [None]:
def hold_out_v4(original_table, training_percentage, the_seed):
  #hidden is assignment of 3 parameters to actual values
  import numpy as np  #powerful library for manipulating data
  rsgen = np.random.RandomState(the_seed)  #we are only going to use numpy's random number generator for now

  shuffled_table = original_table.sample(frac=1, random_state=rsgen).reset_index(drop=True)

  cut_point = int(training_percentage * len(shuffled_table))

  training_table = shuffled_table[:cut_point].reset_index(drop=True)
  testing_table = shuffled_table[cut_point:].reset_index(drop=True)

##Step 3. Figure out what to return

This looks a little more complicated. We really want to return 2 things, training_table and testing_table. How about packaging them up into a list? And then return that.

In [None]:
def hold_out(original_table, training_percentage, the_seed):
  import numpy as np  #powerful library for manipulating data
  rsgen = np.random.RandomState(the_seed)  #we are only going to use numpy's random number generator for now

  shuffled_table = original_table.sample(frac=1, random_state=rsgen).reset_index(drop=True)

  cut_point = int(training_percentage * len(shuffled_table))

  training_table = shuffled_table[:cut_point].reset_index(drop=True)
  testing_table = shuffled_table[cut_point:].reset_index(drop=True)

  return [training_table, testing_table]

##Step 4. Test it

Make some calls on your new function to make sure it looks like it works as you expect.

In [None]:
import pandas as pd

In [None]:
gothic_sentences = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQqRwyE0ceZREKqhuaOw8uQguTG6Alr5kocggvAnczrWaimXE8ncR--GC0o_PyVDlb-R6Z60v-XaWm9/pub?output=csv',
                          encoding='utf-8')


In [None]:
table_list = hold_out(gothic_sentences, .7, 1234)

training_table = table_list[0]  #unpackage
testing_table = table_list[1]   #unpackage

In [None]:
training_table.head()

Unnamed: 0,id,text,author
0,id18824,"An indefinite sense of awe, which at first sig...",EAP
1,id27368,"Surely, man had never before so terribly alter...",EAP
2,id06142,"Why don't you laugh at Oliver's grandfather, w...",HPL
3,id25016,I lay upon the grass surrounded by a darkness ...,MWS
4,id09465,"She stood erect in a singularly fragile canoe,...",EAP


In [None]:
testing_table.head()

Unnamed: 0,id,text,author
0,id05790,"In the confusion attending my fall, I did not ...",EAP
1,id27140,and day Like a thin exhalation melt away Both ...,MWS
2,id21851,Yet I remember ah how should I forget? the dee...,EAP
3,id16429,"But my enthusiasm was checked by my anxiety, a...",MWS
4,id17842,The traces of light wheels were evident; and a...,EAP


##A cool alternative

Python will allow you to unbundle a list into separate variables. Check this out.

In [None]:
training_table, testing_table = hold_out(gothic_sentences, .7, 1234)  #slicker unpackaging

In [None]:
training_table, testing_table, foo = hold_out(gothic_sentences, .7, 1234)  #slicker unpackaging

ValueError: ignored

In [None]:
training_table.head()

Unnamed: 0,id,text,author
0,id18824,"An indefinite sense of awe, which at first sig...",EAP
1,id27368,"Surely, man had never before so terribly alter...",EAP
2,id06142,"Why don't you laugh at Oliver's grandfather, w...",HPL
3,id25016,I lay upon the grass surrounded by a darkness ...,MWS
4,id09465,"She stood erect in a singularly fragile canoe,...",EAP


In [None]:
testing_table.head()

Unnamed: 0,id,text,author
0,id05790,"In the confusion attending my fall, I did not ...",EAP
1,id27140,and day Like a thin exhalation melt away Both ...,MWS
2,id21851,Yet I remember ah how should I forget? the dee...,EAP
3,id16429,"But my enthusiasm was checked by my anxiety, a...",MWS
4,id17842,The traces of light wheels were evident; and a...,EAP


##Test on new dataset to check generality

In [None]:
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vRpZtv1ZFa7Am4j7U8S4JDYTuGkAp3rEyVh7riN8nVLEcaos_wgoAyJiRiE1oe8aITeex8BG-z6Sj5-/pub?output=csv'
tweet_table = pd.read_csv(url) 

In [None]:
training_table, testing_table = hold_out(tweet_table, .7, 1234)

In [None]:
training_table.head()

Unnamed: 0,author,text
0,0,nove :) #flylondon #friends #sunglasses #love
1,0,can't wait anymore ! ðð only a day left ...
2,0,i am thankful for passion. #thankful #positive
3,0,blending with nature&gt;&gt;link in bio#shoot2...
4,0,@user tragic. #wolves


In [None]:
testing_table.head()

Unnamed: 0,author,text
0,0,save thousands $$ free search x logins x broke...
1,0,@user #kindðyou rð@user @user @user @use...
2,0,a beautiful day to explore this of ours âï¸...
3,0,shit guys ó¾´ó¾´ is lie ó¾ó¾
4,0,sushi time! ð£ #sushi #nomnomnom #photoofthe...


#4. What now?

Your first question might be where do you save your functions? You have built a couple cool functions, `float_mult` and `hold_out`. What if you want to use them in the future?

**Easiest answer**: store them in a github gist. If you have your gists open in a separate browswer tab, you will have easy access to your functions. You will have to copy the function definition into a new cell and execute it. Then you can use it.

**More difficult answer**: eventually you may want to store them in your own library. Just like I have been doing with the puddles library. Then you can load your library following the steps I used to load puddles. The trickiest part is creating a repository and then creating a .py file. But seriously, it is not that tricky. You can do it!

#5. Is that all there is to know about functions?

Uh, no. They can get quite complicated. For instance, remember the form of the `float_mult` function I gave you in chapter 5? Here it is.

In [None]:
def float_mult(number_list: list) -> float:
  assert isinstance(number_list, list), f'number_list should be a list but is instead a {type(number_list)}'
  assert all([isinstance(item, float) for item in number_list]), f'number_list must contain all floats'

  result = 1.
  for number in number_list:  #fancier version of for i in range(n):
    result *= number
  return result

##Fireproofing

All the extra pieces above are focused solely on fireproofing the function. You can leave them off, as we have seen. What they do is try to help someone that is attempting to call the function.

* **def float_mult(number_list: list) -> float:**. This tells the caller that you expect the value for `number_list` to be a Python list. It also tells the caller that the function will return a Python float. The jargony name for these is *type hints*.

*   **assert isinstance(number_list, list), f'number_list should be a list but is instead a {type(number_list)}'**. This is yet more checking on the value given to `number_list`. It again checks to make sure it is a list. If it is not, it causes a Python error. The really cool part about it is that you get to roll your own error messages. If you hate Python's error messages, here is your chance to do better :)

* **assert all([isinstance(item, float) for item in number_list]), f'number_list must contain all floats'**. Yet another check on the value in `number_list`. We have established it is a list. But is it all floats? If not, give the caller an error message that is useful.

If the asserts don't cause errors, then you can feel pretty good that you have legit data in `number_list`.


In [None]:
float_mult([.4, .6, .9])  #good call

0.216

In [None]:
float_mult(.6)  #bad call

AssertionError: ignored

In [None]:
float_mult([.4, '.6', .9])  #bad call

AssertionError: ignored