<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/11_writing_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Writing Functions

We've already seen how functions are used and have written some of our own functions. This notebook will now explain these concepts in a little bit more detail. 

Lots of our code cells do things ephermally - if we write a for loop or print out a bunch of information, that code is limited to the code cell it was written in. Writing a function allows us to store a procedure/operation in memory and reuse it. Moreover, writing several functions which work together is a key to writing structured *programs*. 

You may have noticed that the word *program* pop up here and there. And while we can argue that a single function we have made is technically a program, programs are usually a series of many different functions. You've seen how NLTK groups functions into different modules for different things, and this is pretty common among different packages. While we aren't going to be making modules or packages, we can start looking at developing a set of cooperative functions as a means to create an NLP pipeline. 


# Syntax of a function

You've already seen how to write a function. We define a function using `def` and then assign a function name, much the same as we name variables. Functions are similar to `if statements` and `for loops` in that they have a header, followed by a colon, followed by an indented block which contains the body of the function. 

```
def function_name():
  stuff function does

```


In [None]:
# make a useless function
def printer(x):
  print(x)

Note that when I defined `printer()` above, I included an `x` in the parentheses of the function. And, inside the function, I have asked the function to do something with `x` (i.e., use `print()` on `x`). 

By declaring my function with the `x`, I am including a required `arugment` to my function. The function will *not* work unless the user includes a value for `x`. You can basically think of `x` as a required variable that must be supplied. For example:

In [None]:
# supply a string to our printer() function. 
printer('this dream isn\'t feeling sweet')

In [None]:
# supply nothing to our printer function. 
printer()

If you run the cell above you get:

```
TypeError: printer() missing 1 required positional argument: 'x'
```

In other words, the function is complaining that it was not provided with the sufficient data it needs to work.

You can also provide default values for arguments in functions. For example, we could make our default argument for `printer()` a string which tells the user that they have not entered a value:



In [None]:
def printer(x = 'please enter a string'):
  print(x)

Now we can run the function without an argument, although it's a bit silly. In some ways it would be better to provide an error which teaches the user they need to enter a string to use the function properly. 

In [None]:
# we don't get an error anymore, but still isn't overly useful :)
printer()

## Function variable scope

Another benefit of using functions is that it helps us from accidently overwriting variables or run into other issues with variable names. This introduces us to the concept of variable scope. The main distinction in variable scope is between **global** scope and **local** scope. Essentially, global scope refers to your entire Python environment, and something which is global can be accessed anywhere and from anything. Local scope, on the other hand, restricts variables to the functions and classes they belong to.

For example, if I declare a variable in the next code cell, that variable is available to other code cells:

In [None]:
# defining a variable here is in global scope
ribs = "And I've never felt more alone, It feels so scary getting old"

In [None]:
# which I can access in this code cell
[i.upper() for i in ribs.split()]

However, if I define a variable in local scope, such as within a function, that variable is only available in that function. 

In [None]:
def nin_fragile():
  # note that I am assigning a variable just like you do anywhere else
   fragile = 'She reads the minds of all the people as they pass her by'
   print(fragile)

In [None]:
# function works great
nin_fragile()

In [None]:
# but if I try to access `fragile` here?
print(fragile)

I get a `is not defined` error, meaning there is no such variable named `fragile` available in the context I asked for it in (i.e., in the global scope). 

So, using functions allows us to avoid all kinds of problems *and* perform some good tricks when it comes variables. Perhaps most specifically, I could run a for loop in function A and another for loop in function B using the same iterator variable, and I would not need to worry about any cross contamination of the variables names between the functions. (Well, in theory at least!). Scope is a bit more complicated than I've presented it here, but I just want to make sure you see the difference, and also understand that one benefit of functions is to help us sanitize our global environment from rambling variables. 

## Building a less useless function

We can make much better functions than `printer()`, which is just a shell around the pre-existing `print()` function in Python. Almost anything we've already seen in Python can be done within a function, such as including conditional statements, looping, list comprehensions, and so on. 

We can also use arguments not just as data which is passed to the function, but also as flags and switches the user can use to control the way a function works. Let's explore this now. 


Let's write a function called `pre_process1()` which allows for a variety of text cleaning options. We'll start by asking our function to lowercase a string.

In [None]:
# pre_process1 returns a lowercased string
def pre_process1(x):
  x = x.lower()
  return x

In [None]:
# test the function:
pre_process1('WELLINGTON')

### `return`

`pre_process1()` includes a `return` statment, which means the function will output a particlar value (or set of values). In the case of `pre_process1()`, the output is the lowercased string. We'll continue using `return` statements to make our functions be able to pass data to and from one another. 

Let's now add another operation to our function, removing punctuation.

In [None]:
def pre_process2(x):
  punctuation = '.;"!\'[]{}:><-_?'
  x = x.lower()
  x = ''.join([x for x in x if x not in punctuation])
  return x

In [None]:
pre_process2('We\'re never done with killing time. Can I kill it with you?')

### Arguments controlling conditionals

Now, let's provide the option for the user to choose exactly how the string will be pre-processed. We can do so by adding more arguments and some conditional logic inside the function. I add two arguments to the function, `lower` and `remove_punc`. I set the default values of both to `True`. 

Then I used a series of conditional `if` statements to determine if the string should be processed in different ways. The defaults being `True` means that they will run unless the user explicitly sets the arguments to `False`.

In [None]:
# define with default values of True for lower and remove_punc arguments
def pre_process3(x, lower = True, remove_punc = True):
  # remember, this effectively says if lower == True
  if lower:
    x = x.lower()
  
  # remember, this effectively says if remove_punc == True
  if remove_punc:
    # this variable won't be declared unless the flag is set to True
    punctuation = '.;"!\'[]{}:><-_?'
    x = ''.join([x for x in x if x not in punctuation])
  return x

In [None]:
# use the function to only remove punc - capitals still remain
pre_process3('We\'re never done with killing time. Can I kill it with you?', lower = False)

In [None]:
# or to only lowercase - punctuation still remains
pre_process3('We\'re never done with killing time. Can I kill it with you?', remove_punc = False)

In [None]:
# default behaviour if we don't set the flags:
pre_process3('We\'re never done with killing time. Can I kill it with you?')

# Chaining functions

Now that we have our cute little prepocessing function, let's use it within *another* function. For instance, we could develop a tokenizer function which pre-processes each token of a string. Let's use good old `string.split()` for now. 

In [None]:
# a simple tokenizer on whitespace
def tokenizer(x):
  x = x.split()
  return x

In [None]:
# let's save our target string to a variable
target = "What if you could look right through the cracks? Would you find yourself? Find yourself afraid to see?"

In [None]:
# our function works!
tokenizer(target)

Okay, now let's use our functions together. We've already been doing this in for loops and other repeated actions, and here is another example, where I use the `pre_process3()` function on each token returned by the `tokenizer()` function:

In [None]:
# pre process each token, one at a time
for token in tokenizer(target):
  token = pre_process3(token)
  print(token)

We printed the results above, which isn't ideal. Let's make yet *another* function which uses our `tokenizer()` and `pre_process3()` functions. I need to do a bit more here to make the output make sense. 

In the first line of the function, I create my list of tokens.

Because this is a list, I need to either loop or use a list comprehension to get all of my values back. 

Then if I want everything as a string I need to use `' '.join()` on that list comprehension to get it all back. As you can probably imagine, there are a variety of ways things can go wrong during this process :)


In [None]:
# make my pipeline function
def pipeline(x):
  # first create a tokenized list
  tokens = tokenizer(x)

  # then use a list comprehension to apply pre_process3 to each individual token
  processed = [pre_process3(token) for token in tokens]
  return ' '.join(processed)

In [None]:
# it works
pipeline(target)

## Adding defenses to our program

Our program is doing okay for itself eh? But, what happens if we pass something that isn't a string?

In [None]:
pipeline(123)

We get an error because all of our functions are assuming a string is being entered. However, a user might not know this, or we might do something somewhere along the lines that doesn't play nice with strings. 

One way to address this is to add checks to our function which will ensure the input is what we want. There are a variety of ways to do this — `assert` statements are one method to do so and the one explained in Chapter 4 of NLTK. There are other methods we can use as well, such as try and except statements. 

We could also be less fancy and write a simple `if` statement to check the Type and prevent the program from proceeding. 

In [None]:
def pipeline2(x):
  # before doing anything else, is x is a str?
  if type(x) != str:
    print('Please enter a string!')
  else:
    tokens = tokenizer(x)
    processed = [pre_process3(token) for token in tokens]
    return ' '.join(processed)

In [None]:
pipeline2(target)

In [None]:
# We tell the user they need to enter a string. 
pipeline2(1337)

As I said there are a number of other ways to inform the user of what values to enter — but perhaps for this course and other applications your "user" is really just "you", so writing your functions defensively is a way to help you help yourself. 

# Writing a `load_text` program

Let's create a function which loads texts for us, so that we can simply avoid having to type the `open()` syntax each time we want to load a text. Why? We might find that we are constantly opening files from locations. We might want to write a simple function to do this for us, so we can use this in many different places. For example, we have been reading in `.txt` files from our Google Drive using `open(text).read()`. A simple function can help us do this more programmaticaly and allow for some more pre-processing. 


Note in the program below I include a `docstring` which is the triple quoted description of the function inside the cell. You can see these descriptions in different programming environments, in Colab you can type the function name then put your cursor in the brackets of the function and push tab and the context menu should pop up showing you the docstring definition. . 


In [None]:
# create a program which loads in a reads files.
def load_txt(path):
  """opens and returns a text"""
  text = open(path).read()
  return text

In [None]:
# put your cursor in the brackets and push tab - it will show you the details of this function.
# (don't run the cell or else it will give you an error) 
load_txt() 

I'll test my program on a text named `mood_ring.txt` located in my Google Drive (you can download this file from the [LING 226 GitHub](https://github.com/scskalicky/LING-226-vuw/tree/main/texts))

In [None]:
# mount your drive before running this cell. 
load_txt('/content/drive/MyDrive/mood_ring.txt')

Sweet, it works. We could then add a feature to `load_txt` that lowercases the words, right? And in fact we could do this in one line:

In [None]:
def load_txt(path):
  """opens and returns a text"""
  text = open(path).read().lower() # read in and lowercase the string
  return text

In [None]:
load_txt('/content/drive/MyDrive/mood_ring.txt')

As shown above, we could then *further* revise the function to make lower-casing optional by adding a toggle argument

In [None]:
def load_txt(path, lower = True):
  """opens and returns a text"""
  if lower == True:   # add condition for lowercasing
    text = open(path).read().lower()
  else: 
    text = open(path).read()
  return text

In [None]:
# default will return lower cased
load_txt('/content/drive/MyDrive/mood_ring.txt')

In [None]:
# but we can turn that off if we like
load_txt('/content/drive/MyDrive/mood_ring.txt', lower = False)

A helper function such as `load_text()` becomes worthwhile to make when developing programs because it saves you a lot of time retyping, but also because it brings consistency to your program and reduces error. Knowing that you can call `load_text()` multiple times in a program and that each time it will work the same way means you don't have to worry about inconsistencies or even typos. Also, you can change `load_text()` just one time and then it will change all instances of `load_text()`. 

The next notebook after this one will continue exploring how to chain functions together (and some other stuff). However, you might find it worthwhile to take a moment here and create two small functions which work with one another. For example, you could try creating a function which returns the ten most frequent words in a text (using `nltk.FreqDist()`) as well as a tokenizer function, or any other combination of things. 