# Apply, Part 2 - Applying your functions
File(s) needed: doctors.csv (just for the final example)

Now that we know how to write simple value-returning functions, we can use the `apply()` dataframe method to use them on our data. We will start by writing a simple function to return the square of a number.

In [8]:
import math
# Function to return the square of a number
def sq_num(my_num):
    return my_num**2

In [9]:
# Try out the new function
sq_num(2)

4

In [10]:
# We can use it like any other function
my_num = 7
print(f"The square of the number is {sq_num(my_num)}.")

The square of the number is 49.


# Basics of Apply()
When using a function like that with pandas, we will usually want to apply it to the rows or columns of a dataframe. Let's create a simple dataframe we can use for experimentation.

In [1]:
import pandas as pd

df = pd.DataFrame({'age':[12,18,25,40,65],
                   'ht':[58,68,70.5,70.5,69.5], 
                   'fav_temp_F':[75, 71, 72, 68, 77]})
df

Unnamed: 0,age,ht,fav_temp_F
0,12,58.0,75
1,18,68.0,71
2,25,70.5,72
3,40,70.5,68
4,65,69.5,77


## Apply over a Series
That means either a row or a column. Use the `apply()` method built into the pandas Series to use our function across each element.

https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html 

In [None]:
# Confirm that both the first row and column are type Series


In [11]:
# use apply on the age column
# we don't need to use the function brackets inside apply()
df['age'].apply(sq_num)

0     144
1     324
2     625
3    1600
4    4225
Name: age, dtype: int64

## Apply over a DataFrame
The entire axis is passed into the first argument of the function. A function like this using one argument is inherently vectorized like other functions we've seen.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

In [None]:
# The axis option specifies by column (0) or by row (1)


# Keyword arguments
The functions we've used to this point have taken one argument and done something to it. When we apply thme across a column, for example, it is automatically vectorized so it works across all the elements in the column. What if we have another argument (or more) to pass to the function other than the data? We can use **_keyword arguments_** in the function call to make it work.

Keywpord arguments are often referred to as **_kwargs_**.

We have already done this many times with other functions and methods, but we can also do it with our custom functions. Let's look at a slightly more advanced version of the temperature conversion function from the previous notebook as an example. In this version, we tell the function which direction the conversion goes by specifying the current temperature's scale.

In [12]:
# Generic temperature conversion function specifying temp scale
def temp_conv(temp_in, scale):
    if scale.upper() == "C":
        temp = (temp_in * 1.8) + 32
    elif scale.upper() == "F":
        temp = (temp_in - 32)/1.8
    else:
        temp = -999
    return round(temp, 1)


In [None]:
# use the function with a single value


What if we want to convert the temperatures in our df dataframe?

In [15]:
# call the temp_conv function - produces an error
df['fav_temp_F'].apply(temp_conv)

TypeError: temp_conv() missing 1 required positional argument: 'scale'

In [None]:
# call the temp_conv function with the added keyword argument
df['fav_temp_F'].apply(temp_conv,scale='f')

In [16]:
# add the converted temperatures as a new column in df.
# Do you remember how to do that?
df['fav_temp_C']=df['fav_temp_F'].apply(temp_conv,scale='f')
df

Unnamed: 0,age,ht,fav_temp_F,fav_temp_C
0,12,58.0,75,23.9
1,18,68.0,71,21.7
2,25,70.5,72,22.2
3,40,70.5,68,20.0
4,65,69.5,77,25.0


# Vectorizing functions
When we `apply()` a function on the dataframe, we pass the entire axis (i.e., column or row) as an argument. If we have multiple columns passed, for example, pandas will automatically process them in an element-wise fashion. That is, pandas will automatically vectorize them.

Let's write a simple function that calculates height in inches per year of age for each person in our data.


- rewrite the function to handle the parts of the column separately (see section 9.3.2.1 of the book).
- vectorize the function to make it work the way we expect. It can also make it work faster.



In [17]:
# height in inches per year of age
def ht_per_year(h,a):
    return h/a

In [18]:
# try it out on our dataframe
ht_per_year(df['ht'],df['age'])

0    4.833333
1    3.777778
2    2.820000
3    1.762500
4    1.069231
dtype: float64

Most of the time that is exactly what we want. But what if the function is a little more complex? Let's change the function so the calculation is not done if the age is over 40.

In [22]:
# modified height in inches per year of age
def ht(h,a):
    import numpy as np
    if a<=40:
        return h/a
    else:
        return np.nan

In [25]:
# try the new version on our dataframe - error
ht(df['ht'],df['age'])

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [None]:
# what if we run it with numbers instead of dataframe columns?
ht(56,28)
ht(56,41)

The modified function works with numbers but not with our dataframe columns. To make it work the way we expect on columns, we use the `vectorize` function from `numpy`. We pass the function we want to vectorize, and `np.vectorize` creates a new function. 

In [29]:
# vectorize the function
import numpy as np
ht_year_vec=np.vectorize(ht)
ht_year_vec(df['ht'],df['age'])

array([4.83333333, 3.77777778, 2.82      , 1.7625    ,        nan])

This works great with functions someone else wrote and we can't modify the source code. If we are writing our own functions, we can use a Python **_decorator_** to vectorize the function when it is created.
- A decorator extends the behavior of a function without modifying the function.

To use the vectorize decorator, we proceed it with the `@` symbol right before the function definition.

In [30]:
# decorator version
@np.vectorize #only the one right after this will be effected
def v_ht_year(h,a):
    import numpy as np
    if a<=40:
        return h/a
    else:
        return np.nan

In [31]:
# Now we can use this vectorized version without having to create
v_ht_year(df['ht'],df['age'])

array([4.83333333, 3.77777778, 2.82      , 1.7625    ,        nan])

In [None]:
# add it to the dataframe


# Lambda functions
With the functions we have been talking about to this point, you may be thinking they are awfully simple to need to go to all this trouble to use with a dataframe. And you would be correct. Of course, we are talking about **_how_** this is done regardless of the complexity of the function. Most of the time you would use a custom function and vectorize it if it were more complex. Your data will determine that.

However, for simple functions, there is a way to _write the function code directly into the `apply` method_. That technique is called a **_lambda function_**. Lambda functions can contain multiple lines of code, but they are usually just used for one line functions.

Let's do a conversion of the height from inches to cm as a lamba function. We use the keyword `lambda` inside the apply method. The convention is to refer to the parameter as x. We will also save the result as a new column in df.

In [32]:
# height conversion as a lambda function
df['ht'].apply(lambda x: x*2.54)

0    147.32
1    172.72
2    179.07
3    179.07
4    176.53
Name: ht, dtype: float64

In [None]:
# Of course, that is not necessarily the best or easiest way to do 
# that particular calculation.


Here is the example from the book that uses a more sophisticated (but short) function. This uses regex to extract names from each element.

In [38]:
# load the data
docs=pd.read_csv('..\MIS-3335\data\doctors.csv')
docs

Unnamed: 0,doctor
0,William Hartnell (1963-66)
1,Patrick Troughton (1966-69)
2,Jon Pertwee (1970 74)
3,Tom Baker (1974-81)
4,Peter Davison (1982-84)
5,Colin Baker (1984-86)
6,Sylvester McCoy (1987-89)
7,Paul McGann (1996)
8,Christopher Eccleston (2005)
9,David Tennant (2005-10)


In [40]:
# create and use a regex pattern to extract just the letters (and spaces)
# Remember that the group() method returns the matching string
import re
p=re.compile('\w+\s+\w+')
docs['name']=docs['doctor'].apply(lambda x: p.match(x).group())

In [41]:
docs

Unnamed: 0,doctor,name
0,William Hartnell (1963-66),William Hartnell
1,Patrick Troughton (1966-69),Patrick Troughton
2,Jon Pertwee (1970 74),Jon Pertwee
3,Tom Baker (1974-81),Tom Baker
4,Peter Davison (1982-84),Peter Davison
5,Colin Baker (1984-86),Colin Baker
6,Sylvester McCoy (1987-89),Sylvester McCoy
7,Paul McGann (1996),Paul McGann
8,Christopher Eccleston (2005),Christopher Eccleston
9,David Tennant (2005-10),David Tennant
