# When To Not Use Apply On A DataFrame

We just learned how, by using `.apply()`, we can apply any function to a groupby object.  The `.apply()` method can also be used on dataframes, to apply custom functions row-wise (and column-wise).

However, if we can use built in methods of a dataframe, or use vector math (that is add/subtract/multiple/divide entire columns at once) we should. Apply is a last resort because it is the slowest.

We will introduce some examples of using `.apply` below and then we will show how there are better ways.

<a name=introtoapply></a>
# Introducing Apply For DataFrames

The .apply() method we are going to learn about is exactly like the .apply() we learned about in our groupby notebook.  The only difference is that this is the method for dataframes instead of groupby objects.

The docs are here: <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html>

We are going to go back to our baby boy name data for our first example. So, let's load the data!

<a name=examplebabyname></a>
# An Example Of Using .apply() On The Baby Boy Name Data:

### We will see there is a better way to do this without using apply

In [1]:
import pandas as pd
import os

filepath = os.path.join(os.getcwd(), 'data', 'Most_Popular_Baby_Boy_Names__1980-2013.csv')
name_data = pd.read_csv(filepath)

In [3]:
name_data.head()

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273
3,4,1980,Matthew,2112
4,5,1980,David,2088


### Let's take a look at these name values - we will use .unique() to do so:

In [4]:
name_data['Name'].unique()

array(['Michael', 'Jason', 'Christopher', 'Matthew', 'David', 'James',
       'Robert', 'Daniel', 'John', 'Joseph', 'Brian', 'Joshua', 'Ryan',
       'Nicholas', 'Eric', 'Justin', 'Timothy', 'William', 'Anthony',
       'Andrew', 'Adam', 'Thomas', 'Kevin', 'Steven', 'Jeremy',
       'Jonathan', 'Brandon', 'Kyle', 'Jospeh', 'Jonathon', 'Jacob',
       'Zachary', 'Alexander', 'Tyler', 'Austin', 'Dylan', 'Benjamin',
       'Ethan', 'Christian', 'Samuel', 'Zachery', 'Nathan', 'Jack',
       'DANIEL', 'JACOB', 'MICHAEL', 'ALEXANDER', 'JOSHUA', 'ANTHONY',
       'MATTHEW', 'JOSEPH', 'ETHAN', 'ANDREW', 'RYAN', 'CHRISTOPHER',
       'NICHOLAS', 'DAVID', 'WILLIAM', 'NATHAN', 'JOHN', 'BENJAMIN',
       'JAMES', 'TYLER', 'LOGAN', 'JACK', 'NOAH', 'JONATHAN', 'ANGEL',
       'CHRISTIAN', 'JAYDEN', 'AIDEN', 'JACKSON', 'MASON', 'LUCAS',
       'LIAM', 'GABRIEL', 'ELIJAH', 'HENRY'], dtype=object)

### Most likely, we don't want Michael and MICHAEL to count as two different names - let's fix that using `.apply()`, lambda, and `.lower()` to fix this problem!

The basic idea here is want to lower case all of the names, this way all the 'Michael's will be changed to 'michael' and the 'MICHAEL's will be changed to 'michael'. Now every Michael will be counted as the same name! We can lower case string using the .lower() method, let's review this below.

#### A quick review of the lower method:

In [5]:
name = 'MICHAEL'
print(name.lower())

michael


### Now, let's use `.apply()` to _apply_ the `.lower()` method to the all the names in the 'Names' column
We have to wrap it in a _lambda_ function though! We can only pass _functions_ to the apply() method, we can not pass other methods. 

As a review, let's first write function that will convert a string to lower case.

In [6]:
def lower_string(my_string):
    return my_string.lower()

lower_string('MICHAEL')

'michael'

### Let's now apply this function to the entire column of names

In [7]:
name_data['Name'].apply(lower_string)

0          michael
1            jason
2      christopher
3          matthew
4            david
          ...     
845         nathan
846         andrew
847          henry
848          david
849           jack
Name: Name, Length: 850, dtype: object

### We can also use lambda function:

In [8]:
name_data['Name'].apply(lambda x: x.lower())

0          michael
1            jason
2      christopher
3          matthew
4            david
          ...     
845         nathan
846         andrew
847          henry
848          david
849           jack
Name: Name, Length: 850, dtype: object

## Using `%%timeit` to Measure the Speed of Our Code

We can use the `%%timeit` magic command. This is a special command that exists in Jupyter and IPython only. More magic commands can be found here: https://ipython.readthedocs.io/en/stable/interactive/magics.html

In [9]:
%%timeit
lower_case_names = name_data['Name'].apply(lambda x: x.lower())

344 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## A Better Way Than using `.apply()` (Less Code, and Faster):
We can access the string properties of a column with strings in it by using the `str` attribute (very similar to how we accessed the datetime properties using a `dt` attribute. We can then caller the `lower()` method from the `str` attribute.

In [10]:
%%timeit
lower_case_names = name_data['Name'].str.lower()

288 µs ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


<a name=examplelaborsheet></a>
# Another Example On Our Labor Sheet Data

In [11]:
filepath = os.path.join(os.getcwd(), 'data', 'ShiftManagerApp_LaborSheet.csv')

labor_sheet_data = pd.read_csv(filepath, parse_dates=[['Date', 'Ending_Hour'], 'Timestamp'])
labor_sheet_data.head()

Unnamed: 0,Date_Ending_Hour,Store_ID,Manager,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,2017-01-23 08:00:00,4462,JillianA,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
1,2017-02-05 06:00:00,4462,ZoeyD,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
2,2017-02-05 07:00:00,4462,JessicaB,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,
3,2017-02-05 08:00:00,4462,JessicaB,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,
4,2017-02-05 09:00:00,4462,JessicaB,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,


## We Want To Calculate A Simple Score For Each Hour

The formula for this score will be:

`(Sales - Projected_Sales)/100 - ((Scheduled_People + (Sales - Projected_Sales)/50) - Actual_People) - (Timestamp - Date_Ending_Hour)/15`

Let's calculate this score using `apply()` and then again using vector math.

In [12]:
def hour_score(x):
    score = (x['Sales'] - x['Projected_Sales'])/100 - \
    (x["Scheduled_People"] + (x['Sales'] - x['Projected_Sales'])/50) - \
    (x["Timestamp"] - x['Date_Ending_Hour']).total_seconds()/(60*15)
    return score

### Now, let's `%%timeit`

In [13]:
%%timeit
labor_sheet_data.apply(hour_score, axis=1)

1.84 s ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Now, let's time a better implementation

In [14]:
%%timeit
(labor_sheet_data['Sales'] - labor_sheet_data['Projected_Sales'])/100 - \
    (labor_sheet_data["Scheduled_People"] + (labor_sheet_data['Sales'] - labor_sheet_data['Projected_Sales'])/50) - \
    (labor_sheet_data["Timestamp"] - labor_sheet_data['Date_Ending_Hour']).dt.total_seconds()/(60*15)

3.1 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# The Lesson Is - Always Write Data Transformation Formulas As "Column Math" If You Can

# In Class Exercise
Please create a cell below and use apply to manipulate the data in the labor_sheet_data DataFrame. Try to compare an apply method to a non-apply method.

## Question or Comments About This Notebook?
Feel free to contact me via my LinkedIn: https://www.linkedin.com/in/william-j-henry <br>
You can also email me at will@henryanalytics.com <br>