# Using Apply On A DataFrame

## Notebook Outline:
* [Introduction to .apply()](#introtoapply)
* [An example on our baby boy name data](#examplebabyname)
* [Another example on our labor sheet data](#examplelaborsheet)

<a name=introtoapply></a>
# Introduction to .apply()

The .apply() method we are going to learn about is exactly like the .apply() we learned about int our groupby lectures.  The only difference is that this the method for dataframes instead of groupby objects.

The docs are here: <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html>

We are going to go back to our baby boy name data for our first example. So, let's load the data!

<a name=examplebabyname></a>
# An example of using .apply() on our baby boy name data:

In [12]:
import pandas as pd

filepath = ('/Users/yuzhang/Dropbox/Academia/Lecturer/I&C_SCI_X426.62/Assignments/Most_Popular_Baby_Boy_Names__Illinois_1980-2013.csv')
nameData = pd.read_csv(filepath)

#### Let's take a look at these name values again - we will use .unique() to do so:

In [13]:
nameData['Name'].unique()

array(['Michael', 'Jason', 'Christopher', 'Matthew', 'David', 'James',
       'Robert', 'Daniel', 'John', 'Joseph', 'Brian', 'Joshua', 'Ryan',
       'Nicholas', 'Eric', 'Justin', 'Timothy', 'William', 'Anthony',
       'Andrew', 'Adam', 'Thomas', 'Kevin', 'Steven', 'Jeremy',
       'Jonathan', 'Brandon', 'Kyle', 'Jospeh', 'Jonathon', 'Jacob',
       'Zachary', 'Alexander', 'Tyler', 'Austin', 'Dylan', 'Benjamin',
       'Ethan', 'Christian', 'Samuel', 'Zachery', 'Nathan', 'Jack',
       'DANIEL', 'JACOB', 'MICHAEL', 'ALEXANDER', 'JOSHUA', 'ANTHONY',
       'MATTHEW', 'JOSEPH', 'ETHAN', 'ANDREW', 'RYAN', 'CHRISTOPHER',
       'NICHOLAS', 'DAVID', 'WILLIAM', 'NATHAN', 'JOHN', 'BENJAMIN',
       'JAMES', 'TYLER', 'LOGAN', 'JACK', 'NOAH', 'JONATHAN', 'ANGEL',
       'CHRISTIAN', 'JAYDEN', 'AIDEN', 'JACKSON', 'MASON', 'LUCAS',
       'LIAM', 'GABRIEL', 'ELIJAH', 'HENRY'], dtype=object)

#### Most likely, we don't want Michael and MICHAEL to count as two different names - let's fix that using .apply(), lambda, and .lower() to fix this problem!

The basic idea here is want to lower case all of the names, this way all the 'Michael's will be changed to 'michael' and the 'MICHAEL's will be changed to 'michael'. Now every Michael will be counted as the same name! We can lower case string using the .lower() method, let's review this below.

In [14]:
name = 'MICHAEL'
print(name.lower())

# Note you can call .lower() directly on the string as well:
print('MICHAEL'.lower())

michael
michael


#### Now let's use .apply() to _apply_ the .lower() method to the all the names in the 'Names' column
We have to wrap it in a lamabda function though! We can only pass _functions_ to the apply() method, we can not pass other methods. 

As a review, let's first write function that will convert a string to lower case.

In [15]:
def lowerString(aString):
    lowerString = aString.lower()
    return lowerString

lowerString('MICHAEL')

'michael'

#### Let's now apply this function to the entire column of names

In [16]:
nameData['Name'].apply(lowerString)

0          michael
1            jason
2      christopher
3          matthew
4            david
          ...     
845         nathan
846         andrew
847          henry
848          david
849           jack
Name: Name, Length: 850, dtype: object

#### Now, let's do the same, but use a _lambda_ function instead!

In [17]:
nameData['Name'].apply(lambda x: x.lower())

0          michael
1            jason
2      christopher
3          matthew
4            david
          ...     
845         nathan
846         andrew
847          henry
848          david
849           jack
Name: Name, Length: 850, dtype: object

#### Now let's actually update the values in the dataset:

In [18]:
nameData.loc[:, 'Name'] = nameData['Name'].apply(lambda x: x.lower())

#### Now we are free to groupby the names!

In [19]:
nameData.groupby('Name')['Rank'].mean().sort_values().head(5)

Name
michael    1.705882
jacob      3.833333
daniel     4.441176
matthew    5.176471
liam       7.666667
Name: Rank, dtype: float64

<a name=examplelaborsheet></a>
# Another example of apply on our labor sheet data

In [20]:
filepath = ('/Users/yuzhang/Dropbox/Academia/Lecturer/I&C_SCI_X426.62/Assignments/LaborSheetData.csv')
laborSheetData = pd.read_csv(filepath, parse_dates=[[2, 3], 13])
laborSheetData.head()

Unnamed: 0,Date_Hour,Store,Manager,Projected Sales,Sales,DT TTL,KVS Total,Hrs +/- (Actual vs Need),Accum Hrs +/- (Act vs Need),Reason for +/- Labor,Reason for High TTL's,Manager Entering Data,TimeStamp
0,2017-02-04 07:00:00,4007,Monica H,281.0,296.0,192,117,-1.0,1.0,,,,2017-02-04 07:10:00
1,2017-02-04 08:00:00,4007,David H,670.0,347.0,111,97,2.39,2.39,,,,2017-02-04 08:04:00
2,2017-02-03 09:00:00,4007,David H,784.0,649.0,143,73,0.18,0.18,,,,2017-02-03 18:41:00
3,2017-02-04 10:00:00,4007,David H,852.0,750.0,158,85,0.75,3.25,,,,2017-02-04 10:06:00
4,2017-02-05 05:00:00,4007,Zoey D,54.0,108.0,104,45,-1.5,-1.5,,,,2017-02-05 10:57:00


#### Let's calculate the difference between 'TimeStamp' and 'Date_Hour' using .apply() and lambda. Because we are applying the apply() to the whole dataframe, we have to specific the axis that we will be apply the function across - this is confusing, so said another way if you want to apply the function every column then axis will equal 0, if you want to apply the function to every row, axis will equal 1


In [21]:
laborSheetData['lateEnteringData'] = laborSheetData.apply(lambda x: x['TimeStamp'] - x['Date_Hour'], axis=1)

In [22]:
laborSheetData.head(3)

Unnamed: 0,Date_Hour,Store,Manager,Projected Sales,Sales,DT TTL,KVS Total,Hrs +/- (Actual vs Need),Accum Hrs +/- (Act vs Need),Reason for +/- Labor,Reason for High TTL's,Manager Entering Data,TimeStamp,lateEnteringData
0,2017-02-04 07:00:00,4007,Monica H,281.0,296.0,192,117,-1.0,1.0,,,,2017-02-04 07:10:00,0 days 00:10:00
1,2017-02-04 08:00:00,4007,David H,670.0,347.0,111,97,2.39,2.39,,,,2017-02-04 08:04:00,0 days 00:04:00
2,2017-02-03 09:00:00,4007,David H,784.0,649.0,143,73,0.18,0.18,,,,2017-02-03 18:41:00,0 days 09:41:00
