# Lab 09: Applying Functions to DataFrames

## #1
Write a transforming function implementing the formula `y = 5x - 2`

In [1]:
def my_function(x):
    y = 5 * x - 2
    return y

Test it...

In [2]:
my_function(3)

13

In [3]:
my_function(-11.3)

-58.5

## #2
Create a numeric Pandas Series using the `pd.Series` call we've seen before. Use values of your choice.

In [5]:
import pandas as pd
s = pd.Series([5, 3, 8])

## #3
Apply your transforming function from #1 to this Series.

In [6]:
s.apply(my_function)

0    23
1    13
2    38
dtype: int64

## #4
Repeat #3, but apply `y = 5x - 2` as a lambda function instead of a *named* function.

Lambda functions are created using `lambda` instead of `def`.

In [7]:
s.apply(lambda x: 5 * x - 2)

0    23
1    13
2    38
dtype: int64

## #5
Apply this lambda function to your Series: `lambda x: [x, x + 1]`.
What type of data is now in each element?
Why did that happen?

In [8]:
s.apply(lambda x: [x, x + 1])

0    [5, 6]
1    [3, 4]
2    [8, 9]
dtype: object

Now each element of our Series is a list.
This is because our function converts scalars (the inputs) into lists (the outputs).

## #6
Open the flights data.
Create and apply a function such that the result is a Series with one element for each row.
*Hint: take a look at the chart from lecture.*

In [9]:
flights = pd.read_csv('../data/flights.csv')
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00


For this question, any reducing function is fine.
I wrote a function to return either dep_delay or arr_delay for each row, whichever is larger.

In [12]:
def get_largest_delay(row):
    # Return the whichever delay was larger (arrival or departure)
    return max(row['dep_delay'], row['arr_delay'])

In [15]:
delays = flights.apply(get_largest_delay, axis=1)
delays.head()

0    11.0
1    20.0
2    33.0
3    -1.0
4    -6.0
dtype: float64

## #7
Create a function that takes in a row (a Series) and returns a string with "carrier" and "flight" combined together into one, separated by a dash. I.e. running your function on the first row should yield "UA-1545".

In [20]:
def combine_flight_carrier(row):
    carrier = row['carrier']
    flight = row['flight']
    # Flight is an integer by default – we need to convert it
    # to a string.
    flight = str(flight)
    combined = carrier + '-' + flight
    return combined

Let's test it on the first row.

In [21]:
first_row = flights.loc[0]
combine_flight_carrier(first_row)

'UA-1545'

## #8
Apply your function from #7 to every row, and use it to create a new column "carrier_flight_code".

In [22]:
flights['carrier_flight_code'] = flights.apply(combine_flight_carrier, axis=1)
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,carrier_flight_code
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00,UA-1545
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00,UA-1714
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00,AA-1141
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00,B6-725
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00,DL-461


## #9
Suppose we learn that there are some systematic errors in the data. "dep_delay" is understated by two minutes for all flights leaving from LGA and overstated by 1 minutes for flights leaving from EWR.
It's correct for flights leaving from JFK.
Use a custom function and an `apply` to create a new column, "real_dep_delay", adjusting for these inaccuracies.

In [23]:
def adjust_dep_delay(row):
    dep_delay = row['dep_delay']
    # Take different action based on the origin.
    if row['origin'] == 'LGA':
        dep_delay = dep_delay + 2
    elif row['origin'] == 'EWR':
        dep_delay = dep_delay - 1
    # Other cases must be JFK, so we don't need to
    # adjust them in any way.
    return dep_delay

In [26]:
flights['real_dep_delay'] = flights.apply(adjust_dep_delay, axis=1)
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,carrier_flight_code,real_dep_delay
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,...,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00,UA-1545,1.0
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,...,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00,UA-1714,6.0
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,...,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00,AA-1141,2.0
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,...,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00,B6-725,-1.0
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,...,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00,DL-461,-4.0


## #10
Subset your flights data to the first 10,000 rows -- the next question is fairly computationally intense, and this will make it run faster.

In [27]:
flights_sample = flights.loc[:10000]

## #11
You've been assigned to summarize the data in a very particular way:
get the modal value of columns whose names start with "dep", and get the mean value of columns whose names start with "arr".
Do this using an applied function.
*Hint: remember you can check the name of a Series using its `.name` attribute*

Start by creating a function to summarize a column based on its name.

In [28]:
def summarize_col(col):
    if col.name.startswith('dep'):
        return col.mode()
    if col.name.startswith('arr'):
        return col.mean()
    # If neither of the above conditions are met,
    # return None.
    return None

Apply that function to every column in our data.

In [31]:
summary = flights_sample.apply(summarize_col, axis=0)
summary

year                           None
month                          None
day                            None
dep_time                    [556.0]
sched_dep_time                 None
dep_delay              [-5.0, -4.0]
arr_time                    1518.12
sched_arr_time                 None
arr_delay                  0.709241
carrier                        None
flight                         None
tailnum                        None
origin                         None
dest                           None
air_time                       None
distance                       None
hour                           None
minute                         None
time_hour                      None
carrier_flight_code            None
real_dep_delay                 None
dtype: object

Drop the `None`s for a more concise summary.

In [32]:
summary.dropna()

dep_time          [556.0]
dep_delay    [-5.0, -4.0]
arr_time          1518.12
arr_delay        0.709241
dtype: object

## #12
How could you have done this without `apply`?
There are a few ways.
Are any as concise as `apply`?

One possible way would be to loop through your columns and see which match, then use Series summary functions on them.

In [36]:
for column_name in flights_sample:
    
    if column_name.startswith('dep'):
        # Select the current column and get its mean.
        mean = flights_sample[column_name].mean()
        print(column_name, mean)
        print('-------')
        
    if column_name.startswith('arr'):
        # Select the current column and get its mode.
        mode = flights_sample[column_name].mode()
        print(column_name, mode)
        print('-------')

dep_time 1334.127828623152
-------
dep_delay 6.550336920446545
-------
arr_time 0    1102.0
dtype: float64
-------
arr_delay 0   -13.0
dtype: float64
-------


While this method is not much more verbose than using `apply`, I'd argue it's harder to understand.
For many very custom aggregations like this, apply is the best option in my opinion.