# Lab 8.1. Apply

### Learning about apply is fundamental in the data cleaning process. It also encapsulates key concepts in programming, mainly writing functions. apply takes a function and “applies” (i.e., runs it) across each row or column of a dataframe “simultaneously.” If you’ve programmed before, then the concept of an “apply” should be familiar. It is similar to writing a for loop across each row or column and calling the function—apply just does it simultaneously. In general, this is the preferred way to apply functions across dataframes, because it typically is much faster than writing a for loop in Python



## This lab will cover:

1. Functions
2. Applying functions on pandas dataframes

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
from datetime import datetime

# 1. Functions in python

In [4]:
# The following function takes x as an argument, squares it and returns the result

def my_sqare_function(x):
    squared= x**2
    
    return squared

In [5]:
my_sqare_function(3)

9

In [6]:
my_sqare_function(45)

2025

In [7]:
# The following function takes x,y as arguments, computes the average and returns the result

def my_average_function(x,y):
    avg= (x+y)/2
    
    return avg

In [8]:
my_average_function(4,6)

5.0

In [9]:
my_average_function(15,45)

30.0

In [10]:
def my_date_extractor(input_string):
    year=input_string[0:10]
    return year
    

In [11]:
date='2019-06-21 00:00:00+00:00'

In [12]:
my_date_extractor(date)

'2019-06-21'

# 2. Lambda functions in python


## Sometimes the function used in the apply method is simple enough that there is no need to create a separate function.

## Lambda functions are extremely useful to process data in pandas-based environments !!

In [13]:
lambda_add = lambda x, y: x + y

In [14]:
lambda_add(3,4)

7

In [15]:
lambda_add(5,7)

12

In [16]:
lambda_mean= lambda x,y:(x+y)/2

In [17]:
lambda_mean(0,10)

5.0

In [18]:
lambda_date_extractor = lambda date:date[0:10]

In [19]:
lambda_date_extractor('2019-06-21 00:00:00+00:00')

'2019-06-21'

# 2. Apply functions in pandas

## Now that we know how to write a function, how would we use them in Pandas? When working with dataframes, it’s more likely that you want to use a function across rows or columns of your data.

In [20]:
air_quality_no2 = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/BEMM458/master/data/air_quality_no2_long.csv',parse_dates=True)

In [21]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³
...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³


## 2.1. Using Apply with your functions 

In [22]:
### apply is useful to process dataframes using your own functions

In [23]:
air_quality_no2['valuesquared']=air_quality_no2['value'].apply(my_sqare_function)

In [24]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,valuesquared
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,400.00
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,475.24
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,702.25
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,620.01
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,457.96
...,...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,676.00
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,256.00
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00


In [25]:
### for performance and code maintenability is advisable to use native functions whenever possible
air_quality_no2['value'].pow(2)

0       400.00
1       475.24
2       702.25
3       620.01
4       457.96
         ...  
2063    676.00
2064    256.00
2065    361.00
2066    361.00
2067    529.00
Name: value, Length: 2068, dtype: float64

In [26]:
air_quality_no2['date']=air_quality_no2['date.utc'].apply(my_date_extractor)

In [27]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,valuesquared,date
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,400.00,2019-06-21
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,475.24,2019-06-20
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,702.25,2019-06-20
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,620.01,2019-06-20
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,457.96,2019-06-20
...,...,...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,676.00,2019-05-07
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,256.00,2019-05-07
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07


## 2.2. Using Apply with built-in functions

## Indeed it is possible to use apply with python built-in functions

In [28]:
air_quality_no2['value'].apply(np.sqrt)

0       4.472136
1       4.669047
2       5.147815
3       4.989990
4       4.626013
          ...   
2063    5.099020
2064    4.000000
2065    4.358899
2066    4.358899
2067    4.795832
Name: value, Length: 2068, dtype: float64

## 2.3. Using Apply with lambda functions

In [29]:
# Let's extract year, month and day from date.utc column 
air_quality_no2['date'].apply(lambda date:date[0:10])

0       2019-06-21
1       2019-06-20
2       2019-06-20
3       2019-06-20
4       2019-06-20
           ...    
2063    2019-05-07
2064    2019-05-07
2065    2019-05-07
2066    2019-05-07
2067    2019-05-07
Name: date, Length: 2068, dtype: object

In [30]:
air_quality_no2['value'].apply(lambda value:value**2)

0       400.00
1       475.24
2       702.25
3       620.01
4       457.96
         ...  
2063    676.00
2064    256.00
2065    361.00
2066    361.00
2067    529.00
Name: value, Length: 2068, dtype: float64

In [31]:
air_quality_no2['year-month-day']=air_quality_no2['date'].apply(lambda date:datetime.strptime(date, '%Y-%m-%d'))

In [32]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,valuesquared,date,year-month-day
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,400.00,2019-06-21,2019-06-21
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,475.24,2019-06-20,2019-06-20
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,702.25,2019-06-20,2019-06-20
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,620.01,2019-06-20,2019-06-20
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,457.96,2019-06-20,2019-06-20
...,...,...,...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,676.00,2019-05-07,2019-05-07
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,256.00,2019-05-07,2019-05-07
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07,2019-05-07
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07,2019-05-07


In [33]:
air_quality_no2['year']=air_quality_no2['year-month-day'].apply(lambda x:x.year)

In [34]:
air_quality_no2['month']=air_quality_no2['year-month-day'].apply(lambda x:x.month)

In [35]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,valuesquared,date,year-month-day,year,month
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,400.00,2019-06-21,2019-06-21,2019,6
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,475.24,2019-06-20,2019-06-20,2019,6
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,702.25,2019-06-20,2019-06-20,2019,6
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,620.01,2019-06-20,2019-06-20,2019,6
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,457.96,2019-06-20,2019-06-20,2019,6
...,...,...,...,...,...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,676.00,2019-05-07,2019-05-07,2019,5
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,256.00,2019-05-07,2019-05-07,2019,5
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07,2019-05-07,2019,5
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07,2019-05-07,2019,5
