# Lab 8. Functions in Pandas

### So far we have made use of Python funcionality as well as external libraries to fetch, transform and visualize datasets.
### In real scenarios you might need to develop new features. 


## In this lab will learn:

1. Functions
2. Applying functions on pandas dataframes






In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
from datetime import datetime

## 1. Functions in python


### Functions are one of the cornerstones of programming. They provide a way to reuse code. If you’ve ever copy-pasted lines of code just to change a few parameters, then turning those lines of code into a function not only makes your code more readable, but also prevents you from making mistakes later on. Every time code is copy-pasted, it adds another place to look if a correction is needed, and puts that burden on the programmer. When you use a function, you need to make a correction only once, and it will be applied every time the function is called.

### Refer to the following URL for more information: https://swcarpentry.github.io/python-novice-inflammation/08-func/index.html


In [4]:
# The following function takes x as an argument, raises it to the power of two and returns the result

def my_power_function(x):
    powered= x**2
    
    return powered

In [5]:
my_power_function(3)

9

In [6]:
my_power_function(45)

2025

#### The function definition opens with the keyword def followed by the name of the function (my_power_function) and a parenthesized list of parameter names (x). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the return value.

#### When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

In [7]:
# The following function takes x,y as arguments, computes the average and returns the result

def my_average_function(x,y):
    avg= (x+y)/2
    
    return avg

In [8]:
my_average_function(4,6)

5.0

In [9]:
my_average_function(15,45)

30.0

In [10]:
def my_date_extractor(input_string):
    year=input_string[0:10]
    return year
    

In [11]:
date='2019-06-21 00:00:00+00:00'

In [12]:
my_date_extractor(date)

'2019-06-21'

## 2. Lambda functions in python


### Sometimes the function used in the apply method is simple enough that there is no need to create a separate function.

#### Lambda functions are extremely useful to process data in pandas-based environments !!

In [13]:
lambda_add = lambda x, y: x + y

In [14]:
lambda_add(3,4)

7

In [15]:
lambda_add(5,7)

12

In [16]:
lambda_mean= lambda x,y:(x+y)/2

In [17]:
lambda_mean(0,10)

5.0

In [18]:
lambda_date_extractor = lambda date:date[0:10]

In [19]:
lambda_date_extractor('2019-06-21 00:00:00+00:00')

'2019-06-21'

## 3. Apply functions to a pandas dataframe

### Now that we know how to write a function, how would we use them in Pandas? When working with dataframes, it’s more likely that you want to use a function across rows or columns of your data.
### Pandas provides a method to apply a funcion to an entire dataframe.

### Learning about apply is fundamental in the data cleaning process. Apply takes a function and “applies” (i.e., runs it) across each row or column of a dataframe “simultaneously.” If you’ve programmed before, then the concept of an “apply” should be familiar. It is similar to writing a for loop across each row or column and calling the function—apply just does it simultaneously. In general, this is the preferred way to apply functions across dataframes, because it typically is much faster than writing a for loop in Python


### 3.1. Let's load some data

In [20]:
air_quality_no2 = pd.read_csv('../data/air_quality_no2_long.csv',parse_dates=True)

In [21]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³
...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³


### 3.2 Using Apply with your functions 

In [22]:
### WE first DEFINE A FUNCTION
def my_power_function(x):
    powered= x**2
    
    return powered

In [23]:
air_quality_no2['value'].apply(my_power_function)

0       400.00
1       475.24
2       702.25
3       620.01
4       457.96
         ...  
2063    676.00
2064    256.00
2065    361.00
2066    361.00
2067    529.00
Name: value, Length: 2068, dtype: float64

In [24]:
## we save the result as a new column
air_quality_no2['powered']=air_quality_no2['value'].apply(my_power_function)

In [25]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,powered
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,400.00
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,475.24
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,702.25
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,620.01
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,457.96
...,...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,676.00
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,256.00
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00


In [26]:
## Function which takes the first 10 characters of a string
def my_date_extractor(input_string):
    year=input_string[0:10]
    return year
    

In [27]:
air_quality_no2['date.utc'].apply(my_date_extractor)

0       2019-06-21
1       2019-06-20
2       2019-06-20
3       2019-06-20
4       2019-06-20
           ...    
2063    2019-05-07
2064    2019-05-07
2065    2019-05-07
2066    2019-05-07
2067    2019-05-07
Name: date.utc, Length: 2068, dtype: object

In [28]:
air_quality_no2['date']=air_quality_no2['date.utc'].apply(my_date_extractor)

In [29]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,powered,date
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,400.00,2019-06-21
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,475.24,2019-06-20
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,702.25,2019-06-20
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,620.01,2019-06-20
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,457.96,2019-06-20
...,...,...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,676.00,2019-05-07
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,256.00,2019-05-07
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07


## 3.3. Using Apply with built-in functions

### Indeed it is possible to use apply with python built-in functions

In [30]:
air_quality_no2['value'].apply(np.sqrt)

0       4.472136
1       4.669047
2       5.147815
3       4.989990
4       4.626013
          ...   
2063    5.099020
2064    4.000000
2065    4.358899
2066    4.358899
2067    4.795832
Name: value, Length: 2068, dtype: float64

In [31]:
air_quality_no2['squared']=air_quality_no2['value'].apply(np.sqrt)

In [32]:
air_quality_no2

Unnamed: 0,city,country,date.utc,location,parameter,value,unit,powered,date,squared
0,Paris,FR,2019-06-21 00:00:00+00:00,FR04014,no2,20.0,µg/m³,400.00,2019-06-21,4.472136
1,Paris,FR,2019-06-20 23:00:00+00:00,FR04014,no2,21.8,µg/m³,475.24,2019-06-20,4.669047
2,Paris,FR,2019-06-20 22:00:00+00:00,FR04014,no2,26.5,µg/m³,702.25,2019-06-20,5.147815
3,Paris,FR,2019-06-20 21:00:00+00:00,FR04014,no2,24.9,µg/m³,620.01,2019-06-20,4.989990
4,Paris,FR,2019-06-20 20:00:00+00:00,FR04014,no2,21.4,µg/m³,457.96,2019-06-20,4.626013
...,...,...,...,...,...,...,...,...,...,...
2063,London,GB,2019-05-07 06:00:00+00:00,London Westminster,no2,26.0,µg/m³,676.00,2019-05-07,5.099020
2064,London,GB,2019-05-07 04:00:00+00:00,London Westminster,no2,16.0,µg/m³,256.00,2019-05-07,4.000000
2065,London,GB,2019-05-07 03:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07,4.358899
2066,London,GB,2019-05-07 02:00:00+00:00,London Westminster,no2,19.0,µg/m³,361.00,2019-05-07,4.358899


## 3.4. Using apply with lambda functions

In [33]:
# Let's extract year, month and day from date.utc column 
air_quality_no2['date'].apply(lambda date:date[0:10])

0       2019-06-21
1       2019-06-20
2       2019-06-20
3       2019-06-20
4       2019-06-20
           ...    
2063    2019-05-07
2064    2019-05-07
2065    2019-05-07
2066    2019-05-07
2067    2019-05-07
Name: date, Length: 2068, dtype: object

In [34]:
air_quality_no2['value'].apply(lambda value:value**2)

0       400.00
1       475.24
2       702.25
3       620.01
4       457.96
         ...  
2063    676.00
2064    256.00
2065    361.00
2066    361.00
2067    529.00
Name: value, Length: 2068, dtype: float64