In [2]:
import pandas as pd
import matplotlib
path_data = '../../../assets/data/'
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np

Population = pd.read_csv(path_data + 'Canada-Population.csv')

# Manipulation and Transformation of Tabular data

In data science, we often need to perform complex transformations and calculations on our data. So far we have seen some examples of creating new columns of tables by applying functions to existing columns or to other arrays. All of those functions took arrays as their arguments. But frequently we will want to convert the entries in a column by a function that doesn't take an array as its argument. 


For example, it might take just one number as its argument, as in the function `categorize_score` defined below.

In [3]:
#the function to categorize scores
def categorize_score(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 70:
        return 'C'
    elif score >= 60:
        return 'D'
    else:
        return 'F'

In [4]:
categorize_score(17)

'F'

In [5]:
categorize_score(97)

'A'

In [6]:
categorize_score(76)

'C'

The function `categorize_score` simply returns letter grade given a score. Let's consider a Dataframe containing information about students and their scores in a particular exam. To use this function on many scores at once, we will have to be able to *refer* to the function itself, without actually calling it. Analogously, we might show a cake recipe to a chef and ask her to use it to bake 6 cakes.  In that scenario, we are not using the recipe to bake any cakes ourselves; our role is merely to refer the chef to the recipe.  Similarly, we can ask a table to call `categorize_score` on 6 different numbers in a column.

First, we create the dataframe `students` with a column for Name and Score. 

In [7]:
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85, 92, 78, 64]
})
students

Unnamed: 0,Name,Score
0,Alice,85
1,Bob,92
2,Charlie,78
3,David,64


## Applying a Function to a Column

To convert each of the scores to it's letter grade, we will use a new Table method. The `apply` method calls a function on each element of a column, forming a new array of return values. To indicate which function to call, just name it (without quotation marks or parentheses). The name of the column of input values is a string that must still appear within quotation marks.

In [8]:
students['Score'].apply(categorize_score)

0    B
1    A
2    C
3    D
Name: Score, dtype: object

What we have done here is `apply` the function `categorize_score` to each value in the `Score` column of the table `students`. The output is the array of corresponding return values of the function. For example, 85 became 'B', 92 became 'A' and so on.

This array, which has the same length as the original `Score` column of the `students` table, can be used as the values in a new column called `Letter Grad` alongside the existing `Name` and `Score` columns.

In [9]:
students['Letter Grade'] = students['Score'].apply(categorize_score)
students

Unnamed: 0,Name,Score,Letter Grade
0,Alice,85,B
1,Bob,92,A
2,Charlie,78,C
3,David,64,D


### Functions as Values
We've seen that Python has many kinds of values.  For example, `6` is a number value, `"cake"` is a text value, `Table()` is an empty table, and `ages` is a name for a table value (since we defined it above).

In Python, every function, including `categorize_score`, is also a value. It helps to think about recipes again. A recipe for cake is a real thing, distinct from cakes or ingredients, and you can give it a name like "Ani's cake recipe." When we defined `categorize_score` with a `def` statement, we actually did two separate things: we created a function that converts a score to letter grade, and we gave it the name `categorize_score`.

We can refer to any function by writing its name, without the parentheses or arguments necessary to actually call it. We did this when we called `apply` above.  When we write a function's name by itself as the last line in a cell, Python produces a text representation of the function, just like it would print out a number or a string value.

In [10]:
categorize_score

<function __main__.categorize_score(score)>

Notice that we did not write `"categorize_score"` with quotes (which is just a piece of text), or `categorize_score()` (which is a function call, and an invalid one at that).  We simply wrote `categorize_score` to refer to the function.

Just like we can define new names for other values, we can define new names for functions.  For example, suppose we want to refer to our function as `categorize` instead of `categorize_score`.  We can just write this:

In [11]:
categorize = categorize_score

Now `categorize` is a name for a function.  It's the same function as `categorize_score`, so the printed value is exactly the same.

categorize

Let us see another application of `apply`.

- apply() takes an argument that's common in pandas: **axis**.
- The **axis** argument can be either 0 or 1. By default, it's zero.
- When axis = 0, the function is applied to a single column.
- When axis = 1, the function is applied to the entire row.

## Applying Functions to Rows

Sometimes, we may need to apply a function to a row instead of a column. Let's see an example. For this example, we will redefine our initial `students` dataframe with an additional column. 

In [12]:
students = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85, 92, 78, 64],
    'Quiz': [95, 70, 91, 72]
})
students

Unnamed: 0,Name,Score,Quiz
0,Alice,85,95
1,Bob,92,70
2,Charlie,78,91
3,David,64,72


- Let's say that, the 'Quiz' is worth 10% of their final grade.
- Let's create a column called **Final_Grade**. To do this, we need to compute:

> (0.9) * (Score) + (0.1) * (Quiz)

- This formula relies on two columns! If we use axis = 0, we don't have all of the necessary information.

- Note that when applying to a row, we call apply() on the whole dataframe, not just a single column!

In [13]:
def computeFinalGrade(row):
	Score = row['Score']
	Quiz = row['Quiz']

	final_grade = (0.9*Score) + (0.1*Quiz)
	return final_grade

students['Final_Grade'] = students.apply(computeFinalGrade, axis=1)
students

Unnamed: 0,Name,Score,Quiz,Final_Grade
0,Alice,85,95,86.0
1,Bob,92,70,89.8
2,Charlie,78,91,79.3
3,David,64,72,64.8


## Example: Canada Population

One of the most prevalent applications of the `apply` function is in data preprocessing. While we will explore data preprocessing techniques in greater detail in subsequent chapters, this section provides a brief example to illustrate how `apply` can be utilized for preprocessing tasks.

We will take Canada's population dataset (in the years 2021 to 2022) for preprocessing.

In [14]:
Population

Unnamed: 0,Geography,Jan 2021,Jan 2022
0,Canada,38058291,38567576
1,Newfoundland and Labrador,525895,528977
2,Prince Edward Island,159240,164195
3,Nova Scotia,990025,1010460
4,New Brunswick,784950,798656
5,Quebec,8550561,8613999
6,Ontario,14772726,14999441
7,Manitoba,1383854,1400663
8,Saskatchewan,1166348,1171031
9,Alberta,4418338,4465537


If we want to calculate the population growth fron year 2021 to 2022 it will throw a Typeerror. 

In [15]:
try:
	Population['Jan 2022'] - Population['Jan 2022']
except TypeError as err:
	print(err)

unsupported operand type(s) for -: 'str' and 'str'


Let's examine the type of each column to understand why the arithmetic operation is causing an error. This issue arises because `pandas` has interpreted the columns as strings instead of numbers, and arithmetic operations cannot be performed on string data types.

To diagnose and resolve this, we can use the `dtypes` attribute of the DataFrame, which will show us the data type of each column.

In [16]:
print(Population.dtypes)

Geography    object
Jan 2021     object
Jan 2022     object
dtype: object


In this case, both 'Jan 2022' and 'Jan 2021' are of type object, which typically indicates string data. As you can see the values of these two columns contain commas, these values are interpreted as strings by `pandas`.
To perform arithmetic operations, we need to remove the commas to convert these columns to numeric data types. 

Let's define a function `to_number()` for removing ',' and converting to `int`. Then we will apply the function to both 'Jan 2022' and 'Jan 2021'. 

In [17]:
def to_number(str_count):
    num_count = int(str_count.replace(',',''))
    return num_count


Population['Jan 2022'] = Population['Jan 2022'].apply(to_number)
Population['Jan 2021'] = Population['Jan 2021'].apply(to_number)
Population

Unnamed: 0,Geography,Jan 2021,Jan 2022
0,Canada,38058291,38567576
1,Newfoundland and Labrador,525895,528977
2,Prince Edward Island,159240,164195
3,Nova Scotia,990025,1010460
4,New Brunswick,784950,798656
5,Quebec,8550561,8613999
6,Ontario,14772726,14999441
7,Manitoba,1383854,1400663
8,Saskatchewan,1166348,1171031
9,Alberta,4418338,4465537


In [18]:
print(Population.dtypes)

Geography    object
Jan 2021      int64
Jan 2022      int64
dtype: object


Now as  both columns 'Jan 2022' and 'Jan 2021' are integer type we can calculate the percentage of population growth within one year. 

In [21]:
Population['%Growth'] = (Population['Jan 2022'] - Population['Jan 2021'])/Population['Jan 2021']*100
Population

Unnamed: 0,Geography,Jan 2021,Jan 2022,%Growth
0,Canada,38058291,38567576,1.338171
1,Newfoundland and Labrador,525895,528977,0.586049
2,Prince Edward Island,159240,164195,3.111655
3,Nova Scotia,990025,1010460,2.064089
4,New Brunswick,784950,798656,1.746098
5,Quebec,8550561,8613999,0.741916
6,Ontario,14772726,14999441,1.534686
7,Manitoba,1383854,1400663,1.214651
8,Saskatchewan,1166348,1171031,0.40151
9,Alberta,4418338,4465537,1.068252


### Vectorized operation vs Row-wise `apply`

You might consider calculating the '%Growth' column using row-wise `apply`. While this approach will yield correct results, you may wonder why vectorized operations are preferred over row-wise `apply`. The reason is that vectorized operations are not only simpler but also more efficient in terms of performance and memory usage.

So, when is row-wise `apply` appropriate? In brief, row-wise `apply` is preferable in situations where vectorized operations are either infeasible or too complex to implement.