# Transforming the data

In [None]:
# importing essential packages
import pandas as pd
import math

# 1. Método `.map()` 

- O método .map() é uma ótima ferramenta quando se trata de processar e transformar iteráveis de valores numéricos
- é uma maneira conveniente de realizar transformações elementares e outras operações relacionadas à limpeza de dados
- este método em Series recebe um objeto de função e uma sequência de iteráveis (lista, tupla, dicionário, conjunto ou série) como argumentos
- quaisquer funções internas que recebam um argumento e retornem um valor podem ser usadas com .map()
- retorna um iterador (não se preocupe com esse conceito por enquanto)
- os valores resultantes (um iterador) podem ser passados para a função list() ou a função set() para criar uma lista ou um conjunto

Exemplo:

`map(function, iterable)`

To extract the result we can use for example: <break> 

`list(map(function, iterable))`

or 

`set(map(function, iterable))`

### Example 1.1 - Example using For Loops

In [None]:
# Run this code
our_list = ['This', 'is', 'the', 'first', 'example']

In [None]:
#Step 1: intialize an empty list "result_loop" that will store our results later
#Step 2: get the length of each variable in the list "our_list"
#Step 3: append the result to the list "result_loop"
#Step 4: print out the result

result_loop = []

for word in our_list:
    result_loop.append(len(word))

print(result_loop)

[4, 2, 3, 5, 7]


### Example 1.2 - Example using .map() function

In [None]:
# Run this code
our_list = ['This', 'is', 'the', 'first', 'example']

In [None]:
# Step 1: Use .map() method to get the length of the words in our_list
# Step 2: Pass .list() function to create a list of resulting values
# Step 3: Assign the result to the variable name "result" print it on the screen

result_map = list(map(len, our_list))
print(result_map)

[4, 2, 3, 5, 7]


In the above example, `.map()` method iterates over the our_list, applies function on each element and return the length of the strings as a new list.

Which one do you think is neater and shorter?

```python
result_loop = []

for word in our_list:
  result_loops.append(len(word))

print(result_loop)
```
vs. 

```python
result_map = list(map(len, our_list))
print(result_map)
```

In the programming world, it is cleaner and much more concise and sophisticated to use ``map()`` instead of for loops. On top of that, with `map()` you can guarantee that the original sequence won't be acccidentally mutated or changed. As `map()` always returns a sequence of the results and leads to fewer errors in code. 

Feel free to check out [this](https://stackoverflow.com/questions/1975250/when-should-i-use-a-map-instead-of-a-for-loop#:~:text=4%20Answers&text=map%20is%20useful%20when%20you,loop%20and%20constructing%20a%20list.) on stackoverflow, where the advantages of using map over for-loop is discussed.




### Task 1
Now, let's create function `triple` and pandas Series `numbers` that will be our iterable.

In [None]:
# Run this code 
def triple(x):
    return x * 3

In [None]:
# Run this code
import pandas as pd
numbers = pd.Series([15, 4, 8, 45, 36, 7])

In [None]:
# TASK 1 >>>> Apply .map() method with function triple on pandas Series 'numbers' and store it in variable result_2 
#             Print the result_2 (the result should by numbers multiply by 3)
#             Think about the 3 different steps performed in the Example 1

### Start your code below ###
numbers = pd.Series([15, 4, 8, 45, 36, 7])
def triple(x):
    return x * 3
result_2 = list(map(triple, numbers))
result_2

[45, 12, 24, 135, 108, 21]

# 2. .filter() method

- similar to .map(), but instead of any function, .filter() takes a Boolean-valued function (a function that returns True or False based on the input data) instead of any built-in functions and a sequence of iterables (list, tuple, dictionary, set, or Series) as arugments
- returns the items of the intput data which the Boolean-valued function returns `True`
- the Boolean-valued function can be used-defined function


Imagine there is a list with positive and negative numbers

In [None]:
# Run this code
list_mixed = [-1,0,2,24,-42,-5,30,99]

In [None]:
# Run this code
def criteria(x): 
    return x >= 0

With the help of filter and our own user-defined function, we can filter out the negative values and be left with only positive values.

In [None]:
list_positive = list(filter(criteria, list_mixed))
print(list_positive)

[0, 2, 24, 30, 99]


In [None]:
def negative(x):
    return x < 0
list_negative = list(filter(negative, list_mixed))
print(list_negative)

[-1, -42, -5]


# 3. `.apply()` method

- this method applies a function along an axis of the DataFrame $^{1}$ 
- it also works elementwise but is suited to more complex functions and operations
- it accepts user-defined function that applies a transformation/aggregation on a DataFrame (or Series) as well

You can find a nice comparison of `.map()` and `.apply()` methods and when to use them in [this article on stackoverflow](https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas) 

### Example 3.1

In [None]:
# Run this code
students = [(1, 'Robert', 30, 'Slovakia', 26),
           (2, 'Jana', 29, 'Sweden' , 27),
           (3, 'Martin', 31, 'Sweden', 26),
           (4, 'Kristina', 26,'Germany' , 30),
           (5, 'Peter', 33, 'Austria' , 22),
           (6, 'Nikola', 25, 'USA', 23),
           (7, 'Renato', 35, 'Brazil', 26)]

students_1 = pd.DataFrame(students, columns= ['student_id', 'first_name', 'age', 'country', 'score'])
print(students_1)

   student_id first_name  age   country  score
0           1     Robert   30  Slovakia     26
1           2       Jana   29    Sweden     27
2           3     Martin   31    Sweden     26
3           4   Kristina   26   Germany     30
4           5      Peter   33   Austria     22
5           6     Nikola   25       USA     23
6           7     Renato   35    Brazil     26


In [None]:
# Run this code to create a regular function

def score_func(x): 
    if x < 25: 
        return "Retake" 
    else: 
        return "Pass"

score_func(26)

'Pass'

In [None]:
# Use .apply() along with score_func that 
students_1['result'] = students_1.score.apply(score_func)
print(students_1)

   student_id first_name  age   country  score  result
0           1     Robert   30  Slovakia     26    Pass
1           2       Jana   29    Sweden     27    Pass
2           3     Martin   31    Sweden     26    Pass
3           4   Kristina   26   Germany     30    Pass
4           5      Peter   33   Austria     22  Retake
5           6     Nikola   25       USA     23  Retake
6           7     Renato   35    Brazil     26    Pass


### Example 4.0

As we already know, regular functions are created using `def` keyword. These type of functions can have any number of arguments and expressions.

In [None]:
# Example of regular function
def multi_add(x):
    return x * 2 + 5

In [None]:
result_1 = multi_add(5)
print(result_1)

15


# 4. Lambda Function
- an anonymous function (it means it can be defined without a name)
- the def keyword is not necessary with a lambda function
- lambda functions can have any number of parameters, but the function body can only **contain one expression** (that means multiple statements are not allowed in the body of lambda function) = it is used for *`one-line expressions`*
- it returns a function object which can be assigned to variable

General syntax: `lambda x: x`




### Example 4.1

In [None]:
our_lambda = lambda x: x * 2 + 5
print(our_lambda(5))

15


This simple lambda function takes an input `x` (in our case number 5), multiply it by `2` and add `5`. <br>

Lambda functions are commonly used along `.apply()` method and can be really useful. <br>

### Example 4.2

Imagine the scores of students above have not been correctly recorded and we need to multiply them by 10. 

Use lambda function along with `apply()` and assign it to the specific column of the dataset ('score'). 

In [None]:
students_1.score = students_1.score.apply(lambda x: x * 10)
print(students_1)

   student_id first_name  age   country  score  result
0           1     Robert   30  Slovakia    260    Pass
1           2       Jana   29    Sweden    270    Pass
2           3     Martin   31    Sweden    260    Pass
3           4   Kristina   26   Germany    300    Pass
4           5      Peter   33   Austria    220  Retake
5           6     Nikola   25       USA    230  Retake
6           7     Renato   35    Brazil    260    Pass


### Task 2

In [None]:
# TASK 2 >>>> Use .apply() method on column 'country' along with lambda to make words uppercase 
#             Do not forget assign it to this column

### Start your code below ###
country = students_1['country']
country_upper = country.apply(lambda x: x.upper())
print(country_upper)

0    SLOVAKIA
1      SWEDEN
2      SWEDEN
3     GERMANY
4     AUSTRIA
5         USA
6      BRAZIL
Name: country, dtype: object


We can use lambda function to simplify Example 3.1 like this:

In [None]:
# Run this code
students = [(1, 'Robert', 30, 'Slovakia', 26),
           (2, 'Jana', 29, 'Sweden' , 27),
           (3, 'Martin', 31, 'Sweden', 26),
           (4, 'Kristina', 26,'Germany' , 30),
           (5, 'Peter', 33, 'Austria' , 22),
           (6, 'Nikola', 25, 'USA', 23),
           (7, 'Renato', 35, 'Brazil', 26)]

students_1 = pd.DataFrame(students, columns= ['student_id', 'first_name', 'age', 'country', 'score'])

In [None]:
# A Lambda function is used instead of the custom defined function "score_func"

students_1['result'] = students_1.score.apply(lambda x: "Pass" if (x > 25) else "Retake")
print(students_1)


students_1['age'] = students_1.score.apply(lambda x: "Menor que 30" if (x < 30) else "Maior ou igual a 30")
print(students_1)

   student_id first_name                  age   country  score  result
0           1     Robert         Menor que 30  Slovakia     26    Pass
1           2       Jana         Menor que 30    Sweden     27    Pass
2           3     Martin         Menor que 30    Sweden     26    Pass
3           4   Kristina  Maior ou igual a 30   Germany     30    Pass
4           5      Peter         Menor que 30   Austria     22  Retake
5           6     Nikola         Menor que 30       USA     23  Retake
6           7     Renato         Menor que 30    Brazil     26    Pass
   student_id first_name                  age   country  score  result
0           1     Robert         Menor que 30  Slovakia     26    Pass
1           2       Jana         Menor que 30    Sweden     27    Pass
2           3     Martin         Menor que 30    Sweden     26    Pass
3           4   Kristina  Maior ou igual a 30   Germany     30    Pass
4           5      Peter         Menor que 30   Austria     22  Retake
5     

Do you know we can combine `.map()` and `.filter()` methods together? Since `.filter()` returns a selected iterable after the criteria, the output of `.filter()` can be our input for the `.map()` method.

In order to avoid a negative number as an argument for `math.sqrt()` which will cause us a `ValueError`, we want to filter out the negative numbers before we apply the `math.sqrt()` method.


In [None]:
# Run this code
list_mixed = [-1,0,2,24,-42,-5,30,99]

In [None]:
# Run this code
def criteria(x): 
    return x >= 0

In [None]:
import math
list_sqrt = list(map(math.sqrt, filter(criteria, list_mixed)))
print(list_sqrt)

[0.0, 1.4142135623730951, 4.898979485566356, 5.477225575051661, 9.9498743710662]


### Optional: Task 3

In [None]:
# TASK 3 >>>> With the help of .map() and .filter(),
#             round up the number that is bigger than 5 from the list "list_sqrt" to the whole digit.
#             To round up the number, you can use round().
#             Don't forget to write your user-defined function as your criteria to filter out the "not desirable" numbers


### Start your code below ###
def criteria(x): 
    return x >= 5

list_round = list(map(round, filter(criteria, list_sqrt)))
print(list_round)

[5, 10]


# References

$^{1}$ pandas. pandas.DataFrame.apply. [ONLINE] Available at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html#pandas-dataframe-apply. [Accessed 14 September 2020].

Stackoverflow. Difference between map, applymap and apply methods in Pandas. [ONLINE] Available at: https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas. [Accessed 14 September 2020].

Material adapted for RBI internal purposes with full permissions from original authors. [Source](https://github.com/zatkopatrik/authentic-data-science) 