
**Topics**:
 - Useful Functions
 - DataFrames aggregation (groupby and agg)
 - Apply, Applymap, map
 - Exercises


In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [2]:
df_pokemons = pd.read_csv(r'C:\Users\vitor.silva\Desktop\Estudo Python\Python  - Digital\pokemon_data.csv')

# Pandas (Part III)

## Useful Functions


Empty fields can appear in our data routinely. They have different origins and, depending on that origin, they will receive different treatments. At first, let's learn to observe these occurrences and the simplest treatment for them: Exclusion (not always the most appropriate).

In [3]:
#Count NULL
df_pokemons.isnull().sum()

#               0
Name            0
Type 1          0
Type 2        386
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

In [6]:
# drop NAN
df_pokemons_sem_na =  df_pokemons.dropna()

In [5]:
# check for duplicates
df_pokemons.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
795    False
796    False
797    False
798    False
799    False
Length: 800, dtype: bool

In [7]:
# drops duplicates keeping the first that appeared
df_pokemons.drop_duplicates(keep = 'first')

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [8]:
# check the unique elements of a column (works best with qualitative - categorical column)
df_pokemons['Type 1'].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [15]:
# Replace
df_pokemons.replace({True : 'verdadeiro' , False : 'falso'})

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,falso
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,falso
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,falso
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,falso
4,4,Charmander,Fire,,39,52,43,60,50,65,1,falso
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,verdadeiro
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,verdadeiro
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,verdadeiro
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,verdadeiro


## Group By & Aggregate

Aggregation is simply the act of looking at statistics from the perspective of groups.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png" height=600 width=600>

Basically, the data will be separated according to one or more groups in common ```(split step)```. Then, from the separation of these groups a function will be applied ```(apply (sum))``` to each subset. In our example this sum function. In the end, it combines the grouped result ```(combine)```.

<img src="https://www.softwaretestingclass.com/wp-content/uploads/2013/06/sql_group_by_with_aggregate_function.gif" height=450 width=450>

Here we see the department groups and extract the sum of salary by department

In [9]:
# Sum of HP grouped by Type 1
df_pokemons.groupby('Type 1').sum()

Unnamed: 0_level_0,#,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Bug,23080,3925,4897,4880,3717,4471,4256,222,0
Dark,14302,2071,2740,2177,2314,2155,2361,125,2
Dragon,15180,2666,3588,2764,3099,2843,2657,124,12
Electric,15994,2631,3040,2917,3961,3243,3718,144,4
Fairy,7642,1260,1046,1117,1335,1440,826,70,1
Fighting,9824,1886,2613,1780,1434,1747,1784,91,0
Fire,17025,3635,4408,3524,4627,3755,3871,167,5
Flying,2711,283,315,265,377,290,410,22,2
Ghost,15568,2062,2361,2598,2539,2447,2059,134,2
Grass,24141,4709,5125,4956,5425,4930,4335,235,3


In [10]:
#group by more than one group
df_pokemons.groupby(['Type 1' , 
                     'Legendary']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,#,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
Type 1,Legendary,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Bug,False,334.492754,56.884058,70.971014,70.724638,53.869565,64.797101,61.681159,3.217391
Dark,False,451.517241,64.655172,86.862069,68.689655,70.62069,67.827586,73.689655,3.965517
Dark,True,604.0,98.0,110.5,92.5,133.0,94.0,112.0,5.0
Dragon,False,447.35,72.65,103.4,78.15,72.9,77.4,72.35,3.75
Dragon,True,519.416667,101.083333,126.666667,100.083333,136.75,107.916667,100.833333,4.083333
Electric,False,358.05,57.325,66.125,65.425,86.275,72.325,82.275,3.275
Electric,True,418.0,84.5,98.75,75.0,127.5,87.5,106.75,3.25
Fairy,False,432.875,70.875,57.1875,63.875,75.25,83.875,45.4375,4.0
Fairy,True,716.0,126.0,131.0,95.0,131.0,98.0,99.0,6.0
Fighting,False,363.851852,69.851852,96.777778,65.925926,53.111111,64.703704,66.074074,3.37037


In [11]:
# will aggregate the dataframe according to the functions applied within the list
df_pokemons.groupby('Type 1').agg(['min', np.median, 'max'])

Unnamed: 0_level_0,#,#,#,HP,HP,HP,Attack,Attack,Attack,Defense,Defense,Defense,Sp. Atk,Sp. Atk,Sp. Atk,Sp. Def,Sp. Def,Sp. Def,Speed,Speed,Speed,Generation,Generation,Generation,Legendary,Legendary,Legendary
Unnamed: 0_level_1,min,median,max,min,median,max,min,median,max,min,median,max,min,median,max,min,median,max,min,median,max,min,median,max,min,median,max
Type 1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2
Bug,10,291.0,666,1,60.0,86,10,65.0,185,30,60.0,230,10,50.0,135,20,60.0,230,5,60.0,160,1,3.0,6,False,0.0,False
Dark,197,509.0,717,35,65.0,126,50,88.0,150,30,70.0,125,30,65.0,140,30,65.0,130,20,70.0,125,2,5.0,6,False,0.0,True
Dragon,147,443.5,718,41,80.0,125,50,113.5,180,35,90.0,130,30,105.0,180,30,90.0,150,40,90.0,120,1,4.0,6,False,0.0,True
Electric,25,403.5,702,20,60.0,90,30,65.0,123,15,65.0,115,35,95.0,165,32,79.5,110,35,88.0,140,1,4.0,6,False,0.0,True
Fairy,35,669.0,716,35,78.0,126,20,52.0,131,28,66.0,95,40,75.0,131,40,79.0,154,15,45.0,99,1,6.0,6,False,0.0,True
Fighting,56,308.0,701,30,70.0,144,35,100.0,145,30,70.0,95,20,40.0,140,30,63.0,110,25,60.0,118,1,3.0,6,False,0.0,False
Fire,4,289.5,721,38,70.0,115,30,84.5,160,37,64.0,140,15,85.0,159,40,67.5,154,20,78.5,126,1,3.0,6,False,0.0,True
Flying,641,677.5,715,40,79.0,85,30,85.0,115,35,75.0,80,45,103.5,125,40,80.0,90,55,116.0,123,5,5.5,6,False,0.5,True
Ghost,92,487.0,711,20,59.5,150,30,66.0,165,30,72.5,145,30,65.0,170,33,75.0,135,20,60.5,130,1,4.0,6,False,0.0,True
Grass,1,372.0,673,30,65.5,123,27,70.0,132,30,66.0,131,24,75.0,145,30,66.0,129,10,58.5,145,1,3.5,6,False,0.0,True


In [30]:
df_pokemons.groupby('Type 1').agg({'HP' : ['max' , 'min'] , 
                                 'Sp. Atk':[np.std , np.median]})

Unnamed: 0_level_0,HP,HP,Sp. Atk,Sp. Atk
Unnamed: 0_level_1,max,min,std,median
Type 1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bug,86,1,26.697055,50.0
Dark,126,35,33.200952,65.0
Dragon,125,41,42.25736,105.0
Electric,90,20,29.74034,95.0
Fairy,126,35,28.548462,75.0
Fighting,144,30,28.159345,40.0
Fire,115,38,30.042121,85.0
Flying,85,40,34.769479,103.5
Ghost,150,20,32.561217,65.0
Grass,123,30,27.244864,75.0


## Apply, Apply map, map


Suppose we need to change the name of pokémons, adding the prefix "Pokémon" to their names, separating them with an underscore.
**Ex.: Pokémon_Bulbasaur**

### Def Functions


A function is a block of code that takes some parameters and returns a value. Ex.: We can receive the number of hours worked and the hourly value of an employee and return the salary for that month through the function.

To define a function we need the ```def``` keyword. In addition to the ```def```, we need to give our function a name. In our case it will be called calculate_salary. Finally, it will receive 2 parameters: number of hours worked and hourly value, and then return the calculated value through the ```return``` keyword.


Let's use a function to prefix "Pokémon" to any string and return the new string

In [12]:
def add_pokemon(string):
    nova_string = "Pokemon_"+string
    return nova_string

In [13]:
add_pokemon('Bulbasaur')

'Pokemon_Bulbasaur'


Now, we just apply this function to all pokemons in our data frame. How would we do this?


One option would be via a ```for```. But the coolest option would be to use the ```map``` method.

### Map


To the ```map``` method we pass:


The map method will iterate item by item and apply the desired function. Example: Take the square root of each element. Note that there is already a function that performs this task. Creating functions with ```def``` will be for only well-customized functions. In Numpy we have ```np.sqrt```.

In [14]:
# Take the root of each element of the HP column
df_pokemons['HP'].map(np.sqrt)

0      6.708204
1      7.745967
2      8.944272
3      8.944272
4      6.244998
         ...   
795    7.071068
796    7.071068
797    8.944272
798    8.944272
799    8.944272
Name: HP, Length: 800, dtype: float64

### Apply

If we want to apply a certain function to all rows at once? Or in all columns?.<br>
Use ```apply```.

In [15]:
# Applying the max function to each column
df_pokemons.max()

#                          721
Name          Zygarde50% Forme
Type 1                   Water
HP                         255
Attack                     190
Defense                    230
Sp. Atk                    194
Sp. Def                    230
Speed                      180
Generation                   6
Legendary                 True
dtype: object

In [16]:
# Applying the max function to each row
df_pokemons.max(axis=1)

0       65.0
1       80.0
2      100.0
3      123.0
4       65.0
       ...  
795    719.0
796    719.0
797    720.0
798    720.0
799    721.0
Length: 800, dtype: float64


Note that the maximum value of each row has been returned.

However, there are functions that do not have this behavior. Suppose we want to know the difference between the maximum and the minimum of each attribute of our dataframe.

In [17]:
def max_min(x):
    diference = x.max() - x.min()
    return diference

In our example, the x is a column of the dataframe. To apply this calculation logic to a column, we use ```apply```.

In [18]:
# applying the method to some columns
df_pokemons.loc[: , ['HP' , 'Sp. Atk']].apply(max_min)

HP         254
Sp. Atk    184
dtype: int64

## Applymap


What if we want to apply some transformation to each dataframe element at once? <br>
Use applymap.

In [43]:
df_pokemons.loc[: , ['HP' , 'Sp. Atk' , 'Attack' , 'Defense']].applymap(np.sqrt)

Unnamed: 0,HP,Sp. Atk,Attack,Defense
0,6.708204,8.062258,7.000000,7.000000
1,7.745967,8.944272,7.874008,7.937254
2,8.944272,10.000000,9.055385,9.110434
3,8.944272,11.045361,10.000000,11.090537
4,6.244998,7.745967,7.211103,6.557439
...,...,...,...,...
795,7.071068,10.000000,10.000000,12.247449
796,7.071068,12.649111,12.649111,10.488088
797,8.944272,12.247449,10.488088,7.745967
798,8.944272,13.038405,12.649111,7.745967
