### DS102 | In Class Practice Week 2A - Pandas & Numpy III
<hr>

## Learning Objectives
At the end of the lesson, you will be able to:

- create a new column using arithmetic operations

- create a new column `Series.apply()` on a function

- create a `GroupBy` object using a `Series` which represents an instance of an aggregated **intermediate** datatype, aggregating only one column

- use a `GroupBy` object to aggregate data, retrieving the `mean()`, `min()` and `max()` of aggregated records

### Datasets Required for this In Class
1. `pokemon.csv`

#### Import `pandas`

In [2]:
# import pandas
import pandas as pd
import numpy as np

#### Read from CSV to `df`

Read the dataset from `pokemon.csv` to a `df`.

In [3]:
# Read pokemon.csv to a df.
df = pd.read_csv("pokemon.csv")

Find out key information about the `df` using `df.info()`. You will be able to see:

- the number of records in the `df`
- the column names and column datatypes of the `df`

In [4]:
# Use df.info() to show key properties of the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 8 columns):
Pokemon_ID    800 non-null int64
Name          800 non-null object
Type          800 non-null object
HP            800 non-null int64
Attack        800 non-null int64
Defense       800 non-null int64
Speed         800 non-null int64
Legendary     800 non-null bool
dtypes: bool(1), int64(5), object(2)
memory usage: 44.6+ KB


### Create a new Column in a DataFrame using one value

To create a new column which is only one value, simply specify the name of the column its corresponding value.

> **Question**: Create a new column `Mean Attack`. This column will have the mean or average of the `Attack` score of all Pokémon.

To get the mean of the `Attack` column, use `df['Attack'].mean()`. This value will be propogated to all rows in the `df`.

In [5]:
df.head()

Unnamed: 0,Pokemon_ID,Name,Type,HP,Attack,Defense,Speed,Legendary
0,1,Bulbasaur,Grass,45,49,49,45,False
1,2,Ivysaur,Grass,60,62,63,60,False
2,3,Venusaur,Grass,80,82,83,80,False
3,3,VenusaurMega Venusaur,Grass,80,100,123,80,False
4,4,Charmander,Fire,39,52,43,65,False


In [6]:
# First, get the mean attack of all Pokémon
df['Attack'].mean()

79.00125

In [7]:
# Assign the mean to a new column 'Mean Attack'

df['Mean Attack'] = df['Attack'].mean()
df.head()

Unnamed: 0,Pokemon_ID,Name,Type,HP,Attack,Defense,Speed,Legendary,Mean Attack
0,1,Bulbasaur,Grass,45,49,49,45,False,79.00125
1,2,Ivysaur,Grass,60,62,63,60,False,79.00125
2,3,Venusaur,Grass,80,82,83,80,False,79.00125
3,3,VenusaurMega Venusaur,Grass,80,100,123,80,False,79.00125
4,4,Charmander,Fire,39,52,43,65,False,79.00125


### Create a new Column in a DataFrame using arithmetic operations

To create a new column as a sum of other columns, simply express each column as a variable. Add all the variables up to get the sum. 

> **Question**: The total score, or `Total` of a Pokémon is calculated using the sum of the Pokémon's `HP`, `Attack`, `Defense` and `Speed` score. Create a new column in the `df` called `Total` that adds up the four columns.

In [8]:
# Create a column Total that represents the sum of the Pokémon's 
# HP, Attack, Defense and Speed score
df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Speed']
df.drop(columns = ['Random'])
df.head(10)

KeyError: "['Random'] not found in axis"

### Create a new Column in a DataFrame using a function
To apply a function to all records in the column, use `Series.apply()`. In this case, normalise the `Speed` score to a value from $0$ to $1$ using the formula:

$$
\text{Normalised Speed}_x = \frac{Speed_x - Speed_{min}}{Speed_{max} - Speed_{min}}
$$

First obtain the maximum and minimum value of the `Speed` column.

In [9]:
# Exercise: Obtain the maximum value of the 'Speed' column
# 
max_speed = df['Speed'].max()
print(max_speed)

180


In [10]:
# Exercise: Obtain the minimum value of the 'Speed' column
#
min_speed = df['Speed'].min()
print(min_speed)

5


Next, use a function to calculate the normalised speed. In Python, you can declare two variables in one line using a comma `,` on both the LHS and RHS of the `=` symbol.

In [11]:
def calculate_normalised_speed(x):
    min_speed, max_speed = df['Speed'].min(), df['Speed'].max()     
    return (x - min_speed) / (max_speed - min_speed)

In [12]:
#dropping of columns
# df=df.drop(columns=["Random"])

In [13]:
#example
calculate_normalised_speed(99)

df.head()

Unnamed: 0,Pokemon_ID,Name,Type,HP,Attack,Defense,Speed,Legendary,Mean Attack,Total
0,1,Bulbasaur,Grass,45,49,49,45,False,79.00125,188
1,2,Ivysaur,Grass,60,62,63,60,False,79.00125,245
2,3,Venusaur,Grass,80,82,83,80,False,79.00125,325
3,3,VenusaurMega Venusaur,Grass,80,100,123,80,False,79.00125,383
4,4,Charmander,Fire,39,52,43,65,False,79.00125,199


Finally, use `apply` to create a new column, using an existing column as the input.

In [14]:
df['Speed'].apply(calculate_normalised_speed).head()

0    0.228571
1    0.314286
2    0.428571
3    0.428571
4    0.342857
Name: Speed, dtype: float64

In [15]:

df['Normalised Speed'] = df['Speed'].apply(calculate_normalised_speed)

# df['Normalised Speed'] = calculate_normalised_speed(df['Speed'])

df.head()

Unnamed: 0,Pokemon_ID,Name,Type,HP,Attack,Defense,Speed,Legendary,Mean Attack,Total,Normalised Speed
0,1,Bulbasaur,Grass,45,49,49,45,False,79.00125,188,0.228571
1,2,Ivysaur,Grass,60,62,63,60,False,79.00125,245,0.314286
2,3,Venusaur,Grass,80,82,83,80,False,79.00125,325,0.428571
3,3,VenusaurMega Venusaur,Grass,80,100,123,80,False,79.00125,383,0.428571
4,4,Charmander,Fire,39,52,43,65,False,79.00125,199,0.342857


### Data Aggregation using `groupby()`

In this next part, we want to find out some properties of each `Type` of Pokémon. In **every** one of the exercises, the first step is to **isolate columns of interest**. Then, use `groupby`. By doing so, the `groupby` function will return a `DataFrameGroupBy` object. This is just a fancy way of saying:
> I have now pulled the unique (`column_value`) for you. How would you like me to aggregate the target data?

#### `groupby()` followed by `size()`

**Q:** How many Pokémon of each type are there in the dataset? Filter for only `Grass`, `Fire` and `Water` type, and have a `Pokemon_ID` that is between `1` and `151` (inclusive)

In [17]:
# First create a copy of the df and perform the filter
#
grass_fire_water_df = df.copy()

# Debug the following code to perform the filter
req_types = ['Grass', 'Fire', 'Water']
required = list(range(1,152))

# Filter for the required types AND the Pokemon_ID conditions
grass_fire_water_df = grass_fire_water_df[grass_fire_water_df['Type'].isin(req_types) &
                                          grass_fire_water_df['Pokemon_ID'].isin(required)]
grass_fire_water_df.tail()


Unnamed: 0,Pokemon_ID,Name,Type,HP,Attack,Defense,Speed,Legendary,Mean Attack,Total,Normalised Speed
141,130,GyaradosMega Gyarados,Water,95,155,109,81,False,79.00125,440,0.434286
142,131,Lapras,Water,130,85,80,60,False,79.00125,355,0.314286
145,134,Vaporeon,Water,130,65,60,65,False,79.00125,320,0.342857
147,136,Flareon,Fire,65,130,60,65,False,79.00125,320,0.342857
158,146,Moltres,Fire,90,100,90,90,True,79.00125,370,0.485714


Perform the `groupby()` function. Since we want unique `Type`s, group by the column `Type`.

In [18]:
# Perform the groupby function
grass_fire_water_df_gb = grass_fire_water_df.groupby('Type')
grass_fire_water_df_gb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001ED78242EF0>

Finally, use the `size()` function to find the number of Pokémon belonging to each `Type`.

In [24]:
# How many Pokémon of each type are there? Filter for only 
# Pokemon with Grass, Fire and Water type and those with a Pokemon_ID between 1 and 151
grass_fire_water_df_gb.size().reset_index(name = 'No. of Pokemon')

Unnamed: 0,Type,No. of Pokemon
0,Fire,14
1,Grass,13
2,Water,31


In [25]:
grass_fire_water_df.groupby('Type').size().reset_index(name = 'Column Name')

Unnamed: 0,Type,Column Name
0,Fire,14
1,Grass,13
2,Water,31


#### `groupby()` followed by `max()`

**Q:** What is the maximum HP for each of the types of Pokémon?

In [26]:
pokemon_types[['Type']].groupby(['Type']).size().reset_index(name = 'HP')

NameError: name 'pokemon_types' is not defined

In [27]:
# Perform the groupby() and use the max() function to get the maximum HP
grass_fire_water_df.groupby('Type').max()['HP'].reset_index(name = 'Max HP')


#pokemon_types = df.copy()
#r = pokemon_types[['Type','Legendary','HP']].groupby(['Type','Legendary']).max()
#r

Unnamed: 0,Type,Max HP
0,Fire,90
1,Grass,95
2,Water,130


#### `groupby()` followed by `min()`

**Q:** What is the minimum HP for each of the types of Pokémon?

In [160]:
# Perform the groupby() and use the max() function to get the minimum HP
grass_fire_water_df.groupby('Type').min()['HP'].reset_index(name = 'Min HP') # Can i have multiple variables?

Unnamed: 0,Type,Min HP
0,Fire,38
1,Grass,45
2,Water,20


**Credits**
- [Pokemon with stats, Kaggle](https://www.kaggle.com/abcsds/pokemon) for the dataset
<hr>
`HWA-DS102-INCLASS-2A-201905`