# Challenge 2 - ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA.

In [1]:
# Import libraries
import pandas as pd

from scipy import stats

In [3]:
# Load the data:

df = pd.read_csv("Pokemon.csv", index_col=0)

df.head()

Unnamed: 0_level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [4]:
# Your code here


unique_types = pd.concat([df["Type 1"], df["Type 2"]], axis=0).unique().tolist()

len(unique_types) # you should see 19

19

#### Second we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [5]:
unique_types

['Grass',
 'Fire',
 'Water',
 'Bug',
 'Normal',
 'Poison',
 'Electric',
 'Ground',
 'Fairy',
 'Fighting',
 'Psychic',
 'Rock',
 'Ghost',
 'Ice',
 'Dragon',
 'Dark',
 'Steel',
 'Flying',
 nan]

In [6]:
unique_types = [unique_type for unique_type in unique_types if str(unique_type) != 'nan'] # Remove nulls/nan

In [8]:
pokemon_totals = []

# Your code here

for unique_type in unique_types:
    
    pokemon_totals.append(df.loc[(df["Type 1"] == unique_type) | (df["Type 2"] == unique_type)]["Total"])

len(pokemon_totals) # you should see 18

18

In [9]:
pokemon_totals[0]

#
1      318
2      405
3      525
3      625
43     320
      ... 
710    335
711    494
711    494
711    494
711    494
Name: Total, Length: 95, dtype: int64

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](http://b.link/scipy44).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [10]:
pokemon_totals

[#
 1      318
 2      405
 3      525
 3      625
 43     320
       ... 
 710    335
 711    494
 711    494
 711    494
 711    494
 Name: Total, Length: 95, dtype: int64,
 #
 4      309
 5      405
 6      534
 6      634
 6      634
       ... 
 662    382
 663    499
 667    369
 668    507
 721    600
 Name: Total, Length: 64, dtype: int64,
 #
 7      314
 8      405
 9      530
 9      630
 54     320
       ... 
 689    500
 690    320
 692    330
 693    500
 721    600
 Name: Total, Length: 126, dtype: int64,
 #
 10     195
 11     205
 12     395
 13     195
 14     205
       ... 
 637    550
 649    600
 664    200
 665    213
 666    411
 Name: Total, Length: 72, dtype: int64,
 #
 16     251
 17     349
 18     479
 18     579
 19     253
       ... 
 667    369
 668    507
 676    472
 694    289
 695    481
 Name: Total, Length: 102, dtype: int64,
 #
 1      318
 2      405
 3      525
 3      625
 13     195
       ... 
 569    474
 590    294
 591    464
 690    320


In [11]:
stats.f_oneway(*pokemon_totals)

F_onewayResult(statistic=6.6175382960055344, pvalue=2.6457458815984803e-15)

#### Interpret the ANOVA test result. Is the difference significant?

In [12]:
# Your comment here

'''As the P-Value is less than 0.05 we reject the null hypothesis and conclude that there is a statistically
significant difference between group means
'''

'As the P-Value is less than 0.05 we reject the null hypothesis and conclude that there is a statistically\nsignificant difference between group means\n'