# Challenge 2 - ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA.

In [40]:
# Import libraries
import pandas as pd
import scipy.stats as st

In [4]:
# Load the data:
df = pd.read_csv('Pokemon.csv')
df.head(5)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [27]:
# Your code here
unique_types = list(set(list(df['Type 1'].unique()) + list(df['Type 2'].unique())))
#unique_types.append(list(df['Type 2'].unique()))
print(unique_types)
len(unique_types) # you should see 19

['Electric', nan, 'Dragon', 'Ice', 'Bug', 'Fighting', 'Fairy', 'Dark', 'Poison', 'Flying', 'Grass', 'Ghost', 'Water', 'Psychic', 'Normal', 'Rock', 'Ground', 'Fire', 'Steel']


19

#### Second we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [45]:
pokemon_totals = []

for ptype in unique_types:
    
    if ptype == ptype: #checking if not NaN
        pokemon_totals.append(df.loc[df['Type 1'] == ptype,'Total'].sum() + df.loc[df['Type 2'] == ptype,'Total'].sum())

print(pokemon_totals)
len(pokemon_totals) # you should see 18

[22242, 27088, 17763, 27326, 24916, 16637, 23506, 24657, 45837, 39703, 20096, 54066, 42938, 41011, 26050, 29552, 29895, 23843]


18

In [58]:
pokemon_totals = []

# Is «append the selected type's Total to pokemon_groups» a typo? I can't understand.
# Maybe the output of this step is a list of lists, each of the 18 elements is a list 
# of all total values for a given type of pokemon (called a pokemon_group)?
# let's try that
for ptype in unique_types:
    if ptype == ptype: #checking if not NaN
        
        pokemon_totals.append(list(df.loc[df['Type 1'] == ptype,'Total']) + list(df.loc[df['Type 2'] == ptype,'Total']))

print(pokemon_totals)

# In my code I don't need to define a pokeomn_group... I'm confused

len(pokemon_totals) # you should see 18

[[320, 485, 325, 465, 330, 480, 490, 525, 580, 205, 280, 365, 510, 610, 360, 580, 295, 475, 575, 405, 405, 263, 363, 523, 405, 535, 540, 440, 520, 520, 520, 520, 520, 295, 497, 428, 275, 405, 515, 580, 580, 289, 481, 431, 330, 460, 319, 472, 471, 680], [300, 420, 600, 490, 590, 300, 420, 600, 700, 600, 700, 600, 700, 680, 780, 300, 410, 600, 700, 320, 410, 540, 485, 680, 680, 660, 700, 700, 300, 452, 600, 600, 634, 610, 540, 630, 340, 520, 680, 680, 680, 680, 300, 420, 600, 494, 362, 521, 245, 535], [455, 580, 250, 450, 330, 305, 300, 480, 580, 290, 410, 530, 580, 525, 530, 480, 305, 395, 535, 305, 485, 485, 304, 514, 475, 525, 535, 430, 334, 494, 594, 510, 520, 660, 700, 700, 362, 521], [195, 205, 395, 195, 205, 395, 495, 285, 405, 305, 450, 500, 500, 600, 265, 390, 250, 390, 390, 290, 465, 500, 600, 505, 500, 600, 195, 205, 395, 205, 385, 269, 414, 266, 456, 236, 400, 400, 194, 384, 224, 424, 424, 424, 424, 244, 474, 515, 310, 380, 500, 260, 360, 485, 325, 475, 315, 495, 319, 472, 30

18

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](http://b.link/scipy44).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [59]:
# Your code here
# «we want to understand whether there are significant differences among various types of pokemons' Total value»
# stats.f_oneway(*[list(data[data['Archer']==name].Score) for name in set(data['Archer'])])

st.f_oneway(*pokemon_totals)

F_onewayResult(statistic=6.6175382960055344, pvalue=2.6457458815984803e-15)

#### Interpret the ANOVA test result. Is the difference significant?

In [None]:
# Your comment here
'''
p-value is very small, certainly smaller than 1%, that leads me to conclude that we must reject the null hypothesis, 
that all means are the same. I must conclude that at least one the the means of 'Total' of a given Type is significantly
different from the others.
'''