# Bonus Challenge 2 - ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA. Use Ironhack's database to load the pokemon data (db: pokemon, table: pokemon_stats). 

In [1]:
# Import libraries
import pandas as pd

In [2]:
# Load the data:
pokemon = pd.read_csv(r'C:\Users\radek\IronHack\IronRadek\Week5\Day1\Pokemon.csv')

**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [8]:
# Your code here
unique_types= set(pokemon['Type 1']).union(set(pokemon['Type 2']))
len(unique_types) # you should see 19

19

In [9]:
unique_types

{'Bug',
 'Dark',
 'Dragon',
 'Electric',
 'Fairy',
 'Fighting',
 'Fire',
 'Flying',
 'Ghost',
 'Grass',
 'Ground',
 'Ice',
 'Normal',
 'Poison',
 'Psychic',
 'Rock',
 'Steel',
 'Water',
 nan}

#### Second we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [35]:
pokemon_totals = []

# Your code here
#I take a sample of 20 pokemons from each type 
for type_pok in unique_types:
	if type(type_pok) == str:
		total_pokemon=pokemon[(pokemon['Type 1'] == type_pok) | (pokemon['Type 2'] == type_pok) ]['Total'].sample(20)
		pokemon_totals.append(total_pokemon)

len(pokemon_totals) # you should see 18

18

In [36]:
type(pokemon_totals)
type(pokemon_totals[1])

pandas.core.series.Series

In [37]:
pokemon_totals

[62     455
 72     305
 114    455
 321    474
 527    618
 503    300
 334    280
 599    465
 713    580
 681    510
 437    534
 231    500
 504    490
 279    630
 232    600
 593    405
 277    405
 310    460
 559    528
 701    580
 Name: Total, dtype: int64,
 460    424
 427    600
 589    508
 330    330
 229    600
 513    535
 498    625
 455    350
 661    440
 717    600
 749    448
 748    325
 331    430
 416    580
 685    340
 659    489
 410    300
 456    495
 89     465
 777    470
 Name: Total, dtype: int64,
 130    520
 214    490
 527    618
 430    600
 587    425
 165    600
 376    500
 639    370
 357    470
 756    288
 68     310
 336    510
 271    600
 413    700
 335    410
 757    482
 217    405
 218    455
 162    680
 391    425
 Name: Total, dtype: int64,
 55     265
 684    483
 250    330
 597    509
 611    292
 493    600
 39     505
 708    600
 314    266
 794    600
 376    500
 32     300
 239    450
 523    510
 354    560
 424    770
 371

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](http://b.link/scipy44).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [38]:
# Your code here
from scipy.stats import f_oneway

f_oneway(*pokemon_totals)

F_onewayResult(statistic=2.8097727481068944, pvalue=0.00019292491644731343)

#### Interpret the ANOVA test result. Is the difference significant?

In [None]:
# Your comment here
'''
The obtained p-value is very small 0.00019, which means that the pokemon types mean Totals are significantly diffrent. This is logic because the type of pokemon influences its properies and thus thier final score.
'''