# Challenge 2 - ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
# Load the data:
pokemon = pd.read_csv('pokemon.csv')
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [3]:
# Your code here
#gets the unique values for type 1 and type2
type1_unique = pokemon['Type 1'].unique()
type2_unique = pokemon['Type 2'].unique()

                    
type(type1_unique)

numpy.ndarray

In [4]:
#aggregates the unique values of type1 and type2
unique_types = np.concatenate((type1_unique,type2_unique))
#with the set function we get only the unique types, types without repetition
unique_types= set(unique_types)
len(unique_types) # you should see 19

19

#### Second we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [5]:
[(pokemon['Type 1']=='Grass')|(pokemon['Type 2']=='Poison')] 
#'|' symbol is a bitwise OR of integers. For example, if one or both of ax or bx are 1 ,
#this evaluates to 1 , otherwise to 0. In this case 1 is True and 0 is False. 

#It is like saying if pokemon of type1 equal to 'Grass' or type2 equal to Poison it will return 1(True) otherwise returns 0(False)
## for the index 0,1,2,3  it returns true because the type1 is 'Grass', type2 is 'Poisson' but could be something else to return True
#only one of the conditions being compared need to be true because is a 'OR' comparation.

[0       True
 1       True
 2       True
 3       True
 4      False
        ...  
 795    False
 796    False
 797    False
 798    False
 799    False
 Length: 800, dtype: bool]

In [6]:
#shows the dataframe with the rows that have Type 1 equal to 'Grass' or Type 2 equal to 'Poison'
pokemon[(pokemon['Type 1']=='Grass')|(pokemon['Type 2']=='Poison')].head(10)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
16,13,Weedle,Bug,Poison,195,40,35,30,20,20,50,1,False
17,14,Kakuna,Bug,Poison,205,45,25,50,25,25,35,1,False
18,15,Beedrill,Bug,Poison,395,65,90,40,45,80,75,1,False
19,15,BeedrillMega Beedrill,Bug,Poison,495,65,150,40,15,80,145,1,False
48,43,Oddish,Grass,Poison,320,45,50,55,75,65,30,1,False
49,44,Gloom,Grass,Poison,395,60,65,70,85,75,40,1,False


In [7]:
#creates a series for the values of the column Total that are True for the condition Type 1 equal to 'Grass' OR Type 2 equal to 'Poison'
pokemon['Total'][(pokemon['Type 1']=='Grass')|(pokemon['Type 2']=='Poison')][:10] #gets the first 10 rows of the pandas series

0     318
1     405
2     525
3     625
16    195
17    205
18    395
19    495
48    320
49    395
Name: Total, dtype: int64

In [8]:
print(unique_types)

{'Normal', nan, 'Water', 'Fighting', 'Ice', 'Electric', 'Ground', 'Flying', 'Psychic', 'Bug', 'Poison', 'Rock', 'Ghost', 'Steel', 'Dragon', 'Fairy', 'Grass', 'Fire', 'Dark'}


In [9]:
#the first item in the list unique_types is a nan that is a float
type(list(unique_types)[0])

str

In [10]:
#the second item in the list unique_types is a 'ICE' that is a string like the other elements with the exception of nan that is a float
type(list(unique_types)[1])

float

In [11]:
pokemon_totals = []
for poke_type in unique_types:
    #condition for the elements in the list unique_types that are strings that means it excludes the nan (type float)
    if type(poke_type) == str:
        #creates a series of the values of the column Total if one of the conditions for the Type 1 OR Type2 are True.
        total = pokemon['Total'][(pokemon['Type 1']==poke_type)|(pokemon['Type 2']==poke_type)]
        pokemon_totals.append(total)


len(pokemon_totals) # you should see 18

18

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](http://b.link/scipy44).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [12]:
# Your code here
import scipy.stats as st

st.f_oneway(*pokemon_totals)

F_onewayResult(statistic=6.617538296005535, pvalue=2.6457458815984803e-15)

#### Interpret the ANOVA test result. Is the difference significant?

In [13]:
# Your comment here
# the p-value is lower then 0.05, that means that there is a significant difference of the pokemons Total 