# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
# import numpy and pandas
import pandas as pd
import numpy as np


# Challenge 1 - Independent Sample T-tests

In this challenge, we will be using the Pokemon dataset. Before applying statistical methods to this data, let's first examine the data.

To load the data, run the code below.

In [2]:
# Run this code:

pokemon = pd.read_csv('../pokemon.csv')

Let's start off by looking at the `head` function in the cell below.

In [3]:
# Your code here:
pokemon.head()


Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


The first thing we would like to do is compare the legendary Pokemon to the regular Pokemon. To do this, we should examine the data further. What is the count of legendary vs. non legendary Pokemons?

In [9]:
# Your code here:
filter_legendary = pokemon["Legendary"] == True

num_legendary_pokemons = len(pokemon[filter_legendary])
num_non_legendary_pokemons = len(pokemon[~filter_legendary])

print(f"Pokemons legendarios: {num_legendary_pokemons}")
print(f"Pokemons no legendarios: {num_non_legendary_pokemons}")


Pokemons legendarios: 65
Pokemons no legendarios: 735


Compute the mean and standard deviation of the total points for both legendary and non-legendary Pokemon.

In [12]:
# Your code here:
mean_legendary_pokemon_total = pokemon[filter_legendary]["Total"].mean()
mean_non_legendary_pokemon_total = pokemon[~filter_legendary]["Total"].mean()
stdev_legendary_pokemon_total = pokemon[filter_legendary]["Total"].std()
stdev_non_legendary_pokemon_total = pokemon[~filter_legendary]["Total"].std()

print(f"Media del total de puntos de pokemons legendarios: {mean_legendary_pokemon_total}")
print(f"Media del total de puntos de pokemons no legendarios: {mean_non_legendary_pokemon_total}")
print(f"Desviación típica del total de puntos de pokemons legendarios: {stdev_legendary_pokemon_total}")
print(f"Desviación típica del total de puntos de pokemons no legendarios: {stdev_non_legendary_pokemon_total}")


Media del total de puntos de pokemons legendarios: 637.3846153846154
Media del total de puntos de pokemons no legendarios: 417.21360544217686
Desviación típica del total de puntos de pokemons legendarios: 60.93738905315344
Desviación típica del total de puntos de pokemons no legendarios: 106.76041745713005


The computation of the mean might give us a clue regarding how the statistical test may turn out; However, it certainly does not prove whether there is a significant difference between the two groups.

In the cell below, use the `ttest_ind` function in `scipy.stats` to compare the the total points for legendary and non-legendary Pokemon. Since we do not have any information about the population, assume the variances are not equal.

In [14]:
# Your code here:
from scipy.stats import ttest_ind

t_stat, p_value = ttest_ind(pokemon[filter_legendary]["Total"], pokemon[~filter_legendary]["Total"])

print(f"El t-stat es: {t_stat}")
print(f"El p-value es: {p_value}")

El t-stat es: 16.386116965872425
El p-value es: 3.0952457469652825e-52


What do you conclude from this test? Write your conclusions below.

In [6]:
# Your conclusions here:
# El t-stat es muy alto y el p-value muy bajo.
# Si tenemos como hipótesis:
# - H0: los puntos totales de un pokemon legendario son iguales a los puntos totales de un pokemon no legendarios.
# - H1: los puntos totales de un pokemon legendario son mayores a los puntos totales de un pokemon no legendarios.

# Con una significancia del 0.05 (95%) tenemos que el p-value es mucho menor de 0.05 y por tanto podemos rechazar H0
# y concluir que los pokemon legendarios tienen más puntos totales que los no legendarios.

How about we try to compare the different types of pokemon? In the cell below, list the types of Pokemon from column `Type 1` and the count of each type.

In [19]:
# Your code here:
pokemon.groupby("Type 1").count()["#"].sort_values(ascending=False)


Type 1
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Rock         44
Electric     44
Ground       32
Dragon       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: #, dtype: int64

Since water is the largest group of Pokemon, compare the mean and standard deviation of water Pokemon to all other Pokemon.

In [20]:
# Your code here:
# Nota: uso la media y desviación típica del campo Total
filter_water = pokemon["Type 1"] == "Water"

water_pokemon_total_mean = pokemon[filter_water]["Total"].mean()
water_pokemon_total_stdev = pokemon[filter_water]["Total"].std()

non_water_pokemon_total_mean = pokemon[~filter_water]["Total"].mean()
non_water_pokemon_total_stdev = pokemon[~filter_water]["Total"].std()

print(f"La media del total de puntos de los pokemon de agua es: {water_pokemon_total_mean}")
print(f"La desviación típica del total de puntos de los pokemon de agua es: {water_pokemon_total_stdev}")

print(f"La media del total de puntos de los pokemon que no son de agua es: {non_water_pokemon_total_mean}")
print(f"La desviación típica del total de puntos de los pokemon que no son de agua es: {non_water_pokemon_total_stdev}")

La media del total de puntos de los pokemon de agua es: 430.45535714285717
La desviación típica del total de puntos de los pokemon de agua es: 113.1882660643146
La media del total de puntos de los pokemon que no son de agua es: 435.85901162790697
La desviación típica del total de puntos de los pokemon que no son de agua es: 121.0916823020807


Perform a hypothesis test comparing the mean of total points for water Pokemon to all non-water Pokemon. Assume the variances are equal. 

In [21]:
# Your code here:
from scipy.stats import ttest_ind

t_stat, p_value = ttest_ind(pokemon[filter_water]["Total"], pokemon[~filter_water]["Total"])

print(f"El t-stat es: {t_stat}")
print(f"El p-value es: {p_value}")


El t-stat es: -0.4418547448849676
El p-value es: 0.6587140317488793


Write your conclusion below.

In [10]:
# Your conclusions here:

# El t-stat es bajo y el p-value alto.

# Si tenemos como hipótesis:
# - H0: los puntos totales de un pokemon de agua son iguales a los puntos totales de un pokemon que no es de agua.
# - H1: los puntos totales de un pokemon de agua son diferentes a los puntos totales de un pokemon que no es de agua.

# Con una significancia del 0.05 (95%) tenemos que el p-value mayor de 0.05 y por tanto no podemos rechazar H0
# Concluimos que los puntos totales de los pokemon de agua no son significativamente diferentes de los de otros tipos.


# Challenge 2 - Matched Pairs Test

In this challenge we will compare dependent samples of data describing our Pokemon. Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. In the cell below, import the `ttest_rel` function from `scipy.stats` and compare the two columns to see if there is a statistically significant difference between them.

In [24]:
# Your code here:
from scipy.stats import ttest_rel

t_stat, p_value = ttest_rel(pokemon["Defense"], pokemon["Attack"])

print(f"El t-stat es: {t_stat}")
print(f"El p-value es: {p_value}")

El t-stat es: -4.325566393330478
El p-value es: 1.7140303479358558e-05


Describe the results of the test in the cell below.

In [26]:
# Your conclusions here:

# El valor absoluto del t-estadístico es alto y el p-value es muy bajo.

# Nuestras hipótesis son:
# - H0: el valor de ataque y defensa de los pokemon es igual.
# - H1: el valor de ataque y defensa de los pokemon es significativamente diferente.

# La significancia la ponemos en 0.05 (95%)

# Como el p-value es menor que la significancia podemos descartar H0: 
# la conclusión es que el valor de ataque y defensa es significativamente diferente.

# Como el t-estadístico es negativo podemos decir también que el valor de defensa es menor que el de ataque.


We are also curious about whether therer is a significant difference between the mean of special defense and the mean of special attack. Perform the hypothesis test in the cell below. 

In [27]:
# Your code here:
t_stat, p_value = ttest_rel(pokemon["Sp. Def"], pokemon["Sp. Atk"])

print(f"El t-stat es: {t_stat}")
print(f"El p-value es: {p_value}")

El t-stat es: -0.853986188453353
El p-value es: 0.3933685997548122


Describe the results of the test in the cell below.

In [29]:
# Your conclusions here:

# El valor absoluto del t-estadístico es bajo y el p-value es alto.

# Con las hipótesis:
# - H0: el valor del ataque especial y la defensa especial son similares en todos los pokemon.
# - H1: el valor de ataque especial es diferente del de defensa especial en los pokemon.

# Con una significancia de 0.05 sacamos estas conclusiones:
# - Como p-value es mayor de la significancia no podemos rechazar H0. El valor de defensa 
#   y ataque especiales son, en el global de los pokemon, similares.


As you may recall, a two sample matched pairs test can also be expressed as a one sample test of the difference between the two dependent columns.

Import the `ttest_1samp` function and perform a one sample t-test of the difference between defense and attack. Test the hypothesis that the difference between the means is zero. Confirm that the results of the test are the same.

In [36]:
# Your code here:
from scipy.stats import ttest_1samp

# Test 1 sample: difference between pokemon defense and attack against a mean of 0
pokemon_def_vs_attack = list(pokemon["Defense"] - pokemon["Attack"])
t_stat_one_sample, p_value_one_sample = ttest_1samp(pokemon_def_vs_attack, 0)

# Test 2 samples: pokemon defense vs pokemon attack
t_stat_two_sample, p_value_two_sample = ttest_rel(pokemon["Defense"], pokemon["Attack"])

# Notice the results are the same:
print(f"El t-stat con un samples es: {t_stat_one_sample}")
print(f"El t-stat con dos samples es: {t_stat_two_sample}")

print(f"El p-value con un samples es: {p_value_one_sample}")
print(f"El p-value con dos samples es: {p_value_two_sample}")


El t-stat con un samples es: -4.325566393330478
El t-stat con dos samples es: -4.325566393330478
El p-value con un samples es: 1.7140303479358558e-05
El p-value con dos samples es: 1.7140303479358558e-05


# Bonus Challenge - The Chi-Square Test

The Chi-Square test is used to determine whether there is a statistically significant difference in frequencies. In other words, we are testing whether there is a relationship between categorical variables or rather when the variables are independent. This test is an alternative to Fisher's exact test and is used in scenarios where sample sizes are larger. However, with a large enough sample size, both tests produce similar results. Read more about the Chi Squared test [here](https://en.wikipedia.org/wiki/Chi-squared_test).

In the cell below, create a contingency table using `pd.crosstab` comparing whether a Pokemon is legenadary or not and whether the Type 1 of a Pokemon is water or not.

In [45]:
# Your code here:
df = pd.crosstab(pokemon["Legendary"], pokemon["Type 1"])
df

Type 1,Bug,Dark,Dragon,Electric,Fairy,Fighting,Fire,Flying,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
Legendary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
False,69,29,20,40,16,27,47,2,30,67,28,22,96,28,43,40,23,108
True,0,2,12,4,1,0,5,2,2,3,4,2,2,0,14,4,4,4


Perform a chi-squared test using the `chi2_contingency` function in `scipy.stats`. You can read the documentation of the function [here](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html).

In [54]:
# Your code here:
from scipy.stats import chi2_contingency

chi2_stat, p_value, dof, expected_vals = chi2_contingency(df, correction=False)

print(f"Valor Chi^2 estadistico: {chi2_stat}")
print(f"Valor p-value estadistico: {p_value}")
print(f"Grados de Libertad: {dof}")
#print(f"Valores esperados: {expected_vals}")

Valor Chi^2 estadistico: 90.4204913058596
Valor p-value estadistico: 5.118547414721704e-12
Grados de Libertad: 17


Based on a 95% confidence, should we reject the null hypothesis?

In [19]:
# Your answer here:

# Las hipótesis:
# - H0: el valor de tipo de pokemon y el de legendario no están relacionados.
# - H1: el tipo de pokemon esta relacionado con el parametro legendario.

# La significancia es del 0.05 (95%).

# Como el p-value es menor que la significancia podemos rechazar la hipótesis nula.
# Conclusión: el tipo de pokemon está relacionado con la característica de legendario.