# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
# import numpy and pandas

import numpy as np
import math
import pandas as pd
import matplotlib.pyplot as plt
import random

from scipy.stats import uniform
from scipy.stats import norm
from scipy.stats import expon

# Challenge 1 - Independent Sample T-tests

In this challenge, we will be using the Pokemon dataset. Before applying statistical methods to this data, let's first examine the data.

To load the data, run the code below.

In [2]:
# Run this code:

pokemon = pd.read_csv('../pokemon.csv')

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [5]:
for col in pokemon.columns:
    print(pokemon[col].unique())

[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
 235 236 237 238 239 240 241 242 243 244 245 246 24

Let's start off by looking at the `head` function in the cell below.

In [3]:
# Your code here:
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


The first thing we would like to do is compare the legendary Pokemon to the regular Pokemon. To do this, we should examine the data further. What is the count of legendary vs. non legendary Pokemons?

In [9]:
# Your code here:

filter = (pokemon['Legendary'] == True) 
pokemon_leg = pokemon[filter]
pokemon_leg

filter_2 = (pokemon['Legendary'] == False) 
pokemon_reg = pokemon[filter_2]
pokemon_reg

# Count of legendary
cnt_leg = pokemon_leg["Legendary"].count()
cnt_leg

65

In [10]:
# Count of regular
cnt_reg = pokemon_reg["Legendary"].count()
cnt_reg

735

Compute the mean and standard deviation of the total points for both legendary and non-legendary Pokemon.

In [11]:
# Your code here:
# Calculating the mean
pokemon_leg["Total"].mean()

637.3846153846154

In [12]:
pokemon_reg["Total"].mean()

417.21360544217686

In [13]:
# Calculating the Standard Deviation
pokemon_leg["Total"].std()

60.93738905315346

In [14]:
pokemon_reg["Total"].std()

106.76041745713022

The computation of the mean might give us a clue regarding how the statistical test may turn out; However, it certainly does not prove whether there is a significant difference between the two groups.

In the cell below, use the `ttest_ind` function in `scipy.stats` to compare the the total points for legendary and non-legendary Pokemon. Since we do not have any information about the population, assume the variances are not equal.

In [15]:
# Your code here:
from scipy.stats import ttest_ind

ttest_ind(pokemon_leg["Total"], pokemon_reg["Total"], equal_var=False)

Ttest_indResult(statistic=25.8335743895517, pvalue=9.357954335957446e-47)

What do you conclude from this test? Write your conclusions below.

In [6]:
# Your conclusions here:
# Since the P value is very low, we can reject the H0 hypothesis, that mentions that the two samples are the same.

How about we try to compare the different types of pokemon? In the cell below, list the types of Pokemon from column `Type 1` and the count of each type.

In [20]:
# Your code here:
unique_types = pd.unique(pokemon['Type 1'])
unique_types

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [21]:
type_count = pokemon.groupby(["Type 1"])["Type 1"].count()
type_count

Type 1
Bug          69
Dark         31
Dragon       32
Electric     44
Fairy        17
Fighting     27
Fire         52
Flying        4
Ghost        32
Grass        70
Ground       32
Ice          24
Normal       98
Poison       28
Psychic      57
Rock         44
Steel        27
Water       112
Name: Type 1, dtype: int64

Since water is the largest group of Pokemon, compare the mean and standard deviation of water Pokemon to all other Pokemon.

In [23]:
# Your code here:
# Mean and std for Pokemon = Water:

pokemon.loc[pokemon["Type 1"] == "Water", "Total"].mean()

430.45535714285717

In [24]:
pokemon.loc[pokemon["Type 1"] == "Water", "Total"].std()

113.18826606431458

In [25]:
# Mean and std for Pokemon != Water
pokemon.loc[pokemon["Type 1"] != "Water", "Total"].mean()

435.85901162790697

In [26]:
pokemon.loc[pokemon["Type 1"] != "Water", "Total"].std()

121.09168230208066

In [27]:
pokemon_water = pokemon.loc[pokemon["Type 1"] == "Water"]
pokemon_water

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
9,7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
10,8,Wartortle,Water,,405,59,63,80,65,80,58,1,False
11,9,Blastoise,Water,,530,79,83,100,85,105,78,1,False
12,9,BlastoiseMega Blastoise,Water,,630,79,103,120,135,115,78,1,False
59,54,Psyduck,Water,,320,50,52,48,65,50,55,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
724,656,Froakie,Water,,314,41,56,40,62,44,71,6,False
725,657,Frogadier,Water,,405,54,63,52,83,56,97,6,False
726,658,Greninja,Water,Dark,530,72,95,67,103,71,122,6,False
762,692,Clauncher,Water,,330,50,53,62,58,63,44,6,False


In [28]:
pokemon_nonwater = pokemon.loc[pokemon["Type 1"] != "Water"]
pokemon_nonwater

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


Perform a hypothesis test comparing the mean of total points for water Pokemon to all non-water Pokemon. Assume the variances are equal. 

In [29]:
# Your code here:

from scipy.stats import ttest_ind

ttest_ind(pokemon_water["Total"], pokemon_nonwater["Total"], equal_var=True)

Ttest_indResult(statistic=-0.4418547448849676, pvalue=0.6587140317488793)

Write your conclusion below.

In [10]:
# Your conclusions here:
# p-value is within the 95% interval, being quite high. Therefore we can assume that the H0 hypothesis is true.


# Challenge 2 - Matched Pairs Test

In this challenge we will compare dependent samples of data describing our Pokemon. Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. In the cell below, import the `ttest_rel` function from `scipy.stats` and compare the two columns to see if there is a statistically significant difference between them.

In [31]:
# Your code here:

# Defining the samples:
pokemon_attack = pokemon["Attack"]
pokemon_attack

pokemon_defense = pokemon["Defense"]
pokemon_defense

0       49
1       63
2       83
3      123
4       43
      ... 
795    150
796    110
797     60
798     60
799    120
Name: Defense, Length: 800, dtype: int64

In [33]:
from scipy.stats import ttest_ind

ttest_ind(pokemon_attack, pokemon_defense, equal_var=False)

Ttest_indResult(statistic=3.2417640740423126, pvalue=0.0012124374824544375)

Describe the results of the test in the cell below.

In [32]:
# Your conclusions here:
# The p-value is very low so we cannot accept the H0 hypothesis that they are equal.

Ttest_indResult(statistic=3.2417640740423126, pvalue=0.0012123980547321454)

We are also curious about whether therer is a significant difference between the mean of special defense and the mean of special attack. Perform the hypothesis test in the cell below. 

In [35]:
# Your code here:

pokemon_sp_attack = pokemon["Sp. Atk"]
pokemon_sp_attack

pokemon_sp_defense = pokemon["Sp. Def"]
pokemon_sp_defense

0       65
1       80
2      100
3      120
4       50
      ... 
795    150
796    110
797    130
798    130
799     90
Name: Sp. Def, Length: 800, dtype: int64

In [36]:
ttest_ind(pokemon_sp_attack, pokemon_sp_defense, equal_var=False)

Ttest_indResult(statistic=0.6041290031014401, pvalue=0.5458458438771813)

Describe the results of the test in the cell below.

In [14]:
# Your conclusions here:
# P-value is in the 95% interval, so we cannot reject the H0 hypothesis that the distributions are the same.


As you may recall, a two sample matched pairs test can also be expressed as a one sample test of the difference between the two dependent columns.

Import the `ttest_1samp` function and perform a one sample t-test of the difference between defense and attack. Test the hypothesis that the difference between the means is zero. Confirm that the results of the test are the same.

In [39]:
# Your code here:
from scipy.stats import ttest_1samp

# Getting the mean for one of the samples:
pokemon_attack.mean()

79.00125

In [40]:
# 1 sample test for pokemon_defense:
ttest_1samp(pokemon_defense, (pokemon_attack.mean()))

Ttest_1sampResult(statistic=-4.679124590910434, pvalue=3.3829080435248473e-06)

In [None]:
# Again the p-value is very low, so we cannot accept the H0 hypothesis that they are equal
# (as we saw with the 2 sample test).

# Bonus Challenge - The Chi-Square Test

The Chi-Square test is used to determine whether there is a statistically significant difference in frequencies. In other words, we are testing whether there is a relationship between categorical variables or rather when the variables are independent. This test is an alternative to Fisher's exact test and is used in scenarios where sample sizes are larger. However, with a large enough sample size, both tests produce similar results. Read more about the Chi Squared test [here](https://en.wikipedia.org/wiki/Chi-squared_test).

In the cell below, create a contingency table using `pd.crosstab` comparing whether a Pokemon is legenadary or not and whether the Type 1 of a Pokemon is water or not.

In [17]:
# Your code here:



Perform a chi-squared test using the `chi2_contingency` function in `scipy.stats`. You can read the documentation of the function [here](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html).

In [18]:
# Your code here:



Based on a 95% confidence, should we reject the null hypothesis?

In [19]:
# Your answer here:

