                                   Use Python to conduct a hypothesis test
                                                May 17, 2023

### Hypothesis testing - t-test

A t test is a statistical test that is used to compare the means of two groups. 

The t test assumes your data:

* are independent
* are (approximately) normally distributed
* have a similar amount of variance within each group being compared (a.k.a. homogeneity of variance)



#### problem statement:
You want to know whether the mean petal length of iris flowers differs according to their species. You find two different species of irises growing in a garden and measure 25 petals of each species.

I can use Python to simulate taking a random sample of 20 petal lengths in each species, and conduct a
two-sample t-test based on the sample data.

Before I begin with the exercises and analyzing the data, I need to import all libraries and extensions
required for this programming exercise. I will be using pandas and scipy stats for operations.


In [1]:
import pandas as pd
import scipy.stats as stats

In [2]:
df = pd.read_csv('flower.data.csv')

#droped unnecessary column
df.drop('Unnamed: 0', axis = 1 , inplace=True)
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.7,3.2,1.3,0.2,setosa
2,5.0,3.6,1.4,0.2,setosa
3,4.6,3.4,1.4,0.3,setosa
4,4.4,2.9,1.4,0.2,setosa


#### Organize my data

To start, I filtered the data frame for the iris petal length data from the species setosa and
virginica.

First, name a new variable: df_setosa. Then, use the relational operator for equals (==) to get the
relevant data from the species column

In [3]:
df_setosa = df[df['Species'] == 'setosa']

Next, name another variable: df_virginica. Follow the same procedure to get the relevant data from
the species column.

In [4]:
df_virginica = df[df['Species'] == 'virginica']

#reset index
df_virginica.reset_index(drop = True,inplace=True)

#### Simulate random sampling

Now that I have organized the data, I used the sample() function to take a random sample of
20 petal lengths from each species. First, name a new variable: sampled_setosa. Then, enter the
arguments of the sample() function.

* **n**: Your sample size is 20.
*  **replace**: daefault false or Choose True when you are sampling with replacement.
*  **random_state**: Choose an arbitrary number for the random seed– how about 12000.

when we use random state, the output will be the same random value every time. It’s random in the sense that it’s not predictable from the input. But it’s reproducible, because the value will be the same if I run it again, and should be the same if you run it too.

Note that it makes absolutely no difference what number you pass as the argument to random_state.

In [5]:
sampled_setosa = df_setosa.sample(n = 20 , random_state = 12000)

Now, I named another variable: sampled_virginica. Follow the same procedure, but this time chose
a different number for the random seed -- how about 22560.

In [6]:
sampled_virginica = df_virginica.sample(n = 20 , random_state = 22560)

#### Compute the sample means

I now have two random samples of 20 petal lengths, one sample for each specie. Next, I used mean() to
compute the mean petal length for both setosa and virginica.

In [7]:
sampled_setosa['Petal.Length'].mean()

1.47

In [8]:
sampled_virginica['Petal.Length'].mean()

5.569999999999999

setosa has a mean petal length of about 1.47, while virginica has a mean petal length
of about 5.569.

Based on my sample data, the observed difference between the mean petal lengths of
setosa and virginica is 4.099 points (5.569 - 1.47).

**Note**: At this point, I might be tempted to conclude that virginica has a higher overall petal length than setosa. 
However, due to sampling variability, this observed difference might simply
be due to chance - rather than an actual difference in the corresponding population means. A
hypothesis test can help me determine whether or not my results are statistically significant.

### Conduct a hypothesis test

Now that I have organized the data and simulated random sampling, I am ready to conduct the
hypothesis test. Recall that the two-sample t-test is the standard approach for comparing the
means of two independent samples. Let’s review the steps for conducting a hypothesis test:
1. State the null hypothesis and the alternative hypothesis
2. Choose a significance level
3. Find the p-value
4. Reject or fail to reject the null hypothesis

#### Step 1: State the null hypothesis and the alternative hypothesis

The null hypothesis is a statement that is assumed to be true unless there is convincing evidence
to the contrary. The alternative hypothesis is a statement that contradicts the null hypothesis,
and is accepted as true only if there is convincing evidence for it.

In a two-sample t-test, the null hypothesis states that there is no difference between the means of
your two groups. The alternative hypothesis states the contrary claim: there is a difference between
the means of your two groups.

We use H0 to denote the null hypothesis, and HA to denote the alternative hypothesis.

* **H0**: There is no difference in the mean petal length between SETOSA and VIRGINICA
* **HA**: There is a difference in the mean petal length between SETOSA and VIRGINICA

#### Step 2: Choose a significance level
    
The significance level is the threshold in which I will consider a result statistically significant.
This is the probability of rejecting the null hypothesis when it is true. In Iris dataset used their standard level of 5%, or 0.05

#### Step 3: Find the p-value
    
P-value refers to the probability of observing results as or more extreme than those observed when
the null hypothesis is true.

Based on my sample data, the difference between the mean petal lenghts of SETOSA
and VIRGINICA is 4.099 points. The null hypothesis claims that this difference is due to
chance. The p-value is the probability of observing an absolute difference in sample means that
is 4.099 or greater if the null hypothesis is true. If the probability of this outcome is very unlikely
in particular, if my p-value is less than the significance level of 5% – then I will reject the null
hypothesis.




#### Two-Sample t test

* Observations come from two separate populations (separate species), so we perform a **two-sample t test.**
* We don’t care about the direction of the difference, only whether there is a difference, so we choose to use a **two-tailed t test.**

#### Performing a t test

**scipy.stats.ttest_ind()**  For a **two-sample t-test**, I can use scipy.stats.ttest_ind() to compute the p-value. This function includes the following arguments:

* **a**: Observations from the first sample.
* **b**: Observations from the second sample.
* **equal_var**: A boolean, or true/false statement, which indicates whether the population
variance of the two samples is assumed to be equal. In my example, I don’t have access to
data for the entire population, so I don’t want to assume anything about the variance. To
avoid making a wrong assumption, set this argument to False.

Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.

Now I’m ready to write my code and enter the relevant arguments:

* a: The first sample refers to the iris petal length data for SETOSA, which is stored in
the Petal.Length column of the variable sampled_ setosa.
* b: The second sample refers to the iris petal length data for VIRGININCA, which is stored
in the Petal.Length column of the variable sampled_ virginica.
* equal_var: Set to True because I want to assume that the two samples have the same
variance

In [9]:
stats.ttest_ind(a = sampled_setosa['Petal.Length'] , b = sampled_virginica['Petal.Length'] , equal_var=True)

Ttest_indResult(statistic=-31.100840266683495, pvalue=1.2294574474813836e-28)

* The t value: -33.719. Note that it’s negative; this is fine! In most cases, we only care about the absolute value of the difference, or the distance from 0. It doesn’t matter which direction.
* The p value: 1.22e-28 (i.e. 1.22 with 27 zeros in front). This describes the probability that you would see a t value as large as this one by chance.

#### Step 4: Reject or fail to reject the null hypothesis

To draw a conclusion, compare the p-value with the significance level.

* If the p-value is less than the significance level, I conclude there is a statistically significant
difference in the mean iris petal length between SETOSA and VIRGINICA. In other
words, I reject the null hypothesis H0.

* If the p-value is greater than the significance level, I conclude there is not a statistically
significant difference in the mean iris petal length between SETOSA and VIRGINICA. In
other words, I fail to reject the null hypothesis H0.

My p-value of 1.22e-28, is very less than the significance level of 0.05, or 5% (**p-value < 0.05**). So, I **reject the
null hypothesis**, and conclude that there is a **statistically significant difference between the mean
iris petal length of the two species SETOSA and VIRGINICA**.


#### Conclusion:

The difference in petal length between iris species setosa (M = 1.456; SD = 0.206) and iris species virginica (M = 5.54; SD = 0.569) was significant (t  = −33.7190; p < 0.05).