<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_06-Hypothesis_Testing/Week_06_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 06 - Assignment

## Getting Started

* Colab - get notebook from gitmystuff DTSC4050 repository
* Save a Copy in Drive
* Remove Copy of
* Edit name
* Take attendance
* Clean up Colab Notebooks folder
* Submit shared link

## Instructions

**Statistical Methods for Data Science and Analysis: Hypothesis Testing with t-tests**

**Objective:** To understand and apply the principles of hypothesis testing using t-tests.

**Instructions:**

1. **Review the Completed Example:** Carefully examine the provided completed example. Pay close attention to:
    * How the data is generated.
    * The stated null hypothesis (H0) and alternative hypothesis (H1).
    * The calculation of the t-statistic and p-value using `scipy.stats`.
    * The calculation of the degrees of freedom.
    * The calculation of the critical value.
    * The interpretation of the results and the conclusion drawn.

2. **Analyze the Remaining Examples:** For each of the 10 numbered examples:

    a. **Inspect the Data:** Look at the generated data for each group (or single group in the case of a one-sample test).  Think about what the data represents and formulate your own research question.

    b. **State Your Hypotheses:** Based on your research question, write down the null hypothesis (H0) and the alternative hypothesis (H1) for each example.  Remember:
    * The null hypothesis is a statement of no effect or no difference.
    * The alternative hypothesis is what you are trying to find evidence for.  It can be one-tailed (directional) or two-tailed (non-directional).  In these examples, we will use two-tailed tests.

    c. **Calculate the Test Statistic and P-value:** Use the appropriate `scipy.stats` function (`ttest_ind` for two independent samples, `ttest_1samp` for one sample) to calculate the t-statistic and p-value.

    d. **Calculate the Degrees of Freedom:** Determine the correct degrees of freedom for each test.

    e. **Calculate the Critical Value:** Calculate the critical value using `scipy.stats.t.ppf`. Remember to use a two-tailed test (alpha/2) and the correct degrees of freedom.

    f. **Make a Decision:** Compare the absolute value of the calculated t-statistic to the critical value.  Also, compare the p-value to the significance level (alpha = 0.05, unless otherwise stated).

    * If the absolute value of the t-statistic is *greater* than the critical value *and* the p-value is *less than* alpha, you *reject* the null hypothesis.  This means there is statistically significant evidence to support the alternative hypothesis.
    * Otherwise, you *fail to reject* the null hypothesis. This means there is not enough evidence to support the alternative hypothesis.

    g. **State Your Conclusion:** Write a conclusion in the context of the problem.  For example: "We reject the null hypothesis and conclude that there is a statistically significant difference between the means of the two groups."  Or, "We fail to reject the null hypothesis and conclude that there is no statistically significant difference between the means of the two groups."

3. **Submit Your Work:**  Submit the shared link to this notebook including your hypotheses, calculations (t-statistic, p-value, degrees of freedom, critical value), decision (reject or fail to reject H0), and conclusion for each of the 10 examples.

**Key Concepts to Remember:**

* **Null Hypothesis (H0):** A statement of no effect or no difference.
* **Alternative Hypothesis (H1):** A statement of an effect or a difference.
* **T-statistic:** A measure of how far the sample mean(s) deviate from the hypothesized mean(s) in terms of standard error.
* **P-value:** The probability of observing the data (or more extreme data) if the null hypothesis is true.
* **Degrees of Freedom:**  Related to the sample size(s) and used to determine the appropriate t-distribution.
* **Critical Value:** The threshold value used to determine statistical significance.
* **Alpha (Significance Level):** The probability of rejecting the null hypothesis when it is actually true (Type I error).  Commonly set to 0.05.

This detailed instruction set should give your students a clear framework for completing the exercise and understanding the core concepts of hypothesis testing with t-tests.  Encourage them to ask questions and discuss their findings!


## Review

Example 1:
* Define H0 and H1. Example:
* H0: The means of the two groups are equal.
* H1: The means of the two groups are not equal.

* Calculate the t-statistic, p-value, critical value and make a conclusion.
* Example calculation:
* t_statistic, p_value = stats.ttest_ind(group1, group2)
* alpha = 0.05
* degrees_of_freedom = len(group1) + len(group2) - 2
* critical_value = stats.t.ppf(1 - alpha/2, degrees_of_freedom)

Example 2:
* Define H0 and H1. Example:
* H0: The population mean is 50.
* H1: The population mean is not 50.

* Calculate the t-statistic, p-value, critical value and make a conclusion.
* Example calculation:
* t_statistic, p_value = stats.ttest_1samp(data, 50)
* alpha = 0.05
* degrees_of_freedom = len(data) - 1
* critical_value = stats.t.ppf(1 - alpha/2, degrees_of_freedom)

## Set the Seed

In [None]:
# set seed
import time
import numpy as np
import random

def generate_user_seed():
    # Get current time in nanoseconds (more granular)
    nanoseconds = time.time_ns()

    # Add a small random component to further reduce collision chances
    random_component = random.randint(0, 1000)  # Adjust range as needed

    # Combine them (XOR is a good way to mix values)
    seed = nanoseconds ^ random_component

    # Ensure the seed is within the valid range for numpy's seed
    seed = seed % (2**32)  # Modulo to keep it within 32-bit range

    return seed

user_seed = generate_user_seed()
print(user_seed)
np.random.seed(user_seed)

473075039


## Completed Example

In [None]:
import numpy as np
import scipy.stats as stats

# Completed Example:
group1 = np.random.normal(10, 2, 30)  # Mean 10, std dev 2, 30 samples
group2 = np.random.normal(12, 2, 30)  # Mean 12, std dev 2, 30 samples

# Null Hypothesis (H0): The means of the two groups are equal.
# Alternative Hypothesis (H1): The means of the two groups are not equal.

t_statistic, p_value = stats.ttest_ind(group1, group2) #Independent samples t-test

alpha = 0.05
degrees_of_freedom = len(group1) + len(group2) - 2 #Degrees of freedom for a two sample t-test
critical_value = stats.t.ppf(1 - alpha/2, degrees_of_freedom) #Two tailed test

print("Completed Example:")
# print("Group 1:", group1)
# print("Group 2:", group2)
print("T-statistic:", t_statistic)
print("P-value:", p_value)
print("Critical Value:", critical_value)
print("Alpha:", alpha)

if abs(t_statistic) > critical_value and p_value < alpha:
    print("Reject the null hypothesis. The means are significantly different.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in means.")
print("-" * 50)

Completed Example:
T-statistic: -2.969201681099344
P-value: 0.004335895638967814
Critical Value: 2.0017174841452356
Alpha: 0.05
Reject the null hypothesis. The means are significantly different.
--------------------------------------------------


## Create the Data

In [None]:
import numpy as np
import scipy.stats as stats
import pandas as pd  # Import pandas for data storage

# Dictionary to store the data for each example
example_data = {}

np.random.seed(1)
for i in range(1, 11):
    if i % 2 == 0:
        data = np.random.normal(50, 5, 25)
        example_data[f'example_{i}'] = {'data': data}  # Store the data
        print(f"Example {i}")
        print("Data:", data)

    else:
        group1 = np.random.normal(20, 3, 20)
        group2 = np.random.normal(22, 3, 20)
        example_data[f'example_{i}'] = {'group1': group1, 'group2': group2}  # Store the data
        print(f"Example {i}")
        print("Group 1:", group1)
        print("Group 2:", group2)

    print("-" * 50)

# --- Saving the data to a CSV file ---
data_list = []
for example_name, data in example_data.items():
    if 'data' in data: #One sample test
        df = pd.DataFrame(data['data'], columns=['data'])
        df['example'] = example_name
        data_list.append(df)
    else: #Two sample test
        df1 = pd.DataFrame(data['group1'], columns=['group1'])
        df2 = pd.DataFrame(data['group2'], columns=['group2'])
        df = pd.concat([df1, df2], axis=1)
        df['example'] = example_name
        data_list.append(df)
final_df = pd.concat(data_list, ignore_index=True)
final_df.to_csv('hypothesis_testing_data.csv', index=False)

Example 1
Group 1: [24.87303609 18.16473076 18.41548474 16.78109413 22.59622289 13.09538391
 25.23443529 17.7163793  20.95711729 19.25188887 24.38632381 13.81957787
 19.03274839 18.84783694 23.40130833 16.7003262  19.48271538 17.36642475
 20.12664124 21.74844564]
Group 2: [18.69814247 25.43417113 24.70477216 23.50748302 24.70256785 19.94881642
 21.63132932 19.1926917  21.19633576 23.5910664  19.92501774 20.80973942
 19.9384819  19.46438308 19.98626161 21.9620062  18.64806895 22.70324709
 26.97940653 24.22613248]
--------------------------------------------------
Example 2
Data: [49.04082224 45.56185518 46.26420853 58.46227301 50.25403877 46.81502177
 50.95457742 60.50127568 50.60079476 53.08601555 51.5008516  48.23875077
 44.28740901 48.25328639 48.95552883 52.93311596 54.19491707 54.65551041
 51.42793663 54.42570582 46.2280103  56.26434078 52.5646491  48.50953582
 52.44259073]
--------------------------------------------------
Example 3
Group 1: [19.77328486 23.39488816 24.55945045 26

## Solve Problems

In [None]:
# Example of accessing the data
import pandas as pd
data = pd.read_csv('hypothesis_testing_data.csv')

# Accessing the data for Example 3:
example = 'example_3'
print(example)
example_3_data = data[data['example'] == example]
group1 = example_3_data['group1'].dropna().values #dropna to remove NaN values from one sample data
group2 = example_3_data['group2'].dropna().values

print(group1)
print(group2)

example_3
[19.77328486 23.39488816 24.55945045 26.55672622 15.81051099 15.66765858
 18.48660241 20.48011121 22.62850676 20.94690484 13.93339635 19.08138796
 22.48392393 20.69028421 22.28603354 19.33301557 19.39772579 20.55968417
 21.23015494 20.59489916]
[22.35702594 19.98801314 23.13269136 22.36546381 25.38845172 25.59675364
 22.55546925 20.87414515 20.08380878 23.27048306 22.23202021 20.96843897
 22.13079057 20.13999747 24.0940961  20.65861431 25.67352311 23.21047493
 23.78073557 18.71526446]


In [None]:
# Accessing the data for Example 2:
example = 'example_2'
print(example)
example_2_data = data[data['example'] == example]
data_2 = example_2_data['data'].values
print(data_2)

example_2
[49.04082224 45.56185518 46.26420853 58.46227301 50.25403877 46.81502177
 50.95457742 60.50127568 50.60079476 53.08601555 51.5008516  48.23875077
 44.28740901 48.25328639 48.95552883 52.93311596 54.19491707 54.65551041
 51.42793663 54.42570582 46.2280103  56.26434078 52.5646491  48.50953582
 52.44259073]
