###  What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

Both t-tests and z-tests are statistical tests used to make inferences about population parameters based on sample data. However, they have some differences in terms of when they are used and the assumptions they make.

**T-test:**
A t-test is used when the sample size is relatively small (typically less than 30) or when the population standard deviation is unknown. It is particularly suited for situations where you're comparing two groups or means. There are different types of t-tests, including:

1. **Independent Samples T-test:** This is used to compare the means of two independent groups. For example, you might use an independent samples t-test to determine if there's a significant difference in the average scores of a control group and an experimental group exposed to a new teaching method.

2. **Paired Samples T-test:** This is used when you have two related groups (e.g., before and after measurements) and you want to determine if there's a significant difference in their means. An example might be testing whether a new drug treatment leads to a significant change in patients' blood pressure from before to after treatment.

**Z-test:**
A z-test, on the other hand, is used when you have a larger sample size (typically more than 30) and when you know the population standard deviation. It's based on the standard normal distribution (z-distribution) and is used to test hypotheses about population means when the sample size is relatively large.

A common example scenario for using a z-test is when you're testing whether the mean weight of a certain population is significantly different from a known value. Let's say you know that the population mean weight of a particular animal species is 150 grams, and you've collected a sample of 100 individuals from that species. If you know the population standard deviation of weights, you could use a z-test to determine if the sample mean weight significantly differs from the known population mean.

**In Summary:**
- Use a t-test when the sample size is small (typically < 30) or when the population standard deviation is unknown. It's suitable for comparing means of two groups (independent or paired samples).
- Use a z-test when the sample size is large (typically > 30) and you know the population standard deviation. It's suitable for testing hypotheses about population means.

### Differentiate between one-tailed and two-tailed tests.

One-tailed and two-tailed tests are concepts related to hypothesis testing in statistics. They determine the directionality of the effect you're testing and impact how you interpret the results of your statistical analysis.

**One-Tailed Test:**
In a one-tailed test, you're testing for the presence of an effect in only one direction. This means you're interested in determining if a parameter (such as a population mean) is either significantly greater or significantly smaller than a certain value, but not both. One-tailed tests are often used when there is a specific hypothesis or expectation about the direction of the effect.

For example, let's say you're testing whether a new drug increases the average response time of participants. Your null hypothesis (H0) might be that the drug has no effect, and your alternative hypothesis (Ha) would state that the drug increases response time. In this case, you're only interested in the possibility of an increase, so you'd perform a one-tailed test.

**Two-Tailed Test:**
In a two-tailed test, you're testing for the presence of an effect in both directions. This means you're interested in determining if a parameter is significantly different from a certain value, regardless of whether it's greater or smaller. Two-tailed tests are used when you don't have a specific expectation about the direction of the effect, or when you want to be more conservative in your analysis.

Continuing with the drug example, let's say you're testing whether a new drug has any effect on participants' response time. Your null hypothesis (H0) would be that the drug has no effect, and your alternative hypothesis (Ha) would simply state that there is an effect, without specifying the direction. In this case, you'd perform a two-tailed test because you're interested in detecting any significant change in response time, whether it's an increase or a decrease.

###  Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.

In hypothesis testing, Type 1 and Type 2 errors are two possible mistakes that can occur when making decisions about a null hypothesis (H0) and an alternative hypothesis (Ha). These errors are inversely related, meaning that reducing the probability of one type of error often increases the probability of the other.

**Type 1 Error (False Positive):**
A Type 1 error occurs when you reject a null hypothesis that is actually true. In other words, you conclude that there is an effect or a significant result when, in reality, there is no effect in the population. The probability of committing a Type 1 error is denoted by the symbol "α" (alpha), and it's also called the significance level.

**Example Scenario for Type 1 Error:**
Suppose a pharmaceutical company is testing a new drug to reduce blood pressure. The null hypothesis (H0) states that the drug has no effect on blood pressure. If the researchers perform a statistical test and find a significant decrease in blood pressure among participants, they might conclude that the drug is effective and reject the null hypothesis. However, if this conclusion is based on a sample that doesn't accurately represent the entire population, it's possible they have committed a Type 1 error by falsely claiming the drug is effective when it's not.

**Type 2 Error (False Negative):**
A Type 2 error occurs when you fail to reject a null hypothesis that is actually false. In other words, you conclude that there is no effect or no significant result when, in reality, there is an effect in the population. The probability of committing a Type 2 error is denoted by the symbol "β" (beta).

**Example Scenario for Type 2 Error:**
Continuing with the drug example, let's say the drug does indeed have a significant effect on blood pressure. If the researchers perform a statistical test but fail to find a significant decrease in blood pressure among participants, they might fail to reject the null hypothesis. In this case, they have committed a Type 2 error by failing to identify the true effectiveness of the drug.

**Relationship between Type 1 and Type 2 Errors:**
There's an inherent trade-off between Type 1 and Type 2 errors. As you decrease the probability of one type of error (by setting a lower significance level), the probability of the other type of error tends to increase. This is known as the "power" of a statistical test, which is the probability of correctly rejecting a false null hypothesis (1 - β).

### Explain Bayes's theorem with an example.

Bayes's theorem is a fundamental concept in probability theory and statistics that allows you to update your beliefs or probabilities based on new evidence. It's particularly useful for making decisions in uncertain situations where you have some prior knowledge and you want to incorporate new information to make more informed decisions.

![Bayes's Theorem Formula.png](attachment:0a0032e1-601d-48ed-9cc1-853ca8bdfb1a.png)

**Example Scenario: Medical Test**

Let's use a medical test scenario to explain Bayes's theorem:

Suppose we're a doctor and we have a patient who tested positive for a certain disease using a diagnostic test. We know that the test has a false positive rate of 5% (meaning it incorrectly identifies healthy people as having the disease 5% of the time) and a false negative rate of 10% (meaning it incorrectly identifies sick people as healthy 10% of the time). The overall prevalence of the disease in the population is 2%.

Now, we want to determine the probability that your patient actually has the disease given that they tested positive (i.e., P(Disease|Positive)).

Let's define:
- A: The patient has the disease.
- B: The patient tested positive on the diagnostic test.

We want to find ( P(A|B) ), which is the probability that the patient has the disease given that they tested positive.

Using Bayes's theorem:

          P(A|B) = (P(B|A) * P(A))/P(B)

Where:
- P(B|A) is the probability of testing positive given that the patient has the disease. This is the true positive rate, which is ( 1 - false negative rate).
- P(A)  is the prior probability that the patient has the disease, which is 2% or 0.02.
- P(B) is the overall probability of testing positive.

The overall probability of testing positive can be calculated using the probabilities of true positives and false positives:

P(B) = P(B|A) * P(A) + P(B|~A) * P(~A)
P(B) = (1 - false negative rate}) * P(A) + false positive rate * P(~A)
P(B) = (1 - 0.10) * 0.02 + 0.05 * 0.98
P(B) = 0.0656

Now you can calculate P(A|B):

P(A|B) = (P(B|A) * P(A)) / P(B)

P(A|B) = (1 - 0.10) * 0.02 / 0.0656

P(A|B) = approx(0.306)

So, even though the patient tested positive for the disease, the probability that they actually have the disease is only about 30.6%. This shows how Bayes's theorem allows you to update your beliefs based on new evidence, taking into account both the prior probability and the test's accuracy.

### What is a confidence interval? How to calculate the confidence interval, explain with an example.

A confidence interval is a range of values that is constructed around a sample statistic (usually a mean or proportion) to estimate the true population parameter with a certain level of confidence. It provides a measure of the uncertainty associated with the estimate and indicates how much the sample estimate is likely to vary from the actual population parameter.

Confidence intervals are commonly used in statistics to convey the precision of an estimate. The confidence level is typically expressed as a percentage, such as 95% or 99%, and it represents the probability that the true population parameter lies within the calculated interval.

**Calculation of Confidence Interval:**

The formula for calculating a confidence interval for a population mean (assuming the population standard deviation is known) is given by:

![Confidence interval.png](attachment:acb2bc3f-3002-486c-83d3-5a1af814a56f.png)

**Example Scenario:**

Suppose we're interested in estimating the average height of a certain population of individuals. We collect a random sample of 100 individuals and measure their heights. The sample mean height is 170 cm, and you know from previous studies that the population standard deviation is 8 cm. WE want to calculate a 95% confidence interval for the true population mean height.

1. Identify the critical value for the chosen confidence level:
For a 95% confidence level and a large sample size (n > 30), you can use the standard normal distribution (z-distribution). The critical value for a 95% confidence interval is approximately 1.96.

2. Plug in the values into the confidence interval formula: 
   
   Confidence Interval = 170 ± 1.96 * 8/sqrt{100}
   
   Confidence Interval = 170 ± 1.568 

3. Calculate the interval:

   Lower limit: ( 170 - 1.568 = 168.432 )

   Upper limit: ( 170 + 1.568 = 171.568 )

Therefore, the 95% confidence interval for the average height of the population is approximately 168.43 cm to 171.57 cm. This means that you can be 95% confident that the true population mean height falls within this interval based on the information from your sample. Keep in mind that the size of the confidence interval depends on the chosen confidence level and the variability of the data. A higher confidence level will result in a wider interval, reflecting greater uncertainty.

### Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

Sample Problem: Probability of a Rare Disease

Suppose there is a rare disease that affects 1 in 10,000 people in a population. We have developed a diagnostic test for this disease, which has an accuracy rate of 98% for both true positives (correctly identifying a sick person) and true negatives (correctly identifying a healthy person). However, there is still a 2% chance of false positives (healthy person being identified as sick) and a 2% chance of false negatives (sick person being identified as healthy).

Now, we want to calculate the probability that a person who tested positive for the disease actually has the disease.

In [2]:
# A: The person has the disease
# B: The person tested positive on the diagnostic test

P_A = 1 / 10000  
P_B_given_A = 0.98               # probability of testing positive given having the disease
P_B_given_not_A = 0.02           # probability of testing positive given not having the disease

# overall probability of testing positive (P(B))
P_B = P_B_given_A * P_A + P_B_given_not_A * (1 - P_A)

# probability of having the disease given testing positive (P(A|B))
P_A_given_B = (P_B_given_A * P_A) / P_B

print(f"The probability that a person who tested positive for the disease actually has the disease: {P_A_given_B:.6f}")

The probability that a person who tested positive for the disease actually has the disease: 0.004877


###  Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

In [3]:
import scipy.stats as stats

sample_mean = 50
sample_std_dev = 5
confidence_level = 0.95
sample_size = 100  

critical_value = stats.norm.ppf(1 - (1 - confidence_level) / 2)

margin_of_error = critical_value * (sample_std_dev / (sample_size ** 0.5))

lower_limit = sample_mean - margin_of_error
upper_limit = sample_mean + margin_of_error

print(f"95% Confidence Interval: ({lower_limit:.2f}, {upper_limit:.2f})")

95% Confidence Interval: (49.02, 50.98)


This interval is that we are 95% confident that the true population mean falls within this interval based on the given sample data. This means that if we were to repeat the process of collecting samples and calculating confidence intervals, approximately 95% of those intervals would contain the true population mean.

###  What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error

The margin of error in a confidence interval is a measure of the precision or variability associated with the estimate of a population parameter based on a sample. It quantifies the range within which the true population parameter is likely to fall, given a certain level of confidence. In other words, it represents the maximum amount by which the sample estimate could differ from the true population parameter. A smaller margin of error indicates a more precise estimate, while a larger margin of error indicates a less precise estimate.

How Sample Size Affects the Margin of Error:

As the sample size increases:

1. The margin of error decreases: A larger sample size generally leads to a more accurate estimate of the population parameter, resulting in a smaller margin of error. This is because larger samples provide more information about the population and reduce the variability in the estimates.

2. Confidence increases: With a larger sample, you can be more confident that your estimate is close to the true population parameter, leading to a narrower confidence interval.

Example :

Suppose we are conducting a survey to estimate the average income of a specific group of professionals in a city. We decide to collect data from two different sample sizes: one with 100 individuals and another with 1000 individuals.

For the sake of simplicity, let's assume that the true average income of this group is 70,000 dollor and the standard deviation is 10,000 dollor.

In [6]:
import scipy.stats as stats

true_population_mean = 70000 
population_std_dev = 10000   

sample_size_1 = 100
sample_size_2 = 1000

confidence_level = 0.95

critical_value = stats.norm.ppf(1 - (1 - confidence_level) / 2)

margin_of_error_1 = critical_value * (population_std_dev / (sample_size_1 ** 0.5))
margin_of_error_2 = critical_value * (population_std_dev / (sample_size_2 ** 0.5))

confidence_interval_1 = (true_population_mean - margin_of_error_1, true_population_mean + margin_of_error_1)
confidence_interval_2 = (true_population_mean - margin_of_error_2, true_population_mean + margin_of_error_2)

print(f"Sample Size = {sample_size_1}")
print(f"Margin of Error: {margin_of_error_1:.2f}")
print(f"Confidence Interval: {confidence_interval_1}")

print("\n")

print(f"Sample Size = {sample_size_2}")
print(f"Margin of Error: {margin_of_error_2:.2f}")
print(f"Confidence Interval: {confidence_interval_2}")

Sample Size = 100
Margin of Error: 1959.96
Confidence Interval: (68040.03601545995, 71959.96398454005)


Sample Size = 1000
Margin of Error: 619.80
Confidence Interval: (69380.20496769543, 70619.79503230457)


###  Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

In [7]:
data_point = 75
population_mean = 70
population_std_dev = 5

z_score = (data_point - population_mean) / population_std_dev

print(f"The z-score for the data point is: {z_score:.2f}")

The z-score for the data point is: 1.00


The calculated z-score of 1.00 means that the data point with a value of 75 is 1 standard deviation above the population mean of 70. In other words, this data point is located 1 standard deviation away from the mean in the positive direction. This indicates that the data point is relatively higher than the average value in the population by one standard deviation. The z-score helps us understand how a particular data point compares to the rest of the data in terms of its distance from the mean.

###  In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

In [8]:
import scipy.stats as stats

sample_size = 50
sample_mean = 6
sample_std_dev = 2.5
confidence_level = 0.95

t_statistic = (sample_mean - 0) / (sample_std_dev / (sample_size ** 0.5))

degrees_of_freedom = sample_size - 1

critical_t_value = stats.t.ppf(1 - (1 - confidence_level) / 2, df=degrees_of_freedom)

if abs(t_statistic) > critical_t_value:
    conclusion = "reject the null hypothesis"
else:
    conclusion = "fail to reject the null hypothesis"

print(f"T-Statistic: {t_statistic:.2f}")
print(f"Critical T-Value: {critical_t_value:.2f}")
print(f"Conclusion: Based on the test, we {conclusion} at the {confidence_level:.0%} confidence level.")

T-Statistic: 16.97
Critical T-Value: 2.01
Conclusion: Based on the test, we reject the null hypothesis at the 95% confidence level.


###  In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

In [9]:
import scipy.stats as stats

sample_proportion = 0.65
sample_size = 500
confidence_level = 0.95

critical_value = stats.norm.ppf(1 - (1 - confidence_level) / 2)

margin_of_error = critical_value * (sample_proportion * (1 - sample_proportion) / sample_size) ** 0.5

lower_limit = sample_proportion - margin_of_error
upper_limit = sample_proportion + margin_of_error

print(f"95% Confidence Interval: ({lower_limit:.4f}, {upper_limit:.4f})")

95% Confidence Interval: (0.6082, 0.6918)


### A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

In [14]:
import scipy.stats as stats

mean_A = 85
std_dev_A = 6
sample_size_A = 100

mean_B = 82
std_dev_B = 5
sample_size_B =  100

alpha = 0.01

# pooled standard deviation
pooled_std_dev = ((std_dev_A**2 * (sample_size_A - 1)) + (std_dev_B**2 * (sample_size_B - 1))) / (sample_size_A + sample_size_B - 2)
pooled_std_dev = pooled_std_dev ** 0.5

t_statistic = (mean_A - mean_B) / (pooled_std_dev * (1/sample_size_A + 1/sample_size_B) ** 0.5)

degrees_of_freedom = sample_size_A + sample_size_B - 2

critical_t_value = stats.t.ppf(1 - alpha / 2, df=degrees_of_freedom)

if abs(t_statistic) > critical_t_value:
    conclusion = "reject the null hypothesis"
else:
    conclusion = "fail to reject the null hypothesis"

print(f"T-Statistic: {t_statistic:.2f}")
print(f"Critical T-Value: {critical_t_value:.2f}")
print(f"Conclusion: Based on the test, we {conclusion} at the {alpha:.0%} significance level.")

T-Statistic: 3.84
Critical T-Value: 2.60
Conclusion: Based on the test, we reject the null hypothesis at the 1% significance level.


### A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

In [16]:
import scipy.stats as stats

sample_mean = 65
population_std_dev = 8
sample_size = 50
confidence_level = 0.90

degrees_of_freedom = sample_size - 1

critical_value = stats.t.ppf(1 - (1 - confidence_level) / 2, df=degrees_of_freedom)

margin_of_error = critical_value * (population_std_dev / (sample_size ** 0.5))

lower_limit = sample_mean - margin_of_error
upper_limit = sample_mean + margin_of_error

print(f"90% Confidence Interval: ({lower_limit:.2f}, {upper_limit:.2f})")

90% Confidence Interval: (63.10, 66.90)


### In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

In [17]:
import scipy.stats as stats

sample_mean = 0.25
sample_std_dev = 0.05
sample_size = 30
confidence_level = 0.90

t_statistic = (sample_mean - 0) / (sample_std_dev / (sample_size ** 0.5))

degrees_of_freedom = sample_size - 1

critical_t_value = stats.t.ppf(1 - (1 - confidence_level) / 2, df=degrees_of_freedom)

if abs(t_statistic) > critical_t_value:
    conclusion = "reject the null hypothesis"
else:
    conclusion = "fail to reject the null hypothesis"

print(f"T-Statistic: {t_statistic:.2f}")
print(f"Critical T-Value: {critical_t_value:.2f}")
print(f"Conclusion: Based on the test, we {conclusion} at the {confidence_level:.0%} confidence level.")

T-Statistic: 27.39
Critical T-Value: 1.70
Conclusion: Based on the test, we reject the null hypothesis at the 90% confidence level.
