# Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

T-tests and z-tests are both statistical hypothesis tests used to make inferences about population parameters, such as population means. However, they are suited for different situations and differ primarily in how they handle sample size and knowledge of population standard deviation.

Here are the key differences between t-tests and z-tests:

### Z-Test:
1. Z-tests are used when the sample size is large (typically greater than 30) or when you have knowledge of the population standard deviation. The z-test assumes a normal distribution and is appropriate for larger sample sizes.
2. Z-tests require knowledge of the population standard deviation. If you know the population standard deviation, you can use the z-test for more precise results.
#### Example: 
We are conducting quality control for a manufacturing process that produces light bulbs. You have historical data indicating that the population standard deviation of bulb lifetimes is 5 hours. You take a sample of 100 bulbs from a production batch and want to test if the mean bulb lifetime is still 1000 hours as expected.

Test: One-sample z-test (since you know the population standard deviation and have a large enough sample size for the z-test to be appropriate).


### T-Test:
1. T-tests are generally used when the sample size is small (typically less than 30) and the population standard deviation is unknown. The t-distribution has heavier tails than the normal distribution, making it suitable for small samples where the assumptions of normality may not hold perfectly.
2. T-tests do not require knowledge of the population standard deviation. They use the sample standard deviation to estimate the standard error of the sample mean

#### Example:
We want to test if a new teaching method improves student test scores compared to the old method. You take a random sample of 25 students and record their scores before and after the new method is implemented. You do not know the population standard deviation for student scores.

Test: One-sample t-test (to compare the means of two related samples, before and after the new method).


# Q2: Differentiate between one-tailed and two-tailed tests.

One-tailed and two-tailed tests are types of hypothesis tests used in statistics to make decisions about population parameters based on sample data. They differ in terms of the directionality of the test and the alternative hypotheses they consider.

#### One-Tailed Test:

1. Directionality: In a one-tailed test, you are interested in detecting an effect or difference in only one direction (either greater than or less than) but not both.

2. Alternative Hypotheses:

       Left-Tailed Test: In a left-tailed or lower-tailed test, the alternative hypothesis (H1) states that the population parameter is less than a specified value.
       
       Right-Tailed Test: In a right-tailed or upper-tailed test, the alternative hypothesis (H1) states that the population parameter is greater than a specified value.
       
       
3. Rejection Region: The rejection region (critical region) for a one-tailed test is located in only one tail of the distribution (either the left or right tail), depending on the direction of the test.

4. Example Scenarios: A pharmaceutical company wants to test if a new drug increases the average lifespan of patients. They use a one-tailed test with the alternative hypothesis that the drug increases lifespan (right-tailed).

#### Two-Tailed Test:

1. Directionality: In a two-tailed test, you are interested in detecting an effect or difference in both directions (either greater than or less than), which makes it more conservative.

2. Alternative Hypothesis:

       The alternative hypothesis (H1) for a two-tailed test typically states that the population parameter is not equal to a specified value.
       
       It is also common to see a two-tailed test with alternative hypotheses that explicitly state the direction of the difference (e.g., μ ≠ μ0  or μ > μ0 or μ < μ0), but this is still considered a two-tailed test because it covers both directions.
       

3. Rejection Region: The rejection region (critical region) for a two-tailed test is divided between both tails of the distribution, centered around the null hypothesis value.

4. Example Scenarios: A manufacturer wants to test if a new production process affects the mean weight of their products. They use a two-tailed test with the alternative hypothesis that the mean weight is not equal to a specified value.

# Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.

In hypothesis testing, Type I and Type II errors are two types of mistakes or errors that can occur when making decisions about a null hypothesis. These errors are associated with the concept of statistical significance and the trade-off between making a correct decision and making an incorrect decision in hypothesis testing.

1. Type I Error (False Positive):

Definition: Type I error occurs when we reject a null hypothesis that is actually true. In other words, it is a false positive result.

Example Scenario: Suppose a medical researcher is testing a new drug to determine if it's effective in treating a particular condition. The null hypothesis (H0) is that the drug has no effect, and the alternative hypothesis (H1) is that the drug is effective. If, based on the sample data, the researcher mistakenly concludes that the drug is effective (rejects the null hypothesis) when, in fact, it has no effect, that would be a Type I error. This error might lead to the drug being approved for use when it should not be.

2. Type II Error (False Negative):

Definition: Type II error occurs when we fail to reject a null hypothesis that is actually false. In other words, it is a false negative result.

Example Scenario: Consider a quality control scenario in a manufacturing plant. The null hypothesis (H0) is that the production process is within acceptable limits, and the alternative hypothesis (H1) is that the process is out of control. If, based on the sample data, the quality control team fails to detect a problem (fails to reject the null hypothesis) when, in fact, the process is producing defective products, that would be a Type II error. This error might result in continued production of defective items, leading to quality issues.

# Q4: Explain Bayes's theorem with an example.

 It describes how to update the probability of a hypothesis (an event or proposition) based on new evidence or information. Bayes' theorem is especially useful in situations where we want to make inferences or predictions in the presence of uncertainty.

Bayes' theorem can be expressed mathematically as:

P(A∣B)=  P(B | A) * P(A)/P(B)
​
 

Where:
P(A | B) is the conditional probability of event A occurring given that event B has occurred.
P(B | A) is the conditional probability of event B occurring given that event A has occurred.
P(A) is the prior probability of event A.
P(B) is the prior probability of event B.

To illustrate Bayes' theorem, let's walk through an example:

Example: Medical Diagnosis, Imagine a medical scenario where a patient is being tested for a rare disease. The disease, "D," is relatively uncommon in the population, with a prevalence of 1%. The diagnostic test for the disease is not perfect; it can produce both false positives and false negatives.

P(D): Prior probability of having the disease = 1% or 0.01.
P(¬D): Prior probability of not having the disease = 99% or 0.99.
P(+ | D): Probability of testing positive given that the patient has the disease = 95% or 0.95 (sensitivity).
P(+ | ¬D): Probability of testing positive given that the patient does not have the disease = 5% or 0.05 (false positive rate).

Now, we want to calculate the probability that a patient actually has the disease (P(D | +)) given that they tested positive.

Using Bayes' theorem:

P(D | +) =   $\frac{(+ | D) * P(D)}{ P(+) }$

We can calculate P(+) using the law of total probability:

P(+)=P(+∣D)⋅P(D)+P(+∣¬D)⋅P(¬D)

Substituting the values:

P(D∣+)= $\frac{0.95 * 0.01 } { 0.95 * 0.01 + 0.05 * 0.99} $

Calculating this:

P(D∣+)≈0.16

So, even if a patient tests positive for the disease, there is only a 16% chance that they actually have the disease. This illustrates how Bayes' theorem allows us to update our beliefs or probabilities based on new information (in this case, the positive test result) and the prior probabilities. It also demonstrates that in situations with imperfect tests, a positive result does not necessarily imply a high probability of having the condition.

# Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.

A confidence interval is a statistical concept used to estimate a range of values within which a population parameter, such as a mean or proportion, is likely to fall. It provides a measure of the uncertainty or precision associated with a sample estimate. In simpler terms, it gives us a range of values that we are reasonably confident the true population parameter lies within.

Here's how to calculate a confidence interval with an example:

#### Step 1: Collect Data

Let's say we want to calculate a 95% confidence interval for the average height of a certain population. We collect a random sample of 100 individuals from that population and measure their heights.

#### Step 2: Calculate Sample Statistics

Calculate the sample mean (x̄) and the sample standard deviation (s) from our data. These are our point estimates for the population parameters.

#### Step 3: Choose a Confidence Level

Decide on the confidence level that we want to use. A common choice is a 95% confidence level, which means we want to be 95% confident that the true population parameter falls within your calculated interval.

#### Step 4: Determine the Critical Value

The critical value corresponds to the chosen confidence level and the distribution we're working with. For normally distributed data and a 95% confidence level, we will use the z-table or a calculator to find the critical z-value. For instance, if we're using a standard normal distribution (mean = 0, standard deviation = 1), the critical z-value for a 95% confidence level is approximately 1.96.

#### Step 5: Calculate the Margin of Error

The margin of error (MOE) is the maximum amount by which your sample estimate is likely to differ from the true population parameter. It is calculated by multiplying the critical value by the standard error of the sample mean. The formula for the margin of error is:

MOE = Critical Value (Z) × (Standard Deviation of Sample / √Sample Size)

In our example, if the critical value (Z) is 1.96, and the standard deviation of the sample is, let's say, 2 inches, and the sample size is 100, then:

MOE = 1.96 × (2 / √100) = 1.96 × (2 / 10) = 0.392 inches

#### Step 6: Calculate the Confidence Interval

Now we have the margin of error, so we can calculate the confidence interval. The confidence interval is constructed by taking your sample mean (x̄) and adding and subtracting the margin of error:

Confidence Interval = x̄ ± MOE

In our example, if the sample mean (x̄) is 65 inches:

Confidence Interval = 65 ± 0.392

So, the 95% confidence interval for the average height of the population is approximately 64.608 to 65.392 inches.

This means that we can be 95% confident that the true average height of the population falls within this range based on our sample data.

# Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

The formula for Bayes' Theorem is as follows:

#### P(A | B) =  $\frac{ P(B | A)⋅P(A) }{P(B)}$

Where:

1. P(A∣B) is the probability of event A occurring given the evidence B.

2. P(B∣A) is the probability of evidence B occurring given that A has occurred.

3. P(A) is the prior probability of event A.

4. P(B) is the prior probability of evidence B.



Let's work through a sample problem to illustrate how to use Bayes' Theorem:

Sample Problem:

Suppose you are a doctor, and you want to determine the probability that a patient has a certain disease (D) given the results of a diagnostic test (T). You have the following information:

1.  The probability that a person has the disease (prior probability):P(D)=0.02 (2% of the population has the disease).
2. The probability that the test is positive given that a person has the disease: P(T∣D)=0.95 (the test is accurate and gives a positive result 95% of the time for those with the disease).
3. The probability that the test is positive given that a person does not have the disease: P(T∣¬D)=0.10 (the test can produce false positives, so it's positive 10% of the time for healthy individuals).

You want to find P(D∣T), the probability that a patient has the disease given that the test is positive.

Solution:

We can use Bayes' Theorem to calculate P(D∣T): 

P(D∣T)=  $\frac{P(T | D)⋅P(D)}{P(T)}$
 

First, we need to calculate P(T), the overall probability of testing positive, which can be calculated using the law of total probability:

P(T) = P(T | D) * P(D) + P(T | ¬D) * P(¬D)

We know P(D)=0.02 and P(¬D) (the probability of not having the disease) is
 1−P(D) = 1−0.02
        = 0.98.

Now, we can calculate P(T):

P(T) = (0.95 * 0.02) + (0.10 * 0.98)
     =0.019 + 0.098
     =0.117

Now that we have P(T), we can calculate P(D | T) using Bayes' Theorem:

P(D | T) =  $\frac{P(T∣D)⋅P(D)}{P(T)}$
         =  $\frac{0.95⋅0.02}{0.117}$
         ≈0.162
         
So, the probability that a patient has the disease given that the test is positive is approximately 0.162, or 16.2%. This demonstrates how Bayes' Theorem can help update our beliefs about the probability of an event based on new evidence.

# Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

In [13]:
import numpy as np
from scipy import stats

confidence = 0.95
significance_level = 1 - confidence
sample_mean = 50
std = 5
sample_size = 50 # assume sample_size = 50

z_score = stats.norm.ppf(1 - significance_level/2)

margin_of_error = z_score * std/np.sqrt(sample_size)
upper_limit = sample_mean + margin_of_error
lower_limit = sample_mean - margin_of_error

print("With 95% confidence interval and mean of 50 and standard deviation of 5, range is:({:.4f}, {:.4f})".format(lower_limit, upper_limit))

With 95% confidence interval and mean of 50 and standard deviation of 5, range is:(48.6141, 51.3859)


### Interpretation:
The interval of (48.6141, 51.3859) tells us that with 95% confidence we can say that the population mean will lie in between (48.6141, 51.3859) interval.

# Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

The margin of error (MOE) in a confidence interval is a measure of the precision or uncertainty associated with the estimate of a population parameter (e.g., mean, proportion) based on a sample of data. It quantifies the range within which the true population parameter is likely to fall with a certain level of confidence.

The margin of error is calculated using the following formula for different types of estimates:

### For population means:

Margin of Error = $Z*\frac{\sigma}{\sqrt{n}}$

where 

Z is the critical value from the standard normal distribution or t-distribution, 

σ is the population standard deviation (or sample standard deviation, if the population standard deviation is unknown), and 

n is the sample size.

### For population proportions:

Margin of error = $\sqrt{\frac{p(1-p)}{n}}$

where 

Z is the critical value,

p is the sample proportion, and

n is the sample size. 

The margin of error is inversely proportional to the square root of the sample size. In other words, as the sample size increases, the margin of error decreases. This means that larger sample sizes result in smaller margins of error, which leads to more precise estimates.

Let's consider a scenario where a larger sample size would result in a smaller margin of error:

Suppose we are conducting a political poll to estimate the proportion of voters in a city who support a particular candidate. We have two options for sample sizes:

Option 1: We survey 200 randomly selected voters.
Option 2: We survey 1,000 randomly selected voters.

In this case, the larger sample size (Option 2) will result in a smaller margin of error compared to the smaller sample size (Option 1). The reason is that with a larger sample size, we collect more data points, and the variability in your estimate decreases. As a result, we can be more confident that our estimate of the proportion of supporters is closer to the true proportion in the entire city.

# Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

In [4]:
from scipy import stats

data_point = 75
mean = 70
std = 5

z_score = (data_point - mean)/std
print(f"Z score: {z_score}")

Z score: 1.0


### Interpretation:
The z-score of 1 tells us that the data point with a value of 75 is 1 standard deviation above the population mean (μ = 70). This means that the data point is higher than the average value by one standard deviation, z-score of 1 indicates that the data point is relatively higher than the average value within the population.

# Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

In [14]:
import numpy as np
from scipy import stats

sample_size = 50
sample_avg_weight_reduction = 6
pop_avg_weight_reduction = 0
std = 2.5
confidence_level = 0.95
significance_level = 1 - confidence_level
degree_of_freedom = sample_size - 1

#H0: avg weight reduction is almost nothing (i.e 0 pounds)
#H1: avg weight reduction is significant
t_value = (sample_avg_weight_reduction - pop_avg_weight_reduction)/(std/np.sqrt(sample_size))
critical_value = stats.t.ppf(q = (1 - significance_level/2), df = degree_of_freedom)

print(f't value : {t_value}')
print(f'Critical Value : {critical_value}')
if t_value > critical_value:
    print('Reject Null Hypothesis')
    
else:
    print('Fail to reject Null Hypothesis')

t value : 16.970562748477143
Critical Value : 2.009575234489209
Reject Null Hypothesis


# Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

In [17]:
import numpy as np

# Sample proportion
sample_proportion = 0.65

# Sample size
sample_size = 500

# Z-score for a 95% confidence interval (standard normal distribution)
z_score = 1.96  # You can also use stats.norm.ppf(0.975) from scipy.stats

# Calculate the margin of error
margin_of_error = z_score * np.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

# Calculate the lower and upper bounds of the confidence interval
lower_bound = sample_proportion - margin_of_error
upper_bound = sample_proportion + margin_of_error

# Print the results
print("Sample Proportion:", sample_proportion)
print("Sample Size:", sample_size)
print("With 95% Confidence Interval, proportion of people who are satisfied with their job will lie in between: ({:.4f}, {:.4f})".format(lower_bound, upper_bound))

Sample Proportion: 0.65
Sample Size: 500
With 95% Confidence Interval, proportion of people who are satisfied with their job will lie in between: (0.6082, 0.6918)


# Q12. A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

In [6]:
import numpy as np
from scipy import stats

mean_a = 85
std_a = 6
n_A = 50 # assuming sample size is 50

mean_b = 82
std_b = 5
n_B = 50 # assuming sample size is 50
degree_of_freedom = n_A + n_B - 2

significance_level = 0.01
confidence = 1- significance_level

t_num = mean_a - mean_b
t_den = np.sqrt((std_a**2/n_A) + (std_b**2/n_B))
t_value = t_num/t_den

# H0: Their is no any significance difference in performance of both group(both mean are equal)
# H1: Their is difference in performance of both groups(b)
critical_value = stats.t.ppf(1- significance_level/2, degree_of_freedom)

print(f"t value: {t_value}")
print(f"Critical Value: {critical_value}")

if t_value > critical_value:
    print('Reject Null Hypothesis')
else:
    print('Fail to reject Null Hypothesis')

t value: 2.716072381275556
Critical Value: 2.626931094814024
Reject Null Hypothesis


# Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

In [2]:
import numpy as np
from scipy import stats

pop_mean = 60
std = 8
sample_size = 50
sample_mean = 65
confidence = 0.9

standard_error = std/np.sqrt(sample_size)
z_score = stats.norm.ppf(confidence)
margin_of_error = z_score*standard_error
upper_limit = sample_mean + margin_of_error
lower_limit = sample_mean - margin_of_error

print(f"Standard Error {standard_error}")
print(f"Margin of Error {margin_of_error}")
print(f"True Population mean lies in between {(lower_limit, upper_limit)}")

Standard Error 1.131370849898476
Margin of Error 1.449910083898917
True Population mean lies in between (63.550089916101086, 66.44991008389891)


# Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

In [1]:
import numpy as np
from scipy import stats

sample_size = 30
sample_mean = 0.25
mean = 0.25
std = 0.05
confidence = 0.90
alpha = 1-confidence
degree_of_freedom = sample_size - 1

#H0: average reaction time is 0.25
#H1: average reaction time is not 0.25

t_value = (sample_mean - mean)/(std/np.sqrt(sample_size))
critical_value = stats.t.ppf(confidence, degree_of_freedom)

print(f"t-value: {t_value}")
print(f"Critical Value: {critical_value}")

if t_value>critical_value:
    print('Reject Null Hypothesis')
else:
    print('Fail to reject null hypothesis')

t-value: 0.0
Critical Value: 1.311433643950529
Fail to reject null hypothesis
