Q1.Both t-tests and z-tests are statistical tests used to make inferences about population parameters based on sample data. However, they are used under different circumstances and assumptions.

T-test:
A t-test is used when the sample size is small (typically less than 30) and the population standard deviation is unknown. It's often used to compare means of two groups and determine whether the observed difference between them is statistically significant. There are different types of t-tests, such as the independent samples t-test (comparing means of two independent groups) and the paired samples t-test (comparing means of related samples).

Z-test:
A z-test is used when the sample size is relatively large (typically greater than 30) and the population standard deviation is known or when the sample size is large enough for the sample mean to be approximately normally distributed due to the central limit theorem. It's commonly used when comparing a sample mean to a known population mean.

In [1]:
# T-test Example:

import numpy as np
from scipy.stats import ttest_ind

group_a_scores = np.array([85, 92, 88, 78, 95])
group_b_scores = np.array([75, 80, 88, 82, 90])

t_statistic, p_value = ttest_ind(group_a_scores, group_b_scores)
alpha = 0.05

if p_value < alpha:
    print("Reject null hypothesis: There is a significant difference in means.")
else:
    print("Fail to reject null hypothesis: No significant difference in means.")

Fail to reject null hypothesis: No significant difference in means.


In [2]:
# Z-test Example:

from scipy.stats import norm

population_mean = 150
population_stddev = 10
sample_mean = 155
sample_size = 50

z_score = (sample_mean - population_mean) / (population_stddev / (sample_size**0.5))
alpha = 0.05

p_value = 1 - norm.cdf(z_score)

if p_value < alpha:
    print("Reject null hypothesis: Sample mean is significantly different from population mean.")
else:
    print("Fail to reject null hypothesis: No significant difference from population mean.")

Reject null hypothesis: Sample mean is significantly different from population mean.


Q2.Differentiate between one-tailed and two-tailed tests

One-Tailed Test:
A one-tailed test is used when you're specifically interested in whether the observed effect is in one direction (either significantly larger or significantly smaller) based on your hypothesis. In other words, you're only concerned about deviations from the null hypothesis in one direction. The critical region is located in one tail of the distribution.

Two-Tailed Test:
A two-tailed test is used when you're interested in deviations from the null hypothesis in both directions. You want to determine whether the observed effect is significantly different from the null hypothesis, regardless of whether it's larger or smaller. The critical region is divided between both tails of the distribution.

One-Tailed Test Example:
    
Let's say we have two groups of students, and want to test whether Group A's mean score is significantly higher than Group B's mean score.our hypotheses would be:

Null hypothesis (H0): The mean score of Group A is less than or equal to the mean score of Group B.

Alternative hypothesis (H1): The mean score of Group A is greater than the mean score of Group B.

In [3]:
from scipy.stats import ttest_ind

group_a_scores = np.array([85, 92, 88, 78, 95])
group_b_scores = np.array([75, 80, 88, 82, 90])

t_statistic, p_value = ttest_ind(group_a_scores, group_b_scores)

alpha = 0.05
tail = 'right'  # Specify the direction of the test

if p_value/2 < alpha and tail == 'right':
    print("Reject null hypothesis: Group A mean is significantly higher.")
else:
    print("Fail to reject null hypothesis.")

Fail to reject null hypothesis.


Two-Tailed Test Example:
Let's consider the same groups of students, but now you want to test whether there's a significant difference in the mean scores between the two groups, regardless of direction.

Null hypothesis (H0): The mean score of Group A is equal to the mean score of Group B.

Alternative hypothesis (H1): The mean score of Group A is not equal to the mean score of Group B.

In [4]:
import numpy as np
from scipy.stats import ttest_ind

group_a_scores = np.array([85, 92, 88, 78, 95])
group_b_scores = np.array([75, 80, 88, 82, 90])

t_statistic, p_value = ttest_ind(group_a_scores, group_b_scores)

alpha = 0.05

if p_value < alpha:
    print("Reject null hypothesis: There is a significant difference in means.")
else:
    print("Fail to reject null hypothesis: No significant difference in means.")

Fail to reject null hypothesis: No significant difference in means.


Q3. Type 1 Error (False Positive):

A Type 1 error occurs when we reject the null hypothesis when it is actually true. In other words,we conclude that there is a significant effect or difference when, in reality, there is none. The probability of committing a Type 1 error is denoted as alpha (α), which is the significance level we choose.

Type 2 Error (False Negative):
A Type 2 error occurs when we fail to reject the null hypothesis when it is actually false. This means we fail to detect a significant effect or difference that does exist. The probability of committing a Type 2 error is denoted as beta (β), and the complementary value (1 - β) is called the power of the test, which represents the ability of the test to correctly reject the null hypothesis when it is false.



In [5]:
# Type 1 Error Example:

import numpy as np
from scipy.stats import norm

population_mean = 100  
population_stddev = 10
sample_mean = 105  
sample_size = 50

# Assuming a two-tailed test
z_score = (sample_mean - population_mean) / (population_stddev / (sample_size**0.5))
alpha = 0.05

p_value = 2 * (1 - norm.cdf(abs(z_score)))

if p_value < alpha:
    print("Reject null hypothesis: The test incorrectly indicates disease presence.")
else:
    print("Fail to reject null hypothesis: The test correctly indicates absence of disease.")



Reject null hypothesis: The test incorrectly indicates disease presence.


In [6]:
# Type 2 Error Example:

population_mean_disease = 120
sample_mean_test = 115

z_score = (sample_mean_test - population_mean_disease) / (population_stddev / (sample_size**0.5))
beta = norm.cdf(z_score)

power = 1 - beta

print("Probability of Type 2 error:", beta)
print("Power of the test:", power)

Probability of Type 2 error: 0.0002034760087224789
Power of the test: 0.9997965239912775


In [None]:
Q4. Explain Bayes's theorem with an example.


Bayes theorem gives the probability of an “event” with the given information on “tests”. There is a difference between “events” and “tests”. For example there is a test for liver disease, which is different from actually having the liver disease, i.e. an event. Rare events might be having a higher false positive rate.


Examples of Bayes’ Theorem
Bayesian inference is very important and has found application in various activities, including medicine, science, philosophy, engineering, sports, law, etc. and Bayesian inference is directly derived from Bayes’ theorem. Example: Bayes’ theorem defines the accuracy of the medical test by taking into account how likely a person is to have a disease and what is the overall accuracy of the test.

Q5. A confidence interval is a range of values that is used to estimate the unknown true value of a population parameter, such as the mean or the proportion, based on a sample from that population. It provides a measure of the uncertainty associated with the sample estimate. The confidence interval indicates the range within which we can reasonably expect the true population parameter to fall, with a specified level of confidence.

For example, if you calculate a 95% confidence interval for the mean, it means that if you were to repeat the sampling process and construct confidence intervals from different samples, about 95% of those intervals would contain the true population mean.

Calculating a Confidence Interval:

In [7]:
import numpy as np
from scipy.stats import t

scores = np.array([85, 90, 78, 92, 88, 75, 80, 85, 89, 95])


In [9]:
print(scores)

[85 90 78 92 88 75 80 85 89 95]


In [11]:
sample_mean = np.mean(scores)
sample_stddev = np.std(scores, ddof=1)  
sample_size = len(scores)
confidence_level = 0.95
t_score = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)

# Calculate the margin of error
margin_of_error = t_score * (sample_stddev / np.sqrt(sample_size))

# Calculate the confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print("Sample Mean:", sample_mean)
print("Confidence Interval:", confidence_interval)


Sample Mean: 85.7
Confidence Interval: (81.12507026629194, 90.27492973370806)


Q6. Bayes' Theorem is a fundamental concept in probability theory that helps us update our beliefs about the probability of an event occurring based on new evidence.

In [1]:
def bayes_theorem(prior_A, p_B_given_A, p_B_given_not_A):
    p_not_A = 1 - prior_A
    p_A_given_B = (p_B_given_A * prior_A) / ((p_B_given_A * prior_A) + (p_B_given_not_A * p_not_A))
    return p_A_given_B

# Given data
prior_A = 0.01  # Prior probability of having the disease
p_B_given_A = 0.95  # Sensitivity of the diagnostic tool
p_B_given_not_A = 0.03  # False positive rate of the diagnostic tool

# Calculate probability using Bayes' Theorem
probability_with_disease = bayes_theorem(prior_A, p_B_given_A, p_B_given_not_A)
print(f"The probability of actually having the disease given a positive test result is: {probability_with_disease:.4f}")

The probability of actually having the disease given a positive test result is: 0.2423


Q7. To calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5, you'll need to use the formula for confidence intervals along with the appropriate z-score from the standard normal distribution.

The formula to calculate a confidence interval for the mean is:

Confidence Interval = Sample Mean ± Margin of Error
Confidence Interval=Sample Mean±Margin of Error

In [None]:
import numpy as np
from scipy.stats import norm

sample_mean = 50
sample_stddev = 5
sample_size = 100  
confidence_level = 0.95

standard_error = sample_stddev / np.sqrt(sample_size)

z_score = norm.ppf((1 + confidence_level) / 2)

margin_of_error = z_score * standard_error


confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print("Sample Mean:", sample_mean)
print("Confidence Interval:", confidence_interval)


Q8. The margin of error (MoE) in a confidence interval (CI) is a measure of the uncertainty or range around a sample statistic (such as the sample mean or proportion) that we estimate from a sample. It reflects the potential variability between the sample statistic and the true population parameter. The margin of error is typically expressed as a plus/minus value and is used to quantify the level of confidence we have in our estimate.

A larger margin of error indicates a wider interval around the estimate, which means that our estimate is less precise. Conversely, a smaller margin of error indicates a narrower interval, implying a more precise estimate.

The sample size directly affects the margin of error. Generally, as the sample size increases, the margin of error decreases. This is because larger sample sizes tend to provide more information about the population, leading to more accurate estimates.

In [4]:
import numpy as np
from scipy.stats import norm

true_mean = 100
population_stddev = 15
confidence_level = 0.95
desired_margin_of_error = 2

z_score = norm.ppf(1 - (1 - confidence_level) / 2)
required_sample_size = np.ceil((z_score * population_stddev / desired_margin_of_error) ** 2)

print(f"Required sample size for a margin of error of {desired_margin_of_error}: {required_sample_size}")


Required sample size for a margin of error of 2: 217.0


In [5]:
sample_size = int(required_sample_size)
sample_data = np.random.normal(true_mean, population_stddev, sample_size)

# Calculate the sample mean and standard error
sample_mean = np.mean(sample_data)
standard_error = population_stddev / np.sqrt(sample_size)

# Calculate the confidence interval
margin_of_error = z_score * standard_error
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print(f"Sample mean: {sample_mean:.2f}")
print(f"Confidence Interval: ({confidence_interval[0]:.2f}, {confidence_interval[1]:.2f})")
print(f"Actual margin of error: {margin_of_error:.2f}")

Sample mean: 98.98
Confidence Interval: (96.99, 100.98)
Actual margin of error: 2.00


Q9. The z-score measures how many standard deviations a data point is away from the mean of a distribution.

In [7]:
data_point = 75
population_mean = 70
population_stddev = 5

z_score = (data_point - population_mean) / population_stddev

print(f"The z-score for the data point {data_point} is: {z_score:.2f}")

The z-score for the data point 75 is: 1.00


Q10. To conduct a hypothesis test to determine if the weight loss drug is significantly effective, we need to set up the null and alternative hypotheses and perform a t-test on the sample data. The t-test will help us assess whether the observed sample mean is significantly different from a hypothesized population mean.

In [10]:
import numpy as np
from scipy.stats import t

# Given data
sample_mean = 6 
sample_stddev = 2.5  
sample_size = 50  
hypothesized_mean = 0 

t_statistic = (sample_mean - hypothesized_mean) / (sample_stddev / np.sqrt(sample_size))

degrees_of_freedom = sample_size - 1

alpha = 0.05
t_critical = t.ppf(1 - alpha / 2, degrees_of_freedom)
p_value = 2 * (1 - t.cdf(np.abs(t_statistic), degrees_of_freedom))


if p_value < alpha:
    print("Reject the null hypothesis. The drug is significantly effective.")
else:
    print("Fail to reject the null hypothesis. There is no significant evidence of drug effectiveness.")

print(f"t-statistic: {t_statistic:.4f}")
print(f"Degrees of freedom: {degrees_of_freedom}")
print(f"Critical t-value: {t_critical:.4f}")
print(f"P-value: {p_value:.4f}")

Reject the null hypothesis. The drug is significantly effective.
t-statistic: 16.9706
Degrees of freedom: 49
Critical t-value: 2.0096
P-value: 0.0000


Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% 
confidence interval for the true proportion of people who are satisfied with their job.

In [12]:
import math

# Given data
sample_proportion = 0.65  
sample_size = 500
confidence_level = 0.95


z_critical = 1.96  

margin_of_error = z_critical * math.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

lower_bound = sample_proportion - margin_of_error
upper_bound = sample_proportion + margin_of_error

print(f"95% Confidence Interval: ({lower_bound:.4f}, {upper_bound:.4f})")


95% Confidence Interval: (0.6082, 0.6918)


Q13. To calculate the 90% confidence interval for the true population mean based on a sample, we can use the t-distribution since the population standard deviation is not known. 

In [14]:
import numpy as np
from scipy.stats import t

# Given data
sample_mean = 65
sample_stddev = 8
sample_size = 50
confidence_level = 0.90


degrees_of_freedom = sample_size - 1

t_critical = t.ppf(1 - (1 - confidence_level) / 2, degrees_of_freedom)

margin_of_error = t_critical * (sample_stddev / np.sqrt(sample_size))

lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

print(f"90% Confidence Interval: ({lower_bound:.4f}, {upper_bound:.4f})")



90% Confidence Interval: (63.1032, 66.8968)


In [None]:
Q14. To conduct a hypothesis test using a t-test of the effects of caffeine on reaction time:- 

In [15]:
import numpy as np
from scipy.stats import t

# Given data
sample_mean = 0.25  
sample_stddev = 0.05  
sample_size = 30  
hypothesized_mean = 0.28 
confidence_level = 0.90  

t_statistic = (sample_mean - hypothesized_mean) / (sample_stddev / np.sqrt(sample_size))


degrees_of_freedom = sample_size - 1

p_value = 2 * (1 - t.cdf(np.abs(t_statistic), degrees_of_freedom))

# Determine if the results are statistically significant
alpha = 1 - confidence_level
if p_value < alpha:
    result = "Reject the null hypothesis. Caffeine has a significant effect on reaction time."
else:
    result = "Fail to reject the null hypothesis. There is no significant evidence of caffeine effect on reaction time."

print(result)
print(f"t-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

Reject the null hypothesis. Caffeine has a significant effect on reaction time.
t-statistic: -3.2863
P-value: 0.0027
