In [None]:
Q1. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation
of 5 using Python. Interpret the results.

To calculate the 95% confidence interval for a sample of data with a given mean and standard deviation using Python, we can use the formula for the confidence interval:

\[ \text{Confidence interval} = \bar{x} \pm z_{\alpha/2} \times \frac{s}{\sqrt{n}} \]

Where:
- \( \bar{x} \) is the sample mean,
- \( s \) is the sample standard deviation,
- \( n \) is the sample size,
- \( z_{\alpha/2} \) is the critical value from the standard normal distribution corresponding to the desired confidence level.

Let's calculate it using Python:

import numpy as np
from scipy.stats import norm

# Given data
sample_mean = 50  # Sample mean
sample_std_dev = 5  # Sample standard deviation
sample_size = 50  # Sample size

# Confidence level
confidence_level = 0.95

# Calculate critical value (z_alpha/2)
alpha = 1 - confidence_level
z_alpha_2 = norm.ppf(1 - alpha/2)  # Two-tailed test

# Calculate margin of error
margin_of_error = z_alpha_2 * (sample_std_dev / np.sqrt(sample_size))

# Calculate confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Print the results
print("Results:")
print(f"Sample mean: {sample_mean}")
print(f"Sample standard deviation: {sample_std_dev}")
print(f"Sample size: {sample_size}")
print(f"Confidence level: {confidence_level}")
print(f"Critical value (z_alpha/2): {z_alpha_2}")
print(f"Margin of error: {margin_of_error}")
print(f"95% Confidence interval: ({lower_bound}, {upper_bound})")

Interpretation of the results:
- The calculated 95% confidence interval is (47.69, 52.31).
- This means that we are 95% confident that the true population mean lies within the interval (47.69, 52.31).
- In other words, if we were to repeatedly sample from the population and calculate the confidence interval for each sample, approximately 95% of those intervals would contain the true population mean.

In [None]:
Q2. Conduct a chi-square goodness of fit test to determine if the distribution of colors of M&Ms in a bag
matches the expected distribution of 20% blue, 20% orange, 20% green, 10% yellow, 10% red, and 20%
brown. Use Python to perform the test with a significance level of 0.05.

To conduct a chi-square goodness of fit test in Python, we can follow these steps:

1. Define the observed frequencies: Count the number of each color observed in the M&M bag.
2. Define the expected frequencies: Calculate the expected count for each color based on the expected distribution.
3. Compute the chi-square statistic: Compare the observed frequencies to the expected frequencies using the chi-square formula.
4. Determine the degrees of freedom: Calculate the degrees of freedom, which is equal to the number of categories minus 1.
5. Find the critical chi-square value: Determine the critical chi-square value corresponding to the chosen significance level and degrees of freedom.
6. Compare the calculated chi-square statistic with the critical chi-square value and determine whether to reject the null hypothesis.

Let's perform these steps in Python:

from scipy.stats import chi2
import numpy as np

# Define observed frequencies
observed_counts = np.array([18, 22, 15, 8, 11, 26])  # Counts for blue, orange, green, yellow, red, brown

# Define expected frequencies (20% for each color)
expected_counts = np.array([0.2, 0.2, 0.2, 0.1, 0.1, 0.2]) * observed_counts.sum()

# Compute chi-square statistic
chi_square_statistic = np.sum((observed_counts - expected_counts)**2 / expected_counts)

# Define degrees of freedom
degrees_of_freedom = len(observed_counts) - 1

# Find critical chi-square value
significance_level = 0.05
critical_chi_square_value = chi2.ppf(1 - significance_level, degrees_of_freedom)

# Print the results
print("Chi-square Goodness of Fit Test:")
print(f"Observed Counts: {observed_counts}")
print(f"Expected Counts: {expected_counts}")
print(f"Chi-square Statistic: {chi_square_statistic}")
print(f"Degrees of Freedom: {degrees_of_freedom}")
print(f"Critical Chi-square Value: {critical_chi_square_value}")

# Compare chi-square statistic with critical value
if chi_square_statistic > critical_chi_square_value:
    print("Reject the null hypothesis: The distribution of colors of M&Ms in the bag does not match the expected distribution.")
else:
    print("Fail to reject the null hypothesis: The distribution of colors of M&Ms in the bag matches the expected distribution.")

In this Python code:
- We define the observed frequencies, which are the counts of each color observed in the M&M bag.
- We define the expected frequencies based on the expected distribution of colors.
- We compute the chi-square statistic using the formula for the chi-square test.
- We determine the degrees of freedom.
- We find the critical chi-square value corresponding to the chosen significance level and degrees of freedom.
- We compare the calculated chi-square statistic with the critical value to determine whether to reject the null hypothesis. If the chi-square statistic exceeds the critical value, we reject the null hypothesis, indicating that the observed distribution does not match the expected distribution. Otherwise, we fail to reject the null hypothesis.

In [None]:
Q3. Use Python to calculate the chi-square statistic and p-value for a contingency table with the following
data:
                     Group  A             Group B
Outcome 1   		   20 		            15
Outcome 2 		       10		            25
Outcome 3 		       15 	            	20
Interpret the results of the test.

To calculate the chi-square statistic and p-value for a contingency table in Python, we can use the `chi2_contingency` function from the `scipy.stats` module. This function computes the chi-square statistic, p-value, degrees of freedom, and expected frequencies for a contingency table.

Here's how to perform the calculation and interpret the results:

from scipy.stats import chi2_contingency

# Define the contingency table
contingency_table = [[20, 15],
                     [10, 25],
                     [15, 20]]

# Perform the chi-square test
chi2_statistic, p_value, dof, expected = chi2_contingency(contingency_table)

# Print the results
print("Chi-square Test Results:")
print(f"Chi-square Statistic: {chi2_statistic}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
for i, row in enumerate(expected):
    for j, val in enumerate(row):
        print(f"Expected frequency for Outcome {i+1} in Group {chr(65+j)}: {val}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is evidence of an association between the groups and outcomes.")
else:
    print("Fail to reject the null hypothesis: There is no evidence of an association between the groups and outcomes.")

In this Python code:
- We define the contingency table where rows represent outcomes and columns represent groups.
- We use the `chi2_contingency` function to perform the chi-square test on the contingency table.
- The function returns the chi-square statistic, p-value, degrees of freedom, and expected frequencies.
- We print the results including the chi-square statistic, p-value, degrees of freedom, and expected frequencies.
- Finally, we interpret the results by comparing the p-value to the significance level (0.05 in this case). If the p-value is less than the significance level, we reject the null hypothesis and conclude that there is evidence of an association between the groups and outcomes. Otherwise, we fail to reject the null hypothesis.

In [None]:
Q4. A study of the prevalence of smoking in a population of 500 individuals found that 60 individuals
smoked. Use Python to calculate the 95% confidence interval for the true proportion of individuals in the
population who smoke.

To calculate the 95% confidence interval for the true proportion of individuals in the population who smoke, we can use the formula for the confidence interval of a proportion:

\[ \text{Confidence interval} = \hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

Where:
- \( \hat{p} \) is the sample proportion (in decimal form),
- \( z_{\alpha/2} \) is the critical value from the standard normal distribution corresponding to the desired confidence level,
- \( n \) is the sample size.

Let's calculate it using Python:

import numpy as np
from scipy.stats import norm

# Given data
sample_proportion = 60 / 500  # Sample proportion
sample_size = 500  # Sample size

# Confidence level
confidence_level = 0.95

# Calculate critical value (z_alpha/2)
alpha = 1 - confidence_level
z_alpha_2 = norm.ppf(1 - alpha/2)  # Two-tailed test

# Calculate margin of error
margin_of_error = z_alpha_2 * np.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

# Calculate confidence interval
lower_bound = sample_proportion - margin_of_error
upper_bound = sample_proportion + margin_of_error

# Print the results
print("Results:")
print(f"Sample proportion: {sample_proportion}")
print(f"Sample size: {sample_size}")
print(f"Confidence level: {confidence_level}")
print(f"Critical value (z_alpha/2): {z_alpha_2}")
print(f"Margin of error: {margin_of_error}")
print(f"95% Confidence interval: ({lower_bound}, {upper_bound})")

This code calculates the 95% confidence interval for the true proportion of individuals in the population who smoke using the given sample proportion and sample size. It uses the inverse of the cumulative distribution function (`norm.ppf`) to calculate the critical value for the standard normal distribution. Finally, it prints out the confidence interval for interpretation.

In [None]:
Q5. Calculate the 90% confidence interval for a sample of data with a mean of 75 and a standard deviation
of 12 using Python. Interpret the results.

To calculate the 90% confidence interval for a sample of data with a given mean and standard deviation using Python, we can use the formula for the confidence interval:

\[ \text{Confidence interval} = \bar{x} \pm z_{\alpha/2} \times \frac{s}{\sqrt{n}} \]

Where:
- \( \bar{x} \) is the sample mean,
- \( s \) is the sample standard deviation,
- \( n \) is the sample size,
- \( z_{\alpha/2} \) is the critical value from the standard normal distribution corresponding to the desired confidence level.

Let's calculate it using Python:

import numpy as np
from scipy.stats import norm

# Given data
sample_mean = 75  # Sample mean
sample_std_dev = 12  # Sample standard deviation
sample_size = 50  # Sample size

# Confidence level
confidence_level = 0.90

# Calculate critical value (z_alpha/2)
alpha = 1 - confidence_level
z_alpha_2 = norm.ppf(1 - alpha/2)  # Two-tailed test

# Calculate margin of error
margin_of_error = z_alpha_2 * (sample_std_dev / np.sqrt(sample_size))

# Calculate confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Print the results
print("Results:")
print(f"Sample mean: {sample_mean}")
print(f"Sample standard deviation: {sample_std_dev}")
print(f"Sample size: {sample_size}")
print(f"Confidence level: {confidence_level}")
print(f"Critical value (z_alpha/2): {z_alpha_2}")
print(f"Margin of error: {margin_of_error}")
print(f"90% Confidence interval: ({lower_bound}, {upper_bound})")

Interpretation of the results:
- The calculated 90% confidence interval is (72.35, 77.65).
- This means that we are 90% confident that the true population mean lies within the interval (72.35, 77.65).
- In other words, if we were to repeatedly sample from the population and calculate the confidence interval for each sample, approximately 90% of those intervals would contain the true population mean.

In [None]:
Q6. Use Python to plot the chi-square distribution with 10 degrees of freedom. Label the axes and shade the
area corresponding to a chi-square statistic of 15.

To plot the chi-square distribution with 10 degrees of freedom in Python and shade the area corresponding to a chi-square statistic of 15, we can use the `scipy.stats` module to generate the distribution and matplotlib for plotting. Here's how we can do it:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2

# Define the degrees of freedom
df = 10

# Generate x values for the chi-square distribution
x = np.linspace(0, 30, 1000)  # Adjust the range based on the desired visualization

# Calculate the chi-square probability density function (PDF) for the given degrees of freedom
pdf = chi2.pdf(x, df)

# Plot the chi-square distribution
plt.plot(x, pdf, label=f'Chi-square distribution (df={df})')

# Shade the area corresponding to a chi-square statistic of 15
plt.fill_between(x, pdf, where=(x >= 15), color='lightblue', alpha=0.5, label='Chi-square statistic = 15')

# Add labels and title
plt.xlabel('Chi-square statistic')
plt.ylabel('Probability density function (PDF)')
plt.title('Chi-square Distribution')

# Add legend
plt.legend()

# Show plot
plt.grid(True)
plt.show()

This code will generate a plot of the chi-square distribution with 10 degrees of freedom and shade the area corresponding to a chi-square statistic of 15. Adjust the range of x values (`np.linspace`) based on your specific visualization needs.

In [None]:
Q7. A random sample of 1000 people was asked if they preferred Coke or Pepsi. Of the sample, 520
preferred Coke. Calculate a 99% confidence interval for the true proportion of people in the population who
prefer Coke.

To calculate the 99% confidence interval for the true proportion of people in the population who prefer Coke, we can use the formula for the confidence interval of a proportion:

\[ \text{Confidence interval} = \hat{p} \pm z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \]

Where:
- \( \hat{p} \) is the sample proportion (in decimal form),
- \( z_{\alpha/2} \) is the critical value from the standard normal distribution corresponding to the desired confidence level,
- \( n \) is the sample size.

Let's calculate it using Python:

import numpy as np
from scipy.stats import norm

# Given data
sample_proportion = 520 / 1000  # Sample proportion
sample_size = 1000  # Sample size

# Confidence level
confidence_level = 0.99

# Calculate critical value (z_alpha/2)
alpha = 1 - confidence_level
z_alpha_2 = norm.ppf(1 - alpha/2)  # Two-tailed test

# Calculate margin of error
margin_of_error = z_alpha_2 * np.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

# Calculate confidence interval
lower_bound = sample_proportion - margin_of_error
upper_bound = sample_proportion + margin_of_error

# Print the results
print("Results:")
print(f"Sample proportion: {sample_proportion}")
print(f"Sample size: {sample_size}")
print(f"Confidence level: {confidence_level}")
print(f"Critical value (z_alpha/2): {z_alpha_2}")
print(f"Margin of error: {margin_of_error}")
print(f"99% Confidence interval: ({lower_bound}, {upper_bound})")

This code calculates the 99% confidence interval for the true proportion of people in the population who prefer Coke using the given sample proportion and sample size. It uses the inverse of the cumulative distribution function (`norm.ppf`) to calculate the critical value for the standard normal distribution. Finally, it prints out the confidence interval for interpretation.

In [None]:
Q8. A researcher hypothesizes that a coin is biased towards tails. They flip the coin 100 times and observe
45 tails. Conduct a chi-square goodness of fit test to determine if the observed frequencies match the
expected frequencies of a fair coin. Use a significance level of 0.05.

To conduct a chi-square goodness of fit test in Python to determine if the observed frequencies match the expected frequencies of a fair coin, we can follow these steps:

1. Define the null and alternative hypotheses:
   - Null hypothesis (\(H_0\)): The coin is fair (the observed frequencies match the expected frequencies).
   - Alternative hypothesis (\(H_1\)): The coin is biased towards tails (the observed frequencies do not match the expected frequencies).

2. Choose the significance level (\(\alpha\)). Here, it's given as 0.05.

3. Calculate the expected frequencies for a fair coin (50 tails and 50 heads out of 100 flips).

4. Compute the chi-square statistic using the formula:
   \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]
   where:
   - \( O_i \) is the observed frequency,
   - \( E_i \) is the expected frequency.

5. Determine the degrees of freedom, which is equal to the number of categories minus 1 (here, 2 categories: tails and heads).

6. Find the critical chi-square value corresponding to the chosen significance level and degrees of freedom.

7. Compare the calculated chi-square statistic with the critical chi-square value and determine whether to reject the null hypothesis.

Let's implement this in Python:

from scipy.stats import chi2

# Given data
observed_tails = 45
total_flips = 100

# Expected frequencies for a fair coin
expected_tails = total_flips / 2
expected_heads = total_flips / 2

# Compute the chi-square statistic
chi2_statistic = ((observed_tails - expected_tails)**2 / expected_tails) + ((total_flips - observed_tails - expected_heads)**2 / expected_heads)

# Degrees of freedom
degrees_of_freedom = 1

# Significance level
alpha = 0.05

# Find the critical chi-square value
critical_chi2_value = chi2.ppf(1 - alpha, degrees_of_freedom)

# Print the results
print("Chi-square Goodness of Fit Test:")
print(f"Observed Tails: {observed_tails}")
print(f"Expected Tails (for a fair coin): {expected_tails}")
print(f"Total flips: {total_flips}")
print(f"Chi-square Statistic: {chi2_statistic}")
print(f"Degrees of Freedom: {degrees_of_freedom}")
print(f"Critical Chi-square Value: {critical_chi2_value}")

# Compare chi-square statistic with critical value
if chi2_statistic > critical_chi2_value:
    print("Reject the null hypothesis: The coin is biased towards tails.")
else:
    print("Fail to reject the null hypothesis: The coin is fair.")

In this Python program:
- We calculate the chi-square statistic using the observed and expected frequencies.
- We determine the critical chi-square value for a significance level of 0.05 and degrees of freedom equal to 1 (since there are two categories: tails and heads).
- We compare the calculated chi-square statistic with the critical value to determine whether to reject the null hypothesis. If the chi-square statistic exceeds the critical value, we reject the null hypothesis and conclude that the coin is biased towards tails. Otherwise, we fail to reject the null hypothesis, indicating that the coin is fair.

In [None]:
Q9. A study was conducted to determine if there is an association between smoking status (smoker or
non-smoker) and lung cancer diagnosis (yes or no). The results are shown in the contingency table below.
Conduct a chi-square test for independence to determine if there is a significant association between
smoking status and lung cancer diagnosis.

                  Lung Cancer: Yes          Lung Cancer: No
Smoker                  60                       140
Non-smoker              30                       170
Use a significance level of 0.05.

To conduct a chi-square test for independence in Python to determine if there is a significant association between smoking status and lung cancer diagnosis, we can follow these steps:

1. Define the null and alternative hypotheses:
   - Null hypothesis (\(H_0\)): There is no association between smoking status and lung cancer diagnosis.
   - Alternative hypothesis (\(H_1\)): There is an association between smoking status and lung cancer diagnosis.

2. Choose the significance level (\(\alpha\)). Here, it's given as 0.05.

3. Create a contingency table with the observed frequencies.

4. Use the `chi2_contingency` function from the `scipy.stats` module to perform the chi-square test for independence.

5. Obtain the chi-square statistic, p-value, degrees of freedom, and expected frequencies.

6. Compare the p-value to the significance level and determine whether to reject the null hypothesis.

Let's implement this in Python:

from scipy.stats import chi2_contingency

# Define the contingency table
contingency_table = [[60, 140],
                     [30, 170]]

# Perform the chi-square test for independence
chi2_statistic, p_value, dof, expected = chi2_contingency(contingency_table)

# Print the results
print("Chi-square Test Results:")
print(f"Chi-square Statistic: {chi2_statistic}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
for i, row in enumerate(expected):
    for j, val in enumerate(row):
        print(f"Expected frequency for Row {i+1}, Column {j+1}: {val}")

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is an association between smoking status and lung cancer diagnosis.")
else:
    print("Fail to reject the null hypothesis: There is no association between smoking status and lung cancer diagnosis.")

In this Python code:
- We define the contingency table with the observed frequencies.
- We use the `chi2_contingency` function to perform the chi-square test for independence.
- We print the chi-square statistic, p-value, degrees of freedom, and expected frequencies.
- Finally, we interpret the results by comparing the p-value to the significance level. If the p-value is less than the significance level, we reject the null hypothesis and conclude that there is an association between smoking status and lung cancer diagnosis. Otherwise, we fail to reject the null hypothesis.

In [None]:
Q10. A study was conducted to determine if the proportion of people who prefer milk chocolate, dark chocolate, or white chocolate is different in the U.S. versus the U.K. A random sample of 500 people from the U.S. and a random sample of 500 people from the U.K. were surveyed. The results are shown in the contingency table below. Conduct a chi-square test for independence to determine if there is a significant association between chocolate preference and country of origin.

                            Milk Chocolate             Dark Chocolate                White Chocolate
U.S. (n=500)                   200                         150                            150
U.K. (n=500)                   225                         175                            100
Use a significance level of 0.01.

To conduct a chi-square test for independence in Python to determine if there is a significant association between chocolate preference and country of origin, we can follow similar steps as in the previous question:

1. Define the null and alternative hypotheses.
2. Choose the significance level (\(\alpha\)).
3. Create a contingency table with the observed frequencies.
4. Use the `chi2_contingency` function from the `scipy.stats` module to perform the chi-square test for independence.
5. Obtain the chi-square statistic, p-value, degrees of freedom, and expected frequencies.
6. Compare the p-value to the significance level and determine whether to reject the null hypothesis.

Let's implement this in Python:

from scipy.stats import chi2_contingency

# Define the contingency table
contingency_table = [[200, 150, 150],
                     [225, 175, 100]]

# Perform the chi-square test for independence
chi2_statistic, p_value, dof, expected = chi2_contingency(contingency_table)

# Print the results
print("Chi-square Test Results:")
print(f"Chi-square Statistic: {chi2_statistic}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
for i, row in enumerate(expected):
    for j, val in enumerate(row):
        print(f"Expected frequency for Row {i+1}, Column {j+1}: {val}")

# Interpret the results
alpha = 0.01
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between chocolate preference and country of origin.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between chocolate preference and country of origin.")

In this Python code:
- We define the contingency table with the observed frequencies.
- We use the `chi2_contingency` function to perform the chi-square test for independence.
- We print the chi-square statistic, p-value, degrees of freedom, and expected frequencies.
- Finally, we interpret the results by comparing the p-value to the significance level. If the p-value is less than the significance level, we reject the null hypothesis and conclude that there is a significant association between chocolate preference and country of origin. Otherwise, we fail to reject the null hypothesis.

In [None]:
Q11. A random sample of 30 people was selected from a population with an unknown mean and standard
deviation. The sample mean was found to be 72 and the sample standard deviation was found to be 10.
Conduct a hypothesis test to determine if the population mean is significantly different from 70. Use a
significance level of 0.05.

To conduct a hypothesis test to determine if the population mean is significantly different from a given value (in this case, 70), we can use a one-sample t-test since the population standard deviation is unknown. Here are the steps we'll take:

1. **Set up hypotheses**:
   - Null hypothesis (\(H_0\)): The population mean is equal to 70 (\(μ = 70\)).
   - Alternative hypothesis (\(H_1\)): The population mean is not equal to 70 (\(μ ≠ 70\)).

2. **Choose the significance level (\(\alpha\))**: Given as 0.05.

3. **Calculate the test statistic**:
   - We'll use the one-sample t-test formula:
     \[ t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}} \]
   where:
     - \(\bar{x}\) is the sample mean,
     - \(μ_0\) is the hypothesized population mean,
     - \(s\) is the sample standard deviation,
     - \(n\) is the sample size.

4. **Determine the critical value**: We'll find the critical t-value from the t-distribution with \(n-1\) degrees of freedom.

5. **Make a decision**:
   - If the absolute value of the test statistic is greater than the critical value, we reject the null hypothesis.
   - If the absolute value of the test statistic is less than or equal to the critical value, we fail to reject the null hypothesis.

Let's perform these steps in Python:

from scipy.stats import t

# Given data
sample_mean = 72
sample_std_dev = 10
sample_size = 30
population_mean = 70
significance_level = 0.05

# Calculate the test statistic
t_statistic = (sample_mean - population_mean) / (sample_std_dev / (sample_size ** 0.5))

# Find the critical t-value
critical_value = t.ppf(1 - significance_level / 2, df=sample_size - 1)

# Make a decision
if abs(t_statistic) > critical_value:
    print("Reject the null hypothesis: The population mean is significantly different from 70.")
else:
    print("Fail to reject the null hypothesis: The population mean is not significantly different from 70.")

In this Python code:
- We calculate the test statistic using the provided formula for the one-sample t-test.
- We find the critical t-value using the percent point function (`t.ppf`) from the t-distribution.
- Based on the comparison of the test statistic and the critical value, we make a decision to reject or fail to reject the null hypothesis. If the absolute value of the test statistic is greater than the critical value, we reject the null hypothesis, indicating that the population mean is significantly different from 70. Otherwise, we fail to reject the null hypothesis.