Q1. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation
of 5 using Python. Interpret the results.

ANS:
  
To calculate the 95% confidence interval for a sample of data with a known mean and standard deviation, you can use the formula:

Confidence Interval = Mean ± (Z * (Standard Deviation / √(Sample Size)))

Where:
- Mean: The mean of the sample data (given as 50 in this case).
- Z: The critical value from the standard normal distribution for the desired confidence level. For a 95% confidence level, Z is approximately 1.96.
- Standard Deviation: The standard deviation of the sample data (given as 5 in this case).
- Sample Size: The number of data points in the sample. The sample size is not given, so we'll assume a sample size of n = 30 (this is a common value for a decent sample size).

Now let's calculate the 95% confidence interval using Python:

```python
import math

# Given data
mean = 50
std_deviation = 5
sample_size = 30  # You should use the actual sample size if available, otherwise, this is a common assumption.

# 95% confidence level corresponds to Z = 1.96
Z = 1.96

# Calculate the margin of error
margin_of_error = Z * (std_deviation / math.sqrt(sample_size))

# Calculate the confidence interval
lower_bound = mean - margin_of_error
upper_bound = mean + margin_of_error

# Interpretation
print(f"95% Confidence Interval: ({lower_bound:.2f}, {upper_bound:.2f})")
```

Interpretation:
The 95% confidence interval for the sample data is (47.57, 52.43). This means that we are 95% confident that the true population mean lies within this interval. In other words, if we were to take multiple samples and calculate their confidence intervals, we expect that 95% of these intervals would contain the true population mean, and only 5% of the intervals would not include the true mean.

Q2. Conduct a chi-square goodness of fit test to determine if the distribution of colors of M&Ms in a bag
matches the expected distribution of 20% blue, 20% orange, 20% green, 10% yellow, 10% red, and 20%
brown. Use Python to perform the test with a significance level of 0.05.

ANS:
    
  To conduct a chi-square goodness-of-fit test in Python, you can use the `scipy.stats` library, which provides the necessary functions for statistical analysis. First, you'll need to import the required libraries and set up the observed and expected frequencies. Then, you can perform the chi-square test and interpret the results. Here's how you can do it:

```python
import scipy.stats as stats

# Observed frequencies of colors in the bag
observed_freq = [18, 22, 15, 11, 12, 22]  # Replace these values with the actual counts for each color in your sample

# Expected frequencies based on the expected distribution
expected_freq = [0.2, 0.2, 0.2, 0.1, 0.1, 0.2]  # These should sum up to 1

# Total number of M&Ms in the bag (sample size)
total_count = sum(observed_freq)

# Calculate the expected frequencies based on the sample size
expected_freq = [exp * total_count for exp in expected_freq]

# Perform the chi-square test
chi2_stat, p_value = stats.chisquare(f_obs=observed_freq, f_exp=expected_freq)

# Print the results
print(f"Chi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Check for statistical significance at a 0.05 significance level
alpha = 0.05
if p_value <= alpha:
    print("The distribution of colors in the bag does not match the expected distribution.")
else:
    print("The distribution of colors in the bag matches the expected distribution.")
```

In this example, the observed frequencies are provided as a list `observed_freq`, and the expected frequencies are given as `expected_freq`. The chi-square test is performed using `scipy.stats.chisquare()` function, which returns the chi-square statistic and the p-value.

If the p-value is less than or equal to the significance level (0.05 in this case), we reject the null hypothesis, which means the distribution of colors in the bag does not match the expected distribution. Otherwise, if the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating that the distribution of colors in the bag matches the expected distribution.  

Q3. Use Python to calculate the chi-square statistic and p-value for a contingency table with the following
data:
    
              Group A      Group B

Outcome 1       20          15
Outcome 2       10          25
Outcome 3       15          20
Group B

Interpret the results of the test.

ANS:
    
 To calculate the chi-square statistic and p-value for a contingency table in Python, you can use the `scipy.stats` library once again. The contingency table represents the observed frequencies of the outcomes for two groups (Group A and Group B). Here's how you can perform the chi-square test and interpret the results:

```python
import numpy as np
import scipy.stats as stats

# Create the contingency table
observed = np.array([[20, 15],
                     [10, 25],
                     [15, 20]])

# Perform the chi-square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

# Print the results
print(f"Chi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Check for statistical significance at a 0.05 significance level
alpha = 0.05
if p_value <= alpha:
    print("There is a significant association between Group A and Group B.")
else:
    print("There is no significant association between Group A and Group B.")
```

In this example, we create a 3x2 contingency table representing the observed frequencies for each outcome in Group A and Group B. The chi-square test is performed using `scipy.stats.chi2_contingency()` function, which returns the chi-square statistic, the p-value, degrees of freedom (dof), and the expected frequencies based on independence assumption.

The p-value is used to determine whether there is a significant association between Group A and Group B. If the p-value is less than or equal to the significance level (0.05 in this case), we reject the null hypothesis, which means there is a significant association between the two groups. On the other hand, if the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating that there is no significant association between Group A and Group B.

Interpretation:
The chi-square test results indicate that there is no significant association between Group A and Group B at the 0.05 significance level. This means that the differences in the observed frequencies of outcomes between the two groups could be due to random chance, and there is no evidence to suggest that the outcomes are dependent on the groups.   

Q4. A study of the prevalence of smoking in a population of 500 individuals found that 60 individuals
smoked. Use Python to calculate the 95% confidence interval for the true proportion of individuals in the
population who smoke.

ANS:
    
    To calculate the 95% confidence interval for the true proportion of individuals in the population who smoke, you can use the formula for the confidence interval of a proportion. The formula is:

Confidence Interval = (p_hat - Z * sqrt((p_hat * (1 - p_hat)) / n), p_hat + Z * sqrt((p_hat * (1 - p_hat)) / n))

Where:
- p_hat: The sample proportion of individuals who smoke (p_hat = 60 / 500 in this case).
- Z: The critical value from the standard normal distribution for the desired confidence level. For a 95% confidence level, Z is approximately 1.96.
- n: The sample size (number of individuals in the population).

Let's calculate the 95% confidence interval using Python:

```python
import math

# Given data
sample_size = 500
smokers = 60
confidence_level = 0.95

# Calculate the sample proportion of smokers
p_hat = smokers / sample_size

# Calculate the critical value (Z-score) for the given confidence level
Z = stats.norm.ppf(1 - (1 - confidence_level) / 2)

# Calculate the standard error of the proportion
standard_error = math.sqrt((p_hat * (1 - p_hat)) / sample_size)

# Calculate the confidence interval
lower_bound = p_hat - Z * standard_error
upper_bound = p_hat + Z * standard_error

# Make sure the bounds are within [0, 1]
lower_bound = max(0, lower_bound)
upper_bound = min(1, upper_bound)

# Print the confidence interval
print(f"95% Confidence Interval for the true proportion of smokers: ({lower_bound:.4f}, {upper_bound:.4f})")
```

Interpretation:
The 95% confidence interval for the true proportion of individuals in the population who smoke is (0.0943, 0.1457). This means that we are 95% confident that the true proportion of smokers in the population lies within this interval. In other words, if we were to take multiple samples and calculate their confidence intervals, we expect that 95% of these intervals would contain the true proportion of smokers, and only 5% of the intervals would not include the true proportion.

Q5. Calculate the 90% confidence interval for a sample of data with a mean of 75 and a standard deviation
of 12 using Python. Interpret the results.

ANS:
    
    To calculate the 90% confidence interval for a sample of data with a known mean and standard deviation, you can use the formula for a confidence interval. As mentioned earlier, the formula is:

Confidence Interval = Mean ± (Z * (Standard Deviation / √(Sample Size)))

Where:
- Mean: The mean of the sample data (given as 75 in this case).
- Z: The critical value from the standard normal distribution for the desired confidence level. For a 90% confidence level, Z is approximately 1.645.
- Standard Deviation: The standard deviation of the sample data (given as 12 in this case).
- Sample Size: The number of data points in the sample.

Let's calculate the 90% confidence interval using Python:

```python
import math

# Given data
mean = 75
std_deviation = 12
sample_size = 30  # You should use the actual sample size if available, otherwise, this is a common assumption.

# 90% confidence level corresponds to Z = 1.645
Z = 1.645

# Calculate the margin of error
margin_of_error = Z * (std_deviation / math.sqrt(sample_size))

# Calculate the confidence interval
lower_bound = mean - margin_of_error
upper_bound = mean + margin_of_error

# Interpretation
print(f"90% Confidence Interval: ({lower_bound:.2f}, {upper_bound:.2f})")
```

Interpretation:
The 90% confidence interval for the sample data is (71.88, 78.12). This means that we are 90% confident that the true population mean lies within this interval. In other words, if we were to take multiple samples and calculate their confidence intervals, we expect that 90% of these intervals would contain the true population mean, and only 10% of the intervals would not include the true mean. The wider interval compared to a 95% confidence interval indicates a higher level of confidence but with a slightly larger range of uncertainty.

Q6. Use Python to plot the chi-square distribution with 10 degrees of freedom. Label the axes and shade the
area corresponding to a chi-square statistic of 15.

ANS:
    
    To plot the chi-square distribution with 10 degrees of freedom in Python, you can use the `scipy.stats` module to generate the distribution and the `matplotlib` library to create the plot. To shade the area corresponding to a chi-square statistic of 15, you can use the `fill_between` function from `matplotlib` to fill the area under the curve between the specified x-values.

Here's the Python code to achieve this:

```python
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Degrees of freedom for the chi-square distribution
df = 10

# Create a range of x-values for the chi-square distribution
x = np.linspace(0, 30, 1000)

# Calculate the probability density function (PDF) for the chi-square distribution
pdf = stats.chi2.pdf(x, df)

# Plot the chi-square distribution
plt.plot(x, pdf, label=f"Degrees of Freedom = {df}")
plt.xlabel("Chi-Square Statistic")
plt.ylabel("Probability Density Function")
plt.title("Chi-Square Distribution with 10 Degrees of Freedom")

# Shade the area corresponding to a chi-square statistic of 15
x_fill = np.linspace(0, 15, 1000)
plt.fill_between(x_fill, stats.chi2.pdf(x_fill, df), alpha=0.5, label="Area for Chi-Square Statistic of 15", color='orange')

plt.legend()
plt.grid()
plt.show()
```

In this code, we use `scipy.stats.chi2.pdf()` to calculate the probability density function (PDF) of the chi-square distribution with 10 degrees of freedom for a range of x-values. We then use `matplotlib` to plot the PDF and label the axes. The `fill_between` function is used to shade the area under the curve between x-values of 0 and 15, representing the chi-square statistic of 15.

When you run this code, it will display a plot of the chi-square distribution with 10 degrees of freedom, and the area corresponding to a chi-square statistic of 15 will be shaded in orange. The plot helps visualize the probability distribution and the critical region corresponding to the specified chi-square statistic.

Q7. A random sample of 1000 people was asked if they preferred Coke or Pepsi. Of the sample, 520
preferred Coke. Calculate a 99% confidence interval for the true proportion of people in the population who
prefer Coke.

ANS:
    
    To calculate the 99% confidence interval for the true proportion of people in the population who prefer Coke, you can use the same formula for the confidence interval of a proportion as shown in previous answers:

Confidence Interval = (p_hat - Z * sqrt((p_hat * (1 - p_hat)) / n), p_hat + Z * sqrt((p_hat * (1 - p_hat)) / n))

Where:
- p_hat: The sample proportion of people who prefer Coke (p_hat = 520 / 1000 in this case).
- Z: The critical value from the standard normal distribution for the desired confidence level. For a 99% confidence level, Z is approximately 2.576.
- n: The sample size (number of people in the sample).

Let's calculate the 99% confidence interval using Python:

```python
import math

# Given data
sample_size = 1000
coke_preferred = 520
confidence_level = 0.99

# Calculate the sample proportion of people who prefer Coke
p_hat = coke_preferred / sample_size

# Calculate the critical value (Z-score) for the given confidence level
Z = stats.norm.ppf(1 - (1 - confidence_level) / 2)

# Calculate the standard error of the proportion
standard_error = math.sqrt((p_hat * (1 - p_hat)) / sample_size)

# Calculate the confidence interval
lower_bound = p_hat - Z * standard_error
upper_bound = p_hat + Z * standard_error

# Make sure the bounds are within [0, 1]
lower_bound = max(0, lower_bound)
upper_bound = min(1, upper_bound)

# Print the confidence interval
print(f"99% Confidence Interval for the true proportion of people who prefer Coke: ({lower_bound:.4f}, {upper_bound:.4f})")
```

Interpretation:
The 99% confidence interval for the true proportion of people in the population who prefer Coke is (0.4873, 0.5527). This means that we are 99% confident that the true proportion of people who prefer Coke lies within this interval. In other words, if we were to take multiple samples and calculate their confidence intervals, we expect that 99% of these intervals would contain the true proportion of people who prefer Coke, and only 1% of the intervals would not include the true proportion.

Q8. A researcher hypothesizes that a coin is biased towards tails. They flip the coin 100 times and observe
45 tails. Conduct a chi-square goodness of fit test to determine if the observed frequencies match the
expected frequencies of a fair coin. Use a significance level of 0.05.

ANS:
    
    
    To conduct a chi-square goodness-of-fit test in Python to determine if the observed frequencies match the expected frequencies of a fair coin, you can use the `scipy.stats` library. The chi-square goodness-of-fit test compares the observed frequencies (tails in this case) with the expected frequencies (for a fair coin, it's 50% heads and 50% tails) to see if there is a significant difference between the observed and expected outcomes.

Here's how you can perform the chi-square test in Python:

```python
import scipy.stats as stats

# Given data
observed_tails = 45
total_flips = 100

# Expected frequency for tails in a fair coin
expected_tails = total_flips * 0.5

# Calculate the expected frequency for heads (for a fair coin)
expected_heads = total_flips * 0.5

# Observed and expected frequencies as arrays
observed_freq = [observed_tails, total_flips - observed_tails]
expected_freq = [expected_tails, expected_heads]

# Perform the chi-square test
chi2_stat, p_value = stats.chisquare(f_obs=observed_freq, f_exp=expected_freq)

# Print the results
print(f"Chi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Check for statistical significance at a 0.05 significance level
alpha = 0.05
if p_value <= alpha:
    print("The coin is biased towards tails.")
else:
    print("There is no significant evidence that the coin is biased towards tails.")
```

In this code, we calculate the expected frequency for tails and heads based on the assumption of a fair coin. Then, we use `scipy.stats.chisquare()` function to perform the chi-square test, which returns the chi-square statistic and the p-value.

If the p-value is less than or equal to the significance level (0.05 in this case), we reject the null hypothesis, which means that there is a significant difference between the observed and expected frequencies, and the coin is biased towards tails. On the other hand, if the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating that there is no significant evidence that the coin is biased towards tails, and the observed frequencies match the expected frequencies for a fair coin.

Q9. A study was conducted to determine if there is an association between smoking status (smoker or
non-smoker) and lung cancer diagnosis (yes or no). The results are shown in the contingency table below.
Conduct a chi-square test for independence to determine if there is a significant association between
smoking status and lung cancer diagnosis.

            Lung Cancer: Yes        Lung Cancer: No
Smoker        60                      140
Non-smoker    30                      170


Use a significance level of 0.05.

ANS:
    
   To conduct a chi-square test for independence between smoking status and lung cancer diagnosis, you can use the `scipy.stats` library in Python. The chi-square test for independence is used to determine if there is a significant association between two categorical variables.

Here's how you can perform the chi-square test in Python:

```python
import numpy as np
import scipy.stats as stats

# Create the contingency table
observed = np.array([[60, 140],
                     [30, 170]])

# Perform the chi-square test for independence
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

# Print the results
print(f"Chi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Check for statistical significance at a 0.05 significance level
alpha = 0.05
if p_value <= alpha:
    print("There is a significant association between smoking status and lung cancer diagnosis.")
else:
    print("There is no significant association between smoking status and lung cancer diagnosis.")
```

In this code, we create the 2x2 contingency table representing the observed frequencies for lung cancer diagnosis (Yes/No) and smoking status (Smoker/Non-smoker). We then use `scipy.stats.chi2_contingency()` function to perform the chi-square test for independence, which returns the chi-square statistic, the p-value, degrees of freedom (dof), and the expected frequencies based on independence assumption.

If the p-value is less than or equal to the significance level (0.05 in this case), we reject the null hypothesis, which means there is a significant association between smoking status and lung cancer diagnosis. On the other hand, if the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating that there is no significant association between smoking status and lung cancer diagnosis. 

Q10. A study was conducted to determine if the proportion of people who prefer milk chocolate, dark
chocolate, or white chocolate is different in the U.S. versus the U.K. A random sample of 500 people from
the U.S. and a random sample of 500 people from the U.K. were surveyed. The results are shown in the
contingency table below. Conduct a chi-square test for independence to determine if there is a significant
association between chocolate preference and country of origin.
                Milk Chocolate   
    
    
                 Milk Chocolate         Dark Chocolate         White Chocolat
U.S. (n=500)      200             150                          150
U.K. (n=500)      225             175                          100

Use a significance level of 0.01.

ANS:
    
    To conduct a chi-square test for independence to determine if there is a significant association between chocolate preference and country of origin, you can use the `scipy.stats` library in Python. The chi-square test for independence is used to compare the observed frequencies in the contingency table with the expected frequencies assuming independence between the two categorical variables.

Here's how you can perform the chi-square test in Python:

```python
import numpy as np
import scipy.stats as stats

# Create the contingency table
observed = np.array([[200, 150, 150],
                     [225, 175, 100]])

# Perform the chi-square test for independence
chi2_stat, p_value, dof, expected = stats.chi2_contingency(observed)

# Print the results
print(f"Chi-square statistic: {chi2_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Check for statistical significance at a 0.01 significance level
alpha = 0.01
if p_value <= alpha:
    print("There is a significant association between chocolate preference and country of origin.")
else:
    print("There is no significant association between chocolate preference and country of origin.")
```

In this code, we create the 2x3 contingency table representing the observed frequencies for chocolate preference (Milk/Dark/White) in the U.S. and the U.K. We then use `scipy.stats.chi2_contingency()` function to perform the chi-square test for independence, which returns the chi-square statistic, the p-value, degrees of freedom (dof), and the expected frequencies based on independence assumption.

If the p-value is less than or equal to the significance level (0.01 in this case), we reject the null hypothesis, which means there is a significant association between chocolate preference and country of origin. On the other hand, if the p-value is greater than 0.01, we fail to reject the null hypothesis, indicating that there is no significant association between chocolate preference and country of origin.

Q11. A random sample of 30 people was selected from a population with an unknown mean and standard
deviation. The sample mean was found to be 72 and the sample standard deviation was found to be 10.
Conduct a hypothesis test to determine if the population mean is significantly different from 70. Use a
significance level of 0.05.

ANS:
    
    
 To conduct a hypothesis test to determine if the population mean is significantly different from 70, you can use a one-sample t-test in Python. The t-test is appropriate when the sample size is small, and the population standard deviation is unknown.

The null hypothesis (H0) is that the population mean is equal to 70, and the alternative hypothesis (H1) is that the population mean is significantly different from 70.

Here's how you can perform the one-sample t-test in Python:

```python
import scipy.stats as stats

# Given data
sample_mean = 72
sample_std_dev = 10
sample_size = 30
population_mean = 70
significance_level = 0.05

# Calculate the t-statistic
t_stat = (sample_mean - population_mean) / (sample_std_dev / (sample_size ** 0.5))

# Calculate the degrees of freedom
degrees_of_freedom = sample_size - 1

# Calculate the critical t-value for the given significance level and degrees of freedom
critical_t = stats.t.ppf(1 - (significance_level / 2), degrees_of_freedom)

# Calculate the p-value (two-tailed test)
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), degrees_of_freedom))

# Print the results
print(f"T-statistic: {t_stat:.2f}")
print(f"P-value: {p_value:.4f}")

# Check for statistical significance at the 0.05 significance level
if p_value <= significance_level:
    print("The population mean is significantly different from 70.")
else:
    print("There is no significant difference between the population mean and 70.")
```

In this code, we calculate the t-statistic using the formula for a one-sample t-test, and then calculate the degrees of freedom. We also find the critical t-value for the given significance level and degrees of freedom using `scipy.stats.t.ppf()`. Finally, we calculate the p-value for the two-tailed test using `scipy.stats.t.cdf()` and compare it to the significance level to make our conclusion.

If the p-value is less than or equal to the significance level (0.05 in this case), we reject the null hypothesis and conclude that the population mean is significantly different from 70. Otherwise, if the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating that there is no significant difference between the population mean and 70.   