# Module Three Discussion: Confidence Intervals and Hypothesis Testing

This notebook contains the step-by-step directions for your Module Three discussion. It is very important to run through the steps in order. Some steps depend on the outputs of earlier steps. Once you have completed the steps in this notebook, be sure to answer the questions about this activity in the discussion for this module.

Reminder: If you have not already reviewed the discussion prompt, please do so before beginning this activity. That will give you an idea of the questions you will need to answer with the outputs of this script.


## Objectives 	
> Use the link in the Jupyter Notebook activity to access your Python script. Once you have made your calculations, complete this discussion. The script will output answers to the questions given below. You must attach your Python script output as an HTML file and respond to the questions below.

In this discussion, you will apply the statistical concepts and techniques covered in this week's reading to calculate a confidence interval and perform hypothesis testing for a manufacturing process.

The manufacturing process at a factory produces ball bearings that are sold to automotive manufacturers. The factory wants to estimate the average diameter of a ball bearing that is in demand to ensure that it is manufactured within the specifications. Suppose they plan to collect a sample of 50 ball bearings and measure their diameters to construct a 90% and 99% confidence interval for the average diameter of ball bearings produced from this manufacturing process.

The sample of size 50 was generated using Python's numpy module. This data set will be unique to you, and therefore your answers will be unique as well. Run Step 1 in the Python script to generate your unique sample data. Check to make sure your sample data is shown in your attachment.

In your initial post, address the following items. Be sure to answer the questions about both confidence intervals and hypothesis testing.

In the Python script, you calculated the sample data to construct a 90% and 99% confidence interval for the average diameter of ball bearings produced from this manufacturing process. These confidence intervals were created using the Normal distribution based on the assumption that the population standard deviation is known and the sample size is sufficiently large. Report these confidence intervals rounded to two decimal places. See Step 2 in the Python script.
Interpret both confidence intervals. Make sure to be detailed and precise in your interpretation.
It has been claimed from previous studies that the average diameter of ball bearings from this manufacturing process is 2.30 cm. Based on the sample of 50 that you collected, is there evidence to suggest that the average diameter is greater than 2.30 cm? Perform a hypothesis test for the population mean at alpha = 0.01.

In your initial post, address the following items:

Define the null and alternative hypothesis for this test in mathematical terms and in words.
Report the level of significance.
Include the test statistic and the P-value. See Step 3 in the Python script. (Note that Python methods return two tailed P-values. You must report the correct P-value based on the alternative hypothesis.)
Provide your conclusion and interpretation of the results. Should the null hypothesis be rejected? Why or why not?
In your follow-up posts to other students, review your peers' calculations and provide some analysis and interpretation:

How do their confidence intervals compare with yours?
If the population standard deviation is unknown and the sample size is not sufficiently large, would you still use the Normal distribution to calculate these confidence intervals, or would you choose another distribution? If the latter, which distribution would you choose?
Remember to attach your Python output and respond to all questions in your initial and follow-up posts. Be sure to clearly communicate your ideas using appropriate terminology.

To complete this assignment, review the Discussion Rubric.


## Initial post (due Thursday)
_____________________________________________________________________________________________________________________________________________________

### Step 1: Generating sample data
This block of Python code will generate a unique sample of size 50 that you will use in this discussion. Note that your sample will be unique and therefore your answers will be unique as well. The numpy module in Python allows you to create a data set using a Normal distribution. Note that the mean and standard deviation were chosen for you. The data set will be saved in a Python dataframe that will be used in later calculations. 

Click the block of code below and hit the **Run** button above. 

In [1]:
import pandas as pd
import numpy as np
import math
import scipy.stats as st

# create 50 randomly chosen values from a Normal distribution. (arbitrarily using mean=2.48 and standard deviation=0.50). 
diameters = np.random.normal(2.4800,0.500,50)

# convert the array into a dataframe with the column name "diameters" using pandas library.
diameters_df = pd.DataFrame(diameters, columns=['diameters'])
diameters_df = diameters_df.round(2)

# print the dataframe (note that the index of dataframe starts at 0).
print("Diameters data frame\n")
print(diameters_df)

Diameters data frame

    diameters
0        2.30
1        2.74
2        2.32
3        1.95
4        2.11
5        2.60
6        2.07
7        2.35
8        2.56
9        2.40
10       1.50
11       2.50
12       2.03
13       3.54
14       2.84
15       3.30
16       2.40
17       1.87
18       1.93
19       2.74
20       1.82
21       2.18
22       3.66
23       1.72
24       1.77
25       2.30
26       2.19
27       2.72
28       2.02
29       1.87
30       2.50
31       2.90
32       1.61
33       3.08
34       2.07
35       2.08
36       2.25
37       2.95
38       2.25
39       2.53
40       3.50
41       2.34
42       2.90
43       2.46
44       2.43
45       2.60
46       2.19
47       2.99
48       2.45
49       2.59


### Step 2: Constructing confidence intervals
You will assume that the population standard deviation is known and that the sample size is sufficiently large. Then you will use the Normal distribution to construct these confidence intervals. You will use the submodule scipy.stats to construct confidence intervals using your sample data. 

Click the block of code below and hit the **Run** button above. 

In [2]:
# Python methods that calculate confidence intervals require the sample mean and the standard error as inputs.

# calculate the sample mean
mean = diameters_df['diameters'].mean()

# input the population standard deviation, which was given in Step 1.
std_deviation = 0.5000

# calculate standard error = standard deviation / sqrt(n)   where n is the sample size.
stderr = std_deviation/math.sqrt(len(diameters_df['diameters']))

# construct a 90% confidence interval.
conf_int_90 = st.norm.interval(0.90, mean, stderr)
print("90% confidence interval (unrounded) =", conf_int_90)
print("90% confidence interval (rounded) = (", round(conf_int_90[0], 2), ",", round(conf_int_90[1], 2), ")")
print("")

# construct a 99% confidence interval.
conf_int_99 = st.norm.interval(0.99, mean, stderr)
print("99% confidence interval (unrounded) =", conf_int_99)
print("99% confidence interval (rounded) = (", round(conf_int_99[0], 2), ",", round(conf_int_99[1], 2), ")")

90% confidence interval (unrounded) = (2.3030912846323326, 2.5357087153676674)
90% confidence interval (rounded) = ( 2.3 , 2.54 )

99% confidence interval (unrounded) = (2.2372613632281553, 2.6015386367718447)
99% confidence interval (rounded) = ( 2.24 , 2.6 )


### Step 3: Performing hypothesis testing for the population mean
Since you were given the population standard deviation in Step 1 and the sample size is sufficiently large, you can use the z-test for population means. The z-test method in statsmodels.stats.weightstats submodule runs the z-test. The input to this method is the sample dataframe and the value under the null hypothesis. The output is the test-statistic and the two-tailed P-value.

Click the block of code below and hit the **Run** button above. 

In [3]:
from statsmodels.stats.weightstats import ztest

# run z-test hypothesis test for population mean. The value under the null hypothesis is 2.30.
test_statistic, p_value = ztest(x1 = diameters_df['diameters'],  value = 2.30)

print("z-test hypothesis test for population mean")
print("test-statistic =", round(test_statistic,2))
print("two tailed p-value =",round(p_value,4))

z-test hypothesis test for population mean
test-statistic = 1.73
two tailed p-value = 0.0844


## End of initial post
Attach the HTML output to your initial post in the Module Three discussion. The HTML output can be downloaded by clicking **File**, then **Download as**, then **HTML**. Be sure to answer all questions about this activity in the Module Three discussion.

## Follow-up posts (due Sunday)
Return to the Module Three discussion to answer the follow-up questions in your response posts to other students. There are no Python scripts to run for your follow-up posts.