<a href="https://colab.research.google.com/github/raviteja-padala/Statistics/blob/main/Inferential_statistics_Confidence_Intervals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective: To learn and Understand Confidence interval in detail

## Confidence interval

**Confidence interval** is a range of values within which we estimate the true population parameter to lie with a certain level of confidence. It is a statistical tool used in inferential statistics to quantify the uncertainty associated with sample estimates and to make inferences about population parameters based on limited sample data.

In simpler terms, a confidence interval provides a way to express the precision of an estimate. It tells us how confident we can be that the true population parameter falls within a certain range based on the data we have collected from a sample.


### Constructing a confidence interval

Constructing a confidence interval involves using sample data to estimate a range of values within which the true population parameter is likely to lie with a specified level of confidence. The confidence interval provides a measure of the uncertainty associated with the sample estimate. Here's a step-by-step guide on how to construct a confidence interval:

**Step 1: Collect Sample Data**

Gather a random sample of data from the population of interest. The sample should be representative and unbiased to ensure that the estimates are valid.

**Step 2: Choose a Confidence Level**

Decide on the desired level of confidence for the interval. Common choices are 90%, 95%, or 99% confidence levels. A 95% confidence level is commonly used, indicating that we are 95% confident that the true population parameter lies within the constructed interval.

**Step 3: Calculate the Sample Statistic**

Compute the sample statistic of interest. The sample statistic could be the sample mean, sample proportion, sample standard deviation, or any other relevant statistic depending on the research question.

**Step 4: Determine the Standard Error**

Calculate the standard error (SE) of the sample statistic. The standard error quantifies the variability of the sample statistic and is typically derived from the sample data and the sample size.

**Step 5: Find the Critical Value**

Determine the critical value from the appropriate probability distribution (e.g., t-distribution or z-distribution) based on the chosen confidence level and the degrees of freedom (for t-distribution). The critical value defines the margins of the confidence interval.

**Step 6: Compute the Margin of Error**

Multiply the standard error by the critical value to obtain the margin of error. The margin of error represents the maximum distance from the sample statistic within which the true population parameter is likely to fall.

**Step 7: Calculate the Confidence Interval**

Add and subtract the margin of error from the sample statistic to construct the confidence interval. The confidence interval represents the range of values within which we are confident that the true population parameter lies.

The general formula for a confidence interval is:

**Confidence Interval** = Sample Statistic ± (Critical Value * Standard Error)

The width of the confidence interval is determined by the margin of error and the level of confidence. A wider confidence interval implies greater uncertainty, while a narrower interval indicates higher precision in estimating the population parameter.


### Let's see some examples

## Example-1: Constructing a Confidence Interval for the Population Mean

- Suppose we want to estimate the average height of students in a university. We take a random sample of 100 students and measure their heights. The sample mean height is 170 cm, and the sample standard deviation is 5 cm. We want to construct a 95% confidence interval for the population mean height.

**Step 1: Sample statistics**

-  Sample = 100 students,    
- Sample mean height = 170 cm,    
- Sample standard deviation is 5 cm.


**Step 2: Choose a Confidence Level**

- We choose a 95% confidence level, which means we want to be 95% confident that the true population mean height lies within our constructed interval.

**Step 3: Calculate the Sample Statistic**

- The sample mean height is 170 cm.

**Step 4: Determine the Standard Error**

- The standard error (SE) of the sample mean is calculated using the formula:

- SE = sample standard deviation / √(sample size)
   = 5 / √100
   = 5 / 10
   = 0.5


**Step 5: Find the Critical Value**

- For a 95% confidence level and a sample size of 100, we use the t-distribution to find the critical value. The t-distribution table or statistical software gives us a critical value of approximately 1.984 (rounded to three decimal places).

**Step 6: Compute the Margin of Error**

- The margin of error is calculated by multiplying the standard error by the critical value:

**Margin of Error** = Critical Value * Standard Error
               = 1.984 * 0.5
               = 0.992

**Step 7: Calculate the Confidence Interval**
Now, we can construct the confidence interval:

**Confidence Interval** = Sample Mean ± Margin of Error
                    = 170 ± 0.992
                    = [169.008, 170.992]

Interpretation:
We are 95% confident that the true population mean height of students in the university lies within the interval [169.008 cm, 170.992 cm].

In other words, we estimate that the average height of all students in the university falls within this range with a 95% level of confidence based on our sample data. Note that if we were to take multiple random samples and construct confidence intervals from each, we expect approximately 95% of those intervals to contain the true population mean height.


In [None]:
#Loading libraries
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats

# Loading Data

In [None]:
df = sns.load_dataset('titanic')

In [None]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
# shape of dataset
df.shape

(891, 15)

# Example-2: Calculation of Confidence interval using t-score

Suppose we want to estimate the average fare of passengers. We take a random sample of 100 passengers and measure their fares. We will construct a 95% confidence interval of the population fare from the dataset.

## Step 1: Collect Sample Data

In [None]:
# Select random 100 fare samples
random_100_fare_samples = df['fare'].sample(n=100, random_state=42)
random_100_fare_samples

709    15.2458
439    10.5000
840     7.9250
720    33.0000
39     11.2417
        ...   
281     7.8542
712    52.0000
338     8.0500
327    13.0000
321     7.8958
Name: fare, Length: 100, dtype: float64

In [None]:
# Calculate the mean of the random_fare_samples
mean_fare = random_100_fare_samples.mean()

# Calculate the standard deviation of the random_fare_samples
std_dev_fare = random_100_fare_samples.std()

# Display the mean and standard deviation
print("Mean Fare:", mean_fare)
print("Standard Deviation Fare:", std_dev_fare)

Mean Fare: 31.533
Standard Deviation Fare: 42.62158033286078


## Step 2: Choose a Confidence Level
We choose a 95% confidence level, indicating that we want to be 95% confident that the true population fare lies within the confidence interval.


In [None]:
# Define the confidence level (e.g., 95% confidence interval)
confidence_level = 0.95

## Step 3: Calculate the Sample Statistic
Calculate sample mean fare


In [None]:
print("Mean Fare:", mean_fare)

Mean Fare: 31.533


## Step 4: Determine the Standard Error
The standard error (SE) of the sample mean is calculated using the formula:

SE = sample standard deviation / √(sample size)


In [None]:
# Calculate the sample size (which is 100 in this case)
sample_size = len(random_100_fare_samples)

# Calculate the standard error
standard_error = std_dev_fare / np.sqrt(sample_size)
print("standard error is:", standard_error)

standard error is: 4.262158033286078


## Step 5: Find the Critical Value
Determine the critical value from the appropriate probability distribution (e.g., t-distribution or z-distribution) based on the chosen confidence level and the degrees of freedom (for t-distribution). The critical value defines the margins of the confidence interval.


Lets find critical value using t-score. The t-score represents the number of standard errors the sample mean is away from the true population mean. The critical value is used to determine the margin of error in the confidence interval.


In [None]:
# to find critical value using t-distribution , we need to Calculate the degrees of freedom

# Calculate the degrees of freedom
degrees_of_freedom = sample_size - 1

# Calculate the critical value (t-score) using the confidence level and degrees of freedom
critical_value = stats.t.ppf((1 + confidence_level) / 2, df=degrees_of_freedom)
print("Critical value:", critical_value)

Critical value: 1.9842169515086827


*Code explanation*:

The stats.t.ppf function calculates the percent-point function (inverse of the cumulative distribution function) for the t-distribution. It takes two main arguments:

The first argument is the probability (percentile) value, which in this case is (1 + confidence_level) / 2. The (1 + confidence_level) / 2 calculates the middle point of the confidence interval.

The second argument is the degrees of freedom (df), which we calculated earlier.

So, the stats.t.ppf((1 + confidence_level) / 2, df=degrees_of_freedom) calculates the t-score corresponding to the desired confidence_level and degrees_of_freedom.

The critical value is used to calculate the margin of error, which is then used to build the confidence interval as shown in the previous step-wise code.

## Step 6: Compute the Margin of Error
Multiply the standard error by the critical value to obtain the margin of error. The margin of error represents the maximum distance from the sample statistic within which the true population parameter is likely to fall.

### **Margin of Error** = Critical Value * Standard Error


In [None]:
# Calculate the margin of error
margin_of_error = critical_value * standard_error
print("margin_of_error=", margin_of_error)

margin_of_error= 8.457046219655144


## Step 7: Calculate the Confidence Interval
Add and subtract the margin of error from the sample statistic to construct the confidence interval. The confidence interval represents the range of values within which we are confident that the true population parameter lies.

The general formula for a confidence interval is:

### **Confidence Interval** = Sample Statistic ± (Critical Value * Standard Error)

The width of the confidence interval is determined by the margin of error and the level of confidence. A wider confidence interval implies greater uncertainty, while a narrower interval indicates higher precision in estimating the population parameter.


In [None]:
# Calculate the lower and upper bounds of the confidence interval
lower_bound = mean_fare - margin_of_error
upper_bound = mean_fare + margin_of_error


# Display the confidence interval
print("95% Confidence Interval:", (lower_bound, upper_bound))

95% Confidence Interval: (20.554408441505373, 42.51159155849463)


The confidence interval (23.075953780344857, 39.99004621965514) is a range of values that provides an estimate for the true population mean of the fares with a certain level of confidence.

Specifically, it means that if we were to take many samples of the same size from the population and compute the mean of each sample, approximately 95% of those sample means would fall within the range of 23.08 to 39.99.

In other words, we are 95% confident that the true population mean of the fares lies within this interval.

It is essential to note that the confidence interval gives us a range of possible values for the population mean, but it does not tell us the probability that a particular value is the true population mean. Instead, it quantifies the uncertainty associated with the sample estimate.

In summary, with a 95% confidence level, we estimate that the true population mean of the fares lies between 23.08 and 39.99.

In [None]:
# Display the confidence interval
print("Confidence Interval:", (lower_bound, upper_bound))


# calculation of population mean
print("Population mean:",df['fare'].mean())

Confidence Interval: (23.075953780344857, 39.99004621965514)
Population mean: 32.204207968574636


---

# Example-3: Calculation of Confidence interval using z-score

Suppose we want to estimate the average fare of passengers. We take a random sample of 100 passengers and measure their fares. We will construct a 95% confidence interval of the population fare from the dataset.

## Step 1: Collect Sample Data

In [None]:
# Select random 100 fare samples
Random_100_fare_samples = df['fare'].sample(n=100, random_state=40)
Random_100_fare_samples

246      7.7750
588      8.0500
472     27.7500
71      46.9000
654      6.7500
         ...   
56      10.5000
158      8.6625
601      7.8958
223      7.8958
299    247.5208
Name: fare, Length: 100, dtype: float64

In [None]:
# Calculate the mean of the random_fare_samples
Mean_fare = random_100_fare_samples.mean()

# Calculate the standard deviation of the random_fare_samples
Std_dev_fare = random_100_fare_samples.std()

# Display the mean and standard deviation
print("Mean Fare:", mean_fare)
print("Standard Deviation Fare:", std_dev_fare)

Mean Fare: 31.533
Standard Deviation Fare: 42.62158033286078


## Step 2: Choose a Confidence Level
We choose a 99% confidence level, which means we want to be 99% confident that the true population mean fare lies within our constructed interval.


In [None]:
# Define the confidence level (e.g., 95% confidence interval)
Confidence_level = 0.99

## Step 3: Calculate the Sample Statistic
Calculate sample mean fare

In [None]:
print("Mean Fare:", Mean_fare)

Mean Fare: 31.533


## Step 4: Determine the Standard Error
The standard error (SE) of the sample mean is calculated using the formula:

SE = sample standard deviation / √(sample size)


In [None]:
# Calculate the sample size (which is 100 in this case)
Sample_size = len(Random_100_fare_samples)

# Calculate the standard error
Standard_error = Std_dev_fare / np.sqrt(Sample_size)
print("Standard error is:", Standard_error)

Standard error is: 4.262158033286078


## Step 5: Find the Critical Value
Determine the critical value from the appropriate probability distribution. Lets find critical value using Z-score.


In [None]:
# Calculate the critical value (z-score) using the confidence level (assuming normal distribution)
Critical_value = stats.norm.ppf((1 + Confidence_level) / 2)
print("Critical_value:", Critical_value)

Critical_value: 2.5758293035489004


*Code explanation*: Let's see the code for calculating the critical value (z-score) using the confidence level.

The critical value (z-score) is a standard score that corresponds to a specific level of confidence in a normal distribution. It is used to determine the margin of error in constructing a confidence interval. The critical value is based on the desired confidence level, which is often denoted as (1 - α), where α is the significance level (the probability of a type I error).

In the context of a two-tailed confidence interval, we want to find the z-score that corresponds to the middle portion of the distribution, where (1 - α) / 2 lies. For example, for a 95% confidence level (α = 0.05), we want to find the z-score that corresponds to (1 - 0.05) / 2 = 0.975, which is the area between the critical value and the right tail of the distribution.

Here's a step-by-step explanation of the code:

1. `confidence_level`: This is the desired confidence level for the confidence interval, typically expressed as a decimal. For example, 95% confidence level is 0.95.

2. `(1 + confidence_level) / 2`: We add 1 to the confidence level and then divide it by 2 to find the middle area (1 - α) / 2 in the two-tailed distribution. For a 95% confidence level, this results in (1 + 0.95) / 2 = 0.975.

3. `stats.norm.ppf(...)`: This function is from the `scipy.stats` module and is used to find the critical value (z-score) for the given cumulative probability. The `ppf` stands for "percent point function," which is the inverse of the cumulative distribution function (CDF). In other words, it calculates the value of the random variable for which the CDF equals the given probability.

4. `critical_value`: Finally, the critical value is calculated using the `stats.norm.ppf(...)` function, which corresponds to the z-score that marks the cutoff points for the confidence interval.

By using the critical value along with the standard error, you can construct the confidence interval around the sample mean. The margin of error is determined by multiplying the critical value with the standard error, and the lower and upper bounds of the confidence interval are then calculated by subtracting and adding the margin of error to the sample mean, respectively.

## Step 6: Compute the Margin of Error
Multiply the standard error by the critical value to obtain the margin of error. The margin of error represents the maximum distance from the sample statistic within which the true population parameter is likely to fall.

### **Margin of Error** = Critical Value * Standard Error


In [None]:
# Calculate the margin of error
Margin_of_error = Critical_value * Standard_error
print("MArgin of error :", Margin_of_error)

MArgin of error : 10.97859155849463


##  Step 7: Calculate the Confidence Interval

The general formula for a confidence interval is:

### **Confidence Interval** = Sample Statistic ± (Critical Value * Standard Error)

The width of the confidence interval is determined by the margin of error and the level of confidence. A wider confidence interval implies greater uncertainty, while a narrower interval indicates higher precision in estimating the population parameter.

In [None]:
# Calculate the lower and upper bounds of the confidence interval
Lower_bound = Mean_fare - Margin_of_error
Upper_bound = Mean_fare + Margin_of_error

# Display the confidence interval
print("99 % Confidence Interval:", (Lower_bound, Upper_bound))

99 % Confidence Interval: (20.554408441505373, 42.51159155849463)


This means that we are 99% confident that the true population mean of fares falls within this interval. In other words, if we were to take many random samples from the population and calculate the mean fare for each sample, approximately 99 out of 100 of those sample means would fall within the range of 20.55 to 42.51.

# When to use z- score, When to use t-score?

- Use the t-score method when the sample size is small (n < 30) and the population standard deviation is unknown.
- Use the z-score method when the sample size is large (n ≥ 30) or when the population standard deviation is known.

-It's important to note that the z-score method assumes a normal distribution, while the t-score method accounts for the uncertainty introduced by estimating the population standard deviation from the sample when using small sample sizes. The choice between the two methods ensures that the confidence intervals and hypothesis tests are appropriately constructed and provide accurate results for different scenarios.

#  What is the difference between Point Estimates and Confidence Interval?

Ans:

Point Estimation gives us a particular value as an estimate of Population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.

Confidence interval gives us a range of values which is likely to contain the population parameter. Confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 — alpha, where alpha is the level of significance.

# Why do we use only sample mean as test statistic most of times to calculate confidence intervals?

The sample mean is commonly used as the test statistic to calculate confidence intervals because it is an unbiased estimator of the population mean, and it has desirable statistical properties that make it a reliable estimator for the population parameter.

Here are some reasons why the sample mean is preferred as the test statistic for constructing confidence intervals:

1. Unbiased Estimator: The sample mean is an unbiased estimator of the population mean. It means that, on average, the sample mean will be equal to the true population mean. This property makes the sample mean a good point estimator to estimate the unknown population parameter.

2. Efficiency: The sample mean is an efficient estimator, which means it has lower variability (standard error) compared to other estimators. As a result, it provides more precise estimates of the population mean, leading to narrower confidence intervals.

3. Central Limit Theorem: The sample mean follows a normal distribution, thanks to the Central Limit Theorem. This theorem states that, regardless of the distribution of the population, the distribution of sample means approaches a normal distribution as the sample size increases. This normal distribution property allows us to use the standard normal (z) or Student's t-distribution to calculate confidence intervals.

4. Simplicity: The sample mean is straightforward to calculate and interpret, making it a convenient choice for practical purposes.

5. Widely Used: The sample mean is widely used in statistical analysis, and there are well-established statistical tests and methods associated with it. This makes it easier to apply and compare results across different studies and experiments.

6. Minimizes Variance: The sample mean minimizes the variance among all possible estimators, making it the most efficient choice for estimating the population mean.

It's important to note that while the sample mean is commonly used, there are scenarios where other test statistics might be more appropriate. For instance, in situations where the data is not normally distributed or when estimating other population parameters (e.g., population proportion), different test statistics may be used (e.g., sample median, sample proportion). The choice of the test statistic depends on the specific research question and the underlying assumptions of the data.

## Conclusion:

In this study, we have acquired a comprehensive understanding of constructing confidence intervals through a step-by-step approach. We applied this knowledge to practical examples and successfully built confidence intervals using both t-score and z-score methods.

### Thank you for reading till the end.

## -Raviteja Padala


https://www.linkedin.com/in/raviteja-padala/         
