# Confidence Interval

https://towardsdatascience.com/a-complete-guide-to-confidence-interval-and-examples-in-python-ff417c5cb593

As it sounds, the confidence interval is a range of values. In the ideal condition, it should contain the best estimate of a statistical parameter. It is expressed as a percentage. 95% confidence interval is the most common. You can use other values like 97%, 90%, 75%, or even 99% confidence interval if your research demands. Let’s understand it by an example: 

“In a sample of 659 parents with toddlers, about 85%, stated they use a car seat for all travel with their toddler. From these results, a 95% confidence interval was provided, going from about 82.3% up to 87.7%.”

This statement means, we are 95% certain that the population proportion who use a car seat for all travel with their toddler will fall between 82.3% and 87.7%. If we take a different sample or a subsample of these 659 people, 95% of the time, the percentage of the population who use a car seat in all travel with their toddlers will be in between 82.3% and 87.7%

#### Remember, 95% confidence interval does not mean 95% probability

The reason confidence interval is so popular and useful is, we cannot take data from all populations. Like the example above, we could not get the information from all the parents with toddlers. We had to calculate the result from 659 parents. From that result, we tried to get an estimate of the overall population. So, it is reasonable to consider a margin of error and take a range. That’s why we take a confidence interval which is a range.

Confidence 95% - 1.96 (Z)

Standard Error for population = sqrt(population_proportion * (1-population_proportion/number_of_observations))

Standard Error for mean = standard_deviation / sqrt(number_of_observation)

As per the statement, the population proportion that uses a car seat for all travel with their toddlers is 85%. So, this is our best estimate. We need to add the margin of error to it. To calculate the margin of error we need the z-score and the standard error. I am going to calculate a 95% CI. The z-score should be 1.96 and I already mentioned the formula of standard error for the population proportion. Plugging in all the values:

0.85 +/- 1.96 * ((1-0.85)/659)

The confidence interval is 82.3% and 87.7%

In [48]:
import pandas as pd
import numpy as np
df = pd.read_csv('./data/heart2.csv')
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [49]:
print(df.columns)

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')


The last column of the data is ‘target’. It says if a person has heart disease or not. In the beginning, we have a ‘Sex’ column as well.

#### The last column of the data is ‘target’. It says if a person has heart disease or not. In the beginning, we have a ‘Sex’ column as well.

In [50]:
df['Sex1'] = df.sex.replace({1: "Male", 0: "Female"})

We do not need all the columns in the dataset. We will only use the ‘AHD’ column as that contains if a person has heart disease or not and the Sex1 column we just created. Make a DataFrame with only these two columns and drop all the null values.


In [51]:
dx = df[["target", "Sex1"]].dropna()

We need the number of females who have heart disease. The line of code below will give the number of males and females with heart disease and with no heart disease

In [52]:
pd.crosstab(dx.target, dx.Sex1)

Sex1,Female,Male
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,86,413
1,226,300


## CI for the population Proportion

In [16]:
#Calculate the female population proportion with heart disease
p_fm = 226/(86+226)
p_fm

0.7243589743589743

In [15]:
#size of the female population
n = 86 + 226
n

312

In [14]:
#The size of the female population is 97. Calculate the standard error
se_female = np.sqrt(p_fm * (1 - p_fm) / n)
se_female

0.02529714756803247

In [18]:
z_score = 1.96
lcb = p_fm - z_score* se_female #lower limit of the CI
ucb = p_fm + z_score* se_female #upper limit of the CI

print(f"The confidence interval is {lcb} and {ucb}")

The confidence interval is 0.6747765651256307 and 0.773941383592318


In [19]:
import statsmodels.api as sm
sm.stats.proportion_confint(n * p_fm, n)

(0.6747774762140357, 0.773940472503913)

Is the population proportion of females with heart disease the same as the population proportion of males with heart disease? If they are the same, then the difference in both the population proportions will be zero.


In [23]:
p_male = 300/(413+300)  #male population proportion
n = 300+413             #total male population
p_male

0.42075736325385693

In [24]:
se_male = np.sqrt(p_male * (1 - p_male) / n)
se_male

0.018488486410836908

In [21]:
se_diff = np.sqrt(se_female**2 + se_male**2)

In [25]:
d = 0.72 - 0.42
lcb = d - 1.96 * se_diff  #lower limit of the CI
ucb = d + 1.96 * se_diff  #upper limit of the CI
print(f"The confidence interval is {lcb} and {ucb}")

The confidence interval is 0.23858691603344614 and 0.36141308396655386


The CI is 0.18 and 0.4. This range does not have 0 in it. Both the numbers are above zero. So, We cannot make any conclusion that the population proportion of females with heart disease is the same as the population proportion of males with heart disease.

## Calculation of CI of mean

#### we will calculate the confidence interval of the mean cholesterol level of the female population.

In [27]:
df.groupby("Sex1").agg({"chol": [np.mean, np.std, np.size]})

Unnamed: 0_level_0,chol,chol,chol
Unnamed: 0_level_1,mean,std,size
Sex1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,261.455128,64.466781,312
Male,239.237027,43.155535,713


In [28]:
mean_fe = 261.75  #mean cholesterol of female
sd = 64.9         #standard deviation for female population
n = 312           #Total number of female
z = 1.96          #z-score from the z table mentioned before
se = sd /np.sqrt(n)
se

3.6742389757992133

In [29]:
lcb = mean_fe - z* se  #lower limit of the CI
ucb = mean_fe + z* se  #upper limit of the CI
print(f"The confidence interval is {lcb} and {ucb}")

The confidence interval is 254.54849160743353 and 268.95150839256644


That means the true mean of the cholesterol of the female population will fall between 248.83 and 274.67

### Calculation of CI of The Difference in Mean


There are two approaches to calculate the CI for the difference in the mean of two populations: Pooled approach and unpooled approach


As mentioned earlier, we need a simple random sample and a normal distribution. If the sample is large, a normal distribution is not necessary.
There is one more assumption for a pooled approach. That is, the variance of the two populations is the same or almost the same.
If the variance is not the same, the unpooled approach is more appropriate.


In [30]:
n1 = 97
n2 = 206
mean_female = 261.75
mean_male = 239.6
sd_female = 64.9
sd_male = 42.65

As we can see, the standard deviation of the two target populations is different. So. the variance must be different as well.
So, for this example, the unpooled approach will be more appropriate.

In [31]:
sem_female = sd_female / np.sqrt(97)
sem_male = sd_male / np.sqrt(206)

In [32]:
mean_d = mean_female - mean_male

Using the formula for the unpooled approach, calculate the difference in standard error:


In [34]:
sem_d = (np.sqrt((n1-1)*se_female**2 + (n2-1)*se_male**2)/(n1+n2-2))*(np.sqrt(1/n1 + 1/n2))


In [36]:
#Finally, construct the CI for the difference in mean
lcb = mean_d - 1.96*sem_d  #lower limit of the CI
ucb = mean_d + 1.96*sem_d  #upper limit of the CI
(lcb, ucb)

(22.149709217204855, 22.150290782795157)

The lower and upper limit of the confidence interval came out to be 22.1494 and 22.15. They are almost the same. That means the mean cholesterol of the female population is not different than the mean cholesterol of the male population.