## Question on Confidence Interval

In [None]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# A company wants to estimate the average time spent by customers on their website per session. They collect a random 
# sample of 100 customers and find that the sample mean time spent is 4.5 minutes with a sample standard deviation of 1.2 
# minutes. Calculate a 99% confidence interval for the true population mean time spent on the website per session.

x_bar = 4.5
s = 1.2
n = 100
alpha = 0.01

t_score = scipy.stats.t.ppf(1-alpha/2,n-1)
lower_bound = x_bar - t_score * s / (n**0.5)
upper_bound = x_bar + t_score * s / (n**0.5)
print(f"The 99% confidence interval for the true population mean time spent on the website per session is ({lower_bound:.2f}, {upper_bound:.2f}) minutes.")

## Solution-1

First, we need to calculate the margin of error using the formula:

Margin of Error = z*(sigma/√n)
where z is the z-score associated with our level of confidence (99% in this case), sigma is the population standard deviation (unknown), and n is the sample size (100 in this case).

Since we don't know the population standard deviation, we'll use the sample standard deviation as an estimate. The formula for calculating the t-score associated with our level of confidence and degrees of freedom (df = n - 1) is:

t = t.ppf(1 - alpha/2, df=n-1)
where alpha is 1 - our level of confidence (0.01 in this case).

Now we can calculate the margin of error using:

Margin of Error = t*(s/√n)
where s is the sample standard deviation (1.2 in this case).

Finally, we can calculate the confidence interval using:

  Confidence Interval = x̄ ± Margin of Error
where x̄ is the sample mean (4.5 in this case).

Putting it all together, we get:

Margin of Error = t*(s/√n) = 2.626*(1.2/√100) = 0.315
Confidence Interval = x̄ ± Margin of Error = 4.5 ± 0.315 = [4.185, 4.815]
Therefore, we can be 99% confident that the true population mean time spent on the website per session falls between 4.185 and 4.815 minutes.

In [None]:
# A car manufacturer is interested in estimating the mean gas mileage of their new SUV. A sample of 25 SUVs is taken, and 
# their mean gas mileage is found to be 28.6 miles per gallon with a standard deviation of 2.8 miles per gallon. Calculate 
# a 95% confidence interval for the true mean gas mileage of the SUV.

x_bar = 28.6
s = 2.8
n = 25
alpha = 0.05

t_score = scipy.stats.t.ppf(1-alpha/2,n-1)
ci  = x_bar + np.array([-1,1])*t_score * s/(n**0.5)
print(f"95% Confidence interval: {ci}")

In [None]:
# A climate research organization wants to estimate the average temperature of a certain country. They collect temperature 
# data for 2613 days but due to certain limitations, they only have information about the average temperature for 2508 
# days.The organization assumes that the population follows a normal distribution and wants to estimate the population 
# mean temperature with a 95% confidence interval.
# a) Standard deviation is assumed as given data standard deviation. z Procedure
# b) Standard deviation is not given. Apply t Procedure

ind_data = pd.read_csv("ind_temp.csv")

samples = []
stds = []
alpha = 0.05

for i in range(200):
    d = ind_data["AverageTemperature"].dropna().sample(50).values
    stds.append(d.std())
    samples.append(d.tolist())
    
samples = np.array(samples)
x_bar = samples.mean(axis = 1)
s = np.mean(stds)
sns.kdeplot(x_bar)

t_score = scipy.stats.t.ppf(1-alpha/2,49)
ci = x_bar.mean() + np.array([-1,1])*t_score*s/(50**0.5)
print("Interval for 95% confidence(t-procedure):",ci)
print("Actual data mean temprature",ind_data.AverageTemperature.mean())

In [None]:
std = ind_data["AverageTemperature"].std()
z_score = scipy.stats.norm.ppf(1-alpha/2)
ci = x_bar.mean() + np.array([-1,1]) * z_score * std/(50**0.5)
print("Interval for 95% confidence(z-procedure):",ci)
print("Actual data mean temprature",ind_data.AverageTemperature.mean())

In [None]:
# Task 1: The sales manager of a used car company wants to know what is the average selling price of all the used bmw 
# cars. The analyst can collect only a sample of sales of 500 cars in the area. Since this estimate is going to be used by 
# the company to strategize sales of his company, the sample mean should be a good approximation of all the account. What 
# level of confidence is the sales manager going to be satisfied with? What +/- interval number is going to be acceptable?
# Task 2: In addition to the price of the car, the manager also wants to now know the average mileage that the car has been
# driven. But the manager does not have the population standard deviation for the mileage. But the mileage data is 
# available only from the 25 cars that they have sold so far. How do the analyst approach this problem to calculate 95% 
# confidence interval, with only 25 samples?
# Task 3: The manager is not happy with both the intervals (intervals from task 1 and 2) as the interval for the confidence
# is very high. The manager now asks the analyst to estimate the average price of the car (similar to task 1) but within a 
# bound of 750 from the mean with 95% condidence level. How many sample does the analyst have to collect to arrive at this 
# confidence interval level?
# Task 4: Conversely, after the analyst started collecting the data, after a week he was only able to collect for 420 
# samples, though he needs 540 samples for Bound of 1,000, if he has to do the analysis now, what is the best interval he 
# can acheive for 95% confidence levels?

bmw = pd.read_csv("bmw.csv")
bmw

<h3> Task1 </h3>
To determine the level of confidence the sales manager is going to be satisfied with, we need to consider the level of risk he is willing to take. For example, if the sales manager is willing to take a risk of being incorrect 5% of the time, he would want a 95% confidence interval.

To calculate the interval estimate, we need to calculate the mean and standard deviation of the sample, as well as the sample size, and the level of confidence.

In [None]:
sample_size = 500
alpha = 0.05

sample = bmw.sample(sample_size,random_state=42)
x_bar = sample["price"].mean()
sigma = sample["price"].std()

z_score = scipy.stats.norm.ppf(1-alpha/2)
ci = x_bar + np.array([-1,1]) * z_score * sigma/(sample_size**0.5)
print("Interval for 95% confidence:",ci)

<h3> Task2 </h3>
For question 2, we want to calculate the 95% confidence interval for the average mileage of BMW cars sold by the used car company. We have only 25 sample data points for mileage, and we do not know the population standard deviation.

Since we do not know the population standard deviation, we will use the t-distribution instead of the z-distribution. We will use the t-distribution with 24 degrees of freedom (n-1) to calculate the critical value for the 95% confidence level.

First, we calculate the sample mean and sample standard deviation of the 25 sample data points for mileage:

In [None]:
sample_size = 25
sample = bmw.sample(sample_size,random_state=1)
x_bar = sample["mileage"].mean()
s = sample["mileage"].std()

t_score = scipy.stats.t.ppf(1-alpha/2,sample_size-1)
ci = x_bar + np.array([-1,1])*t_score*s/(sample_size**0.5)
print("Interval for 95% confidence:",ci)

<h3> Task3 </h3>

To calculate the required sample size for price estimation with in 750 margin, we need to use the following formula:

- The size of the sample is affected by parameters such as:
- Bound that we need our interval to be within. This is represented by B. In question 3, B = 750
- The confidence level (1−𝛼). In question 3 this is 95%
- What is the estimate of variance (of standard deviation) of the population?

The minimum number of required samples to estimate the population mean μ is:

$$ n = \dfrac{Z^2 _{\alpha / 2} \sigma^2}{B^2} $$

Where,
n = sample size
z = z-score for the desired confidence level (95%)
std = standard deviation of the population (unknown in this case)
B = margin of error (750)

To calculate the sample size, we need to estimate the standard deviation. We can use the standard deviation of the sample as an estimate for the population standard deviation.

First, let's calculate the sample standard deviation:

In [None]:
sample_mean = []
sample_std = []

for i in range(500):
    sample = bmw["price"].sample(n=50)
    sample_mean.append(sample.mean())
    sample_std.append(sample.std())
    
s = np.mean(sample_std)
B = 750
alpha = 0.05
z_score = scipy.stats.norm.ppf(1-alpha/2)
 
n = ((z_score * s)/B)**2
print(n)

<h3> Task4 </h3>
Conversely, after the analyst started collecting the data, after a week he was only able to collect for 420 samples, though he needs 857 samples for B of 750(from question 3), if he has to do the analysis now, what is the best interval he can acheive for 95% confidence levels?


This is converse senario to the question 3. From the formula for the sample size determination we can come up with the formula for calculating the bounds:

$$ n = \dfrac{Z^2 _{\alpha / 2} \sigma^2}{B^2} $$

$$ B = Z _{\alpha / 2} \dfrac{\sigma}{\sqrt n} $$

In [None]:
sample_size = 420
sample_std = bmw["price"].sample(sample_size).std()
z_score = 1.96

B = z_score * sample_std/(sample_size)**0.5
print(B)