# Mid-term Exam (sample) 12/11/2025

## Task 1. Estimation — proportion of AI users

Instructions: compute a point estimate and a 95% confidence interval for the proportion of developers who report using AI tools at least once per week. Based on the interval, state whether the data support the claim that the proportion is ≥ 60%.

Data source: Stack Overflow Developer Survey (e.g. 2023/2024, available on Kaggle/GitHub).

The column with AI usage frequency is named 'AI Select' — select the categories corresponding to 'at least once per week' (adjust to the exact labels in the file).

Report requirements:
- Point estimate (p̂). 
- 95% confidence interval for p.
- Interpret your results.
- Check the error levels and relative (%) error levels. If the precision >5% treat the study as a pilot study and plan the minimum sample size required to achieve the 3% error levels.

In [None]:
import kagglehub
import pandas as pd
import scipy.stats as ss
import numpy as np
# https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2023-developers-survey

path = kagglehub.dataset_download("stackoverflow/stack-overflow-2023-developers-survey")

import os

for file in os.listdir(path):
    print(file)

df = pd.read_csv(os.path.join(path, 'survey_results_public.csv'))
df_clean = df[df['AISelect'].notna()]
n = len(df_clean)
x = len( df_clean[df_clean['AISelect'] == "Yes"])

#a) 
p_hat = x / n
print(p_hat)

#b)
#proportion confint

mean = p_hat * n
SE = np.sqrt(p_hat * (1 - p_hat) / n)
interval = ss.binom.interval(p_hat, mean, SE)
interval

README_2023.txt
so_survey_2023.pdf
survey_results_public.csv
survey_results_schema.csv
0.44379525536244074


(61.0, 70.0)

## Task 2. Estimation – Mean of Daily Steps

Using the sample of Fitbit Fitness Tracker Data, import the data into Python and estimate the mean number of daily steps (TotalSteps) among the participants. Use the point and interval approach.

Perform the following steps:

- Import the dataset and display basic information (number of observations and variables).
- Extract the TotalSteps column as a sample and remove missing values.
- Interpret your results.
- Check the error levels and relative (%) error levels. If the precision >5% treat the study as a pilot study and plan the minimum sample size required to achieve the 3% error levels.

In [None]:
import kagglehub
import numpy as np
import pandas as pd
import scipy.stats as ss
# Download latest version
path = kagglehub.dataset_download("dellakoovakkattu/bellabeatdailyactivity")

import os
from statsmodels.stats.weightstats import DescrStatsW

for file in os.listdir(path):
    print(file)

df = pd.read_csv(os.path.join(path, 'dailyActivity_merged.csv'))
steps = df['TotalSteps'].dropna()
mean_of_steps = np.mean(steps)
n = len(steps)
std_sample = np.std(steps, ddof=1)  
sem = std_sample / np.sqrt(n)      
CI = ss.t.interval(0.95, df=n-1, loc=mean_of_steps, scale=sem)

ds = DescrStatsW(steps)
CI2 = ds.tconfint_mean(alpha=0.05)
mean_of_steps,std_sample, CI, CI2

dailyActivity_merged.csv


(7637.9106382978725,
 5087.150741753411,
 (7312.2847523334, 7963.536524262345),
 (7312.2847523334, 7963.536524262345))

## Task 3. Probability — bug reports

A small software project receives bug reports. Historical data suggest that the number of bug reports arriving per day can be modelled as a Poisson random variable with 12 reports/day on avarage.

a) What is the probability that no bugs will be found in a single day?

b) Calculate the tail probability that more than 20 bugs will be found for a single day. 

c) Consider a 7‑day period. What is the distribution of the total number of reports in 7 days? Compute the exact probability that the 7‑day total is ≥ 90. 

d) Use a normal approximation for the 7‑day total (show mean and variance) and compute the approximate probability that the total ≥ 90. Compare with the exact Poisson result and comment on the quality of the approximation.

e) Short interpretation: is observing ≥ 90 reports in a week a realistic concern? When would you prefer the normal approximation vs exact Poisson?

In [25]:
from scipy.stats import poisson, norm
import numpy as np
lamb = 12
# a)
no_bugs = poisson.pmf(0, lamb)
print(no_bugs)
# b)
above_20 = 1 - poisson.cdf(20,lamb)
print(above_20)
# c)
lamb2 = lamb * 7
above_90_in_7 = 1 - poisson.cdf(89,lamb2)
print(above_90_in_7)

mean = lamb2
std = np.sqrt(lamb2)
z_90 = (90 - mean) / std
above_90_in_7_norm = 1 - norm.cdf(z_90, loc=0, scale=1)
print(above_90_in_7_norm)



6.14421235332821e-06
0.011597737214807502
0.2703554047293071
0.25634538013096164
