# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [5]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [6]:
# Run this code:
salaries = pd.read_csv('../data/Current_Employee_Names__Salaries__and_Position_Titles.csv')

# Your code here:
salaries.head()


Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,94122.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,F,Salary,,101592.0,
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,F,Salary,,110064.0,
4,"ABASCAL, REECE E",TRAFFIC CONTROL AIDE-HOURLY,OEMC,P,Hourly,20.0,,19.86


Examine the `salaries` dataset using the `head` function below.

In [7]:
# Your code here
import numpy as np
from scipy import stats

# Convert annual salary to hourly wage
# This creates a new column called "Hourly Wage".
# We assume a full-time worker works 2080 hours per year (40 hours/week × 52 weeks).
# We divide the annual salary by 2080 to convert it into an hourly wage.
salaries['Hourly Wage'] = salaries['Annual Salary'] / 2080

# Remove any missing values
# dropna() removes any NaN (missing) values.
# This ensures we only use clean, valid numbers in our analysis.
hourly_wages = salaries['Hourly Wage'].dropna()

# Calculate sample mean (average hourly wage)
#This computes the mean (average) hourly wage from the data.
mean = np.mean(hourly_wages)

# Calculate standard deviation
# This calculates the sample standard deviation.
# We use ddof=1 to apply Bessel’s correction, which is standard when working with a sample (not population).
std = np.std(hourly_wages, ddof=1)

# Sample size
# This gets the number of data points (workers) in our dataset.
n = len(hourly_wages)

# Standard error of the mean
# The standard error tells us how much the sample mean is expected to vary from the true population mean.
# It’s calculated as: SE = Standard Deviation / √n
se = std / np.sqrt(n)

# Compute the 95% confidence interval
# stats.t.interval(...) calculates the confidence interval using the t-distribution.
# We pass:
# confidence = 0.95 (for a 95% interval)
# df = n - 1 (degrees of freedom)
# loc = mean (center of the distribution)
# scale = se (spread of the distribution)
confidence = 0.95
interval = stats.t.interval(confidence, df=n-1, loc=mean, scale=se)

# This prints the result with 2 decimal points.
# interval[0] is the lower bound, interval[1] is the upper bound.
print(f"95% confidence interval for the hourly wage: ${interval[0]:.2f} to ${interval[1]:.2f}")


# This checks whether $30/hr is within the confidence interval we just calculated.
# It helps answer the question: "Is $30/hr a reasonable estimate of the average hourly wage?"
if 30 >= interval[0] and 30 <= interval[1]:
    print("$30/hr is within the 95% confidence interval.")
else:
    print("$30/hr is NOT within the 95% confidence interval.")

95% confidence interval for the hourly wage: $41.60 to $41.85
$30/hr is NOT within the 95% confidence interval.


# Challenge 2
This is a placeholder to make the AI corrector be able to find the correct exercise for feedback

# Challenge 3 - Constructing Confidence Intervals

We will test whether the hourly wage of all hourly workers is significantly different from $30/hr.

In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. Is $30/hr within that interval?

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [9]:
sample_100 = hourly_wages.sample(100, random_state=42)

mean_100 = np.mean(sample_100)
std_100 = np.std(sample_100, ddof=1)
n_100 = len(sample_100)
se_100 = std_100 / np.sqrt(n_100)

interval_100 = stats.t.interval(confidence, df=n_100-1, loc=mean_100, scale=se_100)

print(f"95% confidence interval for the sample of 100: ${interval_100[0]:.2f} to ${interval_100[1]:.2f}")

if 30 >= interval_100[0] and 30 <= interval_100[1]:
    print("$30/hr is within the 95% confidence interval of the sample.")
else:
    print("$30/hr is NOT within the 95% confidence interval of the sample.")

95% confidence interval for the sample of 100: $37.78 to $41.38
$30/hr is NOT within the 95% confidence interval of the sample.


This is fine if we have thousands of worker data. But what if we have only 100 workers data?

Sample 100 workers and re-construct the 95% confidence interval. Is the interval wider of narrower? And why?
Do you still encapsulate the $30/hr mark in this case?

In [10]:
### Reflection
# The confidence interval based on a sample of 100 workers is **wider** than the interval calculated using the full dataset. This happens because the sample size is smaller, which increases the standard error and uncertainty in estimating the true mean.
# As a result, confidence intervals from small samples tend to be less precise and have wider ranges. Whether $30/hr is within this interval depends on the random sample taken — sometimes it will be, sometimes not. In our case, we can see whether $30/hr falls within this interval by checking the printed result above.