### Assignment # 09 - Point Estimate and Interval Estimate (Confidence Interval)

A random survey of enrollment at **35** community colleges across the United States yielded the following figures: 

6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044;

5,481; 5,200; 5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200; 

17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768; 7,493; 2,771; 2,861; 

1,263; 7,285; 28,165; 5,080; 11,622

Perform point estimate and interval estimate with **95% confidence level** using **t-distribution**.

Since we don't know the population variance, we use t-distribiution instead of normal distribution.

In [1]:
# Import Python packages
import pandas as pd
import numpy as np
from scipy.stats import t

### Step 0 - Data Preprocessing 

Process the raw data to make a list of integers. In order to calculate descriptive statistic, Python needs to work with a list of numbers.
### note:
Don't manually make the list by hand-typing the numbers. Write code to automate the data preparation.

In [2]:
# make each line of numbers a string object and then concatenate them together 
# The end result is one single string containing 35 numbers separated by ";"

data_1 = "6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044; " 
data_2 = "5,481; 5,200; 5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200; "
data_3 = "17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768; 7,493; 2,771; 2,861; "
data_4 = "1,263; 7,285; 28,165; 5,080; 11,622"
data = data_1 + data_2 + data_3 + data_4
data

'6,414; 1,550; 2,109; 9,350; 21,828; 4,300; 5,944; 5,722; 2,825; 2,044; 5,481; 5,200; 5,853; 2,750; 10,012; 6,357; 27,000; 9,414; 7,681; 3,200; 17,500; 9,200; 7,380; 18,314; 6,557; 13,713; 17,768; 7,493; 2,771; 2,861; 1,263; 7,285; 28,165; 5,080; 11,622'

In [3]:
# Convert the single string to a list of strings using split() function
# Make sure to specify a delimter or separator
data1 = data.split()
print(data1)


['6,414;', '1,550;', '2,109;', '9,350;', '21,828;', '4,300;', '5,944;', '5,722;', '2,825;', '2,044;', '5,481;', '5,200;', '5,853;', '2,750;', '10,012;', '6,357;', '27,000;', '9,414;', '7,681;', '3,200;', '17,500;', '9,200;', '7,380;', '18,314;', '6,557;', '13,713;', '17,768;', '7,493;', '2,771;', '2,861;', '1,263;', '7,285;', '28,165;', '5,080;', '11,622']


Create a list of integers from the list of strings using List Comprehension or for loop. Make sure to remove the "," first and then convert the strings to integers.

In [4]:
# Use for loop
cleaned_data = []
for i in data1:
  data2 = i.replace(";", "") 
  data3 = data2.replace(",", '')
  data4 = int(data3)
  cleaned_data.append(data4)  
print(cleaned_data)   

[6414, 1550, 2109, 9350, 21828, 4300, 5944, 5722, 2825, 2044, 5481, 5200, 5853, 2750, 10012, 6357, 27000, 9414, 7681, 3200, 17500, 9200, 7380, 18314, 6557, 13713, 17768, 7493, 2771, 2861, 1263, 7285, 28165, 5080, 11622]


### Step 1 - Calculate and Display the Sample Size and Sample Mean

In [5]:
# Calculate and display the sample size
sample_size = len(cleaned_data)
print("Sample size = ", sample_size)

Sample size =  35


In [6]:
# Calculate and display the sample mean
import math
mean = sum(cleaned_data)/len(cleaned_data)
mean = float(math.ceil(mean))
print("Sample mean = ", mean)

Sample mean =  8629.0


The point estimate of the mean enrollment of US community colleges is **8629**.

### Step 2 - Calculate and Display the Sample Standard Deviation & Sample Standard Error

Sample Standard Deviation $S=\sqrt{\dfrac{1}{n-1}\sum\limits_{i=1}^n (X_i-\bar{X})^2}$

Sample Standard Error = $\dfrac{S}{\sqrt{n}}$

Note: The default **Delta Degree of Freedom (DDOF)** for Numpy's std function is 0 which is applicable to populate data. For sample data, we need to specify **ddof=1**. 

For the enrollment data, we round up the statistics to be the full integers (no decimal points).


In [7]:
# Calculate and display the sample standard deviation using Numpy's std function.
Std_dev = np.std(cleaned_data)
Std_dev = float(math.ceil(Std_dev))
print("Sample Standard Deviation = ", Std_dev)

Sample Standard Deviation =  6844.0


In [8]:
# Calculate and display the sample standard error
Std_error = np.std(cleaned_data, ddof=1)/np.sqrt(np.size(cleaned_data))
Std_error = float(math.ceil(Std_error))
print("Sample Standard Deviation = ", Std_error)

Sample Standard Deviation =  1174.0


### Step 3 - Calculate t Critical Value using t-Distribution 

$\alpha$ = 1 - Confidence Level = 1 - 95% = 0.05

$\dfrac{\alpha}{2}$ = 0.025

n (sample size) = 35

df (degree of freedom) = n - 1 = 35 - 1 = 34

We will use Python scipy.stats t-distribution's PPF (Percentage Point Function) to calculate t critical value $t_{0.025,34}$.

In [9]:
# Calculate and display the t critical value using scipy.stats.t package ppf 
# df = degree of freedom 
# confidence level is 95%
confidence = .95
q = (1- confidence)/2 
df = sample_size-1
t_critical_value = t.ppf(q, df)
t_critical_value = abs(t_critical_value)
t_critical_value = round(t_critical_value, 2)
print("t critical value = ", t_critical_value)

t critical value =  2.03


### Step 4 - Calculate the Margin of Error

Margin of Error = t-Statistics * Sample Standard Error = $t_{\alpha/2,n-1}\left(\dfrac{s}{\sqrt{n}}\right)$

In [10]:
# Calculate and display the margin of error
Margin_Error =  Std_error * t_critical_value
Margin_Error = float(math.floor(Margin_Error))
print("Margin Error = ", Margin_Error)

Margin Error =  2383.0


### Step 5 - Calculate Lower and Upper Limit of the Confidence Interval

Lower Limit = Sample Mean - Margin of Error

Upper Limit = Sample Mean + Margin of Error

In [11]:
# Calculate and display the lower limit
lower_limit = mean - Margin_Error
print("Lower Limit = ", lower_limit)

Lower Limit =  6246.0


In [12]:
# Calculate and display the upper limit
upper_limit = mean + Margin_Error
print("Upper Limit = ", upper_limit)

Upper Limit =  11012.0


### Step 6 - Now We have the 95% Confidence Interval
Confidence Interval ($\sigma$ unknown) = $\bar{x}\space\pm\space t_{\alpha/2}\left(\dfrac{S}{\sqrt{n}}\right)$ = Sample_Mean $\pm$ Margin of Error

In [13]:
print(f"The 95% Confidence Interval = ({lower_limit}, {upper_limit})")

The 95% Confidence Interval = (6246.0, 11012.0)
