<a href="https://colab.research.google.com/github/stswee/IntroCompStatsHSSP2023/blob/main/Class_Code/Intro_to_Comp_Statistics_Day_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Computational Statistics (HSSP 2023 Edition)
## Day 4: Bootstrapping and Data Analysis

In this notebook, we will run bootstrapping programs and perform basic data analysis.

### Activity 1: Rolling 2 Dice

Problem Statement:

You roll two four-sided dice and you want to see if there is a correlation between the rolls. You gather the following data:

In [None]:
data = np.array([[15, 28, 30, 12], [22, 41, 10, 8], [30, 42, 28, 19], [31, 20, 10, 10]])
print(data)

Each element can be thought of as a coordinate. For example, the 12 on the top right is (1, 4). This reflects the number of times we rolled 1 on the first dice and 4 on the second.

Compute the exact and estimated (bootstrap) p-value.

In [None]:
# By hand calculations
# Initialize totals
row_total = np.zeros(data.shape[0])
column_total = np.zeros(data.shape[1])
sample_size = sum(sum(data))

for i in range(len(row_total)):
  row_total[i] = sum(data[i,:])

for j in range(len(column_total)):
  column_total[j] = sum(data[:, j])

print(sample_size)

# Alternatively, for column total, can use the code below
# column_total = sum(data)

# Determine expected counts array
expected_counts = np.zeros(data.shape)

for i in range(len(row_total)):
  for j in range(len(column_total)):
    expected_counts[i, j] = row_total[i] * column_total[j] / sample_size

# Calculate chi-square test statistic
test_statistic = 0
for i in range(data.shape[0]):
  for j in range(data.shape[1]):
    test_statistic += (data[i, j] - expected_counts[i,j])**2 / (expected_counts[i, j])

# Determine degrees of freedom
df = (data.shape[0] - 1) * (data.shape[1] - 1)

# Determine p-value
pval = 1 - chi2.cdf(test_statistic, df, loc=0, scale=1)

print('Test statistic is : ' + str(test_statistic))
print('p value : ' + str(pval))

In [None]:
# Using stats package
test_statistic, pval = stats.chi2_contingency(data)[0:2]
print('Test statistic is : ' + str(test_statistic))
print('p value : ' + str(pval))

In [None]:
# Bootstrap
# Initialize totals
row_total = np.zeros(data.shape[0])
column_total = np.zeros(data.shape[1])
sample_size = sum(sum(data))

for i in range(len(row_total)):
  row_total[i] = sum(data[i,:])

for j in range(len(column_total)):
  column_total[j] = sum(data[:, j])

# Estimate null distribution
row_dist = row_total / sum(row_total)
column_dist = column_total / sum(column_total)

In [None]:
# Function definition to draw from distribution
def draw_integers(integers, probabilities, n):
    random_integers = random.choices(integers, probabilities, k = n)
    return random_integers

In [None]:
# Run bootstrap
B = 10000 # Number of samples (for B = 10000, about 90 seconds)

# Storage for array of test statistics
boot_test_statistics = np.empty(B)

for b in range(B):
  # Draw from estimated null distribution
  x = np.array(draw_integers([1, 2, 3, 4], row_dist, sample_size))
  y = np.array(draw_integers([1, 2, 3, 4], column_dist, sample_size))

  # Get contingency table
  boot_contingency_table = np.histogram2d(x, y, bins=[np.max(x), np.max(y)])[0]

  # Calculate test statistic
  boot_test_statistic = stats.chi2_contingency(boot_contingency_table)[0]

  # Store test statistic
  boot_test_statistics[i] = boot_test_statistic

# Bootstrap p-value
pval = (sum(boot_test_statistics > test_statistic) + 1)/ (B + 1)
print(pval)

### Activity 2: Performing Data Analysis

Research Question: Is there a statistically significant association between gender, education level, and years of experience with salary?

Getting Dataset from Kaggle:

1. Sign in to Kaggle (make a free account if you have not already)

2. Go to https://www.kaggle.com/datasets/mohithsairamreddy/salary-data and download the dataset

3. Unzip the file and load dataset into Google Colab

In [None]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from scipy.stats import norm
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # Suppress warnings

In [None]:
# Load in data
df = pd.read_csv("Salary_Data.csv")

# Rename columns
df = df.rename(columns = {"Education Level" : "Education", "Years of Experience" : "Experience"})

# Clean data
df = df.dropna()
df.loc[df['Education'] == "Bachelor's", 'Education'] = "Master's Degree"
df.loc[df['Education'] == "Master's", 'Education'] = "Master's Degree"
df.loc[df['Education'] == "phD", 'Education'] = "PhD"

# Print unique values for categorical variables
print(df['Gender'].unique())
print(df['Education'].unique())

In [None]:
# View first 5 rows of dataset
df.head()

First, we will perform ordinary least-squares regression (similar to simple linear regression) using all of the datapoints. We will then calculate the confidence intervals for the slope.

In [None]:
# Perform simple linear regression
ols_model = sm.ols(formula = 'Salary ~ Gender + Education + Experience', data = df)
results = ols_model.fit()

# Display results
print("Intercept: ", results.params[0])
print()
print("Gender Slope:\n", format(results.params[1:3]))
print()
print("Education Slope:\n", format(results.params[3:6]))
print()
print("Years Slope:\n", format(results.params[6]))
print()

In [None]:
# Plot
# Adjust these values
gender = "Male"
education = "PhD"

# Select actual values and make predictions
data = df[(df['Gender'] == gender) & (df["Education"] == education)]
x = data.iloc[:, :-1]
y_actual = data.iloc[:, -1]
y_pred = ols_model.fit().predict(x)

# Plot results
plt.scatter(x['Experience'], y_actual)
plt.plot(x['Experience'], y_pred, linewidth = 1, color = "red")
plt.xlabel("Years")
plt.ylabel("Salary")
plt.title("Gender = " + gender + ", Education = " + education)
plt.show()

In [None]:
# Intercept and Slope 95% confidence intervals
results.conf_int(alpha = 0.05)

Next, we will use bootstrapping to sample a portion of the data and fit its respective slopes. Let's start with getting one sample.

In [None]:
# Sample a portion of the dataset
size_percentage = 0.5
sample = df.sample(n = int(df.shape[0]*size_percentage), replace = True)

# Perform linear regression on sample
ols_model_boot = sm.ols(formula = 'Salary ~ Gender + Education + Experience', data = sample)
results_boot = ols_model_boot.fit()

# Display results
print("Intercept: ", results_boot.params[0])
print()
print("Gender Slope:\n", format(results_boot.params[1:3]))
print()
print("Education Slope:\n", format(results_boot.params[3:6]))
print()
print("Years Slope:\n", format(results_boot.params[6]))
print()

If you were to rerun the code above, you will get a different value every time. We can make the results reproducible by setting a seed. This is done by setting the random_state parameter to a particular value.

In [None]:
# Sample a portion of the dataset
size_percentage = 0.5
sample = df.sample(n = int(df.shape[0]*size_percentage), replace = True, random_state = 0)

# Perform linear regression on sample
ols_model_boot = sm.ols(formula = 'Salary ~ Gender + Education + Experience', data = sample)
results_boot = ols_model_boot.fit()

# Display results
print("Intercept: ", results_boot.params[0])
print()
print("Gender Slope:\n", format(results_boot.params[1:3]))
print()
print("Education Slope:\n", format(results_boot.params[3:6]))
print()
print("Years Slope:\n", format(results_boot.params[6]))
print()

To generate the bootstrap 95% confidence interval and display a histogram of our results, we can first create an array to store the results. Then, we can sample from our dataset, perform a fit, and record the results. Finally, we can plot the results and compute the confidence interval.

In [None]:
# Dataframe to store results
boot_results = pd.DataFrame({'Intercept' : [], 'Gender[T.Male]' : [], 'Gender[T.Other]' : [], 'Education[T.High School]' : [],
                             "Education[T.Master's Degree]" : [], 'Education[T.PhD]' : [], 'Experience' : []})

# Number of iterations
B = 1000 # (1000 takes about 45 seconds to run, 10000 about 6 minutes)

# Perform bootstrapping
for b in range(B):

  # Sample a portion of the dataset
  size_percentage = 0.5
  sample = df.sample(n = int(df.shape[0]*size_percentage), replace = True, random_state = b) # Set random state to b for reproducibility

  # Perform linear regression on sample
  ols_model_boot = sm.ols(formula = 'Salary ~ Gender + Education + Experience', data = sample)
  results_boot = ols_model_boot.fit()

  # Append results
  boot_results = boot_results.append(results_boot.params[0:7], ignore_index = True)

  # Keep track of progress for every 100 iterations
  if (b % 100 == 0):
    print(b)

In [None]:
# Basic histogram of intercept
plt.hist(boot_results.iloc[:,0])
plt.xlabel(boot_results.columns[0])
plt.ylabel("Frequency")

In [None]:
# Code generated by ChatGPT
def plot_histogram_with_fit(data, num_bins=10, xlabel=None, ylabel=None, title=None):
    """
    Plots a histogram with a normal distribution fit overlay.

    Parameters:
        data (list or numpy array): The input data for the histogram and fit.
        num_bins (int): Number of bins for the histogram. Default is 10.
        xlabel (str): Label for the x-axis. Default is None.
        ylabel (str): Label for the y-axis. Default is None.
        title (str): Title for the plot. Default is None.

    Returns:
        None
    """
    # Create the histogram
    n, bins, patches = plt.hist(data, bins=num_bins, alpha=0.7, edgecolor='black', color='skyblue')

    # Get the mean and standard deviation of the data
    mean = np.mean(data)
    std = np.std(data)

    # Generate normal distribution data points for the overlay
    x = np.linspace(min(data), max(data), 100)
    y = norm.pdf(x, mean, std) * len(data) * (bins[1] - bins[0])  # Scaling to match the histogram

    # Plot the normal distribution fit
    plt.plot(x, y, 'r-', label='Normal Fit')

    # Set labels and title
    if xlabel:
        plt.xlabel(xlabel)
    if ylabel:
        plt.ylabel(ylabel)
    if title:
        plt.title(title)

    # Show the plot with the legend
    plt.legend()
    plt.show()

plot_histogram_with_fit(boot_results.iloc[:,0], num_bins=10, xlabel=boot_results.columns[0], ylabel='Frequency', title='Distribution of ' + boot_results.columns[0])


In [None]:
# Plot all histograms
for i in range(boot_results.shape[1]):
  plot_histogram_with_fit(boot_results.iloc[:,i], num_bins=10, xlabel=boot_results.columns[i], ylabel='Frequency', title='Distribution of ' + boot_results.columns[i])


In [None]:
# Generate confidence intervals
# Select significance level
alpha = 0.05

for i in range(boot_results.shape[1]):
  print(str(round(1 - alpha, 2)*100) + "% Confidence interval for " + str(boot_results.columns[i]) + ": [" +
  str(np.nanpercentile(boot_results.iloc[:,i], alpha/2 * 100)) + ", " +
  str(np.nanpercentile(boot_results.iloc[:,i], 100 - alpha/2 * 100)) + "]")

# Compare with original result
# Intercept and Slope 95% confidence intervals
results.conf_int(alpha)