1. Simple Linear Regression is a way to predict the value of the outcome variable based on the predictor variable. We are trying to draw a straight line that best fits the data points we have.

In [None]:
#1
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

beta_0 = 5
beta_1 = 2   # Slope
sigma = 1    # Standard deviation of the error term

np.random.seed(0)
X = np.linspace(0, 10, 50)

Y_true = beta_0 + beta_1 * X

error = norm.rvs(0, sigma, size=X.shape)
Y = Y_true + error

plt.figure(figsize=(10, 6))
plt.plot(X, Y_true, label="True Line (no error)", color="green", linestyle="--")
plt.scatter(X, Y, color="blue", alpha=0.6, label="Observed Y (with error)")
plt.xlabel("Predictor (X)")
plt.ylabel("Outcome (Y)")
plt.legend()
plt.title("Simple Linear Regression Model with Random Noise")
plt.show()
# Plot the results

In [None]:
#2
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf 
import plotly.express as px

np.random.seed(0)
X = np.linspace(0, 10, 50)
beta_0 = 5
beta_1 = 2
sigma = 1 
error = np.random.normal(0, sigma, size=X.shape)
Y = beta_0 + beta_1 * X + error

df = pd.DataFrame({'x': X, 'Y': Y})

model_data_specification = smf.ols("Y ~ x", data=df) 

fitted_model = model_data_specification.fit()

fitted_model.summary() 
fitted_model.summary().tables[1] 
fitted_model.params 
fitted_model.params.values 
fitted_model.rsquared 

df['Data'] = 'Data' 

fig = px.scatter(df, x='x', y='Y', color='Data', trendline='ols', title='Y vs. x')

fig.add_scatter(x=df['x'], y=fitted_model.fittedvalues, line=dict(color='blue'), name="trendline='ols'")

fig.show(renderer="png")

In [None]:
#3
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import plotly.express as px

np.random.seed(0)
X = np.linspace(0, 10, 50)
beta_0 = 5
beta_1 = 2
sigma = 1
error = np.random.normal(0, sigma, size=X.shape)
Y = beta_0 + beta_1 * X + error

df = pd.DataFrame({'x': X, 'Y': Y})

model_data_specification = smf.ols("Y ~ x", data=df)
fitted_model = model_data_specification.fit()

df['True_Y'] = beta_0 + beta_1 * df['x']

df['Data'] = 'Data'
fig = px.scatter(df, x='x', y='Y', color='Data', trendline='ols', title='Y vs. x with True Line and Fitted Line')

fig.add_scatter(x=df['x'], y=df['True_Y'], mode='lines', line=dict(color='green', dash='dash'), name="True Line")
fig.add_scatter(x=df['x'], y=fitted_model.fittedvalues, line=dict(color='blue'), name="Fitted Line")

fig.show(renderer="png")

4. it fitted_model.fittedvalues uses the calculated intercept and slope to find predicted Y values for each point in the data.

5. the ordinary least squares method finds the line that best fits the data by minimizing the total squared differences between observed and predicted values, ensuring that large deviations are penalized and all residuals contribute positively.

6. R^2 represents the proportion of variation in Y explained by the model. It can be calculated by comparing the total variation in Y to the variation left unexplained by the model. Higher R^2 values indicate a more accurate model. The correlation-squared between Y and the predicted values also gives R^2, which reinforcing how well the model captures the relationship with Y.

7. The Simple Linear Regression model assumes a linear relationship, which may not fit well if the data appears curved. It also assumes constant error variance, but if the spread of errors changes across X, this assumption is violated.

8. the null hypothesis states there’s no linear association between geyser duration and waiting time. A low p-value for the slope would suggest rejecting H0, indicating a significant linear relationship; a high p-value would mean insufficient evidence to reject H0, supporting no linear association.

9. For short wait times (<62, <64, <66 minutes), we test if there's still a significant relationship between waiting and duration. A low p-value would indicate a consistent relationship even within short waits, while a high p-value suggests that the association seen in the full data may not apply to shorter waits.

In [None]:
#10
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

old_faithful = sns.load_dataset("geyser")

long_wait = old_faithful['waiting'] > some_threshold

n_bootstrap_samples = 1000
bootstrap_slopes = []

for _ in range(n_bootstrap_samples):
    sample = old_faithful[long_wait].sample(n=160, replace=True)
    model = smf.ols('duration ~ waiting', data=sample).fit()
    bootstrap_slopes.append(model.params['waiting'])

plt.hist(bootstrap_slopes, bins=30, edgecolor='k')
plt.title("Bootstrapped Sampling Distribution of Slope Coefficients")
plt.xlabel("Slope Coefficient")
plt.ylabel("Frequency")
plt.show()

beta_0 = 1.65
beta_1 = 0
sigma = 0.37
simulated_slopes = []

for _ in range(n_bootstrap_samples):
    x = old_faithful[long_wait]['waiting']
    y_simulated = beta_0 + beta_1 * x + np.random.normal(0, sigma, size=len(x))
    sim_data = pd.DataFrame({'waiting': x, 'duration': y_simulated})
    sim_model = smf.ols('duration ~ waiting', data=sim_data).fit()
    simulated_slopes.append(sim_model.params['waiting'])

plt.hist(simulated_slopes, bins=30, edgecolor='k')
plt.title("Sampling Distribution of Slope Coefficients (Null Hypothesis)")
plt.xlabel("Slope Coefficient")
plt.ylabel("Frequency")
plt.show()

ci_lower = np.percentile(bootstrap_slopes, 2.5)
ci_upper = np.percentile(bootstrap_slopes, 97.5)
print(f"95% Bootstrapped Confidence Interval for Slope: [{ci_lower}, {ci_upper}]")

contains_zero = ci_lower <= 0 <= ci_upper
print(f"Does the 95% CI contain 0? {'Yes' if contains_zero else 'No'}")

observed_mean_slope = np.mean(bootstrap_slopes)
simulated_p_value = np.mean([abs(slope) >= abs(observed_mean_slope) for slope in simulated_slopes])
print(f"Simulated p-value: {simulated_p_value}")


11. The indicator model compares average differences between "long" and "short" waits, while the continuous model shows finer linear trends. A low p-value indicates a significant difference between "long" and "short" waits; a high p-value suggests no significant difference.

12. Model 1's histogram best supports the normality assumption if it closely follows the bell-shaped curve. Models 2 and 3's histograms likely show deviations from normality due to skewness or heavy tails. Model 4's histogram likely appears bimodal, indicating non-normal residuals due to the indicator variable.

13. 
A. The permutation test shuffles labels between “short” and “long” groups to assess if the observed mean difference could occur by chance under the null hypothesis.
B. The bootstrap confidence interval resamples each group to estimate the reliability of the observed mean difference. Unlike the indicator variable model, which uses regression to estimate differences, these methods are non-parametric and make no assumptions about linear relationships.

14. ALL FINE