
<h1 style="text-align: center;">AAE 722 - Homework 1</h1>
<h3 style="text-align: center;">Name: Gary Sun</h3>
<h3 style="text-align: center;">Date: September 22, 2025</h3>

---



# Question 1. Replicating the Regression Table

### 1.1 Replicating the Regression Table
Use the `Advertising` dataset (ISLR/ISLP). Fit `sales ~ TV + radio + newspaper` and show the regression table.


In [38]:
import pandas as pd
import statsmodels.formula.api as smf

# Load Advertising dataset from URL
url = "https://www.statlearning.com/s/Advertising.csv"
Advertising = pd.read_csv(url, index_col=0)
print(Advertising.head(5))

model = smf.ols("sales ~ TV + radio + newspaper", data=Advertising).fit()
results_df = pd.DataFrame({
    "Coefficient": model.params,
    "Std. error": model.bse,
    "t-statistic": model.tvalues,
    "p-value": model.pvalues
})
results_df = results_df.round(4)
def format_p_value(p):
    if p <= 0.0001:
        return "<0.0001"
    else:
        return f"{p:.4f}"
results_df["p-value"] = model.pvalues.apply(format_p_value)

display(results_df)


      TV  radio  newspaper  sales
1  230.1   37.8       69.2   22.1
2   44.5   39.3       45.1   10.4
3   17.2   45.9       69.3    9.3
4  151.5   41.3       58.5   18.5
5  180.8   10.8       58.4   12.9


Unnamed: 0,Coefficient,Std. error,t-statistic,p-value
Intercept,2.9389,0.3119,9.4223,<0.0001
TV,0.0458,0.0014,32.8086,<0.0001
radio,0.1885,0.0086,21.8935,<0.0001
newspaper,-0.001,0.0059,-0.1767,0.8599


### 1.2 Hypotheses for the p-values

For each predictor, we test the null hypothesis that the corresponding regression coefficient equals zero (no effect on sales). Formally:

- **TV**  
  - H₀: β_TV = 0 (TV advertising has no effect on sales)  
  - H₁: β_TV ≠ 0 (TV advertising affects sales)  
  - Interpretation: Under H₀, increasing TV ad spending would not change product sales. From the regression results, the p-value is < 0.0001, so we strongly reject H₀. This indicates that TV advertising has a significant positive effect on sales.

- **Radio**  
  - H₀: β_radio = 0 (Radio advertising has no effect on sales)  
  - H₁: β_radio ≠ 0 (Radio advertising affects sales)  
  - Interpretation: Under H₀, spending more on radio ads would not impact sales. The regression shows a p-value < 0.0001, so we strongly reject H₀. Radio advertising is also a significant and positive predictor of sales.

- **Newspaper**  
  - H₀: β_newspaper = 0 (Newspaper advertising has no effect on sales)  
  - H₁: β_newspaper ≠ 0 (Newspaper advertising affects sales)  
  - Interpretation: Under H₀, newspaper ad spending has no relationship with sales. The regression output gives a high p-value (≈ 0.86), so we fail to reject H₀. This suggests that newspaper advertising does not have a statistically significant effect on sales in this dataset.


### 1.3 Interpreting the Results

Based on the regression results:

- **TV advertising** has a strong and statistically significant positive effect on sales. The coefficient (≈ 0.046) means that, holding other factors constant, each additional unit of TV advertising spending is associated with an increase of about 0.046 units in sales. The p-value < 0.0001 confirms that this effect is highly significant.

- **Radio advertising** also shows a strong, positive, and statistically significant relationship with sales. The coefficient (≈ 0.189) indicates that each additional unit of radio spending is associated with an increase of about 0.189 units in sales. Again, the p-value < 0.0001 confirms significance.

- **Newspaper advertising**, however, does not show a statistically significant relationship with sales. The coefficient is very close to zero (≈ –0.001), and the p-value (≈ 0.86) is far above conventional significance levels. This suggests that newspaper spending does not meaningfully predict sales in this dataset.

**Conclusion:** Sales are strongly driven by TV and radio advertising, while newspaper advertising appears to have no measurable effect.


# Question 2: K-Nearest Neighbors (KNN)

The **K-Nearest Neighbors (KNN)** method is a non-parametric, instance-based learning approach. Its central idea is that predictions for a new observation are made based on the outcomes of the *k* most similar observations in the training data.

#### KNN Classifier
- **How it works**: For classification, KNN looks at the *k* nearest neighbors (based on a distance metric, usually Euclidean distance). The new observation is assigned to the class that appears most frequently among those neighbors.  
- **Example**: If most of the nearest neighbors prefer *Coke* rather than *Tea*, the new person is predicted to prefer *Coke*.  
- **Key idea**: Decision is made by **majority vote**.

#### KNN Regressor
- **How it works**: For regression, KNN again finds the *k* nearest neighbors. Instead of voting, it predicts the outcome as the **average (or weighted average)** of the neighbors’ values.  
- **Example**: To predict a new person’s income, KNN takes the average income of the most similar *k* individuals.  
- **Key idea**: Prediction is made by **averaging**.

#### Differences
| Aspect | KNN Classification | KNN Regression |
|--------|--------------------|----------------|
| Output | A **category/label** (e.g., Coke vs. Tea) | A **continuous value** (e.g., income, house price) |
| Rule   | Majority vote among neighbors | Average value of neighbors |
| Error metric | Classification error rate | Mean squared error (MSE), RMSE |

#### Additional Notes
- **Choice of k**: Small *k* can make the model too sensitive (high variance), while large *k* can smooth out important local patterns (high bias).  
- **Feature scaling**: Important, since KNN relies on distance. Variables with large scales can dominate the distance calculation.  
- **“Lazy” algorithm**: KNN does not train a model in advance; it simply stores the data and defers computation until prediction time.


# 3. Analysis of Linear vs. Cubic Regression Models

### (a) Training RSS when the true relationship is linear
If the true relationship between \(X\) and \(Y\) is linear, the linear model is the correct specification.  
The cubic model is more flexible because it includes the linear term as well as higher-order terms.  
On the training data, a more complex model can always fit the data at least as well as the simpler one.  
Therefore, the training RSS of the cubic regression will be **less than or equal to** the training RSS of the linear regression.

### (b) Test RSS when the true relationship is linear
On the test data, the linear model is expected to perform better because it matches the true data-generating process.  
The cubic model may overfit the noise in the training set, which increases test error.  
Thus, the **test RSS of the linear model is expected to be lower** than that of the cubic regression.

### (c) Training RSS when the true relationship is not linear
If the true relationship is not linear, the linear model is misspecified.  
The cubic regression is more flexible and can capture at least some of the nonlinearity in the data.  
Therefore, the training RSS of the cubic regression will be **less than or equal to** the training RSS of the linear regression.

### (d) Test RSS when the true relationship is not linear
For the test data, the outcome depends on how nonlinear the true relationship is.  
- If the true function is strongly nonlinear, the cubic regression is likely to generalize better and achieve a lower test RSS.  
- If the true function is only mildly nonlinear (close to linear), the cubic regression may overfit the training data and perform worse on the test set.  

Therefore, when the true relationship is not linear, the comparison of test RSS between the two models is **uncertain** and depends on the degree of nonlinearity.



# Question 4. t-statistic without intercept


In [79]:
import numpy as np, pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

rng = np.random.default_rng(1)
x = rng.normal(size=100)
y = 2 * x + rng.normal(size=100)

df = pd.DataFrame({"x": x, "y": y})

In [85]:
# Q4(a): y ~ x, no intercept 
m_4a = smf.ols("y ~ 0 + x", data=df).fit()

coef = m_4a.params["x"]
se   = m_4a.bse["x"]
tval = m_4a.tvalues["x"]
pval = m_4a.pvalues["x"]

def fmt_p(p):
    return "<0.0001" if p <= 1e-4 else f"{p:.4f}"

res_4a = pd.DataFrame({
    "Coefficient":[coef],
    "Std. error":[se],
    "t-statistic":[tval],
    "p-value":[fmt_p(pval)]
}, index=["x (no intercept)"]).round(4)

display(res_4a)

Unnamed: 0,Coefficient,Std. error,t-statistic,p-value
x (no intercept),1.9762,0.1169,16.8984,<0.0001


### 4(a) Interpretation

The estimated coefficient is $\hat{\beta} = 1.9762$ with a standard error of $0.1169$.  
The corresponding $t$-statistic is $16.8984$, and the associated $p$-value is less than $0.0001$.

This very small $p$-value means we **strongly reject** the null hypothesis $H_0: \beta = 0$.  
In other words, $x$ is a highly significant predictor of $y$ in this setting.

Overall, the results are consistent with the data-generating process  $ y = 2x + \varepsilon, $ confirming that the slope is significantly different from zero and very close to the true value.


In [87]:
# Q4(b): x ~ y, no intercept 
m_4b = smf.ols("x ~ 0 + y", data=df).fit()

coef_b = m_4b.params["y"]
se_b   = m_4b.bse["y"]
tval_b = m_4b.tvalues["y"]
pval_b = m_4b.pvalues["y"]

res_4b = pd.DataFrame({
    "Coefficient":[coef_b],
    "Std. error":[se_b],
    "t-statistic":[tval_b],
    "p-value":[fmt_p(pval_b)]
}, index=["y (no intercept)"]).round(4)

display(res_4b)

Unnamed: 0,Coefficient,Std. error,t-statistic,p-value
y (no intercept),0.3757,0.0222,16.8984,<0.0001


### 4(b) Report

The estimated coefficient is $\hat{\beta} = 0.3757,$with standard error $SE(\hat{\beta}) = 0.0222,$ t-statistic  $t = 16.8984,$ and p-value $p < 0.0001.$

Since the p-value is extremely small, we reject the null hypothesis 

$$H_0: \beta = 0.$$

This indicates that $y$ is a statistically significant predictor of $x$ when no intercept is included.  
However, the estimated slope (0.376) is kind of different from the true inverse slope (expected value $1/2 = 0.5$), showing that reversing the regression (predicting $x$ from $y$) does not recover the same relationship as predicting $y$ from $x$.


### 4(c) Relationship between (a) and (b)

In part (a), we regressed $y$ on $x$ and obtained an estimate of 

$$\hat{\beta}_{yx} \approx 1.98,$$ 

which is close to the true slope of $2$ in the data-generating process.  

In part (b), we reversed the regression and regressed $x$ on $y$, obtaining 

$$\hat{\beta}_{xy} \approx 0.38.$$

Although one might expect these coefficients to be exact reciprocals (i.e., $2$ vs. $0.5$), this is not the case in practice. Ordinary Least Squares (OLS) assumes that the predictor is measured without error, but when we switch the roles of $x$ and $y$, the error structure changes.  

Therefore, the two slopes are related but not reciprocals of each other. Both regressions show a strong linear association, but the interpretation depends on which variable is treated as the predictor.


### 4(d) Algebraic form of the $t$-statistic (no intercept)

Let

$$
S_{xx}=\sum_{i=1}^n x_i^2,\quad
S_{yy}=\sum_{i=1}^n y_i^2,\quad
S_{xy}=\sum_{i=1}^n x_i y_i .
$$

For regression **through the origin** (no intercept), the OLS slope estimator and its standard error are

$$
\hat{\beta}=\frac{S_{xy}}{S_{xx}},\qquad
SE(\hat{\beta})=\sqrt{\frac{\hat{\sigma}^2}{S_{xx}}},\qquad
\hat{\sigma}^2=\frac{\mathrm{RSS}}{n-1},
$$

where

$$
\mathrm{RSS}=\sum_{i=1}^n (y_i-\hat{\beta} x_i)^2 .
$$

**Step 1. Expand RSS.**

$$
\mathrm{RSS}
=\sum (y_i^2-2\hat{\beta} x_i y_i+\hat{\beta}^2 x_i^2)
= S_{yy}-2\hat{\beta} S_{xy}+\hat{\beta}^2 S_{xx}.
$$

**Step 2. Substitute $\hat{\beta}=S_{xy}/S_{xx}$.**

$$
\mathrm{RSS}
= S_{yy}-2\frac{S_{xy}}{S_{xx}}S_{xy}
 +\left(\frac{S_{xy}}{S_{xx}}\right)^2 S_{xx}
= S_{yy}-\frac{S_{xy}^2}{S_{xx}}.
$$

Hence

$$
\hat{\sigma}^2=\frac{1}{n-1}\!\left(S_{yy}-\frac{S_{xy}^2}{S_{xx}}\right),
\qquad
SE(\hat{\beta})
=\sqrt{\frac{1}{n-1}\cdot\frac{S_{yy}-S_{xy}^2/S_{xx}}{S_{xx}} }.
$$

**Step 3. Form the $t$-statistic.**

$$
\begin{aligned}
t
&= \frac{\hat{\beta}}{SE(\hat{\beta})}
= \frac{S_{xy}/S_{xx}}
{\sqrt{\dfrac{1}{n-1}\cdot\dfrac{S_{yy}-S_{xy}^2/S_{xx}}{S_{xx}}}} \\
&= \frac{\sqrt{n-1}\,S_{xy}}{\sqrt{S_{xx}S_{yy}-S_{xy}^2}}.
\end{aligned}
$$

So

$$
\boxed{t=\frac{\sqrt{n-1}\,\sum_{i=1}^n x_i y_i}{\sqrt{\left(\sum_{i=1}^n x_i^2\right)\left(\sum_{i=1}^n y_i^2\right)-\left(\sum_{i=1}^n x_i y_i\right)^2}}}.
$$

*Note:* df $=n-1$ because there is no intercept.


In [127]:
# Q4(d): verify Eq. 3.38 and algebraic t 
n   = len(df)
sxy = np.sum(x*y)
sxx = np.sum(x**2)
syy = np.sum(y**2)

beta_hat = sxy / sxx                    # Eq. 3.38
resid    = y - x*beta_hat
sigma2   = np.sum(resid**2) / (n-1)     # no-intercept -> df = n-1
SE_beta  = np.sqrt(sigma2) / np.sqrt(sxx)
t1       = beta_hat / SE_beta           # conventional

t2 = (np.sqrt(n-1)*sxy) / np.sqrt(sxx*syy - sxy**2)  # algebraic form

print("beta_hat (Eq. 3.38):", beta_hat)
print("t (beta/SE):        ", t1)
print("t (algebraic form): ", t2)
print("|t1 - t2| =", abs(t1 - t2))


beta_hat (Eq. 3.38): 1.9762423774420508
t (beta/SE):         16.898417063035097
t (algebraic form):  16.898417063035094
|t1 - t2| = 3.552713678800501e-15



### 4(e) Show $t_{y\sim x} = t_{x\sim y}$ (no intercept)
Using the algebraic form above, swap \(x\) and \(y\) and note the expression is symmetric; hence the \(t\)-statistics are equal. We can also confirm numerically below.


In [130]:
# t for x ~ y
beta_xy = np.sum(x * y) / np.sum(y**2)
resid_xy = x - y * beta_xy
sigma2_xy = np.sum(resid_xy**2) / (len(x) - 1)
SE_xy = np.sqrt(sigma2_xy) / np.sqrt(np.sum(y**2))
t_xy1 = beta_xy / SE_xy

# algebraic form after swapping (same numeric value)
sxy = np.sum(x * y)
sxx = np.sum(x**2)
syy = np.sum(y**2)
t_xy2 = (np.sqrt(len(x) - 1) * sxy) / np.sqrt(sxx * syy - sxy**2)

print("t for x~y (conventional):", t_xy1)
print("t (algebraic, swapped):  ", t_xy2)
print("difference |t_xy1 - t_xy2| =", abs(t_xy1 - t_xy2))


t for x~y (conventional): 16.898417063035104
t (algebraic, swapped):   16.898417063035094
difference |t_xy1 - t_xy2| = 1.0658141036401503e-14


In [132]:
# Q4(f): with intercept (use default formula with intercept) 
m_y_on_x = smf.ols("y ~ x", data=df).fit()
m_x_on_y = smf.ols("x ~ y", data=df).fit()

print("t-stat for y~x (coef x):", m_y_on_x.tvalues["x"])
print("t-stat for x~y (coef y):", m_x_on_y.tvalues["y"])


t-stat for y~x (coef x): 16.734055202403045
t-stat for x~y (coef y): 16.734055202403038
