In [3]:
import numpy as np 
import pandas as pd 
from matplotlib.pyplot import subplots 
import statsmodels.api as sm 

In [8]:
from statsmodels.stats.outliers_influence \
import variance_inflation_factor as VIF
from statsmodels.stats.anova import anova_lm

In [9]:
from ISLP import load_data 
from ISLP.models import (ModelSpec as MS, 
                        summarize, 
                        poly)

### Conceptual Questions

### Question 1. 
Describe the null hypothesis to which p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explaination should be in terms of 'sales', 'TV', 'radio', and 'newspaper'. 

### Answer.

For TV $H_0$: $\beta_1 = 0$
For radio $H_0$: $\beta_2 = 0$
For newspaper $H_0$: $\beta_3 = 0$

For the given p-values of TV and radio $p<0.0001$ infers that there is a relationship between these predictors and sales. Therefore, we can reject the null hypothesis for these variables. For the p-value associated with newspaper $p=0.8599$, there is barely a relationship between spending on newspaper ads and sales. It means that the chance that the that probability that the null hypothesis for newspaper ads is true is 0.8599.  

### Question 2. 
Carefully explain the differences between the KNN classifier and KNN regression methods. 

### Answer.
K-Nearest Neighbors (KNN) classifier is used for categorical (qualitative) data, where the goal is to assign a test observation to the most frequent class among its 
K nearest neighbors. The class label is determined by whichever class appears most frequently among the K nearest neighbors.

$$P(Y = j \mid X = x_o) = \frac{1}{K} \sum_{i \in N_o} I(y_i = j)$$

where: 
$$
\begin{array}{ll}
    \bullet & Y \text{ is the class label.} \\
    \bullet & x_o \text{ is the test observation (the point for which we want to predict a class).} \\
    \bullet & K \text{ is the number of nearest neighbors we are considering.} \\
    \bullet & N_o \text{ is the set of indices of the } K \text{ nearest neighbors of } x_o. \\
    \bullet & y_i \text{ is the class label of the } i \text{th training sample.} \\
    \bullet & I(y_i = j) \text{ is an indicator function that returns:} \\
    & \quad \text{1 if } y_i = j \text{ (i.e., if the } i \text{th neighbor belongs to class } j). \\
    & \quad \text{0 otherwise.}
\end{array}
$$

KNN regression is used for continuous (quantitative) data. Instead of assigning a class, it predicts a numerical value by averaging (or sometimes weighting) the values of the 
K nearest neighbors

$$
\hat{f}(x) = \frac{1}{K} \sum_{x_i \in N_o} y_i
$$

where: 
$$
\begin{array}{ll}
    \bullet & \hat{f}(x) \text{ is the predicted output for the test observation } x. \\
    \bullet & K \text{ is the number of nearest neighbors.} \\
    \bullet & N_o \text{ is the set of indices of the } K \text{ nearest neighbors of } x. \\
    \bullet & y_i \text{ is the observed value of the } i \text{th nearest neighbor.} \\
    \bullet & \text{The formula computes the mean (or weighted mean) of the } y_i \text{ values of the } K \text{ nearest neighbors.}
\end{array}
$$




### Question 3. 
Suppose we have a data set with five predictors, $X_1$=GPA, $X_2$=IQ, $X_3$=Level (1 for College, 0 for High School), $X_4$=Interaction between GPA and IQ, and $X5$=Interaction between GPA and Level. The response is starting salary after graduation in thousands of dollars. Suppost we use least squares to fit the model and get $\hat{\beta_0}$=50, $\hat{\beta_1}$=20, $\hat{\beta_2}$=0.07, $\hat{\beta_3}$=35, $\hat{\beta_4}$=0.01, $\hat{\beta_5}$=-10.

a. Which answer is correct and why? 

i. For a fixed value of IQ and GPA, high school graduates earn more, on average, than college graduates?

Ans.
False. Since the model is: 
$Y = 50 + 20X_1 + 0.07X_2 + 35X_3 + 0.01X_4 - 10X_5$

Then since the indicator for the level is 
$X_3 =\begin{cases} 1 & College\\
                     0 &  High\hspace{0.1cm}School 
       \end{cases}$

Then the model for the high school student will be 
$Y = 50 + 20X_1 + 0.07X_2 + 0.01X_4 - 10X_5$ which will 
be less since there is no contributed from the $X_3$.

ii. For a fixed value of IQ and GPA, college graduates earn more, on average, than high school graduates?  

Ans. True, similar logic as before.

iii. For a fixed value of IQ and GPA, high school graduates earn, on average, more than college graduates provided that the GPA is high enough.

Ans. IQ Range = 50 - 150, GPA Range = 1 - 4. Then we can see the following:
Let's set the IQ for both to 100.  
Let's try a low GPA for the college graduate, GPA = 1 and 
high GPA for the high school graduate, GPA = 4. 

For the college grad:
$Y = 50 + 20\times1 + 0.07\times100 + 35\times1 + 0.01\times100 - 10\times1 = 50 + 20 + 7 + 35 + 1 - 10 = 103$

For the high school grad: 
$Y = 50 + 20\times4 + 0.07\times100 + 0.01\times400 = 50 + 60 + 7 + 4 = 121$.

iv. For a fixed value of IQ and GPA, college graduates earn, on average, more than high school graduates provided that the GPA is high enough.

For the college grad:
$Y = 50 + 20\times4 + 0.07\times100 + 35\times1 + 0.01\times400 - 10\times4 = 50 + 60 + 7 + 35 + 4 - 40 = 116$

For the high school grad: 
$Y = 50 + 20\times1 + 0.07\times100 + 0.01\times100 = 50 + 20 + 7 + 1 = 78$

### Question 4. 

I collect a set of data ($n$=100 observations) containing a single predictor and quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression 
$Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 +\epsilon$.

a. Suppose that the true relationship between $X$ and $Y$ is linear $Y = \beta_0 + \beta_1X + \epsilon$. Consider training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect thme to be the same, or is there not enough information to tell? Justify your answer. 

The RSS due to linear regression will be 

$$\begin{align*}
RSS &= \sum_{i=1}^ne_i^2 \\ 
&= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\ 
&= \sum_{i=1}^n((\beta_0 + \beta_1X_i + \epsilon) - (\hat{\beta}_0 + \hat{\beta}_1X_i))^2 \\
&= \sum_{i=1}^n((\beta_0 - \hat{\beta}_0) + (\beta_1 - \hat{\beta}_1)X_i+ \epsilon)^2 \\
\end{align*}
$$

The RSS due to cubic regression will be 

$$\begin{align*}
RSS &= \sum_{i=1}^ne_i^2 \\ 
&= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\ 
&= \sum_{i=1}^n(\beta_0 + \beta_1X_i + \epsilon) - (\hat{\beta}_0 + \hat{\beta}_1X_i + {\beta}_2X_i^2 + {\beta}_3X_i^3) \\
&= \sum_{i=1}^n((\beta_0 - \hat{\beta}_0) + (\beta_1 - \hat{\beta}_1)X_i - {\beta}_2X_i^2 - {\beta}_3X_i^3+ \epsilon)^2\\
\end{align*}
$$

If we let $\beta_2 = \beta_3 = 0$ in the cubic model such that 

$Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon$ 

we can recover the linear model as 

$Y = \beta_0 + \beta_1 X + \epsilon$. 

For the training model we use least squares to minimizes the sum of squares residuals over a larger parameter space (parameter vector $\hat{\beta}$ is 4 by 1 instead of 2 by 1). Minimizing over a larger set of possible solutions can't give a higher minimum RSS, and will at worst reproduce a linear fits by setting the extra coefficients to 0 and can often find a way to reduce the training RSS even further. 


## Fitting Both Models by Ordinary Least Squares

We want to solve:

### Linear fit

$$
\hat{\beta}_{\text{lin}}
=
\arg\min_{(\beta_0,\;\beta_1)}
\;\|\,y - X_{\text{lin}}\,\beta\|^2.
$$

### Cubic fit

$$
\hat{\beta}_{\text{cub}}
=
\arg\min_{(\beta_0,\;\beta_1,\;\beta_2,\;\beta_3)}
\;\|\,y - X_{\text{cub}}\,\beta\|^2.
$$



In [10]:
import numpy as np

x = np.array([1, 2, 3, 4], dtype=float)
y = np.array([2, 3, 6, 9], dtype=float)

# Constuct the design matrix for the linear model, expects one argument 
X_lin = np.column_stack((np.ones_like(x), x))
X_lin

array([[1., 1.],
       [1., 2.],
       [1., 3.],
       [1., 4.]])

In [12]:
X_cub = np.column_stack((np.ones_like(x), x, x**2, x**3))
X_cub

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  2.,  4.,  8.],
       [ 1.,  3.,  9., 27.],
       [ 1.,  4., 16., 64.]])

In [14]:
beta_lin = np.linalg.inv(X_lin.T @ X_lin) @ (X_lin.T @ y)
beta_cub = np.linalg.inv(X_cub.T @ X_cub) @ (X_cub.T @ y)

# compute training RSS for each 
y_lin_hat = X_lin @ beta_lin
RSS_lin = np.sum((y - y_lin_hat)**2)

y_cub_hat = X_cub @ beta_cub
RSS_cub = np.sum((y - y_cub_hat)**2)

print(f"Training RSS (linear) = {RSS_lin:.6f}")
print(f"Training RSS (cubic) = {RSS_cub:.6f}")

Training RSS (linear) = 1.200000
Training RSS (cubic) = 0.000000


b. Answer (a) using test $RSS$ instead. 

Answer. When you switch to test $RSS$ there is no gurantee that the cubic model will have equal of lower $RSS$ than the linear model. In fact, the test $RSS$ can get worse if you add parameters due to $\bold{overfitting}$.

### Overfitting:

When the cubic model uses extra coefficients $\beta_2$ and $\beta_3$ it can chase some random noise in the training set that doesn't generalize. This might reduce the training $RSS$ but might not improve predictions on new data. Sometimes the cubic model might capture the true relationship (especially if it's nonlinear!), giving it a lower $RSS$ than the linear model. But these extra parameters can cause overfitting, causing the test $RSS$ to go up for the cubic model. Therefore, unlike training $RSS$, the test $RSS$ can increase, decrease, or be approximately equal when adding parameters. 

### Bias-Variance Trade-Off: 

$Test\:MSE(\hat{f}) = \underbrace{Bias^2[\hat{f}(x_0)]}_\text{systematic deviation} + \underbrace{Var[\hat{f}(x_0)]}_\text{variance} + \text{irreducible error}$

Adding parameters like $\beta_2, \beta_3$ usually reduces bias but increases variance. Therefore, if the variance is greater than the bias reduction, then your test error can go up. On the training set, the variance does not cause you to do worse, since more parameters allow you to fit atleast as well, so training RSS never rises. 



c. Suppose the true relationship $X$ and $Y$ is not linear, but we don't know how far it's from linear. The training $RSS$ for linear regression, and also the training $RSS$ for cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.

Answer. 

Since the linear model is a special case of the cubic model when $\beta_2=\beta_3=0$, it can replicate the linear fit or acheive a better fit by adjusting the additional parameters. The cubic regression has atleast the same flexibility on the training set as linear regression. Therefore, regardless of how non-linear the true relatioship is $RSS_{cubic} \leq RSS_{linear}$. If the true relationship is only a little bit non-linear, the higher order terms in $\beta_2,\beta_3$ might not help much. The training RSS for linear and cubic fits could be similar. The cubic model still cannot do worse on the training set because it can always revert to $\beta_2=\beta_3=0$. If the relatiionship is very non-linear (high curvature), then we see a lower training $RSS$ for the cubic model because the extra polynomial terms can track that curvature and reduce the sum of squared residuals. 

d. Answer (c) using test rather than training $RSS$. 

A more flexible model that includes polynomial expansions (Section 3.3.2) can better approximate a genuinely non-linear relationship if the sample size is large enough to reliably estimate the additional terms. In that scenario, the linear model often has higher bias and tends to produce a higher test RSS. However, flexibility can lead to overfitting, especially with limited or noisy data, thus increasing variance and potentially hurting test RSS. As a result, there is no guarantee that the polynomial model will outperform the linear model on new data; sometimes, the simpler linear approach yields better generalization precisely because it avoids chasing noisy patterns in the training set.

### Question 5. 
Consider the fitted values that result from performing linear regression without an intercept. In this setting, the $ith$ fitted value takes the form

$$
\hat{y}_i = x_i\hat{\beta} 
$$
where 
$$
\hat{\beta} = (\sum_{i=1}^n x_i y_i) / (\sum_{i'=1}^n x_{i'}^2). 
$$

Show that we can write
$$
\hat{y}_i = \sum_{i'=1}^na_{i'}y_{i'}.
$$

What is $a_{i'}$?

Note: We interpret this result by saying that the fitted values from linear regression are linear combinations of the response values.

### Answer.

Let's first re-state the question because it's a little confusing how's it's presented in the text. 

We have $n$ data points $(x_1, y_1),....,(x_n, y_n)$. We fit a no-intercept model $y\approx\beta x$. Let's variables used for the indices so there is no confusion as in the text: 

$\hat{\beta} = (\sum_{i'=1}^n x_{i'} y_{i'}) / (\sum_{m=1}^n x_{m}^2).$ 

For each data point, $i$, the fitted value is $\hat{y}_i = x_i\hat{\beta}$. We want to prove $\hat{y}_i$ is a linear combination of all the $y_{i'}$ such that 

$$\hat{y}_i = \sum_{i'=1}^na_{i'}y_{i'}.$$

Let's go through this: 
$$\begin{align*}
\hat{y}_i &= x_i(\sum_{i'=1}^nx_{i'} y_{i'})/(\sum_{m=1}x_m^2) \\
&=\frac{\sum_{i'=1}^n x_i x_{i'}}{\sum_{m=1}^n x_m}y_{i'} \\
\end{align*}
$$

Let $a_{i'}=\frac{x_i x_{i'}}{\sum_{m=1}^n x_m}$ then

$$\begin{align*}
\hat{y}_i &=\frac{\sum_{i'=1}^n x_i x_{i'}}{\sum_{m=1}^n x_m}y_{i'} \\
&= \sum_{i'=1}^n a_{i'}y_{i'}.
\end{align*}
$$

### Question 6.

Using (3.4) argue, that in the case of simple linear regressioin, the least squares line always passes through the point $(\bar{x}, \bar{y})$.

### Answer.

We know from eq. (3.4) in ISLP that 

$$\hat{\beta}_0 = \bar{y}- \hat{\beta}_1\bar{x}$$

and for $x = \bar{x}$ and $y = \bar{y}$: 

$$
\begin{align*}
\hat{y} &= \hat{\beta}_0 + \hat{\beta}_1 x\\
&=\hat{\beta}_0 + \hat{\beta}_1 \bar{x} \\
&=(\bar{y} - \hat{\beta}_1\bar{x}) +\hat{\beta}_1 \bar{x}\\
&=\bar{y}
\end{align*}
$$

Therefore, the least squares line must pass through $(\bar{x}, \bar{y})$.

### Question 7. 

It is claimed in the text that in the case of simple linear regression of $Y$ onto $X$, the $R^2$ statistic (3.17) is equal to the square of the correlation between $X$ and $Y$ (3.18). Prove that this is the case. For simplicity, you may assume $\bar{x}=\bar{y}=0$. 


### Answer. (first solution)

Let's first define $R^2$ statistic and correlation between $X$ and $Y$. 

$$
\begin{align*}
R^2 &= \frac{TSS - RSS}{TSS} \\
&= \frac{\sum_{i=1}^n(y_i - \bar{y})^2 -\sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^n(y_i - \bar{y})^2}
\end{align*}
$$

$$ Cor(X, Y) = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}}.$$

If $\bar{x} = \bar{y} = 0$ then we have 

$$
\begin{align*}
R^2 &= \frac{\sum_{i=1}^ny_i^2 - \sum_{i=1}^n(y_i - \hat{y}_i)^2}{\sum_{i=1}^ny_i^2} \\
&= \frac{\sum_{i=1}^ny_i^2 - \sum_{i=1}^n(y_i^2 - 2\hat{y}_iy_i +\hat{y}_i^2)}{\sum_{i=1}^ny_i^2}\\
&=\frac{\sum_{i=1}^n(2\hat{y}_iy_i -\hat{y}_i^2)}{\sum_{i=1}^ny_i^2}.
\end{align*}
$$

We know that 

$$\sum_{i=1}^n \hat{y}_i = \sum_{i=1}^n\beta_i x_i=\frac{\sum_{i'=1}^n x_{i'}y_{i'}}{\sum_{m=1}^n x_{m}^2}x_i$$

and we will make the following argument: since $i'$ is simmeing over the same set of integers then $i=i'=m$ so we now have
$$\sum_{i=1}^n \hat{y}_i =\frac{\sum_{i=1}^n x_{i}y_{i}}{\sum_{i=1}^n x_{i}^2}x_i=\sum_{i=1}^n y_i.$$

Plugging this back into the $R^2$ equation

$$
\begin{align*}
R^2 &=\frac{\sum_{i=1}^n(2\hat{y}_i y_i -\hat{y}_i^2)}{\sum_{i=1}^ny_i^2}\\
&=\frac{\sum_{i=1}^n(2\hat{y}_i^2 - \hat{y}_i^2)}{\sum_{i=1}^n y_i^2}\\
&=\frac{\sum_{i=1}^n\hat{y}_{i}^2}{\sum_{i=1}^ny_i^2}.\\
\end{align*}
$$

The correlation for $\bar{x} = \bar{y}=0$
$$
\begin{align*}
Cor(X, Y) &= \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}} \\
&=\frac{\sum_{i=1}^nx_i y_i}{\sqrt{\sum_{i=1}^nx_i^2}\sqrt{\sum_{i=1}^ny_i^2}}. \\
\end{align*}
$$

Let's now square this result
$$
\begin{align*}
Corr(X, Y)^2 &= \frac{(\sum_{i=1}^nx_i y_i)^2}{{\sum_{i=1}^nx_i^2}{\sum_{i=1}^ny_i^2}}\\
\end{align*}
$$

And since $\hat{y}_{i}^2 =\sum_{i=1}^nx_i^2 \left( \frac{\sum_{j=1}^n x_j y_j}{\sum_{j=1}^n x_j^2}\right)^2=\frac{(\sum_{j=1}^n x_i y_i)^2}{\sum_{i=1}^n x_i^2}$ then we have 

$$
\begin{align*}
Corr(X, Y)^2 &= \frac{(\sum_{i=1}^nx_i y_i)^2}{{\sum_{i=1}^nx_i^2}}\cdot
\frac{1}{\sum_{i=1}^ny_i^2}\\
&=\frac{\sum_{i=1}^n\hat{y}_i^2}{\sum_{i=1}^n y_i^2}.
\end{align*}
$$



### Answer. (second solution)

We know that 

$$R^2 = \frac{TSS - RSS}{TSS} = \frac{ESS}{TSS}.$$

where total sum of squares (TSS) 

$$TSS = \sum_{i=1}^n (y_i - \bar{y})^2,$$

residual sum of squares (RSS)

$$RSS = \sum_{i=1}^n(y_i - \hat{y}_i)^2,$$

explained sum of squares (ESS)

$$ESS = \sum_{i=1}^n(\hat{y}_i - \bar{y})^2.$$

We want to prove that $ESS = TSS - RSS$. We can first decompose $y_i - \bar{y}$ as

$$y_i - \bar{y} = (y_i - \hat{y}_i) + (\hat{y}_i - \bar{y}).$$

Since $TSS = \sum_{i=1}^n(y_i - \bar{y})^2$ and using our decomposition 

$$
\begin{align*}
(y - \bar{y})^2 &= [(y_i - \hat{y}_i) + (\hat{y}_i - \bar{y})]^2 \\
&=(y_i - \hat{y}_i)^2 + 2(y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) + (\hat{y}_i - \bar{y})^2.\\
\end{align*}
$$

And now let's sum over $i$

$$
\begin{align*}
\underbrace{\sum_{i=1}^n(y_i - \bar{y})^2}_\text{TSS} = \underbrace{\sum_{i=1}^n(y_i - \hat{y}_i)^2}_\text{RSS} + 2\sum_{i=1}^n(y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) + \underbrace{\sum_{i=1}^n(\hat{y}_i - \bar{y})^2}_\text{ESS}.\\
\end{align*}
$$

So we need to prove that $\sum_{i=1}^n(y_i - \hat{y}_i)(\hat{y}_i - \bar{y})=0$. In ordinary least squares, one can show that using partial derivatives or geometric projection arguments that $\sum_{i=1}^n(y_i - \hat{y}_i) = 0$.


If we assume that the linear model is 

$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i.$$

The predcition (fitted model) is 

$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i.$$

The residual of the $ith$ observation is 

$$e_i = y_i - \hat{y}_i.$$

We want to choose $\hat{\beta}_0$ and $\hat{\beta}_1$ to minimize the sum of the squares of residuals 

$$
\begin{align*}
f(\hat{\beta}_0, \hat{\beta}_1) &= \sum_{i=1}^n (y_i - \hat{y}_i)^2 \\
&=\sum_{i=1}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i))^2.
\end{align*}
$$

To minimize $f$, we can take the derivative of $f$ with respect to $\hat{\beta}_0$ and set it equal to zero. 

$$
\begin{align*}
\frac{\partial f}{\partial \beta_0} &=-2\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1x_i) = 0\\
&=\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1x_i) = 0
\end{align*}
$$

Recall the fitted value for the $ith$ observation is 

$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i$$

therefore

$$\sum_{i=1}^n (y_i - \hat{y}_i) = 0.$$

In matrix terms, if you write the regression model as:

$$
Y = X\beta + \varepsilon,
$$

with \( X \) including a column of ones (for the intercept), then the residual vector

$$
e = Y - \hat{Y}
$$

is orthogonal to every column of \( X \). In particular, it is orthogonal to the column of ones, which implies:

$$
\mathbf{1}^\top e = \sum_{i=1}^{n} e_i = 0.
$$


So we have proved that 

$$ESS = TSS - RSS$$

and so we can write $R^2$ as 


$$
\begin{align*}
R^2 &= \frac{ESS}{TSS}\\
&=\frac{\sum_{i=1}^n(\hat{y}_i - \bar{y})^2}{\sum_{i=1}^n(y_i - \bar{y})^2}\\
&=\frac{\sum_{i=1}^n\hat{y}_i^2}{\sum_{i=1}^ny_i^2}.\\
\end{align*}
$$

Recall 

$$\hat{y}_i = \hat{\beta}x_i$$ 

where $\hat{\beta}=\frac{\sum_{j=1}^n x_j y_j}{\sum_{j=1}^n x_j^2}$.

Therefore 

$$
\begin{align*}
\hat{y}_i^2 &= \frac{(\sum_{j=1}^n x_j y_j)^2}{(\sum_{j=1}^n x_j^2)^2}\sum_{i=1}^nx_i^2\\
&= \frac{(\sum_{j=1}^n x_j y_j)^2}{\sum_{j=1}^n x_j^2}.
\end{align*}
$$

Substituting into the equation for $R^2$

$$R^2 = \frac{\frac{(\sum_{j=1}^n x_j y_j)^2}{\sum_{j=1}^n x_j^2}}{\sum_{i=1}^n y_i^2}=\frac{(\sum_{j=1}^n x_j y_j)^2}{(\sum_{j=1}^n x_j^2)(\sum_{i=1}^n y_i^2)}.$$


The correlation for $\bar{x} = \bar{y}=0$
$$
\begin{align*}
Corr(x, y) &= \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i - \bar{y})^2}} \\
&=\frac{\sum_{i=1}^nx_i y_i}{\sqrt{\sum_{i=1}^nx_i^2}\sqrt{\sum_{i=1}^ny_i^2}}. \\
\end{align*}
$$

Let's now square this result
$$
\begin{align*}
Corr(x, y)^2 &= \frac{(\sum_{i=1}^nx_i y_i)^2}{{\sum_{i=1}^nx_i^2}{\sum_{i=1}^ny_i^2}}\\
\end{align*}
$$

These are identical expressions and so 

$$R^2 = Corr(x, y)^2.$$