# ISLP - Chapter 12 - Exercise 7
### Author: pzuehlke

To prove this fact, let $ \mathbf{x} $ and $ \mathbf{y} $ be two vectors in $
\mathbb{R}^p $, corresponding to observations $ i $ and $ j $, respectively. By the formula
for the standard deviation,
$$
    \operatorname{std dev}(\mathbf{x}) = \sqrt{\frac{1}{p}\sum_{k=1}^{p} (x_k - \bar{x})^2}
    = \frac{1}{\sqrt{p}}\Vert\mathbf{x - \bar{x}\,\mathbf{1}}\Vert\,.
$$
Thus, if $ \mathbf{x} $ and $ \mathbf{y} $ both have mean zero and standard
deviation $ 1 $, then
$$ \Vert \mathbf{x} \Vert = \Vert \mathbf{y} \Vert = \sqrt{p}\,. $$
Therefore,
$$
    \Vert \mathbf{x} - \mathbf{y} \Vert^2 = \Vert \mathbf{x} \Vert^2
        + \Vert \mathbf{y} \Vert^2 - 2\,\mathbf{x} \cdot \mathbf{y}
        = 2p\,(1 - \cos \measuredangle (\mathbf{x} , \mathbf{y})\big) = 2p\,(1 - r_{ij})\,,
$$
because by the definition of correlation,
$$
    r_{ij} = \frac{\mathbf{x} \cdot \mathbf{y}}{\Vert \mathbf{x} \Vert \, \Vert \mathbf{y} \Vert }
        = \cos \measuredangle (\mathbf{x} , \mathbf{y}) \,. \tag*{$ \blacksquare $}
$$

Now let's verify this proportion in the `USArrests` dataset:

In [8]:
import numpy as np
import pandas as pd
from statsmodels.datasets import get_rdataset
from sklearn.metrics import pairwise_distances
from sklearn.preprocessing import StandardScaler

In [9]:
arrests = get_rdataset("USArrests").data
print(arrests.info())
n, p = arrests.shape  # n = 50, p = 4

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Murder    50 non-null     float64
 1   Assault   50 non-null     int64  
 2   UrbanPop  50 non-null     int64  
 3   Rape      50 non-null     float64
dtypes: float64(2), int64(2)
memory usage: 2.0+ KB
None


In [10]:
arrests.head(10)

Unnamed: 0_level_0,Murder,Assault,UrbanPop,Rape
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,13.2,236,58,21.2
Alaska,10.0,263,48,44.5
Arizona,8.1,294,80,31.0
Arkansas,8.8,190,50,19.5
California,9.0,276,91,40.6
Colorado,7.9,204,78,38.7
Connecticut,3.3,110,77,11.1
Delaware,5.9,238,72,15.8
Florida,15.4,335,80,31.9
Georgia,17.4,211,60,25.8


The strategy is simple: we standardize the _rows_ of the dataset by applying
`StandardScaler` and then compute the correlation matrix and squared distances
between rows to verify the relationship with $ p = 4 $.

In [11]:
scaler = StandardScaler()
X = scaler.fit_transform(arrests.T).T  # transpose to standardize the _rows_, then transpose back
print(X.shape)

(50, 4)


Let's verify that each row now has norm $ \sqrt{p} = \sqrt{4} = 2 $:

In [12]:
print(np.linalg.norm(X, axis=1))

[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2.]


Now let's compute the correlation coefficients and squared distances:

In [14]:
correlation_matrix = np.corrcoef(X, rowvar=True)
squared_distances = pairwise_distances(X, metric="euclidean")**2
print(correlation_matrix.shape)
print(squared_distances.shape)

(50, 50)
(50, 50)


Finally, we compare $ 2p\,(1 - r_{ij}) $ to $ \Vert\mathbf{x}_i - \mathbf{x}_j\Vert^2 $ for each $ i,\,j $:

In [16]:
A = 2 * p * (1 - correlation_matrix)
print(np.round(A, 3))

[[0.    0.073 0.011 ... 0.259 3.3   0.087]
 [0.073 0.    0.082 ... 0.521 3.915 0.26 ]
 [0.011 0.082 0.    ... 0.196 3.052 0.051]
 ...
 [0.259 0.521 0.196 ... 0.    1.821 0.047]
 [3.3   3.915 3.052 ... 1.821 0.    2.399]
 [0.087 0.26  0.051 ... 0.047 2.399 0.   ]]


In [18]:
B = squared_distances
print(np.round(B, 3))

[[0.    0.073 0.011 ... 0.259 3.3   0.087]
 [0.073 0.    0.082 ... 0.521 3.915 0.26 ]
 [0.011 0.082 0.    ... 0.196 3.052 0.051]
 ...
 [0.259 0.521 0.196 ... 0.    1.821 0.047]
 [3.3   3.915 3.052 ... 1.821 0.    2.399]
 [0.087 0.26  0.051 ... 0.047 2.399 0.   ]]


Looks right, but let's verify this conclusively:

In [21]:
np.allclose(A, B, rtol=1e-5, atol=1e-8)

True