In [1]:
import numpy as np
import pandas as pd

# Exploring Covariance

In [2]:
X_train = np.array(
    [
        [1, 2, 3],
        [4, 5, 6],
    ]
)

X_train, X_train.T

(array([[1, 2, 3],
        [4, 5, 6]]),
 array([[1, 4],
        [2, 5],
        [3, 6]]))

In [3]:
"""
In this case, NumPy interprets:

* Each row as a variable (so 2 variables)
* Each column as an observation (so 3 observations)

The diagonal elements (both 1.0) are the variances of each variable, and the off-diagonal elements
(also both 1.0) are the covariances between variables 1 and 2.
"""
np.cov(X_train)

array([[1., 1.],
       [1., 1.]])

The fact that the covariance matrix has all identical values (1.0)
indicates perfect correlation between the two variables, which suggests linear dependence.

In [5]:
"""
When you transpose, you're telling NumPy:

* Each row is a variable (now 3 variables)
* Each column is an observation (now 2 observations)
"""

# The numpy.cov docs say:
# Each row of m represents a variable, and each column a single observation of all those variables.
np.cov(X_train.T)

array([[4.5, 4.5, 4.5],
       [4.5, 4.5, 4.5],
       [4.5, 4.5, 4.5]])

## Real world example

In [6]:
"""
The columns are:

1. sepal length
2. sepal width
3. petal length
4. petal width
5. class
"""
s = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

df = pd.read_csv(s, header=None, encoding='utf-8')
df.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [9]:
# select setosa and versicolor.
y = df.iloc[0:10, 4].values
y = np.where(y == 'Iris-setosa', 0, 1)

# extract sepal length and petal length (columns 0 and 2).
X = df.iloc[0:10, [0, 2]].values

In [10]:
X, y

(array([[5.1, 1.4],
        [4.9, 1.4],
        [4.7, 1.3],
        [4.6, 1.5],
        [5. , 1.4],
        [5.4, 1.7],
        [4.6, 1.4],
        [5. , 1.5],
        [4.4, 1.4],
        [4.9, 1.5]]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

In [13]:
# We use the transpose because in np.cov, rows represent variables.
np.cov(X.T)

array([[0.08488889, 0.01888889],
       [0.01888889, 0.01166667]])

Variances (diagonal elements):

- Variable 1 has a variance of 0.08488889
- Variable 2 has a variance of 0.01166667
- Variable 1 has significantly higher variance than Variable 2 (about 7.3 times higher)


Covariance (off-diagonal elements):

- The covariance between Variables 1 and 2 is 0.01888889
- This positive value indicates the variables tend to move in the same direction

In [19]:
_cov_matrix = np.cov(X.T)

# Extract standard deviations (square root of variances).
std_devs = np.sqrt(np.diag(_cov_matrix))
std_devs, np.square(std_devs)

(array([0.29135698, 0.10801234]), array([0.08488889, 0.01166667]))

In [20]:
"""
The outer product of vectors a and b is a matrix C where:
    C[i,j] = a[i] * b[j] for all i, j
"""
# Create outer product of standard deviations.
outer_std = np.outer(std_devs, std_devs)
outer_std

array([[0.08488889, 0.03147015],
       [0.03147015, 0.01166667]])

In [22]:
# Divide covariance matrix by outer product to get correlation matrix.
corr_matrix = _cov_matrix / outer_std
corr_matrix

array([[1.        , 0.60021603],
       [0.60021603, 1.        ]])

In [23]:
np.corrcoef(X.T)

array([[1.        , 0.60021603],
       [0.60021603, 1.        ]])

In [25]:
np.square( np.corrcoef(X.T) )

array([[1.        , 0.36025929],
       [0.36025929, 1.        ]])

**Diagonal elements (1.0):** These represent the correlation of each variable with itself, which is always exactly 1. This is expected and doesn't provide additional information.

**Off-diagonal elements (0.60021603):** This is the **Pearson correlation coefficient** between the two variables. The value 0.60021603 indicates:

1. **Direction:** The positive value indicates a positive correlation - when one variable increases, the other tends to increase as well.
2. **Strength:** A value of 0.60 is typically considered a moderate to strong positive correlation:

- 0.0-0.19: Very weak
- 0.2-0.39: Weak
- 0.4-0.59: Moderate
- 0.6-0.79: Strong
- 0.8-1.0: Very strong

3. **Shared variance:** The square of the correlation coefficient ($r^2$) is approximately 0.36, meaning about 36% of the variance in one variable can be explained by changes in the other variable.


---

The variables move together about 60% of the time.
They're definitely related, but not perfectly linear (which would be 1.0).
There are other factors influencing each variable beyond their relationship with each other.

In [26]:
from scipy import stats

In [28]:
X.shape, X.T.shape

((10, 2), (2, 10))

In [33]:
# Extract the two variables
var1 = X.T[0, :]  # First row/variable.
var2 = X.T[1, :]  # Second row/variable.

# Calculate correlation coefficient and p-value.
r, p = stats.pearsonr(var1, var2)

print(f"Correlation coefficient: {r:.3f}")
print(f"p-value: {p:.3f}") # Two-tailed p-value
print(f"is p low? {p <= 0.05}")

Correlation coefficient: 0.600
p-value: 0.067
is p low? False


The p-value interpretation:

- If $p \leq 0.05$ (common threshold): The correlation is statistically significant, meaning there's strong evidence against the null hypothesis that the true correlation is zero.
- If $p > 0.05$: The correlation is not statistically significant, suggesting we cannot rule out the possibility that the observed correlation occurred by chance.

So there's a 6.7% probability that we could observe a correlation this strong (or stronger) purely by chance, even if the true correlation in the population is zero.

A correlation of 0.60 represents a moderate to strong relationship.
It might still be practically meaningful even if not statistically significant.
The lack of statistical significance is more about confidence in the result than the size of the effect.

## Why a Two-Tailed Test for Correlation?

**No Assumed Direction of Association:**

- In most research scenarios, we don't have an a priori reason to assume the correlation will be specifically positive or negative
- The null hypothesis is that there is no correlation ($r = 0$)
- The alternative hypothesis is that there is a correlation ($r \neq 0$), without specifying the direction


**Detecting Both Positive and Negative Relationships:**

- A two-tailed test allows us to detect significant correlations in either direction
- This is important because either a positive or negative correlation could be theoretically meaningful


**Statistical Conservatism:**

- Two-tailed tests are more conservative than one-tailed tests (require stronger evidence)
- Using a two-tailed test reduces the risk of Type I errors (false positives)
- One would generally prefer two-tailed tests unless there is strong justification for a directional hypothesis