Sascha Spors,
Professorship Signal Theory and Digital Signal Processing,
Institute of Communications Engineering (INT),
Faculty of Computer Science and Electrical Engineering (IEF),
University of Rostock,
Germany

# Data Driven Audio Signal Processing - A Tutorial with Computational Examples

Master Course #24512

- lecture: https://github.com/spatialaudio/data-driven-audio-signal-processing-lecture
- tutorial: https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise

Feel free to contact lecturer frank.schultz@uni-rostock.de

# Principal Component Analysis (PCA)

Example in 2D, data matrix with M samples in 2 features

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import svd, diagsvd
from statsmodels.multivariate.pca import PCA

## Create Data and Plot in Original Coordinate System

In [None]:
rng = np.random.default_rng(1)  # be careful when changing seed
# SVD U and V vectors might then need reflections
# to match with statsmodels results

# construct 2 features somehow linearly related drawn from normal PDF
# PCA assumes mean-free columns, we can design them this way:
mean = [0, 0]  # mean-free data
cov = [[2, 0.8], [0.8, 1]]
M = 200  # no of samples
x, y = rng.multivariate_normal(mean, cov, M).T
X = np.array([x, y]).T
# center data to origin, remove mean
X = X - np.mean(X, axis=0)
N = X.shape[1]

print("rank", np.linalg.matrix_rank(X))
print("X.shape", X.shape)

# index for specific data points to plot with specific colors
# this helps to identify potential reflections of U and V space vectors
di1 = 12
di2 = 91
# print(np.where(X[:, 0] > 4))
# print(np.where(X[:, 1] > 2.5))

plt.plot(X[:, 0], X[:, 1], "x")
plt.plot(X[di1, 0], X[di1, 1], "C1x")
plt.plot(X[di2, 0], X[di2, 1], "C3x")
plt.axis("square")
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("feature 1")
plt.ylabel("feature 2")
plt.title("original data in original coordinate system")
plt.grid(True)

## Calculate Principal Component Analysis (PCA)

In [None]:
# we work on matrix X directly (so standardize=False), and we already
# made it mean free (so demean=False), normalize=False to give us
# data that is nicely connected to SVD data
# from statsmodels.multivariate.pca
pca = PCA(X, ncomp=2, standardize=False, demean=False, normalize=False)
# using SVD
[U, s, Vh] = svd(X, full_matrices=False)
S = diagsvd(s, N, N)
V = Vh.conj().T

# we use abbreviation PC for 'principal component'

pcs = U @ S  # known as PC, PC signals, factors, scores
pcl = V  # known PC loadings, PC coefficients
# note that sometimes V.T is called loadings, coefficients
# check if statsmodels and our manual SVD are same:
print(np.allclose(pca.scores, pcs))
print(np.allclose(pca.coeff.T, V))

## Indicate the Directions of the PCs

In [None]:
plt.plot(X[:, 0], X[:, 1], "x")
plt.plot(X[di1, 0], X[di1, 1], "C1x")
plt.plot(X[di2, 0], X[di2, 1], "C3x")

# draw direction of first PC axis -> this is the first col vec of V
plt.plot(
    [0, V[0, 0]],
    [0, V[1, 0]],
    "C6",
    lw=3,
    label="direction of 1st PC == 1st col in V, 1st right sing vec",
)
# draw direction of second PC axis -> this is the second col vec of V
plt.plot(
    [0, V[0, 1]],
    [0, V[1, 1]],
    "C7",
    lw=3,
    label="direction of 2nd PC == 2nd col in V, 2nd right sing vec",
)

plt.axis("square")
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("feature 1")
plt.ylabel("feature 2")
plt.title("data in feature coordinate system")
plt.legend()
plt.grid(True)

## Plot in PC Coordinate System

In [None]:
plt.plot(pcs[:, 0], pcs[:, 1], "x")
plt.plot(pcs[di1, 0], pcs[di1, 1], "C1x")
plt.plot(pcs[di2, 0], pcs[di2, 1], "C3x")

# draw direction of first PC axis
plt.plot([0, 1], [0, 0], "C6", lw=3, label="direction of 1st PC")
# draw direction of second PC axis
# works as we have no reflection in U/V
plt.plot([0, 0], [0, 1], "C7", lw=3, label="direction of 2nd PC")

plt.axis("square")
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("data in PC coordinate system")
plt.legend()
plt.grid(True)

In [None]:
phi_deg = -np.arccos(V[0, 0]) * 180 / np.pi
print("rotate the data cloud by ", phi_deg, "deg")
print("to get to the here shown PC coordinate system")
# note that the chosen example does not involve a reflection!
# therefore only rotation is involved

For rank r data matrix, the data is rotated (and potentially reflected in some axes) such that along the axis of PC 1 most variance occurs (most data spread), whereas along the axis of the r-th PC (last PC, here in the case it is PC2) has fewest variance.
Generally, var(PC1)>var(PC2)>...>var(PCr) just as the sorting of the singular values in the SVD. Recall how we calculated `pcs`...

## Variances

In [None]:
print(np.var(pcs[:, 0], ddof=1))
print(np.var(pcs[:, 1], ddof=1))

We compare this to original features

In [None]:
print(np.var(X[:, 0], ddof=1))
print(np.var(X[:, 1], ddof=1))

Note, that we don't lose variance, we just distribute them in another way.

In [None]:
print(np.var(pcs[:, 0], ddof=1) + np.var(pcs[:, 1], ddof=1))
print(np.var(X[:, 0], ddof=1) + np.var(X[:, 1], ddof=1))

That's the fundamental idea of shaping the data to re-sort variances, here using the special case of PCA, i.e. projection of data onto vector of an orthonormal basis...our well known column space spanned in U matrix of the SVD. Variance is explained by their corresponding singular values. Recall how we calculated `pcs`...

1st PC signal explains 82.37 % of total variance:

In [None]:
np.var(pcs[:, 0], ddof=1) / (
    np.var(pcs[:, 0], ddof=1) + np.var(pcs[:, 1], ddof=1)
)

2nd PC signal explains the remaining 17.63 % of total variance: 

In [None]:
np.var(pcs[:, 1], ddof=1) / (
    np.var(pcs[:, 0], ddof=1) + np.var(pcs[:, 1], ddof=1)
)

## Truncated SVD

rank reduction

In [None]:
r_des = 1

# we could make a sum of outer products, i.e. sum of rank-1 matrices
X_rank_red = np.zeros((M, N))
for i in range(r_des):
    # either
    # X_rank_red += S[i,i] * U[:,i][:,None] @ V[:,i][:,None].T
    # or
    X_rank_red += s[i] * np.outer(U[:, i], V[:, i])

# we might also use the PC signals and set intended loadings to zero
X_rank_red2 = np.zeros((M, N))
pcl_rank_red = np.copy(pcl)
pcl_rank_red[:, r_des:] = 0
X_rank_red2 = pcs @ pcl_rank_red.conj().T

np.allclose(X_rank_red, X_rank_red2)

In [None]:
plt.plot(X_rank_red2[:, 0], X_rank_red2[:, 1], "x")
plt.plot(X_rank_red2[di1, 0], X_rank_red2[di1, 1], "C1x", ms=10)
plt.plot(X_rank_red2[di2, 0], X_rank_red2[di2, 1], "C3x", ms=10)
plt.axis("square")
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("new feature 1 for rank reduction")
plt.ylabel("new feature 2 for rank reduction")
plt.title("rank {0:d} approximation of data".format(r_des))
plt.grid(True)

## Dimensionality Reduction after PCA

In [None]:
dim_des = 1
X_dim_red = np.zeros((M, dim_des))
X_dim_red = pcs[:, :dim_des]

X_dim_red_plot = np.zeros((M, N))
X_dim_red_plot[:, :dim_des] = pcs[:, :dim_des]

In [None]:
# for convenience we plot data here in 2D plot
# note however that X_dim_red_plot[:,1] is precisely zero if
# dim_des = 1 was chosen
# but the dimensionality reduction actually yields a matrix with smaller
# dimension, cf. X_dim_red
plt.plot(X_dim_red_plot[:, 0], X_dim_red_plot[:, 1], "x")
plt.plot(X_dim_red_plot[di1, 0], X_dim_red_plot[di1, 1], "C1x")
plt.plot(X_dim_red_plot[di2, 0], X_dim_red_plot[di2, 1], "C3x")

# draw direction of first PC axis
plt.plot([0, 1], [0, 0], "C6", lw=3, label="direction of 1st PC")
# draw direction of second PC axis
plt.plot([0, 0], [0, 1], "C7", lw=3, label="direction of 2nd PC")

plt.axis("square")
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("PCA data dimensionality reduction to {0:d}D".format(r_des))
plt.legend()
plt.grid(True)

print("reduced to dimension", X_dim_red.shape)

## Copyright

- the notebooks are provided as [Open Educational Resources](https://en.wikipedia.org/wiki/Open_educational_resources)
- feel free to use the notebooks for your own purposes
- the text is licensed under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/)
- the code of the IPython examples is licensed under the [MIT license](https://opensource.org/licenses/MIT)
- please attribute the work as follows: *Frank Schultz, Data Driven Audio Signal Processing - A Tutorial Featuring Computational Examples, University of Rostock* ideally with relevant file(s), github URL https://github.com/spatialaudio/data-driven-audio-signal-processing-exercise, commit number and/or version tag, year.