Principal Component Analysis (PCA) is <font color='blue'>a classical method for dimension reduction.

<font color='blue'>It uses the first several **principal components**, statistical features that explain most of the variation of a $m \times n$ data matrix $\mathbf{X}$, to describe the large-scale data matrix $\mathbf{X}$ economically.   

In [None]:
from numpy import linalg as LA

We will introduce PCA with an image processing example.

In [None]:
def generate_test_image(m,n):
    X = np.zeros((m,n))
# generate a rectangle
    X[25:80,25:80] = 1
# generate a triangle
    for i in range(25, 80, 1):
        X[i+80:160, 100+i-1] = 2
# generate a circle
    for i in range(0,200,1):
        for j in range(0,200,1):
            if ((i - 135)*(i - 135) +(j - 53)*(j - 53) <= 900):
                X[i, j] = 3
    return X
X = generate_test_image(200,200)

In [None]:
imgplot = plt.imshow(X, cmap='gray')
plt.title('Original Test Image');

In [None]:
m = X.shape[0] # num of rows
n = X.shape[1] # num of columns

Set each row as a variable, with observations in the columns. Denote the covariance matrix of $\mathbf{X}$ as $\mathbf{C}$, where the size of $\mathbf{C}$ is $m \times m$. $\mathbf{C}$ is a matrix whose $(i,j)^{th}$ entry is the covariance between the $i^{th}$ row and $j^{th}$ row of the matrix $\mathbf{X}$.

In [None]:
X = np.asarray(X, dtype=np.float64)
C = np.cov(X)

In [None]:
np.linalg.matrix_rank(C)

<font color='blue'>Performing principal component analysis decomposes the matrix $\mathbf{C}$ into:
$\mathbf{C} = \mathbf{L}\mathbf{P}\mathbf{L}^{\top},$ <br>  where $\mathbf{P}$ is a diagonal matrix $\mathbf{P}=\text{diag}(\lambda_1,\lambda_2,\dots,\lambda_m)$, with $\lambda_1 \geq \lambda_1 \geq \dots \lambda_m \geq 0$ being the eigenvalues of matrix $\mathbf{C}$. 

<font color='blue'>The matrix $\mathbf{L}$ is an orthogonal matrix, consisting the eigenvectors of matrix $\mathbf{C}$.

In [None]:
P, L = LA.eigh(C)

The function `LA.eigh` lists the eigenvalues from small to large in $P$.

In [None]:
P = P[::-1]
L = L[:,::-1]

In [None]:
np.allclose(L.dot(np.diag(P)).dot(L.T), C)

In [None]:
plt.semilogy(P, '-o')
plt.xlim([1, P.shape[0]])
plt.xlabel('eigenvalue index')
plt.ylabel('eigenvalue in a log scale')
plt.title('Eigenvalues of Covariance Matrix');

<font color='blue'>The $i^{th}$ **principal component** is given as $i^{th}$ row of $\mathbf{V}$, 
$\mathbf{V} =\mathbf{L}^{\top} \mathbf{X}.$


In [None]:
V = L.T.dot(X)

In [None]:
V.shape

<font color='blue'>To approximate $\mathbf{X}$, we use $k$ eigenvectors that have largest eigenvalues:
$\mathbf{X} \approx \mathbf{L[:, 1:k]}\mathbf{L[:, 1:k]}^{\top} \mathbf{X}.$

Denote the approximated $\mathbf{X}$ as $\tilde{\mathbf{X}} = \mathbf{L[:, 1:k]}\mathbf{L[:, 1:k]}^{\top} \mathbf{X}$. <font color='blue'>When $k = m $, the $\tilde{\mathbf{X}}$ should be same as $\mathbf{X}$.

In [None]:
k = 200
X_tilde =  L[:,0:k-1].dot(L[:,0:k-1].T).dot(X)

In [None]:
np.allclose(X_tilde, X)

In [None]:
plt.imshow(X_tilde, cmap='gray')
plt.title('Approximated Image with full rank');

<font color='blue'>The proportion of total variance due to the $i^{th}$ principal component is given by the ratio $\frac{\lambda_i}{\lambda_1 + \lambda_2 + \dots \lambda_m}.$ 

The sum of proportion of total variance should be $1$. 

<font color='blue'>As we defined, $\lambda_i$ is $i^{th}$ entry of $\mathbf{P}$, $\sum_{i}\frac{P_i}{\text{trace}(P)} = 1$
, where the trace$(P)$ is the sum of the diagonal of $P$.

In [None]:
(P/P.sum()).sum()b

In [None]:
plt.plot((P/P.sum()).cumsum(), '-o')
plt.title('Cumulative Sum of the Proportion of Total Variance')
plt.xlabel('index')
plt.ylabel('Proportion');

In [None]:
X_tilde_10 = L[:,0:10-1].dot(V[0:10-1,:])
X_tilde_20 = L[:,0:20-1].dot(V[0:20-1,:])
X_tilde_30 = L[:,0:30-1].dot(V[0:30-1,:])
X_tilde_60 = L[:,0:60-1].dot(V[0:60-1,:])

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 12))
ax1.imshow(X_tilde_10, cmap='gray')
ax1.set(title='Approximated Image with k = 10')
ax2.imshow(X_tilde_20, cmap='gray')
ax2.set(title='Approximated Image with k = 20')
ax3.imshow(X_tilde_30, cmap='gray')
ax3.set(title='Approximated Image with k = 30')
ax4.imshow(X_tilde_60, cmap='gray')
ax4.set(title='Approximated Image with k = 60');

Moving forward, we do not have to do PCA by hand. Luckly, [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) has an implementation that we can use.

In [None]:
symbol = ['IBM','MSFT', 'FB', 'T', 'INTC', 'ABX','NEM', 'AU', 'AEM', 'GFI']

start = "2015-09-01"
end = "2016-11-01"

portfolio_returns = get_pricing(symbol, start_date=start, end_date=end, fields="price").pct_change()[1:]

In [None]:
from sklearn.decomposition import PCA
num_pc = 2

X = np.asarray(portfolio_returns)
[n,m] = X.shape
print 'The number of timestamps is {}.'.format(n)
print 'The number of stocks is {}.'.format(m)

pca = PCA(n_components=num_pc) # number of principal components
pca.fit(X)

percentage =  pca.explained_variance_ratio_
percentage_cum = np.cumsum(percentage)
print '{0:.2f}% of the variance is explained by the first 2 PCs'.format(percentage_cum[-1]*100)

pca_components = pca.components_

In [None]:
x = np.arange(1,len(percentage)+1,1)

plt.subplot(1, 2, 1)
plt.bar(x, percentage*100, align = "center")
plt.title('Contribution of principal components',fontsize = 16)
plt.xlabel('principal components',fontsize = 16)
plt.ylabel('percentage',fontsize = 16)
plt.xticks(x,fontsize = 16) 
plt.yticks(fontsize = 16)
plt.xlim([0, num_pc+1])

plt.subplot(1, 2, 2)
plt.plot(x, percentage_cum*100,'ro-')
plt.xlabel('principal components',fontsize = 16)
plt.ylabel('percentage',fontsize = 16)
plt.title('Cumulative contribution of principal components',fontsize = 16)
plt.xticks(x,fontsize = 16) 
plt.yticks(fontsize = 16)
plt.xlim([1, num_pc])
plt.ylim([50,100]);

<font color='blue'>From these principal components we can construct "statistical risk factors", similar to more conventional common risk factors. These should give us an idea of how much of the portfolio's returns comes from some unobservable statistical feature.

In [None]:
factor_returns = X.dot(pca_components.T)
factor_returns = pd.DataFrame(columns=["factor 1", "factor 2"], 
                              index=portfolio_returns.index,
                              data=factor_returns)
factor_returns.head()

In [None]:
factor_exposures = pd.DataFrame(index=["factor 1", "factor 2"], 
                                columns=portfolio_returns.columns,
                                data = pca.components_).T

In [None]:
factor_exposures

In [None]:
labels = factor_exposures.index
data = factor_exposures.values

In [None]:
plt.subplots_adjust(bottom = 0.1)
plt.scatter(
    data[:, 0], data[:, 1], marker='o', s=300, c='m',
    cmap=plt.get_cmap('Spectral'))
plt.title('Scatter Plot of Coefficients of PC1 and PC2')
plt.xlabel('factor exposure of PC1')
plt.ylabel('factor exposure of PC2')

for label, x, y in zip(labels, data[:, 0], data[:, 1]):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-20, 20),
        textcoords='offset points', ha='right', va='bottom',
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle = '->', connectionstyle='arc3,rad=0')
    );

<font color='blue'>Creating statistical risk factors allows us to further break down the returns of a portfolio to get a better idea of the risk.

This can be used as an additional step after performance attribution with more common risk factors, such as those in the [Quantopian Risk Model](https://www.quantopian.com/risk-model), to try to account for additional unknown risks.

<b>References:</b>
- Datta, B.N., 2010. *Numerical linear algebra and applications*. Siam.
- Qian, E.E., Hua, R.H. and Sorensen, E.H., 2007. *Quantitative equity portfolio management: modern techniques and applications*. CRC Press.