# Linear Algebra in Python

1. Introduction to Linear Algebra
1. Matrix-Vector Equations
1. Eigenvalues and Eigenvectors
1. Principal Component Analysis

The content is taken from "Linear Algebra for Data Science in R" in Datacamp. The purpose is to take the excellent content and code and put them here in Python code.

## Principal Component Analysis

The main purpose of PCA is to reduce the number of dimensions in a dataset. This allows for a simplified dataset that makes it easier to analyse.

In [2]:
import pandas as pd
import numpy as np

# load dataset
draft_df = pd.read_csv("data/combine.csv")
draft_df.head()

Unnamed: 0,Player,Pos,School,College,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Drafted (tm/rnd/yr)
0,Dante Pettis\PettDa00,WR,Washington,College Stats,6-0,186,,,,,,,San Francisco 49ers / 2nd / 44th pick / 2018
1,Kemoko Turay\TuraKe00,EDGE,Rutgers,College Stats,6-5,253,4.65,,,,,,Indianapolis Colts / 2nd / 52nd pick / 2018
2,Josh Adams\AdamJo03,RB,Notre Dame,College Stats,6-2,213,,,18.0,,,,
3,Ola Adeniyi,EDGE,Toledo,,6-2,248,4.83,31.5,26.0,,7.21,4.28,
4,Jordan Akins\AkinJo00,TE,Central Florida,College Stats,6-3,249,,,,,,,Houston Texans / 3rd / 98th pick / 2018


One of the important things that principal component analysis can do is shrink redundancy in your dataset. In its simplest manifestation, redundancy occurs when two variables are correlated.

The Pearson correlation coefficient is a number between -1 and 1. Coefficients near zero indicate two variables are linearly independent, while coefficients near -1 or 1 indicate that two variables are linearly related.

If the covariance between two columns of a matrix is positive and large, then we can say that when one variable goes up, so would the other variable.

In [3]:
# select columns that we will analyse
cols = ["Ht","Wt","40yd", "Vertical","Bench", "Broad Jump", "3Cone", "Shuttle"]
combine = draft_df[cols]
combine.head()

# remove rows that have NA in the columns that we will be analysing
combine = combine.dropna()

# convert height to numeric
def convert_ht(row):
    x = row.split("-")
    if len(x[1]) == 1:
        x[1] = "0" + x[1]
    output = int(x[0] + x[1])
    return output
combine['Ht'] = combine['Ht'].apply(convert_ht)
combine.head()

Unnamed: 0,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle
5,511,192,4.38,35.0,14.0,127.0,6.71,3.98
7,601,298,5.34,26.5,27.0,99.0,7.81,4.71
10,605,256,4.67,31.0,17.0,113.0,7.34,4.38
11,602,198,4.34,41.0,16.0,131.0,6.56,4.03
12,604,257,4.87,30.0,20.0,118.0,7.12,4.23


In [4]:
# standardise the rows, ie we will take the value and subtract the mean from it
for col in cols:
    combine[col] = combine[col] - combine[col].mean()

In [5]:
combine.head()

Unnamed: 0,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle
5,-80.045455,-56.5,-0.403258,2.784091,-5.712121,12.113636,-0.484697,-0.4325
7,9.954545,49.5,0.556742,-5.715909,7.287879,-15.886364,0.615303,0.2975
10,13.954545,7.5,-0.113258,-1.215909,-2.712121,-1.886364,0.145303,-0.0325
11,10.954545,-50.5,-0.443258,8.784091,-3.712121,16.113636,-0.634697,-0.3825
12,12.954545,8.5,0.086742,-2.215909,0.287879,3.113636,-0.074697,-0.1825


We will now convert A into a covariance-variance matrix. This is done by doing:

1. $\frac{A^TA}{n-1}$, where n is the number of rows of A. The minus 1 is the "degree of freedom". 

In [6]:
A = combine.to_numpy()
# get the covariance-variance matrix, B.
B = np.dot(np.transpose(A), A) / (len(A) -1)

After doing that, we find that B(1,2) and B(2,1) share the same value, which is the covariance between cols 1 and 2 of A.

In [9]:
print(B[1,2])
print (B[2,1])
# if you call cov for A, you will get 4 numbers. The top right and bottom left are the ones to be concerned with.
print (np.cov(A[:,1], A[:,2]))

13.69820610687023
13.69820610687023
[[2.13124427e+03 1.36982061e+01]
 [1.36982061e+01 1.09460300e-01]]


B(1,1) corresponds to the variance of A's col 1. Notice here that before we did n-1 to get cov-var matrix, B. Likewise, when we call np.var, we need to specify degree of freedom (ddof) as 1.

In [8]:
print (B[1,1])
print (np.var(A, axis = 0, ddof=1))

2131.2442748091603
[9.76226926e+02 2.13124427e+03 1.09460300e-01 1.98862335e+01
 4.38248901e+01 9.15976752e+01 1.85073953e-01 7.67227099e-02]


### Eigenvalues of the cov-var matrix

With B as done above, we can then get the Eigenvalues of B. Not sure why.

In [12]:
eigenvals = np.linalg.eig(B)[0]
eigenvals

array([2.46495265e+03, 7.27723807e+02, 4.56758097e+01, 2.07071372e+01,
       4.01448843e+00, 5.69153316e-02, 9.10768532e-03, 1.13397932e-02])

The eigen values have been arranged from largest to smallest. Let's see how much the first value is as a percentage of all the values summed.

In [13]:
eigenvals[0] / sum(eigenvals)

0.7553902524092113

From this, we see that the first eigenval is 75.5% of all the values. We can say that 75.5% of the variability of the data can be explained  by this first eigenval aka the first principle component.

## Not doing PCA by hand. 

All that was done above was by running lower level codes which were 

1. Scaling data 
2. Calculating the principle components. 

However, this is not really necessary. We can use scikit-learn to get the PCA. In R you use scale and prcomp


In [20]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# scaling data is still necessary here. PCA does not do it for you (it really should though!)
# this standard scaler was scaled without the standard deviation. 
#  with_std=True  if you want the standard deviation
A_scaled = StandardScaler(with_std=False).fit_transform(A)
A_scaled

array([[-8.00454545e+01, -5.65000000e+01, -4.03257576e-01, ...,
         1.21136364e+01, -4.84696970e-01, -4.32500000e-01],
       [ 9.95454545e+00,  4.95000000e+01,  5.56742424e-01, ...,
        -1.58863636e+01,  6.15303030e-01,  2.97500000e-01],
       [ 1.39545455e+01,  7.50000000e+00, -1.13257576e-01, ...,
        -1.88636364e+00,  1.45303030e-01, -3.25000000e-02],
       ...,
       [ 1.09545455e+01, -5.15000000e+01, -2.33257576e-01, ...,
         6.11363636e+00, -3.04696970e-01, -1.82500000e-01],
       [ 9.95454545e+00, -1.25000000e+01, -1.83257576e-01, ...,
         2.11363636e+00,  1.85303030e-01,  6.75000000e-02],
       [ 1.29545455e+01,  9.50000000e+00, -3.25757576e-03, ...,
        -8.86363636e-01, -2.04696970e-01, -1.25000000e-02]])

In [29]:
# run PCA 
# n_components to limit the components to spit out. Usually 3 is good enough. However in this case we do all

pca = PCA()
# pca = PCA(n_components=3)

principalComponents = pca.fit(A_scaled)

# to see the explained variance (the eigenvalues)
print (f"explained variance(eigenvalues): {principalComponents.explained_variance_}")

# normally though you would use PCA(n_components=3) then use those components to run an analysis of the principle components

array([2.46495265e+03, 7.27723807e+02, 4.56758097e+01, 2.07071372e+01,
       4.01448843e+00, 5.69153316e-02, 1.13397932e-02, 9.10768532e-03])

### Note:
PCA is not actually a feature selection tool.