### Truncate SVD example

Truncated Singular Value Decomposition (SVD) is a matrix factorization technique that factors a matrix M into the three matrices U, Σ, and V. This is very similar to PCA, excepting that the factorization for SVD is done on the data matrix, whereas for PCA, the factorization is done on the covariance matrix. Typically, SVD is used under the hood to find the principle components of a matrix.

Like PCA, Truncate PCD is used to reduce the number of dimensions available in a matrix.

A couple of good videos to watch on this topic -

1. https://www.youtube.com/watch?v=P5mlg91as1c

2. https://www.youtube.com/watch?v=UyAfmAZU_WI

In [1]:
import pandas as pd
import numpy as np
import json
import zipfile
import os
import time

from sklearn import preprocessing
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture


from sklearn.datasets import load_iris
from sklearn.decomposition import TruncatedSVD
from scipy import linalg

### Download the data to run Truncate SVD

In [2]:
if not os.path.exists(os.getcwd() +"/content"):
    os.makedirs(os.getcwd() +"/content")
os.chdir(os.getcwd() +"/content")
download_results = !kaggle datasets download -d deepakg/usarrests
if any("Downloading" in s for s in download_results):
    download_status = True
    print ("New version was downloaded")
    print(download_status)
else:
    download_status = False
    print ("No new version available")
    print (download_status)

No new version available
False


In [3]:
usarrest_data = pd.read_csv("usarrests.zip")
usarrest_data.rename(columns={"Unnamed: 0":"State"},inplace=True)
usarrest_data.head()

Unnamed: 0,State,Murder,Assault,UrbanPop,Rape
0,Alabama,13.2,236,58,21.2
1,Alaska,10.0,263,48,44.5
2,Arizona,8.1,294,80,31.0
3,Arkansas,8.8,190,50,19.5
4,California,9.0,276,91,40.6


In [4]:
x_input = preprocessing.scale(usarrest_data.iloc[:,1:].to_numpy())
x_input.shape

(50, 4)

#### Calculate Truncate SVD

#### Calculate first using Scipy

In [5]:
%time U, s, Vh = linalg.svd(x_input,full_matrices=False)

CPU times: user 5.4 ms, sys: 1.23 ms, total: 6.63 ms
Wall time: 6.59 ms


In [60]:
print (U.shape)
print(s.shape)
print(Vh.shape)

(50, 4)
(4,)
(4, 4)


In [6]:
print(s)

[11.13607107  7.0347891   4.22234047  2.94474182]


_From the value of S (Σ), it can be seen the first 2 components carry most of the weight. So, it should be ok to ignore last 2 components._

In [7]:
print(Vh)

[[-0.53589947 -0.58318363 -0.27819087 -0.54343209]
 [ 0.41818087  0.1879856  -0.87280619 -0.16731864]
 [-0.34123273 -0.26814843 -0.37801579  0.81777791]
 [ 0.6492278  -0.74340748  0.13387773  0.08902432]]


_From Vh above, it can be seen which features first 2 compoents put the weight on. Just like PCA, first component
puts a lot of weight on Crime (i.e Murder, Assault and Rape). Second Component puts most of the wight on urban
population._

In [8]:
U[:5]

array([[-0.08850212,  0.16111249, -0.10521861,  0.0530665 ],
       [-0.17511901,  0.15255799,  0.48314515, -0.14893782],
       [-0.15832905, -0.10603826,  0.01297404, -0.2834384 ],
       [ 0.0126993 ,  0.15917987,  0.02713511, -0.06208045],
       [-0.22664907, -0.2193291 ,  0.14175948, -0.11613802]])

### Calculate using scikit learn TruncateSVD

In [13]:
svd = TruncatedSVD(n_components=x_input.shape[1]-1)
svd_transformed = svd.fit_transform(x_input)

In [14]:
svd.explained_variance_

array([2.48024158, 0.98976515, 0.35656318])

In [15]:
svd.explained_variance_ratio_

array([0.62006039, 0.24744129, 0.0891408 ])

_As can be seen above, first 2 components explain 87% of the variance._

In [71]:
svd.components_

array([[ 0.53589947,  0.58318363,  0.27819087,  0.54343209],
       [ 0.41818087,  0.1879856 , -0.87280619, -0.16731864],
       [-0.34123273, -0.26814843, -0.37801579,  0.81777791]])