# Decomposition

This example demonstrates various Decomposition Model methods. We will use the Iris dataset.

In [107]:
from vertica_ml_python.learn.datasets import load_iris
iris = load_iris()
print(iris)

0,1,2,3,4,5
,SepalLengthCm,Species,PetalWidthCm,PetalLengthCm,SepalWidthCm
0.0,4.30,Iris-setosa,0.10,1.10,3.00
1.0,4.40,Iris-setosa,0.20,1.40,2.90
2.0,4.40,Iris-setosa,0.20,1.30,3.00
3.0,4.40,Iris-setosa,0.20,1.30,3.20
4.0,4.50,Iris-setosa,0.30,1.30,2.30
,...,...,...,...,...


<object>  Name: iris, Number of rows: 150, Number of columns: 5


Let's create a PCA model of the different flowers.

In [109]:
from vertica_ml_python.learn.decomposition import PCA
model = PCA("public.PCA_iris")
model.fit("public.iris", ["PetalWidthCm", "PetalLengthCm", "SepalLengthCm", "SepalWidthCm"])



columns
index|    name     |  mean  |   sd   
-----+-------------+--------+--------
  1  |petalwidthcm | 1.19867| 0.76316
  2  |petallengthcm| 3.75867| 1.76442
  3  |sepallengthcm| 5.84333| 0.82807
  4  |sepalwidthcm | 3.05400| 0.43359


singular_values
index| value  |explained_variance|accumulated_explained_variance
-----+--------+------------------+------------------------------
  1  | 2.05544|      0.92462     |            0.92462           
  2  | 0.49218|      0.05302     |            0.97763           
  3  | 0.28022|      0.01719     |            0.99482           
  4  | 0.15389|      0.00518     |            1.00000           


principal_components
index|  PC1   |  PC2   |  PC3   |  PC4   
-----+--------+--------+--------+--------
  1  | 0.35884|-0.07471| 0.54906| 0.75112
  2  | 0.85657|-0.17577| 0.07252|-0.47972
  3  | 0.36159| 0.65654|-0.58100| 0.31725
  4  |-0.08227| 0.72971| 0.59642|-0.32409


counters
   counter_name   |counter_value
------------------+-------------
ac

Fitting the model creates new model attributes, which make methods easier to use.

In [110]:
model.X

['"PetalWidthCm"', '"PetalLengthCm"', '"SepalLengthCm"', '"SepalWidthCm"']

In [111]:
model.input_relation

'public.iris'

These attributes will be used when invoking the different model abstractions. The model could also have other useful attributes. In the case of PCA, the 'components', 'explained_variance' and 'mean' attributes can give you useful information about the model.

In [112]:
model.components

0,1,2,3,4
,PC1,PC2,PC3,PC4
1.0,0.358843926248216,-0.0747064701350342,0.549060910726603,0.751120560380823
2.0,0.856572105290528,-0.175767403428654,0.0725240754869635,-0.47971898732994
3.0,0.36158967738145,0.656539883285831,-0.580997279827618,0.31725454716854
4.0,-0.0822688898922142,0.729712371326497,0.596418087938103,-0.324094352417966


<object>

In [113]:
model.explained_variance

0,1,2,3
,value,explained_variance,accumulated_explained_variance
1.0,2.05544174529956,0.924616207174268,0.924616207174268
2.0,0.492182457659266,0.0530155678505351,0.977631775024803
3.0,0.280221177097939,0.0171851395250068,0.99481691454981
4.0,0.153892907978245,0.00518308545018961,0.999999999999999


<object>

In [114]:
model.mean

0,1,2,3
,name,mean,sd
1.0,petalwidthcm,1.19866666666667,0.763160741700841
2.0,petallengthcm,3.75866666666667,1.76442041995226
3.0,sepallengthcm,5.84333333333333,0.828066127977863
4.0,sepalwidthcm,3.054,0.433594311362174


<object>

Let's look at the generated SQL code.

In [115]:
print(model.deploySQL())

APPLY_PCA("PetalWidthCm", "PetalLengthCm", "SepalLengthCm", "SepalWidthCm" USING PARAMETERS model_name = 'public.PCA_iris', match_by_pos = 'true', cutoff = 1)


It is also possible to deploy the inverse PCA.

In [116]:
print(model.deployInverseSQL())

APPLY_INVERSE_PCA("PetalWidthCm", "PetalLengthCm", "SepalLengthCm", "SepalWidthCm" USING PARAMETERS model_name = 'public.PCA_iris', match_by_pos = 'true')


You can also use the 'to_vdf' method to get the model vDataFrame and specify the number of components to keep.

In [117]:
model.to_vdf(n_components = 2)

0,1,2
,col1,col2
0.0,-3.22520044627498,-0.503279909485424
1.0,-2.88795856533563,-0.57079802633159
2.0,-2.98184266485391,-0.480250048856075
3.0,-2.99829644283235,-0.334307574590776
4.0,-2.85221108156639,-0.932865367469544
,...,...


<object>  Name: pca_publiciris, Number of rows: 150, Number of columns: 2

Or the minimal cumulative explained variance.

In [118]:
model.to_vdf(cutoff = 0.8)

0,1
,col1
0.0,-3.22520044627498
1.0,-2.88795856533563
2.0,-2.98184266485391
3.0,-2.99829644283235
4.0,-2.85221108156639
,...


<object>  Name: col1, Number of rows: 150, dtype: float

You can choose to keep key columns to join the result to the main relation.

In [119]:
model.to_vdf(cutoff = 0.8, 
             key_columns = ["PetalWidthCm", 
                            "PetalLengthCm", 
                            "SepalLengthCm", 
                            "SepalWidthCm"])

0,1,2,3,4,5
,PetalWidthCm,PetalLengthCm,SepalLengthCm,SepalWidthCm,col1
0.0,0.10,1.10,4.30,3.00,-3.22520044627498
1.0,0.20,1.40,4.40,2.90,-2.88795856533563
2.0,0.20,1.30,4.40,3.00,-2.98184266485391
3.0,0.20,1.30,4.40,3.20,-2.99829644283235
4.0,0.30,1.30,4.50,2.30,-2.85221108156639
,...,...,...,...,...


<object>  Name: pca_publiciris, Number of rows: 150, Number of columns: 5