# Truncated Single Value Decomposition (tsvd) 
The TSVD algorithm is a linear dimensionality reduction algorithm which works really well for datasets in which samples correlated in large groups. TSVD does not center the data before computation unlike PCA. 
The TSVD model is implemented in the cuML library and can accept the following parameters: 
1. n_components : The number of top K singular vectors to be present in the output. The n_componnts variable must be <= number of columns.
2. algorithm: selects the type of algorithm to be used: Jacobi or full. Jacobi is much faster as it iteratively corrects but is less accurate.
3. n_iter: if the algorithm = 'Jacobi' then this variable decides the number of iterations. 
4. tol: if the algorithm = 'Jacobi' then this variable is used to set the tolerance
5. random_state : select a random state if the results should be reproduceable across multiple runs.

The functions that can be used with the tsvd:
1. fit: fits the dataframe on the TSVD model
2. fit_transform: fit the TSVD model on the dataset and perform dimensionality reduction
3. transform: performs dimensionality reduction on the dataset
4. inverse_transform: returns the original dataset from the dimensionally reduced one
5. get_params: returns the value of the parameters set for the TSVD model
6. set_params: allows the user to set the parameter of the TSVD model

The model accepts only numpy arrays or cudf dataframes as the input. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the tsvd model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/0.6.0/api.html#truncated-svd

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD as skTSVD
from cuml import TruncatedSVD as cumlTSVD
import cudf
import os

## Helper Functions

In [None]:
# check if the mortgage dataset is present and then extract the data from it, else throw an error statement
import gzip
# change the path of the mortgage dataset if you have saved it in a different directory
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):
    if os.path.exists(cached):
        print('use mortgage data')
        with gzip.open(cached) as f:
            X = np.load(f)
        X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]
    else:
        # raise an exception if the dataset is not present
        raise FileNotFoundError('Please download the required dataset or check the path')
    df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
    return df


In [None]:
# this function checks if the results obtained from two different libraries (sklearn and cuml) are the same
from sklearn.metrics import mean_squared_error
def array_equal(a,b,threshold=5e-3,with_sign=True):
    a = to_nparray(a)
    b = to_nparray(b)
    if with_sign == False:
        a,b = np.abs(a),np.abs(b)
    error = mean_squared_error(a,b)
    res = error<threshold
    return res

# the function converts a variable from ndarray or dataframe format to numpy array
def to_nparray(x):
    if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
        return np.array(x)
    elif isinstance(x,np.float64):
        return np.array([x])
    elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
        return x.to_pandas().values
    return x

## Create the dataset and define model parameters

In [None]:
%%time
# nrows = number of samples
# ncols = number of features of each sample

nrows = 2**22
ncols = 40

X = load_data(nrows,ncols)
print('data',X.shape)

In [None]:
# define the value of some of the model parameters
n_components = 10
random_state = 42

## Run sklearn tsvd on the dataset

In [None]:
%%time
# use the sklearn tsvd to reduce the dimentionality of the dataset
algorithm='arpack'
tsvd_sk = skTSVD(n_components=n_components,algorithm=algorithm, 
            random_state=random_state)
# fits the dataset on the sklearn tsvd model and returns the dimensionally reduced dataset
result_sk = tsvd_sk.fit_transform(X)

## Run cuml tsvd model on the dataset 

In [None]:
%%time
# convert pandas dataframe to cudf dataframe
X = cudf.DataFrame.from_pandas(X)

In [None]:
%%time
# use the cuml tsvd model to reduce the dimentionality of the dataset
algorithm='full'
tsvd_cuml = cumlTSVD(n_components=n_components,algorithm=algorithm, 
            random_state=random_state)
# fits the dataset on the cuml tsvd model and returns the dimensionally reduced dataset
result_cuml = tsvd_cuml.fit_transform(X)

## Compare the value of the attributes obtained from the sklearn and cuml tsvd models

In [None]:
# obtain attributes of the sklearn and cuml tsvd and check to see if they are equal
for attr in ['singular_values_','components_']:
    passed = array_equal(getattr(tsvd_sk,attr),getattr(tsvd_cuml,attr),threshold=0.1)
    # larger error margin due to different algorithms: arpack vs full
    message = 'compare tsvd: cuml vs sklearn {:>25} {}'.format(attr,'equal' if passed else 'NOT equal')
    print(message)

In [None]:
# compare the reduced matrix
passed = array_equal(result_sk,result_cuml,threshold=0.1)
# larger error margin due to different algorithms: arpack vs full
message = 'compare tsvd: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal')
print(message)