# Efficient Model Training & Predicting

Common problem when dealing with large datasets and working in python is a matter of efficiency in both time and space complexity. Usual solutions to combat these problems are to either increase the computing power available for the computer (which can be expense) or to continue your pipeline in a distributive framework like Spark / PySpark (which can be a hassle & expensive to set up). These sort of problems comes up often in industry settings, especially when working on problems which require you to aggregate a large amount of user data, problems associated to clustering / recommendation systems. Feeding all this data into inefficient models can be cumbersome to deal with since the computer will most likely run out of memory or it will take a lot of time to execute.
We can solve this problem by optimizing the way we train and predict sklearn models. Instead of passing in vectors for training & predicting, we pass in sparse vectors. As you will see below that this will drastically reduce both the speed and memory used for training and predicting with the model.  

### What is a Sparse Matrix?
There are two forms of matrices, sparse and dense matrices. A matrix is sparse if majority of the elements in the matrix has zero values whereas, a matrix is dense if majority of the elements in the matrix has non zero values.  

You would most commonly see sparse matrices when working with recommendation engines, encodings (for example 1-hot encoding), vectorizing etc. It's beneficial to work with sparse matrices in applied ML because it's much more efficient with both time and space complexity. It will reduce the memory necessary to store objects while making it more efficient computationally. You'll see later on that sci-kit learn is very well adept with sparse matrices and will decrease the memory and speed necessary to train and predict on large models. Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to large sparse matrices as processing and memory are wasted on the zeros [1]. Sparse data is easily compressible and requires less storage. Some very large sparse matrices are infeasible to manipulate using standard dense-matrix algorithms [1].  

A matrix is generally stored as a two dimensional array. Each entry in the array represents an element which can be mapped to the row and column index in the matrix to correspond to the value in that element. For an m × n matrix, the amount of memory required to store the matrix in this format is proportional to m × n (disregarding the fact that the dimensions of the matrix also need to be stored) [1]. In the case of a sparse matrix, substantial memory requirement reductions can be realized by storing only the non-zero entries [1]. The frequency of non-zero entries in the matrix will also have an impact in saving memory and time.  

In [1]:
import random
import pandas as pd
import scipy
import uuid

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

## Generate Data

We will begin by synthetically generating a dataset with a large amount of zeros using the random and pandas libraries. Through the use of the pivot_table function available in pandas we can replace all nan values in the pivot table with zeros. For the purposes of this tutorial, we'll create a DataFrame with 50,000 rows and 1,001 columns. The final column will be associated to a category, and the remaining 1,000 will be features necessary to train a classifier to predict that category.

In [2]:
def generate_data(id_n, prd_n):
    '''
    This function will generate a dataframe with a lot of sparse values 
    
    params:
        id_n (Integer) : The number of user's you want in the DF
        prd_n (Integer) : The number of products
    
    returns:
        A dataframe with prd_n + 1 columns where majority of the values
        are 0
    
    example:
        df = generate_data(
            id_n = 10000,
            prd_n = 1000
        )
    '''
    
    dct = {
        'user': [uuid.uuid4() for _ in range(id_n)], 
        'product': [random.randint(1, prd_n) for _ in range(id_n)],
        'value' : [random.randint(1, 100) for N in range(id_n)]
    }
    d = pd.DataFrame(dct)
    
    # convert to a pivot table, replace the nan's with 0's
    df = d.pivot_table(
        values = 'value', index = 'user', columns = 'product'
    ).fillna(0)
    
    categories = ['category1', 'category2', 'category3', 'category4', 'category5']
    df['category'] = [random.choice(categories) for _ in range(df.shape[0])]
    return df

In [3]:
%%time 
df = generate_data(
    id_n = 50000,
    prd_n = 1000
)

CPU times: user 1.25 s, sys: 653 ms, total: 1.9 s
Wall time: 2.18 s


In [4]:
df.shape

(50000, 1001)

In [5]:
ft_cols = [x for x in df.columns if x != 'category']

In [25]:
%time vectors = df[ft_cols].values

CPU times: user 90.9 ms, sys: 208 ms, total: 299 ms
Wall time: 490 ms


In [7]:
%time sparse_vectors = scipy.sparse.csr_matrix(df[ft_cols].astype(float).values)

CPU times: user 631 ms, sys: 280 ms, total: 911 ms
Wall time: 1.06 s


Now that we have a matrix and a sparse matrix, we can train a model on both to see which one performs better. I'll investigate this on the SVC model from sci-kit learn because support vector machines are known to have a high training and prediction latency based on it's architecture. This was confirmed by sci-kit learn's docs in their implementation here [2].

## SVC 

In [8]:
X = vectors
y = df['category'].values
x_train, x_test, y_train, y_test = train_test_split(X ,y ,test_size = 0.7)

In [9]:
%%time
svc = SVC()
svc.fit(x_train, y_train)

CPU times: user 3min 21s, sys: 4.57 s, total: 3min 26s
Wall time: 3min 37s


SVC()

## SVC - Sparse

In [10]:
X = sparse_vectors
y = df['category'].values
x_train, x_test, y_train, y_test = train_test_split(X ,y ,test_size = 0.7)

In [11]:
%%time
sparse_svc = SVC()
sparse_svc.fit(x_train, y_train)

CPU times: user 5.26 s, sys: 62.3 ms, total: 5.32 s
Wall time: 5.36 s


SVC()

In [12]:
185 / 5.91

31.302876480541453

As we can see from the results above, the training speed of SVC significantly improved when using sparse vectors instead of regular vectors. Based on the time performance, we see that the improvement was 31x, which is quite substantial. This not only reduced run time for training the model but made a substantial improvement on the memory based on the nature and structure of sparse matrices.

## Predict SVM

In [13]:
# generate dataset to predict on 
pred_df = generate_data(
    id_n = 10000,
    prd_n = 1000
)

In [14]:
sparse_pred_mat = scipy.sparse.csr_matrix(pred_df[ft_cols].astype(float).values)

In [15]:
%%time
# regular matrix prediction
pred_df['mat_pred'] = pred_df[ft_cols].apply(lambda x : svc.predict(x.values[None])[0], axis = 1)

CPU times: user 1min 4s, sys: 460 ms, total: 1min 5s
Wall time: 1min 5s


In [16]:
%%time
# sparse matrix prediction
pred_df['sparse_pred'] = [sparse_svc.predict(row)[0] for row in sparse_pred_mat]

CPU times: user 14.4 s, sys: 43.3 ms, total: 14.4 s
Wall time: 14.6 s


In [17]:
65/16.5

3.9393939393939394

Just as a note, you can send in sparse vectors into prediction even when the model was not trained on sparse data and vice versa. 
Based on the performance you see above, you can see that the model predicts almost 4x faster when passing in sparse matrices. This is a substantial improvement especially when thinking about the run time and computing costs associated to having a model in production. This will make both real time & batch predictions much easier and efficient to implement in production.

### Concluding Remarks
As you might've noticed, this solution most likely won't have a large impact on your application if you are not working with a large & sparse dataset. When the dataset doesn't have a lot of zeros in it, converting it to a sparse matrix and running the calculations won't have the same impact (if any at all). When this is the case, my advice would be to try parallelizing your code to run on multiple pools / threads. This will aid in reducing the amount of time associated to generating predictions & training.

---