# Efficient Model Training & Predicting

Problem Outline: Common problem when dealing with large datasets and working in python is a matter of efficiency in both time and space complexity. Usual solutions to combat these problems are to either increase the computing power available for the computer (which can be expense) or to continue your pipeline in a distributive framework like Spark / Pyspark (which can be a hastle & exepnsive to set up). These sort of problems comes up often in industry settings, especially when working on problems which require you to aggregate a large amount of user data, problems associated to clustering / recommendation systems. Feeding all this data into inefficient models can be cumberson to deal with since the computer will most likely run out of memory. 

Problem Solution:
We can solve this problem by optimizing the way we train and predict with our models, instead of passing in dense vectors for training & predicting, we pass in sparse vectors. As you will see below that this will drastically reduce both the speed and memory used for training and predicting with the model. 

Dense Matrix vs Sparse Matrix:

- batch prediction
- real time prediction

In [1]:
import random
import numpy as np
import pandas as pd
import scipy
import uuid
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

## Generate Data

In [2]:
def generate_data(id_n, prd_n):
    '''
    This function will generate a dataframe with a lot of sparse values 
    
    params:
        id_n (Integer) : The number of user's you want in the DF
        prd_n (Integer) : The number of products
    
    returns:
        A dataframe with prd_n + 1 columns where majority of the values
        are 0
    
    example:
        df = generate_data(
            id_n = 10000,
            prd_n = 1000
        )
    '''
    
    dct = {
        'user': [uuid.uuid4() for _ in range(id_n)], 
        'product': [random.randint(1, prd_n) for _ in range(id_n)],
        'value' : [random.randint(1, 100) for N in range(id_n)]
    }
    d = pd.DataFrame(dct)
    
    # convert to a pivot table, replace the nan's with 0's
    df = d.pivot_table(
        values = 'value', index = 'user', columns = 'product'
    ).fillna(0)
    
    categories = ['category1', 'category2', 'category3', 'category4', 'category5']
    df['category'] = [random.choice(categories) for _ in range(df.shape[0])]
    return df

In [3]:
%%time 
df = generate_data(
    id_n = 100000,
    prd_n = 1000
)

CPU times: user 3.11 s, sys: 2.11 s, total: 5.22 s
Wall time: 6.87 s


In [4]:
df.shape

(100000, 1001)

In [5]:
ft_cols = [x for x in df.columns if x != 'category']

In [6]:
%time dense_vectors = df[ft_cols].values

CPU times: user 168 ms, sys: 375 ms, total: 542 ms
Wall time: 984 ms


In [7]:
%time sparse_vectors = scipy.sparse.csr_matrix(df[ft_cols].astype(float).values)

CPU times: user 1.57 s, sys: 924 ms, total: 2.5 s
Wall time: 3.3 s


## SVC - Dense

In [None]:
X = dense_vectors
y = df['category'].values
x_train, x_test, y_train, y_test = train_test_split(X ,y ,test_size = 0.7)

In [None]:
%%time
dense_svc = SVC()
dense_svc.fit(x_train, y_train)

## SVC - Sparse

In [8]:
X = sparse_vectors
y = df['category'].values
x_train, x_test, y_train, y_test = train_test_split(X ,y ,test_size = 0.7)

In [9]:
%%time
sparse_svc = SVC()
sparse_svc.fit(x_train, y_train)

CPU times: user 24.5 s, sys: 1.17 s, total: 25.6 s
Wall time: 27 s


SVC()

## Predict SVM

As you might've noticed, this solution most likely won't have a large impact on your application if you are not working with a large & sparse dataset. When the dataset doesn't have a lot of zeros in it, converting it to a sparse matrix and running the calculations won't have the same impact (if any at all). When this is the case, my advice would be to try paralellizing your code to run on multiple pools / threads. This will aid in reducing the amount of time associated to generating predictions & training. 