# Speeding Up Analysis

In this notebook we will go over how we can speed up our analysis by distributing it over multiply CPU cores. 

First, some terminology and history on CPU:
- A central processing unit (CPU) refers to the component in a computer that executes instructions. 
- Once upon a time, each computer only has one CPU, but those days were long gone. 
- When you buy a CPU today, it typically has multiple *cores*. 
Each core is an individual unit capable of executing instructions, 
so it is totally valid if we want to call each core 'a CPU'.
- Further complicating matters, modern CPU typically comes with a feature called *simultaneous multithreading/hyperthreading*,
which allows each CPU core to pretend to be multiple logical CPUs. 
- In resource monitoring, *CPU* refers to these logical CPUs. 
If your computer has, say, an [Intel i7-11700 processor](https://ark.intel.com/content/www/us/en/ark/products/212279/intel-core-i711700-processor-16m-cache-up-to-4-90-ghz.html), 
it comes with 8 physical cores, each capable of running two logical CPUs. 
As a result, both Windows and Linux will report that your computer has 16 CPUs.
- When it comes to computing, what matters is the number of actual CPU cores. 
The department's SCRP cluster does *not* have SMT enabled, 
so each logical CPU is an actual CPU core.

## A. Scikit-Learn Models

Scikit-learn algorithms are often based on libraries written in other programming languages. Some of these algorithms are fast to train (e.g. linear regression) while others are slow (e.g. support vector machine). For the latter, the [Intel Extension for Scikit-learn](https://intel.github.io/scikit-learn-intelex/index.html) provides drop-in replacement that can [significantly speed up](https://intel.github.io/scikit-learn-intelex/acceleration.html) training. You can find the list of supported algorithms [here](https://intel.github.io/scikit-learn-intelex/algorithms.html).

First, let us generate some artifical data:

In [1]:
import pandas as pd
import numpy as np

# 50K samples with binary target and 10 features 
var_num = 10
X = np.random.rand(20000,var_num)
y = np.where(np.sum(X,axis=1)>var_num*0.5,1,0)

Next, we will train scikit-learn's default support vector classifier and time it with Jupyter's magic command `%%time`:

In [2]:
%%time

#Original Scikit Learn SVC
from sklearn.svm import SVC
svc = SVC()
svc.fit(X,y)

CPU times: user 1.83 s, sys: 108 ms, total: 1.93 s
Wall time: 2.08 s


SVC()

Finally, let us train the Intel replacement of SVC. This is done by placing the following two lines of code *before* we import scikit-learn:
```python
from sklearnex import patch_sklearn
patch_sklearn()
```

In [3]:
%%time

# Scikit Learn Intel Extension
from sklearnex import patch_sklearn
patch_sklearn()

# Intel version of SVC
from sklearn.svm import SVC
svc = SVC()
svc.fit(X,y)

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


CPU times: user 2.09 s, sys: 396 ms, total: 2.49 s
Wall time: 1.18 s


SVC()

We have achieved about 90% speed up just by switching to the Intel replacement. Because scikit-learn's default SVC scales quadratically with the number of samples while the Intel replacement scales close to linearly, the speedup gets more impressive as we scale up the task.

### Running on Compute Node

We will now run the same code on a compute node, which allows us vary the number of CPU cores we use. I have put the code in `../Examples/speed-up.py`. We will make use of three convenient commands provided on the cluster:
- `compute [command]` runs a command with four CPU cores.
- `compute-1 [command]` does the same with one CPU core.
- `compute-16 [command]` does the same with 16 CPU cores.

To run a shell command in a notebook, prepend the line with `!`.

In [11]:
# Scikit-learn SVC
!compute-1 python ../Examples/speed-up.py -N 50000
!compute-16 python ../Examples/speed-up.py -N 50000

Available CPUs: 1
Generating 50000 samples...
Mode: Scikit-learn
10.864s
Available CPUs: 16
Generating 50000 samples...
Mode: Scikit-learn
10.902s


As we can see from above, scikit-learn's default SVC is incapable of making use of more than one CPU core.

Now the Intel replacement:

In [12]:
# Intel Extension for Scikit-learn
!compute-1 python ../Examples/speed-up.py sklearnex -N 50000
!compute python ../Examples/speed-up.py sklearnex -N 50000
!compute-16 python ../Examples/speed-up.py sklearnex -N 50000

Available CPUs: 1
Generating 50000 samples...
Mode: Intel Extension for Scikit-learn
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
3.886s
Available CPUs: 4
Generating 50000 samples...
Mode: Intel Extension for Scikit-learn
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
2.721s
Available CPUs: 16
Generating 50000 samples...
Mode: Intel Extension for Scikit-learn
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
2.707s


Not only is the Intel replacement is faster on a single CPU core, it is also able to use multiple cores. That said, the algorithm does not utilize the additional cores all that efficiently. As we will see in the next section, it often makes more sense to parallelize the cross validation process than the model algorithm itself.

## B. GridSearchCV

K-Fold Cross Validation is an *embarassingly parallel* task, 
meaning that it can be easily decomposed into many identical subtasks.
These subtasks can be run concurrently to speed up the task.

First, let us run `GridSearchCV` in its default settings, 
which only uses one CPU core:

In [17]:
%%time

# Remove the Intel Replacement
from sklearnex import unpatch_sklearn
unpatch_sklearn()

# GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
parameters = {'C':np.logspace(0.001,10,16)}
svc = SVC()
gscv = GridSearchCV(svc,parameters,cv=5)
gscv.fit(X, y)    

CPU times: user 1min 6s, sys: 447 ms, total: 1min 6s
Wall time: 1min 6s


GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': array([1.00230524e+00, 4.65157470e+00, 2.15873832e+01, 1.00184377e+02,
       4.64943307e+02, 2.15774441e+03, 1.00138251e+04, 4.64729242e+04,
       2.15675096e+05, 1.00092146e+06, 4.64515275e+06, 2.15575797e+07,
       1.00046062e+08, 4.64301407e+08, 2.15476543e+09, 1.00000000e+10])})

To make use of the additional CPU cores, we should increase `n_jobs` accordingly:

In [18]:
%%time

# GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
parameters = {'C':np.logspace(0.001,10,16)}
svc = SVC()
gscv = GridSearchCV(svc,parameters,n_jobs=4,cv=5)
gscv.fit(X, y)    

CPU times: user 1.15 s, sys: 26.4 ms, total: 1.18 s
Wall time: 19.1 s


GridSearchCV(cv=5, estimator=SVC(), n_jobs=4,
             param_grid={'C': array([1.00230524e+00, 4.65157470e+00, 2.15873832e+01, 1.00184377e+02,
       4.64943307e+02, 2.15774441e+03, 1.00138251e+04, 4.64729242e+04,
       2.15675096e+05, 1.00092146e+06, 4.64515275e+06, 2.15575797e+07,
       1.00046062e+08, 4.64301407e+08, 2.15476543e+09, 1.00000000e+10])})

Note how with four CPU cores we actually managed to achieve a 5x increase in speed.


### Running on Compute Node


As before, let us try running the code on a compute node. 

First, the default grid search. Because the default use only 
one CPU core, there is no speed up from having more CPUs:

In [9]:
!compute python ../Examples/speed-up.py gridsearch
!compute-16 python ../Examples/speed-up.py gridsearch

Available CPUs: 4
Generating 20000 samples...
Mode: Default GridSearchCV
89.597s
Available CPUs: 16
Generating 20000 samples...
Mode: Default GridSearchCV
88.163s


To make use of the additional CPU cores, we should increase `n_jobs` accordingly:

In [3]:
!compute python ../Examples/speed-up.py gridsearch-4
!compute-16 python ../Examples/speed-up.py gridsearch-16

Available CPUs: 4
Generating 20000 samples...
Mode: GridSearchCV with n_jobs=4
24.863s
Available CPUs: 16
Generating 20000 samples...
Mode: GridSearchCV with n_jobs=16
9.221s


Is it worthwhile to use `sklearnex` with a parallelized `GridSearchCV`? It appears only worthwhile if training is quite time-consuming:

In [22]:
# 20K samples. Intel Extension.
!compute python ../Examples/speed-up.py gridsearch-4-sklearnex
!compute-16 python ../Examples/speed-up.py gridsearch-16-sklearnex

Available CPUs: 4
Generating 20000 samples...
Mode: GridSearchCV with n_jobs=4 and Intel Extension for Scikit-learn
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
18.225s
Available CPUs: 16
Generating 20000 samples...
Mode: GridSearchCV with n_jobs=16 and Intel Extension for Scikit-learn
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
12.754s


In [19]:
# 50K samples. Default scikit-learn.
!compute python ../Examples/speed-up.py -N 50000 gridsearch-4
!compute-16 python ../Examples/speed-up.py -N 50000 gridsearch-16

Available CPUs: 4
Generating 50000 samples...
Mode: GridSearchCV with n_jobs=4
102.673s
Available CPUs: 16
Generating 50000 samples...
Mode: GridSearchCV with n_jobs=16
33.389s


In [26]:
# 50K samples. Intel Extension.
!compute python ../Examples/speed-up.py -N 50000 gridsearch-4-sklearnex
!compute-16 python ../Examples/speed-up.py -N 50000 gridsearch-16-sklearnex

Available CPUs: 4
Generating 50000 samples...
Mode: GridSearchCV with n_jobs=4 and Intel Extension for Scikit-learn
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
38.576s
Available CPUs: 16
Generating 50000 samples...
Mode: GridSearchCV with n_jobs=16 and Intel Extension for Scikit-learn
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
20.851s


## Multi-Processing

Finally, for algorithms that do not have a built-in method to run on multiple CPU cores, 
we can run multiple copies of identical code with the `multiprocessing` package. 
We could use this to run, for example, our own implementation of bootstrapping or 
cross validation.

The simpliest way to do so is:
1. Define a function that contains code you want to run multiple copies of:
```python
def f(data):
        # Do something
        return result
```
2. Use the following code to run the function:
```python
with Pool(N) as p:
        results = p.map(f,list_of_data)
```
    - `N` is the number of workers. 
        - Because workers work concurrently, we typically want to set it to the number of CPU cores.
        - The exception is, if the algorithm itself is capable of using
        multiple CPU cores, we might want to set the number of workers to
        a number smaller than the number of CPU cores.
    - `list_of_data` is a list of data. Each element in this list will result in a subtask being created
    and executed by a worker.
    - `results` is a list that contains the result returned by each subtask.
    
First, let us trying running four subtasks with one worker:

In [29]:
%%time 

from multiprocessing import Pool
import numpy as np
from sklearn.utils import resample
from sklearn.svm import SVC

def reg(id):
    # Bootstrap data and run SVC
    boot_X,boot_y = resample(X,y)
    svc = SVC()
    svc.fit(boot_X,boot_y)
    return svc.score(boot_X,boot_y)

# Multiprocessing pool with one worker
with Pool(1) as p:
    results = p.map(reg,np.arange(4))
    
results

CPU times: user 8.54 ms, sys: 11.9 ms, total: 20.4 ms
Wall time: 14.3 s


[0.99475, 0.9945, 0.99365, 0.99445]

Now we will run four subtasks with four workers. We should see essentially linear speedup:

In [30]:
%%time 

from multiprocessing import Pool
import numpy as np
from sklearn.utils import resample
from sklearn.svm import SVC

def reg(id):
    # Bootstrap data and run SVC
    boot_X,boot_y = resample(X,y)
    svc = SVC()
    svc.fit(boot_X,boot_y)
    return svc.score(boot_X,boot_y)

# Multiprocessing pool with four worker
with Pool(4) as p:
    results = p.map(reg,np.arange(4))
    
results

CPU times: user 4.81 ms, sys: 28 ms, total: 32.8 ms
Wall time: 3.93 s


[0.99475, 0.99475, 0.99475, 0.99475]