# Introduction to using GPUs for Analytics

Speaker: [Randy Zwitch](https://github.com/randyzwitch), Senior Director of Community at [OmniSci](https://www.omnisci.com/) <br>
PyData PHL: https://www.meetup.com/PyData-PHL/events/268253667/ <br>
Feb 18, 2020 <br>

This notebook demostrates some of the basic principles for using GPUs to accelerate computations. It is not intended to be a primer on machine learning; rather, the intent is to help users gain an intuition about code that can be parallelized in general, then show the speed up from moving computation from CPU to GPU.


### 0. Example Data

In [None]:
import pandas as pd

#1 month of bikshare data from Baywheels (SF)
#~295k records not that large, but useful for example
#full dataset: https://s3.amazonaws.com/baywheels-data/index.html
baywheels_df = pd.read_csv("https://s3.amazonaws.com/baywheels-data/202001-baywheels-tripdata.csv.zip", low_memory=False)
baywheels_df.shape

In [None]:
baywheels_df.head()

## CPU, single-threaded computation

### 1. For Loop

While __for loops__ are very powerful, in an interpreted language like Python, this will rarely be the _fastest_ way to perform a calculation. Especially for analytics, where the same calculation is applied to thousands, millions, or billions of data elements.

for loops allow for accessing and writing to global variables; because anything can change these global variables, the interpreter can't assume anything about the data in terms of optimizations. The extreme flexibility of for loops works against their performance.

In [None]:
%%timeit

minutes = []
for val in baywheels_df["duration_sec"]:
    minutes.append(val / 60)
    
baywheels_df["duration_min_loop"] = minutes

### 2. List Comprehension

A list comprehension in Python is more frequently seen to apply a function over an array. Here, because list comprehension knows the size of the input (`baywheels_df["duration_sec"]`), the list comprehension can 1) allocate the full memory needed to hold the output list at once and 2) given a list comprehension has a smaller surface area of what it can do, the code _can be_ more specialized (but this depends on the Python implementation).

In this example, the list comprehension is roughly 15% faster than a for loop. You can still do some weird things with list comprehensions (run functions that only have side-effects, filtering, etc.), but they are still more predictable than a for loop.

In [None]:
%%timeit
baywheels_df["duration_min_lc"] = [x/60 for x in baywheels_df["duration_sec"]]

### 3. Using NumPy/pandas "vectorization"

Extending the idea of "smaller surface area" functions being able to be more optimized at the expense of flexibility, "vectorized" calculations have the same properties as list comprehensions (i.e. input/output size known at function time). NumPy goes one step further, being written in C, which allows for a 65x speedup (49.8 millseconds vs 759 _microseconds_).

However, once you "drop to C" (or other compiled languages), you eventually run out of ability to go any faster. One thread eventually hits the maximum amount of work possibly for a given CPU clock speed.

In [None]:
%%timeit
baywheels_df["duration_min_vec"] = baywheels_df["duration_sec"] / 60

### 4. Using multiprocessing

If you run out of single-threaded performance, the answer _must be_ MOAR THREADS, right? Not necessarily...

In the example below, I use `multiprocessing` to create 6 workers to process this data. Given that Python has some limitations in terms of threading due to the [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock), `multiprocessing` _forks_ (i.e. makes separate copies) of the main environment. This is an expensive operation in order to guarantee safety of the parallel processing operation, so you need to be doing A LOT of work to make it worth it. Some IO operations, making database calls, etc. can be useful to parallelize using `multiprocessing` (as Python isn't the limiting factor for these operations).

In addition to making copies of the environment, we are still running interpreted Python (in parallel) vs. compiled C. NumPy/pandas/C single-threaded outperforms Python by such a large margin that parallelism isn't enough to overcome the interpreted Python code, even with 6x as many workers are running.

In [None]:
from multiprocessing import Pool

def to_minute(x):
    return x/60

p = Pool(6) #have a 6-core demo machine

In [None]:
%%timeit
baywheels_df["duration_min_pool"] = p.map(to_minute, baywheels_df["duration_sec"])

## GPU Examples

### Why can GPUs be useful for analytics?

Many analytics use cases are highly parallelizable. When calculating sales by zip code, average salary by age, or any other "group by" type of operation, the results of one group isn't determined by or based on any other group. GPUs are set up to apply simple arithmetic (often called 'kernels') to thousands of data elements in parallel. 

That GPUs usually have a lower clock speed than CPUs is besides the point; the massive parallelism of GPUs far overshadows the lower amount of work a single-thread might be doing. Think of a CPU vs. GPU as a similar comparison of a car vs. a bus...

A car might be able to drive around a track with 4 people inside it in 100 seconds. It will take the car 1000 seconds to move 40 people around the track (ignoring loading times). A bus might drive a lap in 200 seconds (half as fast), but carry 40 people. In this case, it takes the bus 200 seconds to move 40 people, __5x faster in clock time__ than the car. GPUs can be viewed as maximizing _throughput_, not the speed of a single process like a CPU.

Of course, in the case of CPUs vs. GPUs, the throughput disparity is even higher that a car vs. a bus. Even in this consumer laptop, I have 640 GPU cores vs. 6 CPU cores; a high-end GPU might have 4000-5000 GPU cores, and computations can be parallelized across multiple GPUs.


### 5. CuPy

NumPy has been indispensable in bringing Python to the scientific community. By mixing typed arrays and high-performance code written in C, NumPy overcomes a lot of the issues with using interpreted programming languages for analysis (while not bogging the user down with the compile-debug-run workload).

Because of NumPy's success, its API has been implemented in other array focused libraries. [CuPy](https://cupy.chainer.org/) continues this tradition, allwoing for using NumPy-like syntax to run against NVIDIA GPUs.

In the example below, even doing only a minimal amount of work on the GPU (array of 295k elements) is __13x faster__ compared to the "vectorized" CPU NumPy pandas example!

In [None]:
#Check that an NVIDIA GPU is running locally
!nvidia-smi

In [None]:
import cupy as cp
import numpy as np

#Bring data from pandas dataframe to NVIDIA GPU
#THIS COPYING IS NOT FREE! Like multiprocessing, need to make sure you are doing "enough work" to make this worthwhile
#Copying to GPU however is orders of magnitude faster than forking new Python processes
duration_sec_cp = cp.asarray(baywheels_df["duration_sec"])

In [None]:
%%timeit

#Same basic operation as the NumPy pandas example, except way faster
#Still not saturating the GPU for calculations, so 13x performance improvement is smaller than theoretically possible
#The more intense the workload, such as linear algebra operations, the bigger the improvement will be
duration_min_cp =  duration_sec_cp/60

### 6. cuDF

Of course, if you're not doing linear algebra, there's no reason to use CuPy (even if it's a drop-in replacement for NumPy). [cuDF](https://github.com/rapidsai/cudf) attempts to mimic pandas for operations, so that moving from CPU to GPU is as frictionless as possible.

In [None]:
import cudf

#Like transferring data with CuPy, this is not a "free" operation
#Need to make sure enough work will be done to make the memory transfer worth it
baywheels_df_gpu = cudf.from_pandas(baywheels_df)

In [None]:
#Can validate the type of the object, since it can be confusing!
type(baywheels_df_gpu)
#baywheels_df_gpu.head()

In [None]:
%%timeit

#This ends up being slower that CPU, likely because there isn't "enough work" vs highly optimized NumPy code
baywheels_df_gpu["duration_min_gpu"] = baywheels_df["duration_sec"] / 60

## Takeaways

In this talk, I've intentionally kept the discussion high-level, in order to help build intuition around operations that can be parallelized. These toy examples don't necessarily highlight the maximum performance you would see moving a CPU workload to a GPU, but it shows that very little GPU-specific knowledge is needed to get started.

Once you move beyond CuPy and cuDF as drop-in replacements for NumPy and pandas, things can become more complex. Writing CUDA kernels using Numba or PyCUDA leads to code that necessarily starts to look lower-level than most Python you will see. You will also need a strong understanding of associative and commutatitve operations, as well as how to keep GPU threads synchronized for different steps of a kernel. This is of course beyond the scope of this beginner talk, but if you are interested in learning that, both [NVIDIA](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) and [Numba](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html) provide great learning materials.