# parallelization and `gpu` analytics

## the back-story: parallelization

the code we write is (generally speaking) sequential.

```python
x = 100
y = x * 4
print(x + y)
for i in range(x):
    print(i ** 2)
```

programming this way matches our way of reading and writing, and the way we tend to break down problems logically. we write out an ordered list of deterministic steps we want our computer to execute and the order we want them to execute those steps and we let 'er rip

at the lowest levels, the machinery implementing our programs is also performing those basic computational actions sequentially as well. these calculations are sent to the central processing unit (`cpu`) and that unit assigns that calculation to one of it's little local workers (a "core")

sometimes, though, a thing we tell the computer to do is kind of a dud, and the `cpu` / core sits waiting on some *other* thing (loading something from RAM, talking to another part of the computer, etc) to complete before it can go on.

computers *hate* this stuff. they are incredibly impatient. they're *busy*, don't you get that?

it's really just that they want to please, don't take it too personally

and it turns out that we actually do this... a lot. especially in data science, where many of the things we do are *embarassingly parallel* -- that is, we are asking a computer to do simple things over and over and over again with slightly different parameters or conditions. some examples:

**hyperparameter optimization**: for my model, try each of these N hyperparameter sets

```py
for hyperparams in hyperparameter_list:
    clf = sklearn.mymodel(**hyperparams)
    clf.fit(train, test)
```

**k-fold cross validation**: for each fold, train a model

```python
kf = sklearn.model_selection.KFold(n_splits=10)
for train, test in kf.split(X):
    clf = sklearn.mymodel()
    clf.fit(train, test)
```

**vectorized computations**: I need to calculate a gradient w.r.t. each of these N dimensions

```python
for i in range(w):
    grad[i] = gradient_wrt(loss, i, w)
```

**many tasks on the same data**: for each of these 1000 random forests, train on this same dataset

```python
for forest in random_forest:
    forest.fit(train, test)
```

**one task on many pieces of data**: for each of these million `mnist` digit images, calculate the loss of my convolutional neural net model

```python
for (image, label) in minist_images:
    total_loss += my_loss(mymodel, image, loss)
```

**iterations**: after you get my cnn loss, backpropagate changes to my coefficients and do it all over again... a few million times

```python
for i in range(1_000_000):
    mymodel.forward_prop()
    mymodel.backward_prop()
    mymodel.summary()
```

**every matrix multiplication step**: there are a few of these

```python
for epoch in epochs:
    for batch in batches:
        for layer in layers:
            for node in layer:
                batch * node.weights + node.biases
```

**example preparation masochism**: multiply this linear regression matrix against the `iris` dataset every day for the rest of my life

of course, the problem isn't new nor unique to data scientists, so over time computer engineers have dedicated tons of time and resources to making extremely efficient use of our precious and limited computational resources. this has included:

##### bit-level parallelism

increasing the number of bits a single computation can act on. this makes numeric computations with "large" numbers much faster (or equivalently, makes the definition of "large" much larger)

<table>
    <tr>
        <th>8 bits</th>
        <th>16 bits</th>
        <th>32 bits</th>
        <th>64 bits</th>
    </tr>
    <tr>
        <td><img width="50px" src="https://cdn.shopify.com/s/files/1/1137/2142/products/mega-man-8-bit-megaman-jump-vinyl-wall-decal-poster-detail.jpg?v=1516991481"></td>
        <td><img width="100px" src="https://vermillion95.files.wordpress.com/2014/11/16bit-megaman.png"></td>
        <td><img width="200px" src="https://videochums.com/article/digging-up-the-mega-man-legends-series.jpg"></td>
        <td><img width="400px" src="https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/screen-shot-2018-10-03-at-5-46-34-pm-1538603301.png?resize=480:*"></td>
    </tr>
</table>

##### instruction level parallelism

if we break up every computation (aka "thing the `cpu` can do") into distinct parts (e.g. fetch instructions, decode them, execute them, store results in memory, and write the results to a local register), and have different physical components of the processor do each of those tasks, we can process several computations at a time by

+ fetch the 1st instruction
+ decode the 1st, fetch the 2nd
+ execute the 1st, decode the 2nd, fetch the 3rd
+ ...

##### task parallelism

increase the number of cores you have to work with (multiprocessing), or create cores that are internally capable of working on multiple process at the same time (multithreading).

we can multithread pretty easily in `python`. let's look at the performance of two functions called sequentially and then as two separate threads

In [21]:
def pow_2(n):
    return n * n

def pow_10(n):
    return n * n * n * n * n * n * n * n * n * n

In [14]:
%%timeit -n1_000
n2 = pow_2(1)
n10 = pow_10(1)

391 ns ± 12.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


and now let's see what happens if we *thread* those calculations (that is, tell our `cpu` to work on both things at the same time instead of sequentially)

In [15]:
import threading

In [16]:
%%timeit -n1_000
thread1 = threading.Thread(target=pow_2, args=(1,))
thread2 = threading.Thread(target=pow_10, args=(1,))
thread1.start()
thread2.start()
thread1.join()
thread2.join()

215 µs ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


the threading test timing *includes* the time it takes to build the `threading.Thread` objects, so a full 3 orders of magnitude (nanoseconds to microseconds). not bad!

## single-machine limits

the advances above occurred slowly but surely from the 1960s on into the 2010s, and were focused on squeezing every last drop of performance out of single `cpu`s. people were also trying to make those `cpu`s faster

and they have had exponential success... up until very recently

<br><img src="http://www.extremetech.com/wp-content/uploads/2012/02/CPU-Scaling.jpg" width="600px"></img>

### go tall

for a long time, our computer's ability to generate parallelizable tasks far outstripped the capabilities of even the fastest `cpu`s. as a result, the answer to whether or not we needed more `cpu` power was pretty much always the same:

<div align="center"><img src="https://i.imgur.com/0fWjvQ0.png" width="500px"></div>

### go wide

having more cores on my single computer's processor was and still is great -- I'm able to do everything faster.

but what if I had two computers...

<br><div align="center"><img src="http://i208.photobucket.com/albums/bb106/Allscifi/Sherlock/Sherlock17.png" width="800px"></div>

the decision to scale *outward* instead of *upward* in computational resources has been the driving force behind a ton of the most powerful advances in computational methods. the move to "the cloud" has largely been about this exact concept: I can dynamically update my resources (size of storage, or number of `cpu` cores, and therefore number of simultaneous computations I can execute) by adding additional networked computers.

we will revisit this approach in the next lecture (`hadoop` and `spark`, in particular), so we won't belabor it now

but there is one other option

## go game

<br><div align="center"><img src="https://mygaming.co.za/news/wp-content/uploads/2017/06/AcerForGaming-01.jpg" width="900px"></div>

the increasingly graphical nature of computations -- frame-by-frame rendering of screens for video games, desktops, mobile phones, webpages, etc -- meant that `cpu`s were spending a lot of their time performing these specific niche computations. this incentivized engineers and companies (e.g. nvidia) to sink research and development into constructing extremely efficient hardware and software for offloading that subset of calculations to a new processing unit, and the `g`raphical `p`rocessing `u`nits (or `gpu`) was born.

because the actions that the `gpu` performs are much simpler, it is possible for the `gpu` architecture to be much simpler as well. this allows us to put **a lot** more cores on one `gpu`:

<br><div align="center"><img src="https://www.nextron.no/en/TEMPLATE/NEXTRON_NY/grafikk/gpu-computing-cores.jpg" width="900px"></div>

and the performance shows:

<br><div align="center"><img src="https://www.karlrupp.net/wp-content/uploads/2013/06/flops-per-cycle-sp.png" width="900px"></div>

in addition, the actual cost of `gpu` analytics -- while not cheap by any stretch -- is considerably less:

<br><div align="center"><img src="https://cdn-images-1.medium.com/max/2000/1*Hsu_MSC58ZR2Dl7QvT8Ycg.png" width="1000px"></div>

back to what we can do with `gpu`s, though. take, for example the 3d world created in a video game. internally, a single frame is represented by a handful of points (vectors in some space) and rules for drawing the frame based on those vectors.

frame-to-frame changes (e.g. moves made via the controller) are translations, rescalings, or rotations of those vectors, and each of those actions can (and is) expressed as a matrix multiplication

<br><div align="center"><img src="https://www.alanzucconi.com/wp-content/uploads/2016/02/2D_affine_transformation_matrix.svg_.png" width="500px"></div>

there are a few other things I can think of that are basically just matrix multiplication:

<br><div align="center"><img src="https://cdn-images-1.medium.com/max/1600/1*UKIHA2AHtB9WPG-KrfwSZg.png" width="500px"></div>

to be serious, though, `gpu`s excel at exactly the sorts of things we tend to do in data science and analytics:

+ floating point arithmetic
+ dense linear algebra
+ doing the same calculation on many different data points (`S`ingle `I`nstruction `M`ultiple `D`ata, `SIMD`)
    + training or predicting on many records
    + monte carlo simulations
    + hyperparameter tuning

the current age of `gpu` data science was kicked off in the late 2000s, in particular with [this paper](https://ai.stanford.edu/~ang/papers/icml09-LargeScaleUnsupervisedDeepLearningGPU.pdf) in which Andrew Ng et al. demonstrated a 70x speed up on state of the art unsupervised learning methods when using `gpu`s instead of `cpu`s.

for deep learning, 70x times faster doesn't necessarily excite as much as the ability to train or predict on 70x as much **data** in the allocated time.

**<div align="center">what are your questions so far?</div>**

## applications and frameworks

there have been quite a few software products put out that attempt to leverage `gpu` performance for analytical tasks.

### `gpu` databases

many of the parallelization actions we talked about in our database lectures are reproducible with `gpu`s. for example, we could take a large aggregation query, distributing portions of the underlying data to different `gpu` cores, performing those sub-calcualtions in parallel, and then aggregating.

the current players in the `gpu` database field are

+ in memory (*extremely* fast, all records in data up to several Terabytes)
    + [`kinetica`](https://www.kinetica.com/gpu-database/)
    + [`omnisci` (formerly `mapd`)](https://www.omnisci.com/)
    + [`brytlyt`](https://www.brytlyt.com/) (this is actually in `gpu` ram, so even faster in theory)
+ on disk (still very fast, but can handle data size beyond memory limits)
    + [`sqream db`](https://sqream.com/product/)
    + [`blazingdb`](https://blazingdb.com/#/)
+ nosql
    + [`blazegraph` aka amazon `neptune`](https://www.blazegraph.com/), a high-performance graph database that implements `gpu` acceleration (i.e. isn't *wholy* a `gpu` database)

### bitcoin mining

I won't go into this much, but much of the recent demand driving research in `gpu`s has been driven by bitcoin mining. if you're interested in starting, for the low-low price of only $27K you can get this beautiful rig off etsy of all places:

<a href="https://www.etsy.com/listing/597628679/6-nvidia-titan-v-gpu-mining-rig?gpla=1&gao=1&&utm_source=google&utm_medium=cpc&utm_campaign=shopping_us_a-electronics_and_accessories-computers_and_peripherals-computers&utm_custom1=2bbe5f9a-af0b-4e93-a156-633d1986d872&utm_content=go_304501835_22746077915_78727306955_aud-360480980339:pla-106554054755_c__597628679&gclid=CjwKCAiAz7TfBRAKEiwAz8fKOIv-ffgVc6AtHspdiyc2xwj3UT54R0kgR6u3pjhDodAHSHhJ03pW5RoCq5wQAvD_BwE"><img src="https://i.etsystatic.com/17221025/r/il/08a712/1441922516/il_570xN.1441922516_4vgi.jpg" width="500px"></a>

### analytics frameworks

the real reason we are talking about `gpu` analytics is the ability to utilize `gpu`s in our code (e.g. instructing our deep neural net to be trained on our `gpu`.

there is a lot of software required to convert our extremely high-level `python` code down to instructions consumable by your `gpu`

*it doesn't **have** to be `nvidia` products, but they currently dominate the market*

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1M3LZQRI8nfCscnyL_h7xjKi4i9e8lo1t"></div>

unless you are planning on writing your code in `c++` (and writing it better than google's world-class fleet of software engineers), you are probably going to spend most of your deep learning programming time in the `tensorflow` and `keras` region. we will cover these two libraries in more detail in the next lecture.

for now it suffices to say that there are a fair number of moving pieces between you and a `gpu` (just as there are between you and a `cpu`!) and any anlaytics framework you work with will sit somewhere along this spectrum.

for example:

+ [`gunrock`](https://gunrock.github.io/docs/), a graph algorithm execution library using `gpu` acceleration. you can write `c++` or relatively inconvenient `python` code to perform graph calcualtions on **enormous** graphs. current state of the art
+ [`tensorflow`](https://www.tensorflow.org/), google's `python` `api` for graph calculation execution will compile down to various different lower-level `gpu` libraries (as well as `tpu` libraries)
+ [`pytorch`](https://pytorch.org/), an open-source alternative (to `tensorflow`) framework for deep learning
+ [`apache mxnet`](https://mxnet.apache.org/), another open-source alternative (to `tensorflow`) framework for deep learning, emphasizing distributed computing in addition to `gpu` analytics

## actually using `gpu`s

let's start with the most basic question -- do *you* have a `gpu`? if so, what kind is it?

**<div align="center">excercise: determine if you have a `gpu` on your laptop</div>**

+ OS is `windows`
    + pre-10: [check your device manager](https://www.addictivetips.com/windows-tips/check-dedicated-gpu/) for a second "display adapter"
    + 10: you can [do the same as above](https://www.addictivetips.com/windows-tips/check-dedicated-gpu/), or you can get more info with [the directx diagnostic tool](https://www.techjunkie.com/check-graphics-card-windows-10/)
+ OS is `mac`
    + launch the "System information" app, select "Graphics/Displays", and [see what items you might have](https://s3.amazonaws.com/quantstart/media/images/qs-cuda-1-0004.png) (you probably only have one)
    + if you have a `gpu` with `bus` type "Built-In", you have what is called an integrated `gpu` -- it's built into the `cpu` itself. this isn't what we're looking for
+ OS is `*nix`
    + `lspci | grep ' VGA ' | cut -d" " -f 1 | xargs -i lspci -v -s {}`

+ did anyone have a dedicated `gpu`? was it an `nvidia` `gpu`?

most of us won't have a truly amazing dedicated `gpu` on our personal laptop. getting to the state of the art is not cheap! even a relatively straight-forward laptop containing a number of current top-flight `gpu`s will cost [several thousands](https://www.amazon.com/gp/offer-listing/B07D4B2VLP/ref=olp_twister_child?ie=UTF8&mv_size_name=0)

fortunately we don't need to buy a `gpu` machine outright -- we can just rent one from `aws`.

go to https://aws.amazon.com/ec2/instance-types/ and check out the "accelerated computing" options -- specificaly types P2 and P3

the main differences are the type of card used, whether or not there is a "link", and the memory (system and `gpu`). just to compare the cards:

| `ec2` type | `gpu` card | `gpu` cores | tensor cores | tflops (double precision), higher is better |
|-|-|-|-|-|
| p2 | NVIDIA K80 | 2496 | 0 | 1.87 |
| p3 | NVIDIA Tesla V100 | 5120 | 640 | 7 |

so the newer `p3`s with the tesla v100 cards are a good deal better and have an additinal functionality in tensor cores (built to accelerate specific common operations in deep learning algorithms)

*note: performance numbers taken from [here](https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/)*

given the different card qualities, the actual makeup and price of the instance types is

| Model | GPUs | GPU Mem (GiB) | vCPU | Mem (GiB) | link | on demand USD / hr | spot USD / hr | fractional savings |
|-|-|-|-|-|-|-|-|-|
| p2.xlarge | 1 | 12 | 4 | 61 | none | 0.9 | 0.27 | 0.70 |
| p3.2xlarge | 1 | 16 | 8 | 61 | none | 3.06 | 1.2784 | 0.58 |
| p2.8xlarge | 8 | 96 | 32 | 488 | none | 7.2 | 2.16 | 0.70 |
| p3.8xlarge | 4 | 64 | 32 | 244 | NVLink | 12.24 | 4.6116 | 0.62 |
| p2.16xlarge | 16 | 192 | 64 | 732 | none | 14.4 | 14.4 | 0.00 |
| p3.16xlarge | 8 | 128 | 64 | 488 | NVLink | 24.48 | 7.344 | 0.70 |

we will spin these up, but given the cost we will be very selective about exactly when and why we do! and we will **shut them down** as soon as possible when we do!

## `tpu`s

google thinks `gpu`s are cool and all, but they kind of have their own thing going

back in 2013, google recognized that its demand for `gpu` calculations was growing too fast (in their blog, they suggest they predicted they needed to *double* their capacity, which is wild). the way they chose to address this was by desiging their *own* processor -- an alternative to `cpu`s or `gpu`s -- which was not *accidentally* good at deep learning applications like a `gpu`, but was *intentionally* good at it.

they called their processor a `t`ensor `p`rocessing `u`nit, or `tpu`

this type of hardware designed to perform a very context-specific set of calculations is an `A`pplication-`S`pecific `I`ntegrated `C`ircuit, or [`ASIC`](https://en.wikipedia.org/wiki/Application-specific_integrated_circuit). in the case of `tpu`s, the entire unit has separate components each of which only do:

1. matrix multiplication / convolution
1. aggregation of matrices
1. *activation* functions (this is an essential step in neural networks; we will cover in the deep learning lecture)

the results I'm about to show come from [a google blog post](https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu), so *caveat emptor*, but they are pretty compelling (and use some truly awesome units)

<br><div align="center"><img src="https://storage.googleapis.com/gweb-cloudblog-publish/images/tpu-3gpcs.max-1200x1200.PNG" width="800px"></div>

and also:

<br><div align="center"><img src="https://storage.googleapis.com/gweb-cloudblog-publish/original_images/tpu-6tlel.PNG" width="800px"></div>

for the interested, here are some useful links

+ https://kids.kiddle.co/Central_processing_unit (the eli5 version of CPUs, extremely helpful for starting out!)
+ https://www.geeksforgeeks.org/multithreading-python-set-1/
+ https://en.wikipedia.org/wiki/Parallel_computing
    + https://en.wikipedia.org/wiki/Central_processing_unit#Parallelism
    + https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)
+ https://scicomp.stackexchange.com/questions/943/what-kinds-of-problems-lend-themselves-well-to-gpu-computing
+ https://www.quora.com/What-kind-of-math-is-a-graphics-card-better-at-than-a-CPU
+ https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664