# Moore's Law - Machine Learning Research Edition
> A post about how Moore's law is driving machine learning research

- toc: false 
- badges: false
- comments: false
- categories: [jupyter]
- image: images/gpu.jpg
- author: Mathias Lechner

OpenAI recently released a [blog post](https://openai.com/blog/ai-and-efficiency/), showing that the advances in algorithmic efficiency for training neural nets outpaced the scaling of Moore's law. 
In particular, the amount of transistors in silicon chips [doubles every two years](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/05/moores-law-electronics.pdf), whereas the efficiency of training a neural net to a certain accuracy level doubles every 16 to 17 months. 
While this progress is impressive, I am arguing that Moore's law is one of the main drivers of machine learning research. Thus, the advances in neural nets training efficiency are built on the shoulders of Moore's law.
In essence, I am stating the following law:

> The amount of hyperparameter & architecture tuning that can be done for a fixed budget **doubles** every two years
> 
> *Moore's Law - Machine Learning Research Edition*


By *budget* I mean time, money and compute resources.

## Machine Learning Research
If we look at the methodology of machine learning research, we notice an iterative paradigm composed of the following steps.

1. We have an idea
2. We test the idea
3. Based on the results, we refine and improve the idea.

Once this iterative process yields noteworthy results, the idea and corresponding test results get distilled into a research paper.

Let's say we want to speed up our research. 
Common sense tells us that we can speed up any process by getting rid of its bottlenecks.
The most dominant bottleneck in the procedure above is obviously step number 2. 

We could run more machine learning experiments if we simply buy a larger quantity of faster compute units.
But what if our budget is limited? How can we speed up our research then?

The answer is simply waiting.
Yes! Moore's law tells us that every 2 years, we get roughly twice the compute performance for the same budget.
For instance, here is a plot of how the 32-bit floating-point of Nvidia GPU performance increased in the past decade:

In [5]:
#hide
import pandas as pd
import altair as alt
df = pd.DataFrame({'GPU': ['GTX 580'],
                   'Year':  [pd.Timestamp(2010,11,1)],
        'Gen': ['Fermi'],
        'Memory': ['1.5GB'],
        'Compute': [1.581],
        'Tensor cores': [False],
        'Node': ['40nm']
        })
df = df.append({'Year': pd.Timestamp(2012,2,1), 
        'GPU': 'GTX 680',
        'Gen': 'Kepler',
        'Compute': 3.250,
        'Memory': '2GB', 
        'Tensor cores': False,
        'Node': '28nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2013,10,1), 
        'GPU': 'K40',
        'Gen': 'Kepler',
        'Compute': 5.046,
        'Tensor cores': False,
        'Memory': '12GB',
        'Node': '28nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2014,2,1), 
        'GPU': 'Titan Black',
        'Gen': 'Kepler',
        'Tensor cores': False,
        'Memory': '6GB',
        'Compute': 5.645,
        'Node': '28nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2014,11,1), 
        'GPU': 'K80',
        'Gen': 'Kepler',
        'Compute': 8.226,
        'Tensor cores': False,
        'Memory': '2x12GB',
        'Node': '28nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2015,6,1), 
        'GPU': 'GTX 980 Ti',
        'Gen': 'Maxwell',
        'Compute':6.060 ,
        'Tensor cores': False,
        'Memory': '6GB',
        'Node': '28nm',
        },ignore_index=True)


df = df.append({'Year': pd.Timestamp(2015,11,1), 
        'GPU': 'M40',
        'Gen': 'Maxwell',
        'Compute': 6.844,
        'Tensor cores': False,
        'Memory': '12GB',
        'Node': '28nm',
        },ignore_index=True)

df = df.append({'Year': pd.Timestamp(2016,5,1), 
        'GPU': 'GTX 1080',
        'Gen': 'Pascal',
        'Compute': 8.873,
        'Memory': '8GB',
        'Tensor cores': False,
        'Node': '16nm',
        },ignore_index=True)


df = df.append({'Year': pd.Timestamp(2016,4,1), 
        'GPU': 'P100',
        'Gen': 'Pascal',
        'Compute': 10.61,
        'Memory': '16GB',
        'Tensor cores': False,
        'Node': '16nm',
        },ignore_index=True)

df = df.append({'Year': pd.Timestamp(2017,3,1), 
        'GPU': 'GTX 1080Ti',
        'Gen': 'Pascal',
        'Compute': 11.34,
        'Memory': '11GB',
        'Tensor cores': False,
        'Node': '16nm',
        },ignore_index=True)

df = df.append({'Year': pd.Timestamp(2017,12,1), 
        'GPU': 'Titan V',
        'Gen': 'Volta',
        'Compute': 14.90,
        'Memory': '12GB',
        'Tensor cores': True,
        'Node': '12nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2018,3,1), 
        'GPU': 'V100',
        'Gen': 'Volta',
        'Compute': 14.13,
        'Memory': '16/32GB',
        'Tensor cores': True,
        'Node': '12nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2018,9,13), 
        'GPU': 'T4',
        'Gen': 'Turing',
        'Compute': 8.141 ,
        'Memory': '16GB',
        'Tensor cores': True,
        'Node': '12nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2018,11,1), 
        'GPU': 'RTX 2080 Ti',
        'Gen': 'Turing',
        'Compute': 13.45,
        'Memory': '12GB',
        'Tensor cores': True,
        'Node': '12nm',
        },ignore_index=True)
df = df.append({'Year': pd.Timestamp(2018,12,1), 
        'GPU': 'Titan RTX',
        'Gen': 'Turing',
        'Compute': 16.31,
        'Memory': '24GB',
        'Tensor cores': True,
        'Node': '12nm',
        },ignore_index=True)

df = df.append({'Year': pd.Timestamp(2020,5,1), 
        'GPU': 'A100',
        'Gen': 'Ampere',
        'Compute': 19.49 ,
        'Memory': '40GB',
        'Tensor cores': True,
        'Node': '7nm',
        },ignore_index=True)

 

df

Unnamed: 0,GPU,Year,Gen,Memory,Compute,Tensor cores,Node
0,GTX 580,2010-11-01,Fermi,1.5GB,1.581,False,40nm
1,GTX 680,2012-02-01,Kepler,2GB,3.25,False,28nm
2,K40,2013-10-01,Kepler,12GB,5.046,False,28nm
3,Titan Black,2014-02-01,Kepler,6GB,5.645,False,28nm
4,K80,2014-11-01,Kepler,2x12GB,8.226,False,28nm
5,GTX 980 Ti,2015-06-01,Maxwell,6GB,6.06,False,28nm
6,M40,2015-11-01,Maxwell,12GB,6.844,False,28nm
7,GTX 1080,2016-05-01,Pascal,8GB,8.873,False,16nm
8,P100,2016-04-01,Pascal,16GB,10.61,False,16nm
9,GTX 1080Ti,2017-03-01,Pascal,11GB,11.34,False,16nm


In [4]:
#hide_input
points = alt.Chart(df).mark_circle(size=60).encode(
    alt.X('Year:T',scale=alt.Scale(domain=(pd.Timestamp(2010,1,1), pd.Timestamp(2021,1,1)))),
    y=alt.Y('Compute',axis=alt.Axis(title="float32 Teraflop/s")),
    color='Gen',
    tooltip=['Compute', 'Node','Gen','Memory','Tensor cores']
).properties(
    width=600,
    height=400
)

text = points.mark_text(
    align='left',
    baseline='middle',
    dx=7
).encode(
    text='GPU'
)
points + text

This chart does not even include the improvements of Mixed-precision methods and other tricks that achieve higher performance by sacrificing arithmetic precision. For instance, Nvidia's latest A100 can perform 156 Teraflop/s when using the slightly less precise [TensorFloat32](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/) format.
If we compare the TensorFloat32 throughput of the A100 to the GTX 580 used by Alex Krizhevsky to train [AlexNet](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), we see a 100x jump in compute performance in almost exactly ten years.


## But what about larger models and datasets?

Of course, the statement about the doubling of the hyperparameter & architecture tuning assumes that the datasets and extend of the networks does not change dramatically.
However, given that ImageNet is still the de-facto standard computer vision benchmark (special credit to Prof. Fei-Fei), and the fact that the top-performing networks in 2020 ([EfficientNet](https://arxiv.org/pdf/1905.11946.pdf)) are smaller than their 2016 counterparts (ResNet/[ResNeXt](https://arxiv.org/pdf/1611.05431.pdf)), this assumptions holds.