Skip to content
This repository has been archived by the owner on Aug 15, 2019. It is now read-only.

Add block matmul #1212

Merged
merged 13 commits into from
Aug 9, 2018
Merged

Add block matmul #1212

merged 13 commits into from
Aug 9, 2018

Conversation

aman-tiwari
Copy link
Contributor

@aman-tiwari aman-tiwari commented Aug 6, 2018

Description

Currently: 1.4x times speed up on a 512x512 matrix-matrix matmul.

In response to tensorflow/tfjs#582


For repository owners only:

Please remember to apply all applicable tags to your pull request.
Tags: FEATURE, BREAKING, BUG, PERF, DEV, DOC, SECURITY

For more info see: https://github.com/tensorflow/tfjs/blob/master/DEVELOPMENT.md


This change is Reviewable

@aman-tiwari
Copy link
Contributor Author

This is the benchmarking code (removed it from the PR)

  it('benchmark matmul sq matrix', async done => {
    const backend = tf.ENV.backend as MathBackendCPU;
    const bs = [32, 48, 64, (64 / 2) + 64, 128, (128 / 2) + 128];
    const ns = [64, 128, 192, 256, 239, 398, 512];
    const RUNS = 20;
    for (const n of ns) {
      const a = tf.randomUniform([n, n]) as tf.Tensor2D;
      const b = tf.randomUniform([n, n]) as tf.Tensor2D;
      // Warmup.
      backend.matMulNaive(a, b, false, false).dataSync();

      let res: tf.Tensor = null;
      const start = now();
      for (let i = 0; i < RUNS; i++) {
        res = backend.matMulNaive(a, b, false, false);
      }
      res.dataSync();
      const naiveTime = (now() - start) / RUNS;
      console.log(`N: ${n}\t ${naiveTime.toFixed(2)}ms`);

      for (const blockSize of bs) {
        backend.blockSize = blockSize;
        const a = tf.randomUniform([n, n]) as tf.Tensor2D;
        const b = tf.randomUniform([n, n]) as tf.Tensor2D;
        // Warmup.
        backend.matMul(a, b, false, false).dataSync();

        let res: tf.Tensor = null;
        const start = now();
        for (let i = 0; i < RUNS; i++) {
          res = backend.matMul(a, b, false, false);
        }
        res.dataSync();
        const elapsed = (now() - start) / RUNS;
        const speedup = (naiveTime / elapsed).toFixed(2);
        console.log(
            `mul BS: ${blockSize}\t ${elapsed.toFixed(2)} ms\t speedup: ${
                speedup}x\t diff:${(elapsed - naiveTime).toFixed(2)}ms`);
        await tf.nextFrame();
      }
    }
    done();
  });

Copy link
Contributor

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 1 of 2 files at r2, 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: 0 of 1 approvals obtained

@aman-tiwari
Copy link
Contributor Author

By running tests on matrices we found that the block size for the cache blocked matrix multiply was 48. We suffer a performance penalty of 0.5-1ms on small matrices but gain 100s of milliseconds on large matrices.

@dsmilkov dsmilkov changed the title WIP: Add block matmul Add block matmul Aug 9, 2018
Copy link
Contributor

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5.
Reviewable status: :shipit: complete! 1 of 1 approvals obtained

@dsmilkov dsmilkov merged commit dbb1b81 into tensorflow:master Aug 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants