Skip to content
This repository was archived by the owner on Nov 15, 2022. It is now read-only.
This repository was archived by the owner on Nov 15, 2022. It is now read-only.

Regarding the performance of tensorwise #17

@justanhduc

Description

@justanhduc

I found out that tensorwise actually just runs a for loop over the nested tensors. I benchmarked tensorwise against map, list comprehension and for loop. (Un)surprisingly, tensorwise performs much slower than the others. Here is the benchmark

import torch as T
import nestedtensor as nt

crit = lambda x, y: T.mean((x - y) ** 2)


@nt.tensorwise()
def loss_nt(a, b):
    return crit(a, b)


def loss_map(a, b):
    return sum(map(crit, a, b)) / len(a)


def loss_for(a, b):
    return sum([crit(a_, b_) for a_, b_ in zip(a, b)]) / len(a)


def loss_expfor(a, b):
    loss = []
    for a_, b_ in zip(a, b):
        loss.append(crit(a_, b_))
    return sum(loss) / len(loss)


p1 = T.arange(64 * 5000 * 3).cuda().view(64, 5000, 3).float()
p2 = T.arange(64 * 5000 * 3).cuda().view(64, 5000, 3).float()

p1_list = list(p1[:, None])
p2_list = list(p2[:, None])

p1_nt = nt.as_nested_tensor(p1_list).cuda()
p2_nt = nt.as_nested_tensor(p2_list).cuda()

start = T.cuda.Event(enable_timing=True)
end = T.cuda.Event(enable_timing=True)

for i in range(100):
    start.record()
    loss_nt(p1_nt, p2_nt)
    end.record()
    T.cuda.synchronize()
    total_nt = start.elapsed_time(end)

    start.record()
    loss_map(p1_list, p2_list)
    end.record()
    T.cuda.synchronize()
    total_map = start.elapsed_time(end)

    start.record()
    loss_for(p1_list, p2_list)
    end.record()
    T.cuda.synchronize()
    total_for = start.elapsed_time(end)

    start.record()
    crit(p1, p2)
    end.record()
    T.cuda.synchronize()
    total = start.elapsed_time(end)

    start.record()
    loss_expfor(p1_list, p2_list)
    end.record()
    T.cuda.synchronize()
    total_expfor = start.elapsed_time(end)

    print(i, total_nt, total_map, total_for, total_expfor, total)

Is it because tensorwise is not in C++ yet?
If the implementation of tensorwise is final then I wonder if tensorwise is just for convenience, not for performance?

Metadata

Metadata

Assignees

No one assigned

    Labels

    perfperformance related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions