Skip to content

The tensor-based computation of exponentiation and logarithmic operations is much slower than using NumPy #143800

@yxma2015

Description

@yxma2015

🐛 Describe the bug

Hi there, hope this message finds you well.
I have encountered a significant performance issue when using PyTorch tensors for exponentiation (torch.exp()) and logarithmic operations (torch.log()) compared to NumPy. Specifically, these tensor operations are much slower than their NumPy counterparts. This issue is likely real. When I tested the following code, I didn't use a GPU.

The issue lies in the loss_5() function. On my machine, when implementing loss_5 with NumPy in the example below, it took 23 seconds, but when using PyTorch, it took 781 seconds.

# -*-coding:utf-8 -*-

import numpy as np
import tqdm
from sklearn.decomposition import NMF
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
import torch
import torch.nn as nn
from sklearn.cluster import SpectralClustering
from sklearn.metrics.pairwise import cosine_similarity
def init_graph(low_dim_x):
    n_spot = low_dim_x.shape[0]
    n_neighbor = 15
    init_W = cosine_similarity(low_dim_x)
    """cos_init = np.zeros((n_spot, n_spot))
    for i in range(n_spot):
        vec = init_W[i, :]
        distance = vec.argsort()[:: -1]
        for t in range(n_neighbor + 1):
            y = distance[t]
            cos_init[i, y] = init_W[i, y]"""
    return init_W
def spectral_clustering(x: np.array, n_cluster: int) -> np.array:
    """

    Args:
        x (np.array): feature matrix $x /in R^{N times D}$
        n_cluster (int): cluster number

    Returns:
        np.array: clustering labels
    """
    model = SpectralClustering(n_clusters=n_cluster,
                               assign_labels='discretize',
                               random_state=0).fit(x)
    labels = model.labels_
    partition = [[] for i in range(n_cluster)]
    for i in range(x.shape[0]):
        partition[labels[i]].append(i + 1)

    """grids = np.zeros((x.shape[0],x.shape[0]))
    for i in range(x.shape[0]):
        for j in range(x.shape[0]):
            if model.labels_[i] == model.labels_[j]:
                grids[i,j] = 1"""
    return partition
def get_laplace_matrix(x):
    #x = x + np.eye(x.shape[0])
    degree_matrix = np.zeros((x.shape[0], x.shape[0]))
    for i in range(x.shape[0]):
        degree_matrix[i, i] = sum(x[i, :])
    lap = degree_matrix - x
    #lap = lap + 0.01*np.eye(lap.shape[0])
    return lap
def nmf_ini(x: np.array, rank: np.array) -> np.array:
    """do NMF(non-negative matrix factorization) with a given matrix x and expected dimension.

    Args:
        x (np.array): non-negative matrix X to be factorized
        dimension (np.array): dimension

    Returns:
        np.array: (W, H) whose product approximates the non-negative matrix X
    """
    """model = NMF(n_components=dimension, init='random', random_state=0, max_iter=500)
    w = model.fit_transform(x)
    h = model.components_"""
    u, s, v = np.linalg.svd(x, full_matrices=False)
    w_ini = u[:,:rank]
    h_ini = np.diag(s[:rank])@v[:rank,:]

    return w_ini, h_ini

class MVFC(nn.Module):
    def __init__(self, parameters):
        super(MVFC, self).__init__()
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.gene_number = nn.Parameter(
            torch.tensor(parameters['gene_number']), requires_grad=False)
        self.spot_number = nn.Parameter(
            torch.tensor(parameters['spot_number']), requires_grad=False)
        self.feature_dimension = nn.Parameter(
            torch.tensor(parameters['feature_dimension']), requires_grad=False)

        self.alpha = nn.Parameter(
            torch.tensor(parameters['alpha']), requires_grad=False)
        self.beta = nn.Parameter(
            torch.tensor(parameters['beta']), requires_grad=False)
        self.gamma = nn.Parameter(
            torch.tensor(parameters['gamma']), requires_grad=False)
        self.eta = nn.Parameter(
            torch.tensor(parameters['eta']), requires_grad=False)
        self.epochs = nn.Parameter(
            torch.tensor(parameters['epochs']), requires_grad=False)


        self.base_spot = nn.Parameter(torch.rand((self.spot_number, self.feature_dimension),
                                                 dtype=torch.float32)
                                      )
        self.base_spot_g = nn.Parameter(torch.rand((self.gene_number, self.feature_dimension),
                                                 dtype=torch.float32))

        self.feature_fusion = nn.Parameter(torch.rand((self.feature_dimension,
                                                       self.spot_number),
                                                      dtype = torch.float32 ) )


        self.affinity_graph = nn.Parameter(torch.rand((self.spot_number,
                                                       self.spot_number),
                                           dtype=torch.float32))


    def objective_function(self,
                           w1,
                           w2,
                           lap_w2,
                           lap_w1):
        """

        Args:
            input:

        Returns:

        """

        loss_component = self.compute_loss(w1 = w1,
                                           w2 = w2,
                                           lap_w2 = lap_w2,lap_w1=lap_w1)
        return loss_component

    def initialize(self, w1,w2):
        print("model initializing...")
        with torch.no_grad():
            n_components = int(self.feature_dimension.detach())
            w, h = nmf_ini(w1.to("cpu").detach().numpy(),n_components)
            w = torch.from_numpy(w).float().to(self.device)
            h = torch.from_numpy(h).float().to(self.device)
            self.base_spot_g.data, self.feature_fusion.data = w, h

            w, h = nmf_ini(w2.to("cpu").detach().numpy(), n_components)
            w = torch.from_numpy(w).float().to(self.device)
            h = torch.from_numpy(h).float().to(self.device)
            self.base_spot.data, self.feature_fusion.data = w, h

            w1.to(self.device)
            w2.to(self.device)
        print("model initialized...")


    def compute_loss(self,w1,w2,lap_w2,lap_w1):
        # TODO
        loss = torch.zeros(6,dtype=torch.float32)
        # ST NMF
        loss[0] = self.loss_0(w1=w1)
        # spatial NMF
        loss[1] = self.loss_1(w2=w2)
        # penalty
        #loss[2] = self.loss_2()
        # lpp
        loss[3] = self.loss_3(lap_w2=lap_w2, lap_w1=lap_w1)
        # affinity graph
        loss[4] = self.loss_4()
        # contrastive loss
        loss[5] = self.loss_5(w2)
        return loss

    def loss_0(self,w1):
        return torch.norm(w1 - self.base_spot_g @ self.feature_fusion  )
    # self representation
    """def loss_0(self, w1):
        return torch.norm(w1 - w1 @ (self.feature_fusion + self.sr_gene))"""
    def loss_1(self,w2):
        return self.alpha * torch.norm(w2 - self.base_spot @ self.feature_fusion)
    def loss_2(self):
        return self.beta*torch.norm(self.affinity_graph,p=1)
    def loss_3(self, lap_w2, lap_w1):
        return self.gamma * torch.trace(self.feature_fusion @ lap_w2 @ self.feature_fusion.T)


    def loss_4(self):
        return self.eta * torch.norm(self.feature_fusion - self.feature_fusion @ self.affinity_graph)

    def loss_5(self,w2):
        contrastive_loss = 0
        for i in range(self.affinity_graph.shape[0]):
            denominator = torch.sum(
                torch.exp(self.affinity_graph[i,:])) - torch.exp(self.affinity_graph[i,i])
            for j in torch.where(w2 != 0)[0]:
                numerator = torch.exp(self.affinity_graph[i,j])
                contrastive_loss += -torch.log(numerator / denominator)
        return contrastive_loss
    def loss_5_numpy(self, w2):
        contrastive_loss = 0
        for i in range(self.affinity_graph.shape[0]):
            affinity = self.affinity_graph.to("cpu").detach().numpy()
            denominator = (np.sum(
                np.exp(affinity[i, :])) - np.exp(affinity[i, i]))
            for j in torch.where(w2 != 0)[0]:
                numerator = np.exp(affinity[i,j])
                contrastive_loss += -np.log(numerator / denominator)
        self.affinity_graph.to(self.device)
        return torch.tensor(contrastive_loss.astype(np.float32))







    def forward(self,w1,w2, lap_w2,lap_w1):

        self.feature_fusion.data = torch.nn.functional.relu(self.feature_fusion.data)
        self.base_spot_g.data = torch.nn.functional.relu(self.base_spot_g.data)
        self.base_spot.data = torch.nn.functional.relu(self.base_spot.data)
        self.affinity_graph.data = torch.nn.functional.relu(self.affinity_graph.data)
        self.affinity_graph.data =(self.affinity_graph.data + self.affinity_graph.data.T)/2

        return self.objective_function(w1,w2,lap_w2,lap_w1)

# test
def test(w1, w2, parameters):
    w1_cos = init_graph(w1.T)
    lap_w2 = get_laplace_matrix(w2).astype(np.float32)
    lap_w1 = get_laplace_matrix(w1_cos).astype(np.float32)

    model = MVFC(parameters=parameters)
    model.affinity_graph.data = torch.from_numpy(w1_cos.astype(np.float32))
    model = model.to(model.device)



    w1 = torch.from_numpy(w1)
    w2 = torch.from_numpy(w2)
    lap_w2 = torch.from_numpy(lap_w2)
    lap_w1 = torch.from_numpy(lap_w1)

    w1 = w1.to(model.device)
    w2 = w2.to(model.device)
    lap_w2 = lap_w2.to(model.device)
    lap_w1 = lap_w1.to(model.device)


    model.initialize(w1, w2)
    print("the model is built!")
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    loss_history = np.zeros((model.epochs, 6))
    for k in range(model.epochs):
        optimizer.zero_grad()
        loss = model.forward(w1,w2,lap_w2,lap_w1)
        loss_history[k,:] = loss.detach().numpy()[:]
        loss = torch.sum(loss)
        print(f"\rEpoch {k + 1}'s loss is:{loss}",end=" ")
        #model.affinity_graph = nn.Parameter(torch.clamp(model.affinity_graph,min=0))
        """model.feature_fusion = nn.Parameter(torch.clamp(model.feature_fusion, min=0))
        model.sr_gene = nn.Parameter(torch.clamp(model.sr_gene, min=0))
        model.sr_spatial = nn.Parameter(torch.clamp(model.sr_spatial, min=0))"""
        loss.backward()
        optimizer.step()

    print("optimized end!")
    # clustering
    #partition = spectral_clustering(model.feature_fusion.detach().numpy(), 11)

    return (model.affinity_graph.to("cpu").detach().numpy(),
            model.feature_fusion.to("cpu").detach().numpy(),
            loss_history,
            model.base_spot_g.to("cpu").detach().numpy(),
            model.base_spot.to("cpu").detach().numpy())




w1 = np.random.normal(loc=1,scale=0.1,size=(20,100))
w2 = np.random.normal(loc=1,scale=0.1,size=(100,100))
parameters = {
    "device": "cpu" if torch.cuda.is_available() else "cuda:0",
    "gene_number": w1.shape[0],
    "feature_dimension": 10,
    "alpha": 0.8,
    "beta": 0.8,
    "gamma": 0.8,
    "eta": 0.8,
    "spot_number": w1.shape[1],
    "epochs": 10,
    "n_cluster":10

    }
import time
start = time.time()
test(w1, w2, parameters)
end = time.time()
print(end - start)

Versions

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i7-13700
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 1
BogoMIPS: 4223.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht sy
scall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4
_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibr
s_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 576 KiB (12 instances)
L1i cache: 384 KiB (12 instances)
L2 cache: 24 MiB (12 instances)
L3 cache: 30 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Vulnerable: No microcode
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] numpy-groupies==0.11.2
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] triton==3.1.0
[conda] Could not collect

cc @msaroufim @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cpuCPU specific problem (e.g., perf, algorithm)module: performanceIssues related to performance, either of kernel code or framework glueneeds reproductionEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions