## Benchmarking Fermi Microarchitecture

Tristan Overney, Clément Humbert November 7, 2014

#### 1 Goals

The goals of this research is to expose the microarchitecture implemented by Nvidia Fermi cards such as: pipeline length, instructions latency, scheduling patterns.

#### 2 Methods

To achieve our goals, we used a serie of specially crafted CUDA kernels. These usually contain large batches of dependent instructions that we time with the assistance of the clock64() function offered by the CUDA API.

The benchmark programs were run on a machine equipped with a: Nvidia GeForce GTX 580.

### 3 Integers multiplication benchmarking

This section contains results obtained through the previously described methods using large batches of integer multiplication

#### 3.1 Integer multiplication: threads starting times



# 3.2 Integer multiplication: threads ending times against thread ids



### 3.3 Threads total running time against thread ids



# 3.4 Threads running times divide by number of multplications



- 4 Interpretation
- 5 Conclusion