*Seminar in Computer Architecture*

# Seminar 1.1 - Introduction and Basics

<hr>

The ongoing goal is to build fundamentally better architectures from processors to accelerators and memories. 

There are **four key current directions**:

- security, reliability, safety
- energy efficiency (memory-centric/data-centric arch.)
- low-latency, predictability
- AI/ML, genomics, medecine, health specific archs.

The transformation hierarchy is an extended view of the general understanding of computer architecture (restricted to the SW/HW interface and micro-architecture). The expanded view covers (within the **transformation hierarchy**):

- *Problem*
    - **Algorithm**
    - **Program/Language**
    - **System Software**
    - **SH/HW interface**
    - **Micro-Architecture**
    - **Logic**
    - **Devices**
- *Electrons*

The expanded view helps understand the seemless working of machines.

### Axiom

1. To achieve the highest energy efficient and performance: **We must take the expanded view of Computer Architecture**.

2. There is plenty of room both at the top and bottom of the stack, enhanced when there is cross-comunication (optimization goal).

It implies the **co-design across the hierarchy** (algorithms to devices), which helps **specialize as much as possible within the design goals**.

<u>Examples:</u>
- data-centric arch for low energy & high perf (proc. in mem/dram, nvm, unifoed mem/storage)
- toleration of bit flips, secure memory
- ML/AI arch. (algorithm/arch. co-design) 
- data-aware arch.

<hr>

## 1 - Why Study Computer Architecture?

### The Why

Comptuer architecture is the science and art of **designing computing platforms** (hardware, interface, system SW, and programming model) to **achieve a set of design goals** (highest performance on some workloads, longest battery life at a given form factor, best average perf across known workloads at the best perf/cost ratio).

Different platforms, and different goals usually rely on the same fundamentals (in terms of architecture).

More recently, new type of platforms have been created to help handle more efficient workloads in some areas such as data processing in ML (e.g. [WIKI](https://en.wikipedia.org/wiki/Systolic_array), In parallel computer architectures, a **systolic array** is a homogeneous network of tightly coupled data processing units (DPUs) called cells or nodes. Each node or DPU independently computes a partial result as a function of the data received from its upstream neighbors, stores the result within itself and passes it downstream.)

> The science and art of designing, selecting, and interconnecting hardware components and designing the HW/SW interface to create a computing system that meets functional, performance, energy consumption, costs, and other specific goals.
>
> **goal**: to enable better system (faster, cheaper, smaller), new applications, better solution to problems

### Current paradigm shift

There are many difficult problems that motivate and cause the shift (e.g. data-intensive applications, power/energy/thermal constraints, complexity, technology scaling, memory bottleneck, security/privacy).

Changing problem, aglorithms, and programs/languages driven by users **should inform low-level design** (runtime system (VM, OS, MM), ISA, etc.). 

## 2 - Some Cross-Layer Designs

**Goal**: To allow the system to expose certain characteristics of the hardware to the upper layer of the transformation stack.

### eXpressive Memory (X-Mem) interfaces

![xmem](images/Xmem.png)

#### Heterogeneous-Reliability Memory

#### EDEN: Data-Aware Efficient DNN Inference

#### SMASH: SW/HW Indexing Acceleration

#### Virtual Block Interface

#### Intel Optane Persistent Memory

A type of non-volatile RAM memory (more can be integrated on the same stick)

#### Phase Change Memory

#### Cerebras' Wafer Scale Engine

Largest ML accelerator chip (400k cores, 1.2 trillion transistors, 46mm^2) -- highly parallelized, high data consumption.

#### UPMEM Processing-in-DRAM Engine

Processing in DRAM engine includes standard DIMM modules with a large number of DPU processors combined with DRAM chips. It replaces standard DIMMs allowing a large amount of compute and memory bandwith. 

#### Samsung Function-in-Memory DRAM

## 3 - Processing in Memory: Two Approaches

### 3.1 Processing near Memory

> **Definition**: A compute unit is positioned closer to memories arrays.

Specialized Processing in Memory was introduced in 2015 (Scalable Processing-In-Memory Accelerator for Parallel Graph Processing, ISCA, 2015; Simple Processing in Memory (PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture, ISCA, 2015; Google Workloads for Consumer Devices: Mitigating DAta Movement Bottlenecks, ASPLOS, 2018)

<u>Other examples:</u> accelerating GPU operation, accelerating linked data structures, accelerating runahead execution, FPGDA-based processing near memory.

To work on enabling more efficient processing, we need to understand data movement bottlenecks: thus DAMOV Analysis Methodology & Workloads (paper: "DAMOV: A new Methodology and Benchmark Suite for Evaluating Data Movements Bottlenecks"). 

### 3.2 Processing using Memory

> **Definition**: The memory itself is used to perform a computation.

<u>Other examples:</u> In-DRAM Processing (bits in the same lines can perform computation; SIMDRAM: A Framework for Bit-Serial; LISA: Increasing Connectivity in DRAM). 

Prominent in AI with Tensor Processing Units (TPU).

## 4 - Reliability and Security

### RowHammer

<u>DRAM RowHammer Vulnerability</u> has to be addressed (predictable bit-flip induction in commodity DRAM chips). It was the **first example of how a simple hardware failure mechanism can create a widespread system security vulnerability**.

> **RowHammer**: induction of bit-flips in the neighboring rows of a memory array when 'hammering a given row'. It is a type of **disturbance error**
>
> **Consequence**: one can take over an otherwise-secure system (increasing privileges)

Newer DRAM chips are more vulnerable to RowHammer. There are chips today whose weakest cells fail after only 4,800 hammers. Chips of newer DRAM technology nodes can exhibit RowHammer bit flips in 1) **more rows** and 2) **farther away from the victim row**.

Existing mitigation mechanisms are *not* effective. Though BlockHammer exists (i.e. blocking rapidly accessed rows). 

### Meltdown and Spectre

Someone can steal secret data from the system even though:
- the program and data are perfectly correct
- the harware behaves according to the spefication
- there are no software vulnerabilities/bugs

Speculative execution leaves traces of secret data in the processor's cache (internal storage). A malicious program can **inspect the contents of the cache to infer secret data** that is not supposed to be accessed. It can also force another program to speculatively execute code that leaves traces of secret data.

## 5 - More demanding Workloads

Applications are straining architectures. 

<u>Example:</u> Genome analysis
- No machine can give a full sequence of a genome (current sequencing machine provides small randomized fragments of the original DNA sequence -- the issue is to perform a read mapping of the sequences to rebuild the full sequence). **Analysis is bottlenecked in Read Mapping**
- Example of solution: Nanopore Sequencing Technology

## Evolutions

New Computig Paradigms (PRocessing in/Near Memory, Neuromorphic COmputing), new accelerators (AIML, Graph Analytics, Genome Analysis), new memories and storage systems (Non-Volatile Main Memory).