# The ARM Assembly Language Programming

**Peng-Sheng Chen** 

Fall, 2017

#### Introduction

- The ARM processor is easy to program at the assembly level
- In this study, we will
  - Look at ARM assembly language programming at the user level
  - See how to write simple programs which will run on an ARM development board / ARM emulator

#### **Outline**

- Data processing instructions
- Data transfer instructions
- Control flow instructions
- Writing simple assembly language programs

#### **Outline**

- Data processing instructions
- Data transfer instructions
- Control flow instructions
- Writing simple assembly language programs

#### Data processing instructions

- Enable the programmer to perform arithmetic and logical operations on data values in registers
- The applied rules
  - All operands are 32 bits wide and come from registers or are specified as literals in the instruction itself
  - The result, if there is one, is 32 bits wide and is placed in a register
    - (An exception: long multiply instructions produce a 64 bits result)
  - Each of the operand registers and the result register are independently specified in the instruction
    - (This is, the ARM uses a '3-address' format for these instruction)

#### Simple Register Operands

The semicolon here indicates that everything to the right of it is a comment and should be ignored by the assembler

#### **Arithmetic Operations**

- These instructions perform binary arithmetic on two 32bit operands
- The carry-in, when used, is the current value of the C bit in the CPSR

| ADD | r0, r1, r2 | r0 := r1 + r2         |
|-----|------------|-----------------------|
| ADC | r0, r1, r2 | r0 := r1 + r2 + C     |
| SUB | r0, r1, r2 | r0 := r1 - r2         |
| SBC | r0, r1, r2 | r0 := r1 - r2 + C - 1 |
| RSB | r0, r1, r2 | r0 := r2 - r1         |
| RSC | r0, r1, r2 | r0 := r2 - r1 + C - 1 |

#### **Bit-Wise Logical Operations**

 These instructions perform the specified boolean logic operation on each bit pair of the input operands

| AND r0, r1, r2 | r0 := r1 AND r2       |
|----------------|-----------------------|
| ORR r0, r1, r2 | r0 := r1 OR r2        |
| EOR r0, r1, r2 | r0 := r1 XOR r2       |
| BIC r0, r1, r2 | r0 := r1 AND (NOT r2) |

- BIC stands for 'bit clear'
- Every '1' in the second operand clears the corresponding bit in the first

#### **Example: BIC Instruction**

Assume that r1, r2, and r0 are 16-bit registers

r1 = 1111111111111111
 r2 = 000000001100101

BIC r0, r1, r2

r0 = 11111111110011010

#### Register Movement Operations

 These instructions ignore the first operand, which is omitted from the assembly language format, and simply move the second operand to the destination

| MOV r0, r2 | r0 := r2     |
|------------|--------------|
| MVN r0, r2 | r0 := NOT r2 |

#### **Comparison Operations**

 These instructions do not produce a result, but just set the condition code bits (N, Z, C, and V) in the CPSR according to the selected operation

| CMP | r1, r2 | compare         | set cc on r1 – r2   |
|-----|--------|-----------------|---------------------|
| CMN | r1, r2 | compare negated | set cc on r1 + r2   |
| TST | r1, r2 | bit test        | set cc on r1 AND r2 |
| TEQ | r1, r2 | test equal      | set cc on r1 XOR r2 |

#### **Immediate Operands**

 If we wish to add a constant to a register, we can replace the second source operand with an immediate value

```
ADD r3, r3, #1; r3 := r3 + 1; r8 := r7<sub>[7:0]</sub>

A constant preceded by '#'

A hexadecimal by putting '0x' after the '#'
```

# **Shifted Register Operands (1)**

 These instructions allows the second register operand to be subject to a shift operation before it is combined with the first operand

```
ADD r3, r2, r1, LSL #3 ; r3 := r2 + 8 * r1
```

- They are still single ARM instructions, executed in a single clock cycle
- Most processors offer shift operations as separate instructions, but the ARM combines them with a general ALU operation in a single instruction

## **Shifted Register Operands (2)**

| LSL |                       | Fill the vacated bits at the LSB of the word with zeros |
|-----|-----------------------|---------------------------------------------------------|
| ASL | arithmetic shift left | A synonym for LSL                                       |



**LSL #5** 

## **Shifted Register Operands (3)**

LSR logical shift right by 0 to 32 Fill the vacated bits at the MSB of the word with zeros



**LSR #5** 

#### **Shifted Register Operands (4)**

ASR arithmetic shift right by 0 to 32 Fill the vacated bits at the MSB of the word with zero (source operand is positive)



ASR #5 ;positive operand

## **Shifted Register Operands (5)**

ASR arithmetic shift right by 0 to 32 Fill the vacated bits at the MSB of the word with one (source operand is negative)



ASR #5 ;negative operand

#### **Unsigned Integer**

- 16 bit

#### Signed Integer (2's Complement)

```
(最大正數)
• 32767 => 0111111111111111
        => 000000000000100
        => 000000000000001
        => 0000000000000000
        => 1111111111111111
    -1
    -4
       => 1111111111111000
\bullet -32767 => 100000000000001
                               (最小負數)
\bullet -32768 => 1000000000000000
```

#### **Example 1**

```
    8 => 000000000001000
```

- Shift right 2 bits
  - -000000000001000

  - -000000000000000 (ASR)

#### Example 2

```
\bullet -8 => 1111111111111000
```

- Shift right 2 bits
  - **-111111111111000**
  - -00111111111111 (LSR)
  - -1111111111111 (ASR)

## **Shifted Register Operands (6)**

ROR Rotate right by 0 to 32 The bits which fall off the LSB of the word are used to fill the vacated bits at the MSB of the word



**ROR #5** 

# **Shifted Register Operands (7)**

RRX Rotate right extended by 1 place with the old value of the C flag and the operand is shifted one place to the right



RRX

## **Shifted Register Operands (8)**

- It is possible to use a register value to specify the number of bits the second operand should be shifted by
- Ex:

```
ADD r5, r5, r3, LSL r2 ; r5:=r5+r3*2^r2
```

Only the bottom 8 bits of r2 are significant

#### **Setting the Condition Codes**

- Any data processing instruction can set the condition codes (N, Z, C, and V) if the programmer wishes it to
- Ex: 64-bit addition

```
+ r3 r2

ADDS r2, r2, r0; 32-bit carry out->C
ADC r3, r3, r1; C is added into
; high word

Adding 'S' to the opcode, standing for 'Set condition codes'
```

## Multiplies (1)

- A special form of the data processing instruction supports multiplication
- Some important differences
  - Immediate second operands are not supported
  - The result register must not be the same as the first source register
  - If the 'S' bit is set, the C flag is meaningless

```
MUL r4, r3, r2 ; r4 := (r3 x r2)<sub>[31:0]</sub>
```

## Multiplies (2)

The multiply-accumulate instruction

```
MLA r4, r3, r2, r1 ; r4 := (r3 x r2 + r1) [31:0]
```

- In some cases, it is usually more efficient to use a short series of data processing instructions
- Ex: multiply r0 by 3

```
; move 3 to r1
MUL r3, r0, r1 ; r3 := r0 x 3
```

OR

```
ADD r3, r0, r0, LSL #1 ;r3:= r0 + r0 x 2
```

#### Multiplies (3)

Ex: multiply r0 by 2

```
; move 2 to r1
MUL r3, r0, r1 ; r3 := r0 x 2
```

#### OR

```
MOV r3, r0, LSL #1 ; r3 := r0 x 2
```

Ex: multiply r0 by 35

```
; move 35 to r1
MUL r3, r0, r1 ; r3 := r0 x 35
```

#### OR

```
ADD r0, r0, r0, LSL #2 ; r0' := 5 x r0
RSB r0, r0, r0, LSL #3 ; r0'':= 7 x r0'
```

#### **Outline**

- Data processing instructions
- Data transfer instructions
- Control flow instructions
- Writing simple assembly language programs

#### Addressing mode

- The ARM data transfer instructions are all based around register-indirect addressing
  - Based-plus-offset addressing
  - Based-plus-index addressing

```
LDR r0, [r1] ; r0 := mem_{32}[r1]
STR r0, [r1] ; mem_{32}[r1] := r0
```

Register-indirect addressing

#### **Data Transfer Instructions**

- Move data between ARM registers and memory
- Three basic forms of data transfer instruction
  - Single register load and store instructions
  - Multiple register load and store instructions
  - Single register swap instructions

# Single Register Load and Store Instructions (1)

- These instructions provide the most flexible way to transfer single data item between an ARM register and memory
- The data item may be a byte, a 32-bit word, 16bit half-word

```
LDR r0, [r1] ; r0 := mem_{32}[r1]
STR r0, [r1] ; mem_{32}[r1] := r0
```

Register-indirect addressing

# Single Register Load and Store Instructions (2)

| LDR   | Load a word into register                 | Rd ←mem32[address]             |
|-------|-------------------------------------------|--------------------------------|
| STR   | Store a word in register into memory      | Mem32[address] ←Rd             |
| LDRB  | Load a byte into register                 | Rd ←mem8[address]              |
| STRB  | Store a byte in register into memory      | Mem8[address] ←Rd              |
| LDRH  | Load a half-word into register            | Rd ←mem16[address]             |
| STRH  | Store a half-word in register into memory | Mem16[address] ←Rd             |
| LDRSB | Load a signed byte into register          | Rd ←signExtend(mem8[address])  |
| LDRSH | Load a signed half-word into register     | Rd ←signExtend(mem16[address]) |

#### **Endianess Example**



**Assembly Language, CSIE, CCU** 

#### **Endianess Example**



**Assembly Language, CSIE, CCU** 

#### **Endianess Example**



36

#### Base-plus-offset Addressing (1)

#### Pre-indexed addressing mode

 It allows one base register to be used to access a number of memory locations which are in the same area of memory

```
LDR r0, [r1, #4] ; r0 := mem_{32}[r1 + 4]
```

#### Base-plus-offset Addressing (2)

- Auto-indexing (Preindex with writeback)
  - No extra time
  - The time and code space cost of the extra instruction are avoided

```
LDR r0, [r1, #4]! ; r0 := mem<sub>32</sub>[r1 + 4] ; r1 := r1 + 4
```

The exclamation "!" mark indicates that the instruction should update the base register after initiating the data transfer

#### Base-plus-offset Addressing (3)

- Post-indexed addressing mode
  - The exclamation "!" is not needed

```
LDR r0, [r1], #4 ; r0 := mem<sub>32</sub>[r1] ; r1 := r1 + 4
```

```
i = 0;
while (i<n) {
   do some operation on A[i];
   i ++;
}</pre>
```

### **Application**

table

```
Memory
0x100

A[0]

A[1]

A[2]
```

```
ADR r1, table

LOOP LDR r0, [r1] ; r0 := mem<sub>32</sub>[r1]

ADD r1, r1, #4 ; r1 := r1 + 4

;
;do some operation on A[i] (that is r0)
...
```

```
ADR r1, table

LOOP LDR r0, [r1], #4 ; r0 := mem<sub>32</sub>[r1]
; r1 := r1 + 4

;
; do some operation on A[i] (that is r0)
...
```

#### **Example**

Pre-indexed addressing mode

```
LDR r0, [r1, #8] ; r0 := mem<sub>32</sub>[r1 + 8]
```

Auto-indexing (Preindex with writeback)

```
LDR r0, [r1, #8]! ; r0 := mem<sub>32</sub>[r1 + 8] ; r1 := r1 + 8
```

Post-indexed addressing mode

```
LDR r0, [r1], #8 ; r0 := mem<sub>32</sub>[r1]
; r1 := r1 + 8
```

## Multiple Register Load and Store Instructions (1)

- Enable large quantities of data to be transferred more efficiently
- They are used for procedure entry and exit to save and restore workspace registers
- Copy blocks of data around memory

```
LDMIA r1, {r0, r2, r5} ; r0 := mem<sub>32</sub>[r1] ; r2 := mem<sub>32</sub>[r1 + 4] ; r5 := mem<sub>32</sub>[r1 + 8]
```

The base register r1 should be word-aligned

### Multiple Register Load and Store Instructions (2)

| LDM | Load multiple registers  |
|-----|--------------------------|
| STM | Store multiple registers |

| Addressing mode      | Description | Starting address | End address | Rn!    |
|----------------------|-------------|------------------|-------------|--------|
| IA (increase after)  | 執行後增加       | Rn               | Rn+4*N-4    | Rn+4*N |
| IB (increase before) | 執行前增加       | Rn+4             | Rn+4*N      | Rn+4*N |
| DA (decrease after)  | 執行後減少       | Rn-4*Rn+4        | Rn          | Rn-4*N |
| DB (decrease before) | 執行前減少       | Rn-4*N           | Rn-4        | Rn-4*N |

Addressing mode for multiple register load and store instructions

### Example (1)

r0

 address
 data

 0x100
 10

 0x104
 20

 0x108
 30

 0x10C
 40

 0x110
 50

 0x114
 60

| LDMIA | rO, | {r1, r2, r3} |  |
|-------|-----|--------------|--|
| OR    |     |              |  |
| LDMIA | rO, | {r1-r3}      |  |



r1 := 10 r2 := 20 r3 := 30

### Example (2)



r1 := 10

r2 := 20

r3 := 30

r0 := 0x10C

### Example (3)



r1 := 20

r2 := 30

r3 := 40

r0 := 0x10C

### Example (4)

LDMDA r0!, {r1, r2, r3}

| address | data |
|---------|------|
| 0x100   | 10   |
| 0x104   | 20   |
| 0x108   | 30   |
| 0x10C   | 40   |
| 0x110   | 50   |
| 0x114   | 60   |



r0

r1 := 40 r2 := 50

r3 := 60

### Example (5)

LDMDB r0!, {r1, r2, r3}

| address | data |
|---------|------|
| 0x100   | 10   |
| 0x104   | 20   |
| 0x108   | 30   |
| 0x10C   | 40   |
| 0x110   | 50   |
| 0x114   | 60   |



r0

r1 := 30 r2 := 40 r3 := 50

### Example (6)

STMIA r0, {r1, r2, r3}
OR
STMIA r0, {r1-r3}





|    |         | address | data |
|----|---------|---------|------|
| r0 | <b></b> | 0x100   | 1    |
|    |         | 0x104   | 2    |
|    |         | 0x108   | 3    |
|    |         | 0x10C   | 40   |
|    |         | 0x110   | 50   |
|    |         | 0x114   | 60   |

### Example (7)



r1 := 1

r2 := 2

r3 := 3



### Example (8)



r1 := 1 r2 := 2

r3 := 3



### Example (9)



| r1 | := | 1 |
|----|----|---|
| _  |    | _ |

r2 := 2

r3 := 3



### Example (10)



| r1 | := | 1     |  |
|----|----|-------|--|
| r2 | := | 2     |  |
| r3 | := | 3     |  |
|    |    |       |  |
| r0 | := | 0x110 |  |



## Multiple Register Load and Store Instructions (3)

- Base register used to determine where memory access should occur
  - 4 different addressing modes allow increment and decrement inclusive or exclusive of the base register location.
  - Base register can be optionally updated following the transfer by appending it with an '!'
  - Lowest register number is always transferred to/from lowest memory location accessed

#### **Application**

#### Copy a block of memory

```
      ; r9
      存放來源資料的起始位址

      ; r10
      存放目標的起始位址

      ; r11
      存放來源資料的結束位址

      LOOP
      LDMIA r9! , {r0-r7}

      STMIA r10!, {r0-r7}
      CMP r9 , r11

      BNE LOOP
```



#### **Application**

#### Copy a block of memory

```
      ; r9
      存放來源資料的起始位址

      ; r10
      存放目標的起始位址

      ; r11
      存放來源資料的結束位址

      LOOP
      LDMIA r9! , {r0-r7}

      STMIA r10!, {r0-r7}
      CMP r9 , r11

      BNE LOOP
```



### **Application: Stack Operations**

- ARM uses multiple load-store instructions to operate the stack
  - POP: multiple load instructions
  - PUSH: multiple store instructions

#### The Stack (1)

- Stack向上生長或向下生長
  - Ascending, 'A': 遞增
  - Descending, 'D': 遞減
- Full stack, 'F': sp指向stack的最後一個已使用的位址
- Empty stack, 'E': sp指向stack的第一個沒有使用的位址

### The Stack (2)

The mapping between the stack and block copy views of the multiple load and store instructions

| 定址方式 | 說明  | POP   | =LDM  | PUSH  | =STM  |
|------|-----|-------|-------|-------|-------|
| FA   | 遞增滿 | LDMFA | LDMDA | STMFA | STMIB |
| FD   | 遞減滿 | LDMFD | LDMIA | STMFD | STMDB |
| EA   | 遞增空 | LDMEA | LDMDB | STMEA | STMIA |
| ED   | 遞減空 | LDMED | LDMIB | STMED | STMDA |

**Assembly Language, CSIE, CCU** 

### The Stack (3)

- The stack type to be used is given by the postfix to the instruction:
  - STMFD/LDMFD: Full Descending stack
  - STMFA/LDMFA: Full Ascending stack
  - STMED/LDMED: Empty Descending stack
  - STMEA/LDMEA: Empty Ascending stack
- Pseudo instruction
- Note: ARM Compilers will always use a Full descending stack



Assembly Language, CSIE, CCU

| 定址方式 | 說明  | POP   | =LDM  | PUSH  | =STM  |
|------|-----|-------|-------|-------|-------|
| FA   | 遞增滿 | LDMFA | LDMDA | STMFA | STMIB |



r1 := 1 r2 := 2 r3 := 3

sp := 0x100



Assembly Language, CSIE, CCU

| 定址方式 | 說明  | POP   | =LDM  | PUSH  | =STM  |
|------|-----|-------|-------|-------|-------|
| FA   | 遞增滿 | LDMFA | LDMDA | STMFA | STMIB |

LDMFA sp! , {r4, r5, r6}
OR
LDMFA sp! , {r4-r6}

sp := 0x10C



|    |         | address | data |
|----|---------|---------|------|
| sp | <b></b> | 0x100   | 10   |
|    | ,       | 0x104   | 1    |
|    |         | 0x108   | 2    |
|    |         | 0x10C   | 3    |
|    |         | 0x110   | 50   |
|    |         | 0x114   | 60   |

r4 := 1 r5 := 2 r6 := 3 sp := 0x100

## Single Register Swap Instructions (1)

- Allow a value in a register to be exchanged with a value in memory
- Effectively do both a load and a store operation in one instruction
- They are little used in user-level programs
- Atomic operation
  - 在操作期間,禁止其他指令對欲存取的儲存單元讀寫
- Application
  - Implement semaphores (multi-threaded / multiprocessor environment)

## Single Register Swap Instructions (2)

SWP{B} Rd, Rm, [Rn]

| SWP  | WORD exchange | tmp = mem32[Rn] |
|------|---------------|-----------------|
|      |               | mem32[Rn] = Rm  |
|      |               | Rd = tmp        |
| SWPB | Byte exchange | tmp = mem8[Rn]  |
|      |               | mem8[Rn] = Rm   |
|      |               | Rd = tmp        |

#### Example

r0: 123456

r1: 111111

r2: 0x108

| address | data |
|---------|------|
| 0x100   | 10   |
| 0x104   | 20   |
| 0x108   | 30   |



SWP r0, r1, [r2]



r0: 30

r1: 111111

r2: 0x108

| address | data   |
|---------|--------|
| 0x100   | 10     |
| 0x104   | 20     |
| 0x108   | 111111 |

Assembly Language, CSIE, CCU

# Load an Address into Register (1)

- The ADR (load address into register) instruction to load a register with a 32-bit address
- Example
  - ADR r0,table
  - Load the contents of register r0 with the 32-bit address "table"



# Load an Address into Register (2)

- ADR is a pseudo instruction
- Assembler will transfer pseudo instruction into a sequence of appropriate normal instructions
- Assembler will transfer ADR into a single ADD, or SUB instruction to load the address into a register.



#### **Outline**

- Data processing instructions
- Data transfer instructions
- Control flow instructions
- Writing simple assembly language programs

#### **Control Flow Instructions**

Determine which instructions get executed next

```
LABEL
      B
LABEL
            r0, #0
                       ; initialize counter
      MOV
LOOP
      ; do something here
      . . .
            r0, r0, #1; increment loop counter
      ADD
      CMP r0, #10; compare with limit
                       ; repeat if not equal
      BNE LOOP
                       ; else fall through
```

#### **Branch Conditions**

| Branch | Interpretation   | Normal uses                                       | Conditional |  |
|--------|------------------|---------------------------------------------------|-------------|--|
| B      | Unconditional    | Always take this branch                           | execution   |  |
| BAL    | Always           | Always take this branch                           | - OXOGUIOII |  |
| BEQ    | Equal            | Comparison equal or zero result                   |             |  |
| BNE    | Not equal        | Comparison not equal or non-zero result           |             |  |
| BPL    | Plus             | Result positive or zero                           |             |  |
| BMI    | Minus            | Result minus or negative                          |             |  |
| BCC    | Carry clear      | Arithmetic operation did not give carry-out       |             |  |
| BLO    | Lower            | Unsigned comparison gave lower                    |             |  |
| BCS    | Carry set        | Arithmetic operation gave carry-out               |             |  |
| BHS    | Higher or same   | Unsigned comparison gave higher or same           |             |  |
| BVC    | Overflow clear   | Signed integer operation; no overflow occurred    |             |  |
| BVS    | Overflow set     | Signed integer operation; overflow occurred       |             |  |
| BGT    | Greater than     | Signed integer comparison gave greater than       |             |  |
| BGE    | Greater or equal | Signed integer comparison gave greater or equal   |             |  |
| BLT    | Less than        | Signed integer comparison gave less than          |             |  |
| BLE    | Less or equal    | Signed integer comparison gave less than or equal |             |  |
| вні    | Higher           | Unsigned comparison gave higher                   |             |  |
| BLS    | Lower or same    | Unsigned comparison gave lo                       | wer or same |  |

#### **Branch Instructions**

| В   | <b>跳躍</b>       | PC=label                                                             |
|-----|-----------------|----------------------------------------------------------------------|
| BL  | 帶返回的跳躍          | PC=label<br>LR=BL後面的第一道指令的位址                                         |
| BX  | 跳躍並切換狀態         | PC=Rm & 0xfffffffe, T=Rm & 1                                         |
| BLX | 帶返回的跳躍並<br>切換狀態 | PC=label, T=1<br>PC=Rm & 0xfffffffe, T=Rm & 1<br>LR = BLX後面的第一道指令的位址 |

# **Branch and Link Instructions (1)**

BL instruction save the return address into r14 (Ir)

```
BL subroutine ; branch to subroutine

CMP r1, #5 ; return to here

MOVEQ r1, #0
...
```

```
subroutine ; subroutine entry point ...

MOV pc, lr ; return
```

#### **Example**

```
subroutine ; subroutine entry point ...

MOV pc, lr ; return
```

### **Example**

```
0x100
0x104 BL subroutine ; branch to subroutine
0x104 CMP r1, #5 ; return to here
MOVEQ r1, #0
```

```
r15 (pc) => 0x104

mov pc, lr ; return
```

### **Example**

```
0x100
                   subroutine
                              ; branch to subroutine
            BL
0x104
                   r1, #5
            CMP
                              ; return to here
            MOVEQ r1, #0
      subroutine
                             ; subroutine entry point
            MOV pc, lr ; return
```

# **Branch and Link Instructions (2)**

#### Problem

 If a subroutine wants to call another subroutine, the original return address, r14, will be overwritten by the second BL instruction

#### **Problem**

```
SUB1 ; branch to subroutine SUB1
      BL
           r1, r2, #100
0x80
      SUB
  SUB1
                                            r14 = 0x80
            r0, r1
     MOV
           SUB2
     BL
0x104 ADD r1, r2, r3
           pc, r14; copy r14 into r15 to return
     MOV
  SUB2
     MOV pc, r14; copy r14 into r15 to return
```

#### **Problem**

```
BL SUB1 ; branch to subroutine SUB1
0x80 SUB r1, r2, #100
  SUB1
      MOV
             r0, r1
     BL
              SUB2
0x104 ADD
              r1, r2, r3
      MOV
              pc, r14; copy r14 into r15 to return
  SUB2
                                              r14 = 0x104
          pc, r14 ; copy r14 into r15 to return
      MOV
```

# **Branch and Link Instructions (2)**

#### Solution

- Push r14 into a stack
- The subroutine will often require some work registers, the old values in these registers can be saved at the same time using a store multiple instruction

# **Branch and Link Instructions (3)**

```
BL SUB1 : branch to subroutine SUB1
SUB1
    STMFD
            r13!, {r0-r2, r14}; save work & link register
   BL
            SUB2
   LDMFD
            r13!, {r0-r2, pc}; restore work register and
                               ; return
```

```
SUB2
...
MOV pc, r14 ; copy r14 into r15 to return
```

Assembly Language, CSIE, CCU

# **Conditional Execution (1)**

- One of the ARM's most interesting features is that each instruction is conditionally executed
- In order to indicate the ARM's conditional mode to the assembler, all you have to do is to append the appropriate condition to a mnemonic



# **Conditional Execution (2)**

The conditional execution code is faster and smaller

```
; if ((a==b) && (c==d)) e++;
; a is in register r0
; b is in register r1
; c is in register r2
; d is in register r3
; e is in register r4
              r0, r1
       CMP
              LABEL1
       BNE
              r2, r3
       CMP
       BNE
             LABEL1
              r4, r4, #1
       ADD
LABEL1:
```

# **Conditional Execution (2)**

The conditional execution code is faster and smaller

```
; if ((a==b) && (c==d)) e++;
; a is in register r0
; b is in register r1
; c is in register r2
; d is in register r3
; e is in register r4
   CMP r0, r1
   CMPEQ r2, r3
   ADDEQ r4, r4, #1
```

# **Conditional Execution (3)**

- Predicate
- Real products
  - Partial prediction support
    - SPARC, Alpha, ELF
  - Full prediction support
    - IA-64, XScale, TIC6, ARM

# **Conditional Execution (4)**



```
cmp.ne p0, p1, r1, 0;  // branch b: set predicate register
add r2, r3, r4 (p0);  // if p0 is true, r2 = r3 + r4
add r2, r2, 1 (p1);  // if p1 is true, r2 = r2 + 1;
sub r5, r2, r6;
```

#### An example of IA64

# **Supervisor Calls (1)**

- SWI: SoftWare Interrupt
- The supervisor calls are implemented in system software
  - They are probably different from one ARM system to another
  - Most ARM systems implement a common subset of calls in addition to any specific calls required by the particular application

```
; This routine sends the character in the bottom
; byte of r0 to the use display device

SWI SWI_WriteC ; output r0[7:0]
```

# **Supervisor Calls (2)**

```
; This routine returns control from a user program
; back to the monitor program

SWI SWI_Exit ; return to monitor
```

# **Jump Tables (1)**

 A programmer sometimes wants to call one of a set of subroutines, the choice depending on a value computed by the program

**Note**: slow when the list is long, and all subroutines are equally frequent

```
BL
            JUMPTAB
JUMPTAB
            r0, #0
    CMP
    BEO
            SUB<sub>0</sub>
            r0, #1
    CMP
            SUB1
    BEO
            r0, #2
    CMP
            SUB2
    BEO
     . .
```

# **Jump Tables (2)**

 "DCD" directive instructs the assembler to reserve a word of store and to initialize it to the value of the expression to the right



#### **Outline**

- Data processing instructions
- Data transfer instructions
- Control flow instructions
- Writing simple assembly language programs

Writing Simple Assembly Language Programs (ARM ADS)

| AREA       | HelloW, | CODE, | READONLY |
|------------|---------|-------|----------|
| SWI_WriteC | EQU     | 0.3   |          |
| SWI_Exit   | EQU     | &1:   | 1        |

**AREA**: chunks of data or code that are manipulated by the linker

|               | ENTRY        | numer                      |
|---------------|--------------|----------------------------|
| START<br>LOOP | ADR<br>LDRB  | r1, TEXT<br>r0, [r1], #1   |
|               | CMP          | r0, #0                     |
|               | SWINE<br>BNE | LOOP memo                  |
|               | SWI          | SWI_Exit conter            |
| TEXT          | =            | "Hello World", &0a, &0d, ( |

**END** 

**EQU**: give a symbolic name to a numeric constant (\*)

**DCB**: allocate one or more bytes of memory and define initial runtime content of memory (=)

**ENTRY**: The first instruction to be executed within an application is marked by the ENTRY directive. An application can contain only a single entry point.

#### **General Assembly Form (ARM ADS)**

label <whitespace> instruction <whitespace> ;comment

- The three sections are separated by at least one whitespace character (a space or a tab)
- Actual instructions never start in the first column, since they must be preceded by whitespace, even if there is no label
- All three sections are optional

# Backup

# Using the Barrel Shifter: The Second Operand



Result

# **Example: Pipelines (1)**

- Laundry Example
- 4 load of clothes
  - Washer takes 30 minutes
  - Dryer takes 40 minutes
  - "Folder" takes 20 minutes









# **Sequential Laundry**



- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would laundry take?

#### Pipelined Laundry: Start work ASAP



Pipelined laundry takes 3.5 hours for 4 loads



# **Pipelined Design**

- Prevalent in today's processor implementations
- More pipeline stage
  - Improve throughput
  - Help to increase clock frequency



Assembly Language, CSIE, CCU

# Pipelines (1)

# Pipelines (2)



- fetch: fetch the instruction from memory
- dec: decode it to see what sort of instruction it is
- reg: access any operands that may be required from the register bank
- ALU: combine the operands to form the result or a memory address
- mem: access memory for a data operand, if necessary
- res: write the result back to the register bank

# More Pipeline stages, Better Performance?

- Pentium 3: 10
- Pentium 4 (Old): 20
- Pentium 4 (Prescott): 31
- Next-Generation Micro-Architecture (NGMA): 14

### **Pipelines Hazards**

```
Ex:
      r1, r2, #10
                   (write r1)
add
      r3, r1, #20
sub
                 (use r1)
                      reg
         fetch
                dec
                           ALU
                                 mem
                                        res
1
                                                    ALU
                      dec
                                stall
2
               fetch
                                              reg
                                                          mem
                                                                res
instruction
```

Read-after-write pipeline hazard

time

## Pipelined Branch Behavior



Wrong instructions in pipeline need to be flushed (thrown away)

Assembly Language, CSIE, CCU

#### **Solutions**

- Stall pipeline until branch resolved
- Branch prediction
  - Mis-prediction will pay a big penalty
- Q: May we remove branch instruction?
  - Conditional execution: operations based on the value of a Boolean source operand
  - drawback
    - Affect instruction cache
    - Increase the critical path length