# Chapter 04: awk Tutorial
___

`awk` is more like a "language" for pattern matching than just a tool. We say that it is a language since it contains 
* __variable definition__ (变量定义),
* __branch/loop statements__ (分支/循环语句) and also 
* __function definition__ (函数定义).

As `sed` deals with a single line each time for a given file, `awk` can process a "record" in a given file at a time (Each record can be separated by a __RS__, record separator).

`sed`以文件行为单位进行处理，而`awk`以记录为单位进行处理，这里“记录”指的是用__RS__分隔的数据。

### A small `awk` example 

In [17]:
cp /etc/passwd test
cat test | head -5

root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync


### `awk` syntax
```
awk -F FS 'PATTERN{ACTION}' FILENAMES
awk -f SCRIPT-FILE FILENAMES
```

Here `PATTERN` can be regular expression (正则表达式).

In [88]:
awk -F: '$7!~/bash/{print $1}' test | head -5

daemon
bin
sys
sync
games


In [89]:
awk -F":" '$0!~/bash/{print $1}' test | wc -l

44


In [91]:
awk -F":" '$7~/bash|zsh/' test

root:x:0:0:root:/root:/bin/bash
postgres:x:116:124:PostgreSQL administrator,,,:/var/lib/postgresql:/bin/bash
bio:x:1000:1000:Ricky Woo,,,:/home/bio:/bin/bash
hadoop:x:1001:1001::/home/hadoop:/bin/zsh
couchdb:x:127:137:CouchDB Administrator,,,:/var/lib/couchdb:/bin/bash
oprofile:x:128:138:OProfile JIT user,,,:/var/lib/oprofile:/bin/bash
biotmp:x:1004:1000::/home/biotmp:/bin/bash


In [93]:
awk -F":" '$7=="/bin/bash"' test | head -3

root:x:0:0:root:/root:/bin/bash
postgres:x:116:124:PostgreSQL administrator,,,:/var/lib/postgresql:/bin/bash
bio:x:1000:1000:Ricky Woo,,,:/home/bio:/bin/bash


In [94]:
awk -F":" '$3>1000' test

nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
hadoop:x:1001:1001::/home/hadoop:/bin/zsh
biotmp:x:1004:1000::/home/biotmp:/bin/bash


In [96]:
awk 'BEGIN{FS=":"}{if ($3>$4) print }' test

usbmux:x:103:46:usbmux daemon,,,:/home/usbmux:/bin/false
speech-dispatcher:x:110:29:Speech Dispatcher,,,:/var/run/speech-dispatcher:/bin/sh
hplip:x:114:7:HPLIP system user,,,:/var/run/hplip:/bin/false
biotmp:x:1004:1000::/home/biotmp:/bin/bash


## Built-in variables

| Variable | Definition |
| --- | --- |
| ARGC | The number of command-line arguments including the command itself |
| ARGV | The array of command-line arguments including the command itself |
| ENVIRON | The environment variables |
| FILENAME | The filename to read |
| NF | Number of fields |
| NR | Index of current record for all the processed files|
| OFS | Output field separator |
| ORS | Output record separator |
| FNR | Current index for the current processing files |
| RS | Record separator |
| FS | Field separator |
| SUBSEP | Separator for the "multidimensional" array subscript | 


In [11]:
cp /etc/passwd test



In [32]:
awk 'BEGIN { print ARGV[1] }' test

test


In [80]:
head -6 test | awk 'NR%2==0{print}'

daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
games:x:5:60:games:/usr/games:/usr/sbin/nologin


In [14]:
head test | awk '{print FNR, NR}'

1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10


## Three special addresses
| Blocks | Description |
| --- | --- |
| `BEGIN{...}` |  Before the whole file processing begins |
| `{}` | Processing each record |
| `END{}` | After processing the whole file |

假设文件包含n个记录，BEGIN对应的是第0个记录，也就是说在未开始之前；而END对应的是第n+1个记录，也就是说，所有行都遍历完毕以后。

In [36]:
awk '
BEGIN {
    print "UserName\tShell"
    print "====================";
    FS = ":";
    OFS = "\t";
}
$7=="/bin/bash" {
    print $1, $7;
}
END {
    print "--------------------";
}' test

UserName	Shell
root	/bin/bash
postgres	/bin/bash
bio	/bin/bash
couchdb	/bin/bash
oprofile	/bin/bash
biotmp	/bin/bash
--------------------


In [37]:
cat > numbers.txt <<EOF
3,5,6,7
2,3,1,0
4,5,6,9
2,3,4,4
2,2,1,0
4,5,0,9
EOF



In [42]:
cat numbers.txt

3,5,6,7
2,3,1,0
4,5,6,9
2,3,4,4
2,2,1,0
4,5,0,9


In [69]:
awk -F"," '{
    x+=$2+$3;
    a[NR]=$2+$3
}
END{
    y=x/NR;
    for(i in a){if(y<a[i]) z++;}
    print "The average is ", y
    print "The numbers greater than average: ", z
}' numbers.txt

The average is  6.83333
The numbers greater than average:  3


In [68]:
awk -F"," '
BEGIN {
    n=0
    while (getline<"numbers.txt"){
        x+=$2+$3;
        i++;
    };
    avg = x/i; 
}
{
    if ($2+$3 > avg) n++;
}
END {
    printf("The average is %f\n", avg);
    printf("The number of $5 greater than avg: %d\n", n);
}' numbers.txt

The average is 6.833333
The number of $5 greater than avg: 3


In [46]:
awk -F"," '
{
    a[NR] = $2;
    b[NR] = $0
}
END{
    asort(a, c);
    for (i=1;i<=NR;i++) print c[i]
}' numbers.txt

2
3
3
5
5
5


从这里，我们可以看到awk支持变量的定义和数组，但awk仅支持1-D数组，且其下标实际上是字符串而非数值。对于所谓的高维数组，我们可以看到，其实际上是伪高维数组：

In [16]:
awk 'BEGIN {\
    for(i=1;i<4;i++){
        for(j=i;j<4;j++) a[i,j]=i*j;
    }\
    for(k in a) {
        split(k, subs, SUBSEP)
        print subs[1], "-", subs[2], ":", a[k];
    }\
}'

2 - 2 : 4
2 - 3 : 6
3 - 3 : 9
1 - 1 : 1
1 - 2 : 2
1 - 3 : 3


## Conditional statements and loop statements

### `if`-condition

In [63]:
awk -F"," -v i=1 '{
    if($4 == 0) {
        z[i] = $1+$2+$3;
        i++;
    }
}
END{
    for (i in z) {
        print i, z[i]
    }
}' numbers.txt

1 6
2 5


### `for`-loop

In [5]:
awk 'BEGIN{
    for (i=0;i<5;i++) {
        a[i] = i;
    }
    for (i in a) {
        print i, a[i];
    }
}'

4 4
0 0
1 1
2 2
3 3


In [3]:
VAR=10 awk 'BEGIN{
        print ENVIRON["VAR"];
        print $VAR
}'

10



### `while`-loop

In [5]:
awk -v i=1 'BEGIN{
    while (i < 10) {
        j=1;
        while (j < 10) {
            printf("%dx%d=%d", i, j, i*j);
            printf("\t");
            j++;
        }
        print "\n";
        i++;
    }
}'

1x1=1	1x2=2	1x3=3	1x4=4	1x5=5	1x6=6	1x7=7	1x8=8	1x9=9	

2x1=2	2x2=4	2x3=6	2x4=8	2x5=10	2x6=12	2x7=14	2x8=16	2x9=18	

3x1=3	3x2=6	3x3=9	3x4=12	3x5=15	3x6=18	3x7=21	3x8=24	3x9=27	

4x1=4	4x2=8	4x3=12	4x4=16	4x5=20	4x6=24	4x7=28	4x8=32	4x9=36	

5x1=5	5x2=10	5x3=15	5x4=20	5x5=25	5x6=30	5x7=35	5x8=40	5x9=45	

6x1=6	6x2=12	6x3=18	6x4=24	6x5=30	6x6=36	6x7=42	6x8=48	6x9=54	

7x1=7	7x2=14	7x3=21	7x4=28	7x5=35	7x6=42	7x7=49	7x8=56	7x9=63	

8x1=8	8x2=16	8x3=24	8x4=32	8x5=40	8x6=48	8x7=56	8x8=64	8x9=72	

9x1=9	9x2=18	9x3=27	9x4=36	9x5=45	9x6=54	9x7=63	9x8=72	9x9=81	



### `do`-loop

In [50]:
awk 'BEGIN{
    total=0;
    i=0;
    do
    {
        total+=i;
        i++;
    } while(i<=100)
    print total;
}'

5050


## Arithmetic functions
| Function name | Description |
| --- | --- |
| `atan2(y,x)` | 反正切函数 |
| `cos(x)` | 余弦函数 |
| `sin(x)` | 正弦函数 |
| `exp(x)` | 以自然对数e为底指数函数 |
| `log(x)` | 计算以$e$为底的对数值 |
| `sqrt(x)` | 平方根函数 |
| `abs(x)` | 绝对值函数 |
| `int(x)` | 将数值转换成整数 |
| `rand()`  | 返回0到1的一个随机数值，不包含1 |
| `srand([expr])` | 设置随机种子，一般与rand函数配合使用，如果参数为空，默认使用当前时间为种子 |

## Built-in string function
| Function name | Description | Example |
| :--- | --- | --- |
| `gsub(r,s)` | 在整个\$0中用s替代r | `awk 'gsub(/name/，"xingming") {print $0}' temp` |
| `gsub(r,s,t)` |  在整个t中用s替代r | `awk '{gsub(/name/, "xingming", $3);}' temp` |
| `index(s,t)` | 返回s中字符串t的第一位置 | `awk 'BEGIN {printindex("Sunny"，"ny")}' temp ` |
| `length(s)` | 返回s的长度 | `awk '{print length($2)}'`|
| `match(s,r)` | 测试s是否包含匹配模式r的字符串 | `awk '$1=="J.Lulu" {print match($1，"u")}'` temp` |
| `split(s,a,fs)` | 根据fs将s分成数组a | `awk 'BEGIN {print split("12#345#6789", myarray, "#")"'` |
| `sprintf(fmt,exp)`  | 返回经fmt格式化后的exp |  `awk '{a=sprintf("%10s%20s", $2, $3); print a}' test` |
| `sub(r,s)` | 从\$0中最左边最长的子串中用s代替r(只更换第一遇到的匹配字符串) | `awk '{sub(/bio/, "student"); print $0}' test` |
| `substr(s,p)` | 返回字符串s中从p开始的后缀部分 | `awk '{print substr($7, 3)}' test` |
| `substr(s,p,n)` | 返回字符串s中从p开始长度为n的后缀部分 | `awk '{print substr($7, 3, 3)}' test` |
| `tolower(s)` | 返回字符串的小写 | `awk 'BEGIN{ print tolower("TEST")}'` |
| `toupper(s)` | 返回字符串的大写 | `awk 'BEGIN{ print toupper("test")}'` |

## I/O functions
<table>
<tr><td>Function name</td><td>Description</td><td>Example</td></tr>
<tr><td>getline [var]</td><td>读取命令结果到变量</td><td>awk 'BEGIN{while("ls -l" | getline text) print text}'</td></tr>
<tr><td>getline [var]</td><td>读取文件内容到变量</td><td>awk 'BEGIN{while(getline) print NF, $0 }' test`</td></tr>
<tr><td>close</td><td>关闭打开的文件或者管道</td><td>慎用</td>
<tr><td>system(command)</td><td>执行系统命令</td><td>awk 'BEGIN{ system("uname -r") }'</td></tr>
</table>

### control statements
| Statements | Description |
| --- | --- |
| break | 当 break语句用于 while 或 for 语句时，导致退出程序循环。|
| continue | 当 continue语句用于 while 或 for 语句时，使程序循环移动到下一个迭代。|
| next | 能够导致读入下一个输入行，并返回到脚本的顶部。这可以避免对当前输入行执行其他的操作过程。|
| exit | 语句使主输入循环退出并将控制转移到END,如果END存在的话。如果没有定义END规则，或在END中应用exit语句，则终止脚本的执行。|

## User-defined Function

Here is an example of function definition (Note that the functions should be defined outside all the addresses, since it should be GLOBAL):

In [32]:
awk '
function t(a) {
    b = a;
}
BEGIN {
    print "Before function run: b = ", b;
    t("hello"); 
    print "After function run: b = ", b;
}
'

Before function run: b =  
After function run: b =  hello


If you want to know more about `awk`, please refer to [awk cheat sheet](http://www.catonmat.net/download/awk.cheat.sheet.pdf).

## <font color="blue">Exercise</font>
1. Can you sort the above file "numbers.txt" according to the second field arithmetically in descending order, with `awk`?
2. Compute the row sums, row products and column sums, column products.
3. Compute the average score for each person:
```
Tom     Music   86.5
Jerry   Math    97
Kitty   English 64
Tom     Math    77
Jerry   English 33
Kitty   Math    66
Kitty   Music   99
Jerry   Music   44
Tom     English 88    
Tom     History 77
Kitty   Geography 99
```

## Conclusion remarks

In this tutorial we have introduced some of the important concepts in `awk`:
1. Some built-in variables
2. Variable definition
3. Conditional and loop statements
4. Built-in functions
5. User-defined function
6. Arrays in `awk`
7. Advanced usage for `awk`