
# Chapter 04: `sed`, `awk` and Perl regular expression

___




## SECTION ONE: `sed` tutorial



`sed` is the stream editor which can conduct text transformation on the input stream (text file or input pipeline).

Here is the basic command synopsis for `sed`:
```bash
sed SCRIPT INPUTFILE
```

And of course `sed` has its own full synopsis:
```bash
sed [OPTIONS] SCRIPT INPUTFILE
```

In [3]:
sed --help

Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

  -n, --quiet, --silent
                 suppress automatic printing of pattern space
  -e script, --expression=script
                 add the script to the commands to be executed
  -f script-file, --file=script-file
                 add the contents of script-file to the commands to be executed
  --follow-symlinks
                 follow symlinks when processing in place
  -i[SUFFIX], --in-place[=SUFFIX]
                 edit files in place (makes backup if SUFFIX supplied)
  -l N, --line-length=N
                 specify the desired line-wrap length for the `l' command
  --posix
                 disable all GNU extensions.
  -r, --regexp-extended
                 use extended regular expressions in the script.
  -s, --separate
                 consider files as separate rather than as a single continuous
                 long stream.
  -u, --unbuffered
                 load minimal amount

### Synopsis
```bash
sed -n -e '{COMMAND}' FILENAME
sed -n -e 'ADDR1{COMMAND}' FILENAME
sed -n -e 'ADDR1,ADDR2{COMMAND}' FILENAME
```

<font color='red'>NOTE</font>: Although the option `-e` is not compulsory, we still recommend you to use this option especially when you have multiple commands (scripts) to run.

In [4]:
sed -e 'p' re/numbers.txt
#sed -e '' re/numbers.txt



* You could find that without -n to suppress the auto-print, so each line was printed twice. Thus we can add this option.

In [5]:
sed -n -e 'p' re/numbers.txt

zero
one
two
three
four
five
six
seven
eight
nine

### Addresses

1. When no addresses are given, command will be executed on all the input lines;
2. When only one address is given, command will be executed on the matching line;
3. When two addresses are given, commmand will be executed on the range between the two lines.



#### <font color='blue'>Exercise</font>
Have a look at the manual of `sed`, tell what the following command will do:
```bash
sed -n -e '0~5p' re/numbers.txt
sed -n -e '1,+5p' re/numbers.txt
sed -n -e '1,~5p' re/numbers.txt
sed -n -e '1,1p' re/numbers.txt
```

If no address is given, the operation will be conducted on all the lines.

In [6]:
sed -n -e 'p' re/numbers.txt

zero
one
two
three
four
five
six
seven
eight
nine

* With a single address, the command will be conducted only on the line.

In [7]:
sed -n -e '3p' re/numbers.txt

two


* This can also be a regular expression.

In [8]:
sed -n -e '/^t/p' re/numbers.txt

two
three


* The two-address format can be `NUM1,NUM2`:

In [9]:
sed -n -e '3,5p' re/numbers.txt

two
three
four


In [10]:
sed -n -e '0,4p' re/numbers.txt

sed: -e expression #1, char 4: invalid usage of line address 0


* You can find that with `NUM1,NUM2` format, `ADDR1` cannot be 0.
* But if `ADDR2` is a regular expresion, `ADDR1` can be 0. 

In [11]:
sed -n -e '0,/^z/p' re/numbers.txt

zero


The `two-address` can also be `/REGEX1/, /REGEX2/`:

In [12]:
sed -n -e '/^o/,/^o/p' re/numbers.txt

one
two
three
four
five
six
seven
eight
nine

But if we only want to print the first line starting with `o`, we can use two commands:

In [13]:
sed -n -e '/^o/{p;q}' re/numbers.txt

one


* Do NOT forget to use the `{p;q}`, otherwise nothing will be output since here `q` will be conducted on all the lines, therefore at first line `sed` quit. 

In [14]:
sed -n -e '/^o/p;q' re/numbers.txt



In [15]:
sed -n -e '5,+6p' re/numbers.txt

four
five
six
seven
eight
nine

In [16]:
sed -n -e '3,~6!p' re/numbers.txt

zero
one
six
seven
eight
nine

### Options

`sed` can have different options:

| Option | Description |
| --- | --- |
| -n, --quiet, --silent | Suppress automatic printing of pattern space. |
| -e SCRIPT, --expression=SCRIPT | Add the script to the commands to be executed. |
| -f SCRIPT-FILE, --file=SCRIPT-FILE | Use the scripts in SCRIPT-FILE. |
| -i[SUFFIX], --in-place[=SUFFIX] | Edit files in place (backup if SUFFIX is supplied). |
|  -l N, --line-length=N | Specify the desired line-wrap length for command `l`. |
| -r, --regrexp-extended | Use ERE in the script. |
| -s, --separate | Consider files as separate rather than as a single continuous stream. |
| -u, --unbuffered | Load minimal amounts of data from the input files and flush the output buffers more often. |
| -z, --null-data | Separate lines by NUL characters. |



### Commands for `sed` SCRIPTS

`sed` has many different commands:

##### 1. Commands for non-address

| Command | Description | Example |
| --- | --- | --- |
| : label | **label for b and t commands ** | : loop |
| #comment | The comment extends until the next new line | |
| } | The closing bracket of a { } block | {;} |

##### 2. Commands for zero- or one-address

| Command | Description | Example |
| --- | --- | --- | 
| = | Print the line number of current address | `sed -n '1,~5=' test` |
| i \TEXT   | Insert TEXT before the current address | `sed '1~5i \NEW' test` |
| a \TEXT   | Append TEXT after the current address | `sed '1~5a \NEW' test` |
| q [EXIT_CODE] | Quit immediately without any further processing; will print the content space if auto-print is enabled. | `sed -e '1~5p' -e '14q' test` |
| Q [EXIT_CODE] | Quit immediately without any further processing | `sed -e '1~5p' -e '14Q' test` |
| r FILENAME | Append the text read from FILE | `sed -n -e '1～5p;1~5r test2' test` |
| R FILENAME | Append a line read from FILE | `sed -n -e '1～5p;1~5R test2' test` |

##### 3. Commands for 1+-addresses

| Command | Description | Example |
| --- | --- | --- |
| c \TEXT   | Change the range with TEXT   | `sed '1,5c \NEW' test` |
| d | Delete the pattern space and start a new cycle. | `sed -n -e '1,5d;p' test`  |
| D | Delete the first line in the pattern space if the pattern space contains a newline. | `sed -e '1,5{N;D}' test` |
| n | Copy the next line to the pattern space. | `sed -e 'n;p' test` |
| N | Append the next line to the pattern space. | `sed -n -e 'N;p' test` |
| g | Copy the hold space to the pattern space . |                        |
| G | Append the hold space to the pattern space. | `sed -n -e '1!G;h;$p' test3` |
| h | Copy the pattern space to the hold space. | `sed -n -e '1!G;h;$p' test3`  |
| H | Append the pattern space to the hold space. |                       |
| x | Exchange the pattern space and the hold space. |                    |
| b LABEL | Branch to LABEL; if LABEL is omitted, branch to end of script.|        |
| t LABEL | If  a  s///  has done a successful substitution since the last input line was  read  and  since the  last t or T command, then branch to LABEL; if LABEL is omitted, branch to end of script. |             |
| T LABEL | If  NO  s///  has done a successful substitution since the last input line was  read  and  since the  last t or T command, then branch to LABEL; if LABEL is omitted, branch to end of script.|              |
| s/REGEX/REPLACE/ | Substitute the pattern REGEX into REPLACE. |                    |
| y/SRC/DEST/  | Transliterate the char in SRC to correponding char in DEST | `sed 'y/a-z/A-Z/' test` |
| p |   Print the current pattern space.             | `sed -n -e '1~5N;p' test` |
| P |   Print the first line of the current space.   | `sed -n -e '1~5N;P' test` |
| w FILENAME | Write the pattern space into FILENAME.|  `sed -n -e '1~5w test3' test`  |
| W FILENAME | Write the first line of the pattern space into the FILENAME. |  `sed -e '1~5N;W test3' test` |


<font color="red">NOTE</font>: We can add "!" before any command, which means ACTION on all the other lines with the exception of the given line.


### <font color='blue'>Exercise</font>
#### 1. What is the difference between `sed -n -e '1,5{p;N;D}' test` and `sed -n -e '1,5{p;N;d}' test`?
#### 2. Here is a complete example of *commify* (Add commas as thousands separator)

(1) This is the file `numbers.txt`:
```1
12
123
1234
12345
123456
1234567
12345678
123456789
1234567890
1234567890.1234
+1234567890.1234
-1234567890.1234
$1234567890.1234
```

(2) `sed` command:
```bash
sed -r ':a; s@(^|[^0-9.])([0-9]+)([0-9]{3})@\1\2,\3@g;t a' numbers.txt
```

#### 3. Rewrite the file so that all the numbers in the file have 2 valid decimal digits. 
```
name marks grade
abc 50.5 CB
def 45 CC
ghhi 55 CA
jkl 85 A
mno 75.0 BA
pqr 77 BA
stu 89.50 A
```

#### 4. There is a file containing 1 to 100, with each on one line. Rewrite the file so that the numbers will be printed on one single line, separated by TAB.

In [17]:
sed '1,+5!c\NEW' re/numbers.txt

zero
one
two
three
four
five
NEW
NEW
NEW
NEW


In [28]:
sed -n -e '1!G;$!h;${s/\n/ /g;p}' re/numbers.txt

nine eight seven six five four three two one zero


Here is the explantion for the above command:
* For first line, `1h` will copy the pattern space to the hold space.
* For the other lines, `G;h` will append the hold space to the pattern space, and then copy back to the hold space. `s/\n/ /g` will replace all the newlines with a space.
* For the final line, `p` will print out the pattern space.

In [18]:
sed -n -e '1h;1!H;${g;s/\n/ /g;p}' re/numbers.txt

zero one two three four five six seven eight nine

Here is the explanation:
* For line 1, `1h;1!H` will copy the pattern space to the hold space.
* For the other lines, `1!H` will append the pattern space to the hold space.
* For the last line `$`, `g` will copy the hold space to the pattern space, and `s/\n/ /g` will replace all the newlines with a space, and then print out the pattern space.