# Lesson 01: Basics of Awk

If you haven't read the Awk [man page](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/awk.1.html), you should start there. It's helpful! Some highlights: 

> awk − pattern-directed scanning and processing language

> `awk [ −F fs ] [ −v var=value ] [ ’prog’ | −f progfile ] [ file ... ]`

> _Awk_ scans each input _file_ for lines that match any of a set of patterns specified literally in _prog_ or in one or more files specified as __−f__ _progfile_.

> With each pattern there can be an associated action that will be performed when a line of a _file_ matches the pattern.

> Each line is matched against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern

> A pattern-action statement has the form `pattern {action}`.

> A missing `{ action }` means print the line; a missing pattern always matches. 

I created an simple example file to demonstrate basic Awk:

In [1]:
cat data/letters.txt

a
bb
ccc
dddd
ggg
hh
i

### A Basic Pattern

If we match lines longer than two characters and use the implicit print action, we get:

In [2]:
awk 'length $0 > 2' data/letters.txt

bb
ccc
dddd
ggg
hh


`$0` is a built-in variable that contains the line.

### A Basic Function

If we leave out a pattern, we will match every line. A trivial action would be to print each line:

In [3]:
awk '{ print }' data/letters.txt

a
bb
ccc
dddd
ggg
hh
i


Using the `length` function as our action, we can get the length of each line:

In [4]:
awk '{ print length }' data/letters.txt

1
2
3
4
3
2
1


The action implicity acts on the whole line. We can be more explicit if we want:

In [5]:
awk '{ print length $0 }' data/letters.txt

1a
2bb
3ccc
4dddd
3ggg
2hh
1i


Awk has special controls for executing some code before the file input begins and after it is complete.

In [6]:
awk 'BEGIN { print "HI" } { print $0 } END { print "BYE!" }' data/letters.txt

HI
a
bb
ccc
dddd
ggg
hh
i
BYE!


We can have more control over printing by using `printf`. The following example comes from the [GNU Awk User's Guide](https://www.gnu.org/software/gawk/manual/html_node/Printf-Examples.html). 

In [4]:
awk 'BEGIN { printf "%-10s %s\n", "Name", "Number"
             printf "%-10s %s\n", "----", "------" }
           { printf "%-10s %s\n", $1, $2 }' data/mail-data

Name       Number
----       ------
Amelia     555-5553
Anthony    555-3412
Becky      555-7685
Bill       555-1675
Broderick  555-0542
Camilla    555-2912
Fabius     555-1234
Julie      555-6699
Martin     555-6480
Samuel     555-3430
Jean-Paul  555-2127


### Combining Patterns and Functions

Of course, patterns and functions can be combined so that the function is only applied when the pattern is matched. 

From the man page:

> A pattern-action statement has the form

> ```pattern { action }```

We can print the length of all lines longer than 2 characters.

In [7]:
awk 'length($0) > 2 { print length($0) }' data/letters.txt

3
4
3


Actually, we don't have to limit Awk to just one pattern! We can have arbitrarily many patterns separated by a semicolon or a new line:

In [8]:
awk 'length($0) > 2 { print "Long:  " length($0) }; length($0) < 2 { print "Short: " length($0) }' data/letters.txt

Short: 1
Long:  3
Long:  4
Long:  3
Short: 1


### Multiple Fields

Awk is designed for easy handling of data with multiple fields per row. The field delimiter can be specified with the `-F` option.

Here's a simple space-delimited file:

In [9]:
awk '{ print }' data/field_data.txt

Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.


If we specify the field seperator, we can print the second field from each row:

In [10]:
awk -F " " '{ print $2 }' data/field_data.txt

are
are
is
so


We don't get an error if a line doesn't have the referenced field; it just shows up as blank:

In [11]:
awk -F " " '{ print $4 }' data/field_data.txt




you.


The seperator expression is interpreted as a regular expression.

In [12]:
awk -F "((so )?are|is) " '{print "Field 1: " $1 "\nField 2: " $2}' data/field_data.txt

Field 1: Roses 
Field 2: red,
Field 1: Violets 
Field 2: blue,
Field 1: Sugar 
Field 2: sweet,
Field 1: And 
Field 2: you.


### Regular Expressions

Patterns can be regular expressions, not just built-in functions. From the man page:

> Regular expressions are as defined in [re_format(7)](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man7/re_format.7.html#//apple_ref/doc/man/7/re_format).  Isolated regular expressions in a pattern apply to the entire line.

We can use a regular expression to find all words in the Unix words with 5 vowels in a row.

In [26]:
awk '/[aeiou]{5}/' /usr/share/dict/words

cadiueio
Chaouia
euouae
Guauaenok


### Passing variables into program

The `-v` option for Awk allows us to pass variables it the program. For example, we could use it to hard code constants.

In [13]:
awk -v pi=3.1415 'BEGIN { print pi }'

3.1415


You can also use `-v` to pass Bash variables as Awk variables

In [14]:
awk -v user=$USER 'BEGIN { print user }'

tdhopper


### If-else Statements

If-else statements in Awk are of the form:

    if (condition) then-body [else else-body]
  
For example:

In [25]:
printf "1\n2\n3\n4" | awk \
    '{ \
        if ($1 % 2 == 0) print $1, "is even"; \
        else print $1, "is odd" \
     }'

1 is odd
2 is even
3 is odd
4 is even


### Looping

Awk includes several looping statements: `while`, `do while`, and `for`.

They take the expected C-ish syntax.

In [5]:
awk \
    'BEGIN { \
        i = 0; \
        while (i < 5) { print i; i+=1; } \
     }'

0
1
2
3
4


In [13]:
awk \
    'BEGIN { \
        i = 0; \
        do { print i; i+=1; } while(i < 0) \
     }'

0


In [17]:
awk \
    'BEGIN { \
        i = 0; \
        for(i = 0; i<5; i++) print i \
     }'

0
1
2
3
4


`for` can also loop through the keys of an array, which we will see in the next lesson.