# Awk

Notebook for `awk` and its various features. Awk is good for text processing and data extraction from text files and data streams.

In [None]:
awk

Awk can be used to parse text files and split tokens.

The `-F` option (field separator) is also explicitly defined as the variable `FS`.  Whitespace is the default `FS`. For a typical CSV file like `MOCK_DATA.csv`, using ',' is generally fine.  In this case, print the first 10 rows and piping the output to `head` to limit the length.

Once awk splits the record using the field separator, `$1` is the first field, `$2` the second and so on. `$0` is the whole record.


In [None]:
awk -F',' '{print $1}' MOCK_DATA.csv | head -10

To skip the first line, use `NR>1` (NR = current record number):

In [None]:
awk -F, 'NR>1 {print $1}' MOCK_DATA.csv | head -10

Awk also has a built-in variable called `NF` (the number of fields in the current record).

Not all the rows have the same number of fields, why is that?

In [None]:
awk -F, '{print NF}' MOCK_DATA.csv | head -10

This example reveals a challenge to Awk, the field separator is in the data itself.  More modern versions (5.3+) of awk support `--csv`, otherwise it is a bit more painful to process (or rexport as a TSV).

In earlier versions, FPAT can be used to keep quoted text together:

In [None]:
awk -F, -v FPAT='"[^"]*"|[^,]*' 'NF!=8' MOCK_DATA.csv

^ No output mean no lines with more than 8 fields.

Awk can also be used to do pattern matchig on records and fields.  This matches and prints out the records that match both Music and Wine:

In [None]:
awk -F, '/Music/ && /Wine/' MOCK_DATA.csv | head -10

In [None]:
awk -F, '$3 == "Music"' MOCK_DATA.csv | head -10

In [None]:
awk -W version