### Understanding AWK


Exercises taken from a variety of different sources. 

### Intro

_I'll use `awk`, AWK, awk interchangeably in this tutorial._

<hr>

# <center> ! </center>

If you're looking for a more simple approach for text/pattern processing (searching, replacing, extracting) I'd recommend to stick to `grep`. 

<br>

<hr>


#### How does AWK work?

Firs ot all `awk` isn't just a function (or a tool) but a programming language. It can process instructions as simple or as complex as we want; it has a steep learning curve and I don't recommend to try learning how to deeply use it unless you have a strong reason (or unless you're stubborn, like me). 

AWK works best with formatted text (ie, tables) that have a delimited number of columns and use the same separator (ie, comma for CSV files, tab for TSV). 

It works by reading one line at a time, performs an operation and prints a result to screen. 

On each line it separates the columns based on a character (ie., comma, space or tab) and assigns each column to a variable: `${N}` (N is a number between 1 and the total number of columns in the table).

Let's start by looking at a simple table: first column are the row names, then from each column on different values.

In [372]:
head cols.txt

rowA	1	1	9
rowB	2	7	10
file3	3	6	20
file4	4	5	99
line_a	12	13	144
line_b	15	16	177


In [373]:
# A simple instruction: print only values from column 2
awk '{print $2}' cols.txt

1
2
3
4
12
15


In [374]:
# In this example we assume there is no header (column names) in the table. 
# If the table had column names we can add an instruction to skip the first `n` lines.

awk 'NR>1 {print $2}' cols.txt #Skips the first line

2
3
4
12
15


In [375]:
## The field (column) variable is indicated by the symbol $. A value of 0 means the whole line.
# In this example, although we cannot see it, awk actually prints line by line instead of the
# whole table like `cat` does.

awk '{print $0}' cols.txt

rowA	1	1	9
rowB	2	7	10
file3	3	6	20
file4	4	5	99
line_a	12	13	144
line_b	15	16	177


In [376]:
# Print the first and third columns:
awk '{print $1,$3}' cols.txt

rowA 1
rowB 7
file3 6
file4 5
line_a 13
line_b 16


In [377]:
# Print the first and third columns separated by a tab
awk '{print $1 "\t" $3}' cols.txt

rowA	1
rowB	7
file3	6
file4	5
line_a	13
line_b	16


In [378]:
# Print the 1st,3rd and 4th columns separated by a comma (useful to create csv tables)
awk '{print $1 "," $3","$4}' cols.txt # The spacing in the commands doesn't matter much, but 
# using spaces helps to keep things tidy.

rowA,1,9
rowB,7,10
file3,6,20
file4,5,99
line_a,13,144
line_b,16,177


### regex (regular expressions)

*they can save your life (and make you look cool!) <sup>[citation needed]</sup>*

https://www.xkcd.com/208/

\- - - <br>

Print lines that match a pattern.

Some basic rules:

* The pattern goes in between forward dashes: /pattern/
* `^pattern` looks for a exact match at the START of the text
* `pattern$` looks for a exact match at the END of the text
* `p[abc]ttern` looks for **a** or **b** or **c** in between *p* and *ttern*
    * Possible matches would be: pattern,pbttern,pcttern 
    * but not combinations of any of the letterns in the brackets: <strike>p**ab**ttern, p**bc**ttern</strike>
* `pat1|pat2` matches either pattern separated by the vertical line.

https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

In [379]:
# If the first column has an 'a' (anywhere)
awk '$1 ~ /a/ {print $0}' cols.txt

line_a	12	13	144


In [380]:
# We can also omit the print command. By default awk needs to print something.
# if left unspecificed, it will print the whole line.
awk '$1 ~ /a/' cols.txt

line_a	12	13	144


In [381]:
# Look for an "r" at the START of the text:
awk '$1 ~ /^r/' cols.txt

rowA	1	1	9
rowB	2	7	10


In [382]:
# The pattern rules are pretty much the same for any other regex functions (grep, perl, R)
awk '$1 ~ /a$/' cols.txt

line_a	12	13	144


In [383]:
# We can negate a match with !~
# 'Print rows that don't contain w or n:
awk '$1 !~ /[wn]/' cols.txt

file3	3	6	20
file4	4	5	99


In [384]:
## And print specific columns
# 'Print the 2nd and 4th columns of rows that don't contain w or n:
awk '$1 !~ /[wn]/ {print $2,$4}' cols.txt

3 20
4 99


In [385]:
## A wide search
# Print rows that contain a 9 in ANY column
awk '/9/ {print $0}' cols.txt

rowA	1	1	9
file4	4	5	99


In [386]:
## A wide search can also be negated
# Print rows that DON'T contain a 9 in ANY column
awk '!/9/ {print $0}' cols.txt

rowB	2	7	10
file3	3	6	20
line_a	12	13	144
line_b	15	16	177


<hr>

### exact matches

In [387]:
## Using == we can look for columns that have an EXACT match. 
# Since we're doing a more direct search, we need to specify a column.
awk '$4 == "99" {print $0}' cols.txt

file4	4	5	99


In [388]:
# Negate a search
awk '$4 != "99" {print $0}' cols.txt

rowA	1	1	9
rowB	2	7	10
file3	3	6	20
line_a	12	13	144
line_b	15	16	177


In [390]:
## Print specific columns with a defined format
awk '$4 == "99" {print $1","$4}' cols.txt

file4,99


In [391]:
## Even use multiple conditions with && (AND)
awk '$3 == "5" && $2 == "4" {print $1 "," $2 "," $3}' cols.txt

file4,4,5


In [393]:
# || (OR)
awk '$3 == "5" || $4 == "3" {print $0}' cols.txt

file4	4	5	99


In [394]:
## Combine exact and patterns matching:
# 'Print lines whose first column has an "a" or their 4th column is 99'
awk '$1 ~ /a/ || $4 == "99" {print $0}' cols.txt

file4	4	5	99
line_a	12	13	144


In [395]:
## Relational operators also work:
awk '$2 > 3 {print $0}' cols.txt

file4	4	5	99
line_a	12	13	144
line_b	15	16	177


In [396]:
#### Watch out, we can modify the values of a column if we're not careful:
## This searches for lines whose second column is 1
awk '$2 == 1 {print $0}' cols.txt

rowA	1	1	9


In [397]:
## == is important. Using a single '=' changes  values on the second column for a 1
awk '$2 = 1 {print $0}' cols.txt

rowA 1 1 9
rowB 1 7 10
file3 1 6 20
file4 1 5 99
line_a 1 13 144
line_b 1 16 177


In [398]:
#### The order of the operatios matter:
# The instructions are followed as:
## Either 
# a) Find an 'a' on the first column 
# OR
# b) Find a 99 on the 4th column AND a value greater than 1 on the second  column.
awk '$1 ~ /a/ || $4 == "99" && $2 > 1{print $0}' cols.txt

file4	4	5	99
line_a	12	13	144


In [399]:
## Either 
# a) Find an 'a' on the first column AND a value greater than 1 on the second column.
# OR
# b) Find a 99 on the 4th column 

awk '$1 ~ /a/ && $2 > 1 || $4 == "99" {print $0}' cols.txt

file4	4	5	99
line_a	12	13	144


In [400]:
## We can use parentheses to group logical operations together.

awk '$1 ~ /a/ && ( $2 > 1 || $4 == "99" ) {print $0}' cols.txt

line_a	12	13	144


<hr>

### Arithmetic Operations

### ** + * - / **

A powerful functionality of `awk` is to perform arithmetic operations as part of the parsing process. These can be combined with the pattern matching.

In [401]:
## Let awk tell you how much 1 +1 is: 
awk '{ print 1 + 1 }' cols.txt

## awk operates on a line-by-line basis, 
# we're instructing awk to, on each line, print the sum of 1 + 1.

2
2
2
2
2
2


In [402]:
## Now print it only once:
# We'll review the structures of an awk function later:
awk ' END { print 1 + 1 }' cols.txt

2


In [403]:
awk ' BEGIN { print 1 + 1 }' cols.txt

2


In [404]:
## We can sum values of columns:
awk '{ print $2 + $3 }' cols.txt

2
9
9
9
25
31


In [405]:
# Print the individual values per column and add a third column with the sum.
awk '{ print $2, $3, $2 + $3 }' cols.txt

1 1 2
2 7 9
3 6 9
4 5 9
12 13 25
15 16 31


In [406]:
## Use pattern matching and logic operations to sum values:
awk '$1 ~ /a/ && ( $2 > 1 || $4 == "99" ) {print $0}' cols.txt

line_a	12	13	144


In [407]:
## Sum columns 2 and 3:
awk '$1 ~ /a/ && ( $2 > 1 || $4 == "99" ) {print $2 + $3}' cols.txt

25


#### Cumulative sums.

Use of variables

In [408]:
## We can store values on variables for later use:
# Create a variable to keep track of how many lines we've processed: 
awk 'BEGIN {counter=0} {counter += 1} {print $0,counter}' cols.txt

rowA	1	1	9 1
rowB	2	7	10 2
file3	3	6	20 3
file4	4	5	99 4
line_a	12	13	144 5
line_b	15	16	177 6


In [409]:
# We can skip the initialization of the variable:
awk '{counter += 1} {print $0 "\t" counter}' cols.txt

rowA	1	1	9	1
rowB	2	7	10	2
file3	3	6	20	3
file4	4	5	99	4
line_a	12	13	144	5
line_b	15	16	177	6


In [410]:
## Combine with pattern matching:
# If column 2 is 1:
## a) print the line.
## b) aggregate the values of column 4
## At the end of the run, print the cumulative sum of column 4 values for rows whose second column is 1.
awk '$2 == "1" { print; sum += $4 } END { print sum }' cols.txt

rowA	1	1	9
9


In [411]:
# Initialize a counter, to keep track of the number of occurences, separate with a tab.
# at the end of the run, skip a line, print the cumulative sum of column 4 for those lines.
awk 'BEGIN {counter=0} $2 == "1" {counter+=1; print $0 "\t" counter; sum += $4;  } END { print "\n"; print sum }' cols.txt

rowA	1	1	9	1


9


#### Means.

Use of the counter to calculate means of a column.

In [412]:
# Initialize a counter, to keep track of the number of occurences, separate with a tab.
# at the end of the run, skip a line, print the mean.
awk 'BEGIN {counter=0} $2 == "1" {counter+=1; print $0 "\t" counter; sum += $4;  } END { print "\n"; print sum/counter }' cols.txt

rowA	1	1	9	1


9


In [413]:
## Calculate column means:
awk 'BEGIN {counter=0} {print $0; tot2 +=$2;tot3 +=$3; tot4 +=$4; counter += 1 } END { print "\n"; print "Means" "\t" tot2/counter "\t" tot3/counter "\t" tot4/counter }' cols.txt

rowA	1	1	9
rowB	2	7	10
file3	3	6	20
file4	4	5	99
line_a	12	13	144
line_b	15	16	177


Means	6.16667	8	76.5


#### More complex examples.



In [414]:
#### awk works by reading each line at a time.
## The following code evaluates the instruction in the following way:
# a) If the second column is 1, Store a cumulative sum in variable 'sum'.
# b) If the value of the third column is less of equal than 2, do a cumulative sum.
# c) If the rowname has an "a" sum the values on the 4th column
# END the instructions
# print the cumulative sums of column 2 if is 1 and column 3 if it's a 2.
awk '{print $0} $2 == 1 { sum += $2 } $3 <= 2 {sum2 += $3} $1 ~ /a/ {sum3 += $4} END { print "\n" "\t" sum "\t"  sum2 "\t"  sum3 }' cols.txt

rowA	1	1	9
rowB	2	7	10
file3	3	6	20
file4	4	5	99
line_a	12	13	144
line_b	15	16	177

	1	1	144


### AWK programs basic structure

As you might've noticed by now awk is line oriented. It will process each line at a time and thus we can do different processes on each line. 

The basic idea for an awk instruction is that an `instruction` is followed by an `{action}`:

> ``awk '$2 > 1 {print $0}' cols.txt`` <br>
> awk *__If__ the second column is larger than 1, __then__ {print the whole line}* 

Sometimes the instruction can be omitted (but not the action), in this case the default's action of awk is just to read every line.

Multiple actions within a block can be separated by a semicolon (;)

> ``awk 'instruction {action1;action2}'``


`awk` also has two important block of instructions (but that can be also omitted): BEGIN and END.

#### BEGIN

It specifies an instruction and action to be performed BEFORE the start of the program (ie, before reading the lines). This can be useful to start variables or print something. 

The basic format for this block has to be specified using "BEGIN" then the action (in this case BEGIN is the instruction):

> ``awk 'BEGIN {a=0} {a+=1; print a}' cols.txt`` <br>
> *Start with a=0. For each line in cols.txt increase the value of a by 1 and print it.*

#### END

Specifies the actions AFTER the program has run (ie after the lines have been read). For example to avoid printing each time a line is read, we can print a single output by the end.

> ``awk 'BEGIN {a=0} {a+=1} END {print a}' cols.txt`` <br>
> *Start with a=0. For each line in cols.txt increase the value of a by 1. After reading each line print the final value of a.*

In principle we can run awk with only one block of instructions (either BEGIN, END or the unnamed instruction that is executed on each line). 


> ``awk 'BEGIN {print "Start of program"}' cols.txt`` <br>
> ``awk 'END {print "End of program"}' cols.txt`` <br>



In [415]:
# AWK
awk 'BEGIN {print "BOF (beginning of file)"} {print "do something every line"} END {print "EOF (end of file)"}' cols.txt

BOF (beginning of file)
do something every line
do something every line
do something every line
do something every line
do something every line
do something every line
EOF (end of file)


### Summary of AWK Commands:


* if    `awk '{if (NR%2==1) {print "odd"} else {print "even"}}' cols.txt`
* while `awk 'BEGIN { i=1; while (i <= 10) {print "The square of ", i, " is ", i*i;i = i+1;}}' cols.txt`
* for   `awk 'BEGIN { for (i=1; i <= 10; i++) {print "The square of ", i, " is ", i*i;}}' cols.txt`
* length `awk '{print length($1)}' cols.txt`


Others:

* break
* continue
* print [ expression-list ] [ > expression ]
* next 
* exit

<hr>

### Passing variables to awk

<br>
# <center> ! </center>

You can skip this section if you're going to use awk for one-liners outside of shell scripts or less complicated stuff.
<br><br>

<hr>

When using awk inside shell scripts there's a special situation we should be aware of. Imagine you write a script to print out user defined columns using awk or just passing a variable to the program.

In shell, variables are also defined by the *$* sign, thus it'd be logical to write the awk program passing the variable name as is, for example:

> ``Column=1`` # variable indicating which column to print. <br>
> ``echo $Column`` # print the variable <br>
> `` awk '{print $$Column}' cols.txt `` # Call awk, pass the variable to print the column. <br>

As you can see this will result in an error:

In [416]:
Column=1



In [417]:
# This will result in an error:
awk '{print $$Column}' cols.txt 

awk: illegal field $(), name "Column"
 input record number 1, file cols.txt
 source line number 1


Since `awk` also uses **$** for variables we need to bypass the special character by toggling off the interpreter at the variable name, so that it's recognized as such. We use quotes to switch it off:

In [418]:
awk '{print $'$Column'}' cols.txt 

rowA
rowB
file3
file4
line_a
line_b


### Recommendation: default variables

Set up default variables when writing shell scripts. This way if the input is empty or not set, it will take on a default value.

The format is as follows: ``${variable:-defaultvalue}``

For example:

>``column=2 #Choose any value `` <br>
>``column=${column:-1} #Set column as the same variable. If empty, it will take on the default value of 1.`` <br>



In [420]:
## Example with an empty input. If no input is given, the value is overriden by the default.
column=
echo $column
column=${column:-1}
echo $column


1


In [421]:
## Example with a valid input. The value chosen is NOT overwritten.
column=2
echo $column
column=${column:-1}
echo $column

2
2


<hr>

## Back to AWK

<br>

### Positional Variables

The variables we've used so far are user-defined, that is, we decide the value that the variable will take. 

awk's positional variables are functions called by the dollar sign:

In the following code, `a` is a user defined variable, we assign the value 0 and it will increase by one on each line. The positional variable `$1` points to the first column 

In [82]:
awk 'BEGIN {a=0} {a+=1; print $1 "\t" a}' cols.txt

rowA	1
rowB	2
file3	3
file4	4
line_a	5
line_b	6


Positional variables can be modified:

In the following line we rename the rows to their position by chaning the value of $1 for a:

In [422]:
awk 'BEGIN {a=0} {a+=1; print $1=a,$2,$3,$4}' cols.txt # The output can be redirected to a new text file appending `> newfile.txt`

1 1 1 9
2 2 7 10
3 3 6 20
4 4 5 99
5 12 13 144
6 15 16 177


### Quick Summary:

There are 8 positional variables in awk. These are not user-defined but built in into awk and they're called with `$`.

* {number}. Indicates a specific column. If 0, then prints the whole line.
* `NF` (Number of fields). Indicates the number of fields (columns) in the line.
* `NR` (Number of records). Indicates the number of records (lines) in the line.
* `FS` (Field separator). Indicates which character to use as **column** separator when reading the file.
* `OFS` (Output field separator). Indicates which character to use as **column** separator when printing the output.
* `RS` (Record separator). Indicates which character to use as **line** separator when reading the file.
* `ORS` (Output record separator). Indicates which character to use as **line** separator when printing the output.
* `FILENAME`. Name of the file being read.

Using these variables along with conditions (if,else) can have interesting results.


### Field Separator

By default awk parses lines and separates them by whitespaces. If the file we're parsing has separators other than this (ie a comma separated file) we can change the value using the `FS` argument:

In [424]:
## I converted the counts.txt to a csv file with vim:
# Used > :%s/\s\+/,/g < to convert
head -n 3 cols.txt
echo "---"
head -n 3 cols.csv

rowA	1	1	9
rowB	2	7	10
file3	3	6	20
---
rowA,1,1,9
rowB,2,7,10
file3,3,6,20


In [425]:
## The default awk won't be able to read it:
awk 'BEGIN {a=0} {a+=1; print $1=a,$2,$3,$4}' cols.csv

1   
2   
3   
4   
5   
6   


In [426]:
## But we can switch the field separator to a ","
awk -F, 'BEGIN {a=0} {a+=1; print $1=a,$2,$3,$4}' cols.csv

1 1 1 9
2 2 7 10
3 3 6 20
4 4 5 99
5 12 13 144
6 15 16 177


In [427]:
## Or set the FS variable
awk 'BEGIN {a=0;FS=","} {a+=1; print $1=a,$2,$3,$4}' cols.csv


1 1 1 9
2 2 7 10
3 3 6 20
4 4 5 99
5 12 13 144
6 15 16 177


In [431]:
# Check if the Field Separator matches a character.
awk ' { if ($0 ~ /":"/) {FS=":";} else {FS=",";$0=$0} print $3 }' fubar.txt


7
6

13



In [432]:
# The above script prints column 3 from rows separated with a comma
cat fubar.txt

rowA:1:1:9
rowB,2,7,10
file3,3,6,20
file4:4:5:99
line_a,12,13,144
line_b:15:16:177


### Output Field Separator

When printing outputs there is a difference between printing 

> ``awk '{print $1 $2}' cols.txt``

And printing 

> ``awk '{print $1,$2}' cols.txt``

Using a space will concatenate the output into a single field, using the comma will print two fields with the Output Field Separator (OFS) between them. 

By default the OFS is a space but can be specified

In [433]:
awk 'BEGIN {OFS="\t"} {print $1,$2}' cols.txt
echo "---"
awk 'BEGIN {OFS=","} {print $1,$2}' cols.txt
echo "---"
awk 'BEGIN {OFS=":"} {print $1,$2}' cols.txt

rowA	1
rowB	2
file3	3
file4	4
line_a	12
line_b	15
---
rowA,1
rowB,2
file3,3
file4,4
line_a,12
line_b,15
---
rowA:1
rowB:2
file3:3
file4:4
line_a:12
line_b:15


### Number of Fields

The `NF` variable tells us the number of fields (columns) that a file has.

In [434]:
awk 'END  {print NF}' cols.txt

4


In [435]:
## We can take the advantage of the value of NF and print the \ 
# last column of a file using the NF as a call to the field variable:
awk '{print $NF}' cols.txt

9
10
20
99
144
177


# <center> ! </center>

AWK has a limit of 99 fields in a single line ($1 to $99). Other programming languages (like Perl) don't have such limits to handle multiple fields/columns.

<hr>

### Number of Records

The `NR` variable tells us the number of records (lines) in a file.

This is useful if we were reading a file with a header (column names) and wish to skip the first line.

In [436]:
## We can skip n lines with a condition for the NR value:
awk '{ if (NR > 1) {print NR, $0;}}' cols.txt 

2 rowB	2	7	10
3 file3	3	6	20
4 file4	4	5	99
5 line_a	12	13	144
6 line_b	15	16	177


In [437]:
# We get the same effect just by indicating the condition (without the If/else statement)
awk 'NR>1 {print NR, $0}' cols.txt

2 rowB	2	7	10
3 file3	3	6	20
4 file4	4	5	99
5 line_a	12	13	144
6 line_b	15	16	177


### Record Separator

AWK reads one line at a time and separates each line into columns (fields).

The `RS` variable indicates the character used to separate records (lines).

By default the end of line (EOL) is "\n". If the document has any other type of end-of-line character we specify it with RS in the BEGIN block. 

In [438]:
## Define the end of a line as a white space. Print the record number and then the columns.
# By redefining the lines as whitespace, the field separator is now the end of the line ("\n") and thus \
# the file is a 1x4 table: 1 column and each line becomes a column.
awk 'BEGIN {RS=" "} {print NR,$1,$2,$3,$4} END {print"\n";print "Rows:",NR,"Columns:",NF}' cols.csv

1 rowA,1,1,9 rowB,2,7,10 file3,3,6,20 file4,4,5,99


Rows: 1 Columns: 6


In [439]:
awk '{print NR,$1,$2,$3,$4} END {print"\n";print "Rows:",NR,"Columns:",NF}' cols.csv

1 rowA,1,1,9   
2 rowB,2,7,10   
3 file3,3,6,20   
4 file4,4,5,99   
5 line_a,12,13,144   
6 line_b,15,16,177   


Rows: 6 Columns: 1


In [440]:
# If we specify the field separator as the record separator we'll read one word per line
awk 'BEGIN {RS="\t"} {print NR,$0} END {print"\n";print "Rows:",NR,"Columns:",NF}' cols.txt

1 rowA
2 1
3 1
4 9
rowB
5 2
6 7
7 10
file3
8 3
9 6
10 20
file4
11 4
12 5
13 99
line_a
14 12
15 13
16 144
line_b
17 15
18 16
19 177



Rows: 19 Columns: 1


### Output Record Separator

The `ORS` variable indicates the character used to separate records (lines) __in the output__.

By default the end of line (EOL) is "\n". If we want the document to have any other type of end-of-line character we specify it with RS in the BEGIN block. 

In [96]:
awk 'BEGIN {ORS=";"} {print $0}' cols.txt

rowA	1	1	9;rowB	2	7	10;file3	3	6	20;file4	4	5	99;line_a	12	13	144;line_b	15	16	177;

### Filename

The `FILENAME` variable indicates the name of the file being read.

In [97]:
awk '{print $0} END {print"\n";print "Rows:",NR,"Columns:",NF, "File:", FILENAME}' cols.txt

rowA	1	1	9
rowB	2	7	10
file3	3	6	20
file4	4	5	99
line_a	12	13	144
line_b	15	16	177


Rows: 6 Columns: 4 File: cols.txt


### Associattive Arrays

Most programming languages would count occurences of an event in two arrays: one for the name and the other for the number, and the index would link them together (ie, a table is with one column being the name, the other the number)

`AWK` overcomes this issue using associative arrays:

In [None]:
### Count the number of files that a user has.

## Bad awk

ls -l . | \    
  awk 'BEGIN {number_of_users=0;} { \
          if (NF>7) { user=0; for (i=1; i<=number_of_users; i++) {\
         if (username[i] == $3) {user=i;}} \
          if (user == 0) {username[++number_of_users]=$3;user=number_of_users;}count[user]++;} \
          }\
        END {for (i=1; i<=number_of_users; i++) {print count[i], username[i]}}' 

In [None]:
# Quick and dirty solution
ls -l | awk '{print $3}' | sort | uniq -c | sort -nr

In [None]:
## Good awk
ls -l | awk '{username[$3]++;} END { for (i in username) {print username[i], i; }}'

In [106]:
## Better awk programming

ls -l | awk 'BEGIN {username[""]=0;}{username[$3]++;}END {for (i in username) {if (i != "") {print username[i], i;}}}'

9 jrm


All arrays in AWK are associative arrays. 

When adding entries and counts in the arrays the counts are incremented by invoking:

`entry[variable]++;`

## Applying awk to real biological data.

The following is a SAM file with information of sequence reads aligned to the Arabidopsis genome. The format of this type of data is constant, so we can apply awk commands over these files with ease and these scripts can work for more than one SAM file.

You can read more about the format here: https://en.wikipedia.org/wiki/SAM_(file_format)

In [None]:
# First let's take a look at how the file looks
head -n 3 bowtie2_genome.sam

The file containts 5346701 entries (alignments or lines) : `wc -l bowtie2_genome.sam`

`awk 'END {print NR}' bowtie2_genome.sam #Prints the NR of the last line (index) which is effectively the number of lines.`


#### RNAME 

Let's count how many reads are mapped to the different chromosomes. To do this we use the third column of the SAM file.

In [441]:
##One way to do it is to pipe different commands: `cut` the file for column 3 (the field containing the )      

SECONDS=0; head -n 5346701 bowtie2_genome.sam | cut -f 3 | sort | uniq -c | sort -nr; echo $SECONDS;

1410685 Chr1
1094098 Chr5
1058436 Chr3
886535 Chr4
864493 Chr2
22383 mitochondria
10071 chloroplast
119


In [442]:
## Using awk and associtive arrays we get the same result but in a much faster time. 
SECONDS=0; head -n 5346701 bowtie2_genome.sam | \
    awk '{chromosome[$3]++;} END { for (i in chromosome) {print chromosome[i], i; } }';  \
echo $SECONDS;

10071 chloroplast
1410685 Chr1
864493 Chr2
1058436 Chr3
886535 Chr4
1094098 Chr5
22383 mitochondria
59


Using awk read and processed the __5346701 lines in ~59 seconds__, while stitching (with pipe: "|") the columns to sort and unique takes about twice as much (~119 seconds). 

#### MAPQ

The 5th column of the SAM format gives us the qualities of the mapping. The quality is measured as the −10 log10 Pr{mapping position is wrong}.

See more about qualities here: http://www.acgt.me/blog/2014/12/16/understanding-mapq-scores-in-sam-files-does-37-42

Values for 255 indicate the mapping quality is not available.

Using awk we can:


* filter out reads with low quality 

In [450]:
## 
#SECONDS=0; head -n 5346701 bowtie2_genome.sam | \
SECONDS=0; head -n 100 bowtie2_genome.sam |\
    awk '{ if ($5 != 255) {print $0} else {qual[$5]++} }  END { for (i in chromosome) {print i, chromosome[i]; } } ';\
    echo $SECONDS;

J00113:234:HGCCMBBXX:6:1101:5690:1297	16	Chr1	2338748	32	42M	*	0	0	AATACAAAGCTGGGAAAAGTCAAAGCTTAAGAGAGTTGATAC	JJJJJFJA7FFFFFJJJJFJFJFFFJJFFJAFFJJJJFJFFJ	AS:i:84	XS:i:74	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:42	YT:Z:UU
J00113:234:HGCCMBBXX:6:1101:5690:1297	272	Chr5	22971914	32	1S20M1D21M	*	0	0	AATACAAAGCTGGGAAAAGTCAAAGCTTAAGAGAGTTGATAC	JJJJJFJA7FFFFFJJJJFJFJFFFJJFFJAFFJJJJFJFFJ	AS:i:74	XS:i:74	XN:i:0	XM:i:0	XO:i:1	XG:i:1	NM:i:1	MD:Z:20^A21	YT:Z:UU
J00113:234:HGCCMBBXX:6:1101:27630:1297	0	Chr4	7439507	31	42M	*	0	0	CGTAAGTTGAATACCTATGACATACCTATGAAACAAACGAAT	AFJJJJ<J7<<FJAA<FF-F--<F7<-F77F-FAFJJAFJ<A	AS:i:84	XS:i:78	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:42	YT:Z:UU
J00113:234:HGCCMBBXX:6:1101:27630:1297	256	Chr4	7434582	31	42M	*	0	0	CGTAAGTTGAATACCTATGACATACCTATGAAACAAACGAAT	AFJJJJ<J7<<FJAA<FF-F--<F7<-F77F-FAFJJAFJ<A	AS:i:78	XS:i:78	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:29A12	YT:Z:UU
J00113:234:HGCCMBBXX:6:1101:9171:1314	0	Chr3	5580741	38	42M	*	0	0	CATATGAGGTCTCTTCATGCCTCTGGTGAATG

In [451]:
## Only keep lines with an available MAPQ. Count 
## If mapq is 255 
head -n 10 bowtie2_genome.sam | \
awk '{if ($5 == 255) {qual[$5]++} else {print $0}} END { for (i in qual) {print i, qual[i]}}' 

255 10


In [452]:
## Distribution of qualities
SECONDS=0; head -n 5346701 bowtie2_genome.sam | \
#SECONDS=0; head -n 100 bowtie2_genome.sam | \
    awk '{chromosome[$5]++;} END { for (i in chromosome) {print i, chromosome[i]; } }';  \
echo $SECONDS;

40 49377
31 179607
32 110361
33 121616
34 116134
35 87025
37 61411
38 66742
39 97631
255 4456797
61


In [454]:
awk '{if (NR%2==1) {print "odd"} else {print "even"}}' cols.txt

odd
even
odd
even
odd
even


### Redirection

For example we want to redirect reads with no quality mapping to one file and reads with a quality mapping to another file.

In [None]:
## This oneliner filters a SAM file by MAPQ values:

# Creates a file for reads without available quality.
# Creates a file for reads with quality.
## Algorithm:
# Gets the name file, removes the suffix (.sam)
# Creates a variable for the file names of the output.
# For every line checks if the quality is 255 or not.
# If so, sends the reads to one line (low), if not, sends the reads to another file (pass)
# prints a summary of how many reads have low quality.

awk 'BEGIN{file=ARGV[1]; gsub(/.sam/,"", file); pass=file"_okMAPQ.sam"; low=file"_bad.sam"}\
     {if ($5 == 255) {print $0 > low;qual[$5]++} else {print $0>pass}}\
     END { for (i in qual) {print i, qual[i]}}' SAMtest200.sam


In [None]:
wc -l SAMtest200_bad.txt # Same as the output

In [None]:
## We can add some cool features to this script. 

# We can create a file with the distribution of qualities
awk 'BEGIN{file=ARGV[1]; gsub(/.sam/,"", file); \
            pass=file"_okMAPQ.sam"; low=file"_bad.sam";distr=file"_MAPQdistribution.txt";} \
    {if ($5 == 255) {print $0 > low;qual[$5]++} else {print $0>pass;qual[$5]++}} \
    END { for (i in qual) {print i, qual[i] > distr}}' SAMtest200.sam


In [236]:
## Add a filter for low quality reads (including reads with MAPQ=255)

awk 'BEGIN{file=ARGV[1]; gsub(/.sam/,"", file); \
            pass=file"_okMAPQ.sam"; low=file"_bad.sam";distr=file"_MAPQdistribution.txt";} \
    {if ($5 == 255 || $5 < 37) {print $0 > low;qual[$5]++} else {print $0>pass;qual[$5]++}} \
    END { for (i in qual) {print i, qual[i] > distr}}' SAMtest200.sam



In [241]:
## Add a filter for low quality reads (MAPQ < 37) (including reads with MAPQ=255) AND 
# reads from non genomic locations (chloroplast or mitochondria)

awk 'BEGIN{file=ARGV[1]; gsub(/.sam/,"", file); \
            pass=file"_okMAPQ.sam"; low=file"_bad.sam";distr=file"_MAPQdistribution.txt";} \
    {if ($5 == 255 || $5 < 37 || $3 ~ /chl|mit/ ) {print $0 > low;qual[$5]++} else {print $0>pass;qual[$5]++}} \
    END { for (i in qual) {print i, qual[i] > distr}}' SAMtest200.sam



<hr>

## Processing fasta files 

In [2]:
grep ">" promoters.fa | wc -l

     109


In [483]:
# Count number of entries in the fasta file
#awk ' {if ($1 ~ />/) {print $0} else {} } ' promoters.fa
awk ' BEGIN {c=0} {if ($1 ~ />/) {c+=1;} } END {print c} ' promoters.fa 


109


In [None]:
## Convert a FASTA file to uppercase
########
awk 'BEGIN{c=0; file=ARGV[1]; gsub(/.fa/,"", file); toFile=file"_upper.fa";}\
            {if ($1 ~ />/) {c+=1;print $0;} else {print toupper($0)} } END {print c, toFile} ' promoters.fa 

In [484]:
## Save results to file
awk 'BEGIN{c=0; file=ARGV[1]; gsub(/.fa|.fasta/,"", file); toFile=file"_upper.fa";}\
      {if ($1 ~ />/) {c+=1;print $0 > toFile;} else {print toupper($0)> toFile} } END {print c, toFile} ' promoters.fa 
########            

109 promoters_upper.fa


### Searching for patterns in fasta files.

We can use conditions to split a fasta file based in patterns (filtering by either the header names or the sequences).

More on regular expressions: https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

In [458]:
## Filter a FASTA file: header names with pattern go to one file, genes with no match go to another file.
########
awk 'BEGIN{m=0;nom=0; file=ARGV[1]; gsub(/.fa|.fasta/,"", file); passFile=file"_ok.fa"; noFile=file"_no.fa"}\
      {if ($1 ~ />/) { tmp=$0;}\
      else { if (tmp ~ /Gen02/) {m+=1; print tmp,RS,toupper($0) > passFile}\
                               else {nom+=1; print tmp,RS,toupper($0) > noFile} }}\
                               END {print " Genes with match:",m,RS,"Genes with no match:",nom} ' promoters.fa 
########            

 Genes with match: 10 
 Genes with no match: 99


In [459]:
## Filter a FASTA file: sequences with pattern go to one file, sequences with no match go to another file.
########
awk 'BEGIN{m=0;nom=0; file=ARGV[1]; gsub(/.fa|.fasta/,"", file); passFile=file"_ok.fa"; noFile=file"_no.fa"}\
      {if ($1 ~ />/) { tmp=$0;}\
      else { if (toupper($0 )~ /CCC.{1,4}GGG/) {m+=1; print tmp,RS,toupper($0) > passFile}\
                               else {nom+=1; print tmp,RS,toupper($0) > noFile} }}\
                               END {print " Genes with match:",m,RS,"Genes with no match:",nom} ' promoters.fa 
# The pattern using grep would be "CCC.\{1,4\}GGG" with slashes between the intervals:
#grep -i "CCC.\{1,4\}GGG" promoters.fa  -B 1 | grep ">" | wc -l
########            

 Genes with match: 20 
 Genes with no match: 89


### FASTQ processing

Fastq files have a special format. They're composed of 4 lines:

* header, starts with an @.
* read sequence. Nucleotides (including N).
* quality separator, starts with a +. May or may not have the same information as the header (except for the @).
* quality (in phred scores).


We can ask the value of the module of the line number divided by 4:

* If $n$=1: header
* If $n$=2: read
* If $n$=3: + separator
* If $n$=0: quality 

In [460]:
## Read the header
awk ' BEGIN {c=0} {if ($1 ~ /@/) {c+=1; if (NR < 8) {print $0};} } END {print c} ' testSeq.fastq

@SRR1463325.1 HS2:447:C2DFYACXX:5:1101:1336:2178 length=59
@SRR1463325.2 HS2:447:C2DFYACXX:5:1101:1364:2181 length=59
200000


In [486]:
awk ' {print (NR "\t" NR%4 "\t" $0)}' testQ32.fastq

1	1	@SRR1463325.1 HS2:447:C2DFYACXX:5:1101:1336:2178 length=59
2	2	ATGTTAGTAACCGAACCTTCTTCAAAAAGGGCTAAGGGATAAGCTACATACGCAATAAA
3	3	+SRR1463325.1 HS2:447:C2DFYACXX:5:1101:1336:2178 length=59
4	0	BBBFFFBFFFF0FF0FFBBB0BFFFFIFBF0BBF<B<BF<BB<FFIFFBBBBF######
5	1	@SRR1463325.2 HS2:447:C2DFYACXX:5:1101:1364:2181 length=59
6	2	ACGCATTTATTAGATAAAAGGTCGACGCGGGCTCTGCCCGTTGCTCTGATGATTCATGA
7	3	+SRR1463325.2 HS2:447:C2DFYACXX:5:1101:1364:2181 length=59
8	0	BBBFFFFFFFFFFFFFIIIFIFIIF'B7FFFBBBFF'7BFBFBFBBFBB<B7<7'0<B<
9	1	@SRR1463325.3 HS2:447:C2DFYACXX:5:1101:1499:2208 length=59
10	2	AGGACCTCTTTAGTATTTTTGTTGATGACCAAAGCACCAGCACCTACAACATGAGAAGC
11	3	+SRR1463325.3 HS2:447:C2DFYACXX:5:1101:1499:2208 length=59
12	0	BBBFFFFFFFFFFFFFIIIIIIIIIIIIIIIIIIIFIIIFIIIFFIIIIIIIIFFIIIB
13	1	@SRR1463325.4 HS2:447:C2DFYACXX:5:1101:1648:2157 length=59
14	2	NTGTAGAATCTATGTTGAATCACCATTTAGCAGGGCTACTAGGACTTGGGTCCCTTTCT
15	3	+SRR1463325.4 HS2:447:C2DFYACXX:5:1101:1648:2157 length=59
16	0	#0<BFFFFFFFFFIIIIIIIII

In [462]:
### Read the file in awk and reproduce it exactly.
awk 'BEGIN {separator="+"} {\
       if(NR%4==1) {header=$0}\
             else { if(NR%4==2) {sequence=$0}\
                else { if(NR%4==0) {quality=$0; print (header "\n" sequence "\n" separator "\n" quality )  }}};\
            }' testQ32.fastq

@SRR1463325.1 HS2:447:C2DFYACXX:5:1101:1336:2178 length=59
ATGTTAGTAACCGAACCTTCTTCAAAAAGGGCTAAGGGATAAGCTACATACGCAATAAA
+
BBBFFFBFFFF0FF0FFBBB0BFFFFIFBF0BBF<B<BF<BB<FFIFFBBBBF######
@SRR1463325.2 HS2:447:C2DFYACXX:5:1101:1364:2181 length=59
ACGCATTTATTAGATAAAAGGTCGACGCGGGCTCTGCCCGTTGCTCTGATGATTCATGA
+
BBBFFFFFFFFFFFFFIIIFIFIIF'B7FFFBBBFF'7BFBFBFBBFBB<B7<7'0<B<
@SRR1463325.3 HS2:447:C2DFYACXX:5:1101:1499:2208 length=59
AGGACCTCTTTAGTATTTTTGTTGATGACCAAAGCACCAGCACCTACAACATGAGAAGC
+
BBBFFFFFFFFFFFFFIIIIIIIIIIIIIIIIIIIFIIIFIIIFFIIIIIIIIFFIIIB
@SRR1463325.4 HS2:447:C2DFYACXX:5:1101:1648:2157 length=59
NTGTAGAATCTATGTTGAATCACCATTTAGCAGGGCTACTAGGACTTGGGTCCCTTTCT
+
#0<BFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFIIIFFII
@SRR1463325.5 HS2:447:C2DFYACXX:5:1101:1776:2228 length=59
AGCCTCTTTCCGATCTTCTCAACTCCAAGGCTCTCAACGAACTTCCTCACTTCATCATC
+
<<0<BFBFBFB0BF<FFFBFFFB<B0BB<<FBFFBFBFBF<FBBFFFFBB<707<B<<7
@SRR1463325.6 HS2:447:C2DFYACXX:5:1101:1956:2235 length=59
AGAGTCAATAATTTT

In [463]:
### Filter by Identifier.
awk 'BEGIN {separator="+";\
     file=ARGV[1]; gsub(/.fa|.fasta|.fastq/,"", file); passFile=file"_IDfilter.fq"; noFile=file"_no.fq"}\
     {\
           if(NR%4==1) {header=$0}\
                 else { if(NR%4==2) {sequence=$0}\
                    else { if(NR%4==0) {quality=$0; \
                    if (header ~ /SRR1463325.1/) {\
                        print (header "\n" sequence "\n" separator "\n" quality)>passFile\
                        }\
          }}};\
     }\
     END {print passFile}' testQ32.fastq

testQ32_IDfilter.fq


In [464]:
### Print only sequences
awk 'BEGIN {separator="+";\
     file=ARGV[1]; gsub(/.fa|.fasta|.fastq/,"", file); passFile=file"_ok.fq"; noFile=file"_no.fq"}\
     {\
           if(NR%4==1) {header=$0}\
                 else { if(NR%4==2) {sequence=$0}\
                    else { if(NR%4==0) {quality=$0; \
                        print (sequence)}\
                        }};\
     }\
     END {print passFile}' testQ32.fastq

ATGTTAGTAACCGAACCTTCTTCAAAAAGGGCTAAGGGATAAGCTACATACGCAATAAA
ACGCATTTATTAGATAAAAGGTCGACGCGGGCTCTGCCCGTTGCTCTGATGATTCATGA
AGGACCTCTTTAGTATTTTTGTTGATGACCAAAGCACCAGCACCTACAACATGAGAAGC
NTGTAGAATCTATGTTGAATCACCATTTAGCAGGGCTACTAGGACTTGGGTCCCTTTCT
AGCCTCTTTCCGATCTTCTCAACTCCAAGGCTCTCAACGAACTTCCTCACTTCATCATC
AGAGTCAATAATTTTATATGAGGAACTACTGAACTCAATCACTTGCTGCCGTTACTCTT
NTGTTTGAGGGGGAGGTCATAAGCGTCTATACCGTAAAATAGATTTTCGACGAAATGCA
CTAAGGGTGGGTTGATAACCCACAGCAGAAGGCATTCTACCCAATAAGGCGGATACCTC
testQ32_ok.fq


In [465]:
### Filter by sequence.
awk 'BEGIN {separator="+";\
     file=ARGV[1]; gsub(/.fa|.fasta|.fastq/,"", file); passFile=file"_SequenceFilter.fq"; noFile=file"_no.fq"}\
     {\
           if(NR%4==1) {header=$0}\
                 else { if(NR%4==2) {sequence=$0}\
                    else { if(NR%4==0) {quality=$0; \
                    if (sequence ~ /[TA]..TTTT/) {\
                        print (header "\n" sequence "\n" separator "\n" quality)>passFile\
                        }\
          }}};\
     }\
     END {print passFile}' testQ32.fastq

testQ32_SequenceFilter.fq


<hr>

### Random sampling FASTA files

In [None]:
# Read a fasta file in awk
awk 'BEGIN{c=0; file=ARGV[1]; gsub(/.fa/,"", file); toFile=file"_upper.fa";}\
            {if ($1 ~ />/) {c+=1;print $0;} else {print toupper($0)} } END {print c, toFile} ' promoters.fa 

In [487]:
## Get the number of lines. This will be the range.
total=`wc -l promoters.fa | awk -F " " '{print $1 }'`
echo $total

218


In [468]:
# use rand() to generate random numbers between 0 and 1.
# multiply by a constant M to get numbers between 0 and M.
#include 'srand(1);' right at the start to 'seed' the random function.
echo "" | awk 'BEGIN {i=0;while (i++<10) {print (int(rand()*100)) } exit;}'

84
39
78
79
91
19
33
76
27
55


In [469]:
# Script to generate random numbers and count the number of occurences wach one is sampled.
echo "" | awk 'BEGIN {srand(); i=0;while (i++<10)\
         {x=int(rand()*10 + 0.5);y[x]++;} \
         for (i=0;i<=10;i++) {printf("%dt%d\n",y[i],i);}exit;}'

0t0
1t1
0t2
0t3
1t4
2t5
1t6
2t7
1t8
0t9
2t10


In [470]:
## Random number generator doesn't have a function to avoid replacement 
# ie, 23,29 and 33 appear twice
echo "" | awk 'BEGIN { ranVar[""]=0;srand(1); for (i = 1; i <= 10; i++) { v=int(36 * rand()+1); ranVar[v]++}} \
               END {for (i in ranVar) {if (i != "") {print ranVar[i], i;}}}'


1 28
1 8
2 29
1 10
1 13
1 15
1 31
1 33
1 20


In [471]:
## We can create our own random number generator without replacement.
# Instead of 29 appearing twice, we get 18. We can use associative arrays to prove this works:
maxV=36
N=11
minV=${minV:-1}
minV=1
#######
## Algorithm: 
### srand() for seed.
### Start a loop to generate N random numbers
### 1. on each iteration generate a random number between 0 and max:  
###### This works as follows:
######### call the rand() function  (this generates a number between 0 and 1)
######### Multiply it by the max value and get the integer part it with int(). 
######### If you want to generate values between a MIN and MAX, add the MIN, and multiply  rand() by the range (max-min) + 1
### 2a. If the value already exists go back one step in the counter and generate a new value.
### 2b. If the value hasn't been generated save it into an array.
#########
echo "" | awk 'BEGIN { if ('$N' > '$maxV'-'$minV') {print ("Number of unique values to be generated exceeds the possible range"); exit;} srand(1); \
                    for (i = 1; i <= '$N'; i++) {\
                    v=int('$minV'+rand()*('$maxV'-'$minV'+1)); if(v in ranVar) i-- ; else { ranVar[v]++ }}}\
                    END {for (i in ranVar) { print ranVar[i], i;} }'


1 23
1 28
1 8
1 29
1 10
1 13
1 15
1 18
1 31
1 33
1 20


In [472]:
## We can use this program to randomly sample a table.
file="cols.txt"
maxV=`wc -l $file | awk -F " " '{print $1 }'`
echo $maxV
######
N=3 #Max number of numbers being sampled
minV=${minV:-1}
minV=1
### We can now use this random list to sample a file. 
awk 'BEGIN { if ('$N' > '$maxV'-'$minV') {print ("Number of unique values to be generated exceeds the possible range"); exit;} srand(1); \
                    for (i = 1; i <= '$N'; i++) {\
                    v=int('$minV'+rand()*('$maxV'-'$minV'+1)); if(v in ranVar) i-- ; else { ranVar[v]++ }}}\
                    {if (NR in ranVar) {print ($0)} }' $file
                    #END {for (i in ranVar) { print ranVar[i], i;} }'


6
file3	3	6	20
line_a	12	13	144
line_b	15	16	177


In [474]:
file="cols.txt"
maxV=`wc -l $file | awk -F " " '{print $1 }'`



In [476]:
## We can use this program to randomly sample a FASTA file.
# This program introduces a modification in the algorithm:
## Divide Max by 2 (Not all lines are to be sampled,).
file="promoters.fa"
maxV=`wc -l $file | awk -F " " '{print $1 }'`
echo $maxV
######
N=4 #Max number of numbers being sampled
minV=${minV:-1}
minV=0
### We can now use this random list to sample a file. 
awk 'BEGIN { maxPossible=int( '$maxV'/2) + 1;\
        file=ARGV[1]; gsub(/.fa/,"", file); toFile=file"_RanSample.fa";\
        if ('$N' > maxPossible-'$minV') {print ("Number of unique values to be generated exceeds the possible range"); exit;}\
        srand(1); \
        for (i = 1; i <= '$N'; i++) {\
            v=int('$minV'+rand()*('$maxV'-'$minV'+1));\
                  if(v%2==1 || v in ranVar) i-- ; else { ranVar[v]++ ; print (v)}}}\
        {if ($1 ~ />/) { tmp=$0;} ;\
        if (NR in ranVar) {print tmp "\n" toupper($0) > toFile }} END {print toFile}' $file



218
184
86
174
168
promoters_RanSample.fa


<hr>
## Random Sampling FASTQ files

In [478]:
## We can use this program to randomly sample a FASTA file.
# This program introduces a modification in the algorithm:
## Divide Max by 4.
file="testQ32.fastq"
maxV=`wc -l $file | awk -F " " '{print $1 }'`
echo $maxV
######
N=4 #Max number of numbers being sampled
minV=${minV:-1}
minV=1
### We can now use this random list to sample a file. 
awk 'BEGIN { maxPossible=int( '$maxV'/4) + 1;\
             separator="+";\
        file=ARGV[1]; gsub(/.fq|.fastq/,"", file); toFile=file"_RanSample.fq";\
        if ('$N' > maxPossible-'$minV') {print ("Number of unique values to be generated exceeds the possible range"); exit;}\
        srand(); \
        for (i = 1; i <= '$N'; i++) {\
            v=int('$minV'+rand()*('$maxV'-'$minV'+1));\
                  if(v%4!=0 || v in ranVar) i-- ; else { ranVar[v]++ ; print (v)}}}\
    {\
        if(NR%4==1) {header=$0}\
                 else { if(NR%4==2) {sequence=$0}\
                    else { if(NR%4==0) {quality=$0;\
        if (NR in ranVar) {print (NR "\t" header "\n" sequence "\n" separator "\n" quality ) > toFile}\
       }}}\
    }' $file



32
16
12
20
24



### More thoroguh guides and tutorials:

* http://www.grymoire.com/Unix/Awk.html

### Regular expressions:

* https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/


