Advanced UNIX commands used in bioinformatics

Autor: Luiz Carlos
Data: 31/12/2021

Bash is a Unix shell and command language written by Brian Fox for the GNU Project as a free software replacement for the Bourne shell. First released in 1989, it has been used as the default login shell for most Linux distributions. A version is also available for Windows, via the Windows Subsystem for Linux.¹

Redirect outputs from command line into a file - Part 2

# Showing output and overwritting a existing file using pipe (|) and tee
command | tee file.txt

# Showing output and append it to a existing file
command | tee -a file.txt

# Showing output and append it to a existing file
command | tee -a file.txt

Visualize files - Part 2

more and less

Those are a *nix command line used to display the contents of a file in a console.
The basic usage of more command is to run the command against a file as shown below:

more filename.ext 

less filename.ext

Option for less and more:

-g               # highlight serch, last match  
-G               # highlight serch  
-s               # squeze long lines  
-S               # chop-long-lines

MOVING:

e or ▼           # forward one line  
y or ▲           # backward one line  
f or space       # forward one window  
b                # backward one windown

SEARCHING:

/pattern         # search for a parttern forward  
?pattern         # search for a pattern backward  
&pattern         # display only matching lines

JUMPING

g - ESC-<        # Go to first line in file (or line N).
10g              # Jump to a specific line forward.
G - ESC->        # Go to last line in file (or line N).
10G              # Jump to a specific line backward.

In nano

ctrl + shift + - and choose the line number

Exit

q                # quit reading file

Uses of regular expression - regex

*       # Any kind of character one or more times
?       # Any kind of character one time
[0-9]   # Any number character
[a-z]   # Any alphabetic character
{pattern1,pattern2,etc.}.png

Usage:

ls *
ls *.png
ls [0-9]*.png           # find a match which have a number repeated mulpliple times. Such as 111.png, 90876.png etc.
ls [a-z]*.png           # The same as above, but now with alphabetical characteres.
ls ?mage.png            # Find a match which starts with any character only one time folowed by mage.png
ls {file1,file2}.png

Global regular Expression print - GREP

Description:

grep searches for PATTERNS in each FILE. PATTERNS is one or more patterns separated by newline characters, and grep prints each line that matches a pattern. Typically PATTERNS should be quoted when grep is used in a shell command.

A FILE of “-” stands for standard input. If no FILE is given, recursive searches examine the working directory, and nonrecursive searches read standard input.

In addition, the variant programs egrep, fgrep and rgrep are the same as grep -E, grep -F, and grep -r, respectively. These variants are deprecated, but are provided for backward compatibility.

Sintax:

grep [OPTION] PATTERNS [FILE]
grep [OPTION] -e PATTERNS [FILE]
grep [OPTION] -f PATTERN_FILE [FILE]

Options:

-e PATTERNS, --regexp=PATTERNS 
    Use PATTERNS as the patterns. If this option is used multiple times or is combined with the -f (--file) option, 
    search for all patterns given. This option can be used to protect a pattern beginning with “-”.
-f FILE, --file=FILE
    Obtain patterns from FILE, one per line.  If this option is used multiple times or is combined with the -e  (--regexp)  option,
    search  for  all  patterns  given.  The empty file contains zero patterns, and therefore matches nothing.
-i, --ignore-case
    Ignore case distinctions in patterns and input data, so that characters that differ only in case  match each other.
-v, --invert-match
    Invert the sense of matching, to select non-matching lines.
-c, --count
    Suppress normal output; instead print a count of matching lines for each input file. With the -v, count non-matching lines.
-o, --only-matching
    Print  only  the matched (non-empty) parts of a matching line, with each such part on a separate output line.
-n, --line-number
    Prefix each line of output with the 1-based line number within its input file.
-P, --perl-regexp
    Interpret  PATTERNS  as  Perl-compatible regular expressions (PCREs). This option is experimental when combined with 
    the -z (--null-data) option, and grep -P may warn of unimplemented features.

Usage:

grep pattern file.ext

# find a pattern in all files.txt
grep "pattern" *.txt

# return the line number of the pattern
grep -n "pattern" file.ext

# Counts the occurences of the pattern
grep -c "pattern" file.ext

# to ignore case sensitive
grep -i "PatTern" file.ext

# to grep all except the pattern
grep -v "pattern" filename

# to grep multiple patterns
grep "[I,D]"

# to grep multiple patterns with extended expression
egrep "pattern1|parttern2" file
grep -e "pattern1" -e "pattern2" file

# Seqching from pattern with range of number
grep --regexp=ID={0..100}_{1..3} file

# to show x lines Before pattern
grep -B 10 "pattern" *.txt

# to show x lines after pattern
grep -A 10 "pattern" *.txt

Usage with --perl-regexp patterns

\d = digit character
\D = non digit character
\t = tab delimited
\n = end line, equal to $
\w = word
\W = non word
\s = whitespace
\S = non whitespace
.  = any character
+  = 1 or more character
*  = zero or more characteres

grep -P "^ID=\d+" 
grep -P "ID=\d+\tName:\w" file.ext

Stream Editor - SED

SED command in UNIX stands for stream editor and it can perform lots of functions on file like searching, find and replace, insertion or deletion.

Usage:

sed OPTIONS... [SCRIPT] [INPUTFILE...]

options:

-i ------> change the file in place
-e ------> print without changing the file
-n ------> show just the result of the command
s -------> replace a pattern for another
p -------> p at the end, prints
g -------> g at the end, change all accurrences

Sed command is mostly used to replace the text in a file.

# Replacing or substituting string not in place: 
sed 's/strig_to_find/new_string/' file.txt

# Replacing or substituting string in place: 
sed -i 's/strig_to_find/new_string/' file.txt

# Replacing a string between lines 3 and 9: 
sed '3,9s/strig_to_find/new_string/' file.txt

# Delete whole line with the string specified on the line
sed -i '/string/d' file.txt

# insert a string at the beginning of each line
sed 's/^/string/' file.txt

# replace all occurrences of the strings “marcos”, “luiz”, “karol” by “friends”
sed 's/marcos\|luiz\|karol/friends/g' file.txt

# Insert the string ‘#’ at the beggining of lines 1 to 8
sed -i '1,8s/^/#/' file.txt

# Substitute “foo” by “bar” onle at lines with “##”
sed '/##/s/foo/bar/g' arquivo.txt

# prints only the lines each starts with the string ‘http’
sed -n '/^http/p' file.txt

# print only the line 9
sed -n '9p' file.txt

# Print lines from line 6 to 9 
sed -n '6,9p' file.txt

# Removing blank lines from a file
sed -i '/^$/d' file.txt

link for more exemples

Subseting a file with sed

sed -n -e '10,1000p' input.txt > output.txt

# -n means don't print each line by default. 
# -e means execute the next argument as a sed script. 
# 10,1000p is the subseting, from line 10 until line 100 (inclusive)
# p iqual to print those lines

AWK

Awk is a pattern scanning and processing language for manipulating data reports. Awk allows us to use variables, numerif funcions, string functions, and logical operators.

With this utility is possible to write programs in form of statements that define text patterns to be searched. By default it reads standard input and writes standard output.

Useful to:

Transform data
Produce formated reports
Performs condicional and loops
Perform arithmetic and strings operations

Filds in awk

awk represent a line as a record and is represented as NR1, NR2, NR3 till length of file. And each record is splited in fild by whitespace and stored variable called $1, $2, $3 ... and $0 is a entire line.

Options:

-F or --field-separator=fs modify the fild of separator.

Syntax:

awk [options] "pattern {action}" input_file.ext > output_file
awk -F: '/pattern/ { print $0 }' input_file.ext > output_file

# In case below, no pattern was used, than the action will be aplied to all record/lines.
awk '{sum += $1 }; END { print sum }' input_file.ext > output_file

The 1 at the end of the program, is a condition (always true) with no action, so it executes the default action for every line, printing the line (which may have been modified by the previous action in braces or not).

awk '/^>/ {$0= "> new name"}1' teste.txt > teste.txt
awk '/^>/ {$0= "> new name"} {print $0}' teste.txt > teste.txt # this has the same effect.

Usage:

# Searches for a pattern in a file, and prints the line
awk "/pattern/ {print}" input_file

# Searches for a pattern in a file, and prints the fild 1 and 4
awk "/pattern/ {print $1, $4}" input_file

# Print the line if the fild 1 ==A and fild2 == T
awk '$1 == "A" && $2 =="T"' input_file

# Divide the fild1 by 4 and print it
awk '{x=$1/4; print x}' input_file

# If the file has a line 3, print the fild 1 and 3
awk '{if(NR==3) print $1, $2}' input_file

# If the file has a line 7, print a substring of all line, from 2 to end
awk '{if(NR==7) print substr($0, 2, 100)}' input_file

# declare a variable, and exceute a if statement
awk '{var=10; if(length($1) > var) print $0}' input_file

awk uses in fast files

# rename a fasfa header
awk '/^>/ {$0= "> new name"}1' teste.fa > teste.txt
# the 1 is a pattern; it evaluates to true. The action is not specified, so the default action is invoked, which is {print}

# Print only one sequence from a multifasta file
awk 'BEGIN {RS=">"} /chr1/ {print ">"$0}' teste.fa > teste.txt


awk '{if (RS!=">") print $0}'

Explanation:

BEGIN ---------- This tells the awk to execute the immediately following code in curly brackets at the beginning.
{RS=”>”} ------- Record separator  (Since every sequence starts with a “>” sign. This will separate the sequences records from the file)
/Chr2/ --------- keyword to search in the entire record
{print “>”$0} -- $0 is the current record (From “Chr1” to the entire sequence till next “>”). 
Add “>” at the beginning to get the standard identitifer which is not included in $0.

seqkit

Searches for sequences in a multifasta file, and retrieves thoses which sequences matches the ids/or full name in the a file with a list of ids.

-f/--pattern-file for a file of patterns
-n/--by-name for matching full name instead of just ID.

conda install -c bioconda seqkit
 
seqkit grep -n -f retrons_ids.txt sequences.fasta > retrons_sequences.fasta
seqkit grep -f retrons_ids.txt sequences.fasta > retrons_sequences.fasta

Compare sorted files with COMM

Description:
Compare sorted files FILE1 and FILE2 line by line.

Syntax:
comm [OPTION]... FILE1 FILE2

# With no options, produce three-column output:
# Column one contains lines unique to FILE1, 
# column two contains lines unique to FILE2, and 
# column three contains lines common to both files.
comm fileA.txt fileB.txt

# suppress column 1 (lines unique to FILE1) 
comm -1 fileA.txt fileB.txt

# suppress column 2 (lines unique to FILE2) 
comm -2 fileA.txt fileB.txt

# suppress column 3 (lines that appear in both files)  
comm -3 fileA.txt fileB.txt

# Show the differences from two outputs
comm -3 <(zcat SRR11940311.1/*_1.fastq.gz | grep "@" | cut -d " " -f 1) <(zcat SRR11940311.1/*_2.fastq.gz | grep "@" | cut -d " " -f 1)

# Print only lines present in both file1 and file2.
comm -12 fileA.txt fileB.txt

Diff

# Command Substitution: $(...)
diff $(ls) $(ls)

# Process Substitution: <(...)
diff <(cat file1 | cut -f 1) <(cat file2 | cut -f 1)

CHMOD

This command modify files permissions, controling who can access files, search directories, and run scripts.

Usage: chmod [mode] [operator] [permission] file.ext

Mode ugoa:

These letters "ugoa" in the command chmod definy which of the user will have ther permissions modified.

u : Owner of the file (user);
g : Members of the group the file belongs (Groups);
o : any other user (others);
a : all user of the system (all).

If none is passed to the command, the system will uses all as default.

Operators:

With the command chmod, has to be used with a operator, to specify the kind of permission which will be given.

+ : Grants the permission. The permission is added to the existing permissions;
– : remove permissions specified;
= : Set a permission and remove others.

Values:

r : The read permission.
w : The write permission.
x : The execute permission.

Setting And Modifying Permissions

Checking the permission

ls -l file.txt

output: I separated them to a better visualization:

- rwx rw- r-- 1 luiz luiz 6kb Jan 12 14:38 file.txt
d rwx rwx rw- 1 luiz luiz 6k Jan 12 14:38 data

where the first fild "-" means the file is not a directory, like in the seconda exemple, represented by "d".

The seconda fild "rwx", represent the permissions for the user;
The third fild "rw-", represents the permissions for others;
And the forth fild "r--", represents the permissions to all.

Setting a permissions

# removing permission to write and execute to group and others
chmod go-wx file.txt

# set only permission to read and write to group and others
chmod go=rw file.txt

# Add permission to execute to group and others
chmod go+x file.txt

For Loop in linux

Loop over a string

sample="1 2 3 4"; for i in $sample; do echo "wellcome $i times"; done

Over a array:

first declare an array of string with type.
Then Iterate the string array using for loop.
The doble quote around ${StringArray[@]}, force the whitespaced variable to be read as a simgle variable.

StringArray=("Linux Mint" "Fedora" "Red Hat Linux" "Ubuntu" "Debian" ); for val in "${StringArray[@]}"; do echo $val; done

Loop over a file

array=("seq1", "seq2", "seq3") for i in "${array[@]}"; do wget -N "ftp://ftp.patricbrc.org/genomes/$i/$i.fna"; done

Running scripts

# simple run
bash script.sh

# Run in second plan
bash script.sh &

# redirect both stdout and stderr to file.log.txt
bash script.sh >& script.log.txt

# to run the scrip withou hang up, redirect both stdout and stderr to file.log.txt and run in second plan
nohup sh script.sh >& script.log.txt &

Running scripts with a conda env setup

This kind of running is required when we need to activate a conda env, where all the programs needed to run the script are.

Exemple of script

I do recommend not use any kind of characters as (_ , - , /) in the script name.

At the top of the script, starts as follow:

#!/usr/bin/env bash                # shabang sign - Tells the operating system which interpreter to use. 
set -o errexit                     # This will exit, if any erro happens.
conda activate genomics            # This will activate the env we need to execute the script, change for the one you have.

command 1
command 2
command 3

Before run any command, let's ussume we have a env called genomics, and there are all the programs needed to execute the following script.

# Note: this script starts as showed above.

# running in interactive mode.
bash -i script2.sh

# redirect both stdout and stderr to file.log.txt
bash -i script2.sh >& script2.log.txt


# Now, let's assume we don't wrote in the script to activate genomics
# We can open the terminal and do as follow:
conda activate genomics
./script2.sh >& script2.log.txt    # This comand will run the script interacting if the current env.

Creating alias to open files in wsl

.bashrc is a login scripts which load a user's personal preferences. Upon creation of their account, a .bashrc file with default settings is copied into a user's home directory. The user can then modify that file to customize their session. They can modify environment variables, load modules, create aliases and activate Python virtual environments. Users can also modify .bashrc to change aesthetic things like the terminal color scheme and what their bash prompt displays. To edit your .bashrc file use a command line editor like vim or nano:

Creating a alias "open" by writing it into the .bashrc file:

echo "alias open='wslview'" >> ~/.bashrc

WE can edit the .bashrc too from nano.

nano ~/.bashrc

# with the file opened:
"alias open='wslview'"

Activating alias

source ~/.bashrc

Removing alias:

unalias alias_name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.2-bash_Unix_commands-Advanced.md

3.2-bash_Unix_commands-Advanced.md

Advanced UNIX commands used in bioinformatics

Redirect outputs from command line into a file - Part 2

Visualize files - Part 2

more and less

Uses of regular expression - regex

Global regular Expression print - GREP

Stream Editor - SED

Subseting a file with sed

AWK

awk uses in fast files

seqkit

Compare sorted files with COMM

Diff

CHMOD

Setting And Modifying Permissions

For Loop in linux

Loop over a string

Over a array:

Loop over a file

Running scripts

Running scripts with a conda env setup

Exemple of script

Creating alias to open files in wsl

Creating a alias "open" by writing it into the .bashrc file:

Activating alias

Removing alias:

Files

3.2-bash_Unix_commands-Advanced.md

Latest commit

History

3.2-bash_Unix_commands-Advanced.md

File metadata and controls

Advanced UNIX commands used in bioinformatics

Redirect outputs from command line into a file - Part 2

Visualize files - Part 2

more and less

Uses of regular expression - regex

Global regular Expression print - GREP

Stream Editor - SED

Subseting a file with sed

AWK

awk uses in fast files

seqkit

Compare sorted files with COMM

Diff

CHMOD

Setting And Modifying Permissions

For Loop in linux

Loop over a string

Over a array:

Loop over a file

Running scripts

Running scripts with a conda env setup

Exemple of script

Creating alias to open files in wsl

Creating a alias "open" by writing it into the .bashrc file:

Activating alias

Removing alias: