# Linux commands with bash

**Note:** I would encourage you to try out at least some of these commands at the Linux command line using a Juypter terminal. There are instructions for accessing the terminal at the end of the video/ slides associated with this Notebook. The only thing to ensure is that you are in an appropriate subdirectory (that, in itself, is good practice!).

## Moving and listing
Our first 3 Linux commands are `pwd` (print working directory), `cd` (change directory) and `ls` (list directory contents), as well as `echo`, which we are familiar with: 

In [1]:
%%bash
echo 'current directory:'
pwd
echo
echo 'home directory:'
cd
pwd
echo
echo 'data directory:'
cd biocomp1-2022/data/
pwd
echo
echo 'Contents of data directory:'
ls

current directory:
/d/user6/as004/biocomp-spring_2023/session11

home directory:
/d/user6/as004

data directory:
/d/user6/as004/biocomp1-2022/data

Contents of data directory:
12e8.h
1CS4.npz
A0A0G2RR03.fasta
A0A0G2RZ64.fasta
aa_types.txt
add.txt
AF316817.gb
atoms.txt
bacteria.txt
bubbles.txt
chain_ids.txt
chain_ids_with_errors.txt
clever_birds.txt
codons.txt
common_scientific.txt
common.txt
corvids.txt
dna_adj.csv
dna.txt
emdb.db
FA8_HUMAN.fasta
filenames.txt
garden_birds.txt
genes.gb
hAPP.clustal
hAPP.phylip
HLA-B1542.txt
HLA-B1550.txt
ICTV2015.csv
integers.txt
multi_seqs.txt
names1.txt
names2.txt
names3.txt
names4.txt
number_rows.txt
numbers.txt
numbers_with_errors.txt
P00451_1.gb
P03437.fasta
pdb3vun.ent
pdb_chains2.txt
pdb_chains.txt
pdb_counts.txt
PDB_data.csv
PDB_growth.csv
pdb_species2.txt
pdb_species.txt
penguinpox.fasta
plot_data.txt
s1.txt
sample1.txt
sample2.txt
seq1.txt
seq2.txt
seq_3code.txt
seq_long.txt
seq_n.txt
seq_ss_n2.txt
seq_ss_n.txt
seq_ss.txt
seqs_with_ids.txt
sp

Next, `ls` is invoked with different parameters:
- `ls -aC` (list all [`a`] entries in columns [`C`]) 
- `ls -ltr` (long listing [`l`] sorted by modification time [`t`] reversed [`r`] &mdash; hence most recent last)
- `ls -lShrF` (long listing [`l`] sorted by size [`S`] human readable [`h`] and reversed [`r`] with formatting [`F`])

The `-a` flag is only worth using if you are intested in files that start with a dot, as these are usually hidden (certain system configuration files such as `.cshrc` are like that). The `-h` flag is useful if you have big files in the directory you are listing. For example, `7111680` (bytes) becomes `6.8M` (megabytes).

In [2]:
%%bash
cd ../data/
echo 'ls -aC:'
ls -aC
echo
echo 'ls -ltr *.fasta:'
ls -ltr *.fasta
echo
echo 'ls -lShrF:'
ls -lShrF

ls -aC:
.			   genes.gb		    pdb_species2.txt
..			   hAPP.clustal		    pdb_species.txt
12e8.h			   hAPP.phylip		    penguinpox.fasta
1CS4.npz		   HLA-B1542.txt	    plot_data.txt
A0A0G2RR03.fasta	   HLA-B1550.txt	    Q8WZ42.fasta
A0A0G2RZ64.fasta	   ICTV2015.csv		    s1.txt
aa_types.txt		   integers.txt		    sample1.txt
add.txt			   .ipynb_checkpoints	    sample2.txt
AF316817.gb		   multi_seqs.txt	    seq1.txt
atoms.txt		   names1.txt		    seq2.txt
bacteria.txt		   names2.txt		    seq_3code.txt
bubbles.txt		   names3.txt		    seq_long.txt
chain_ids.txt		   names4.txt		    seq_n.txt
chain_ids_with_errors.txt  number_rows.txt	    seq_ss_n2.txt
clever_birds.txt	   numbers.txt		    seq_ss_n.txt
codons.txt		   numbers_with_errors.txt  seq_ss.txt
common_scientific.txt	   P00451_1.gb		    seqs_with_ids.txt
common.txt		   P03437.fasta		    species1.txt
corvids.txt		   P42858.fasta		    species2.txt
dna_adj.csv		   pdb3vun.ent		    sub.txt
dna.txt			   pdb_chains2.txt	    taxonomy.txt
emdb.db		

## Working with files

There are 2 files in the `data` directory called `sample1.txt` and `sample2.txt`. Both contain **different versions** of the *Lorem Ipsum* dummy text (the standard dummy text of the printing and typesetting industry since the 1500s!). Here `diff` is used to print out the differences. The `|| true` is a way of ensuring that a zero is returned; without it, `diff` returns a non-zero status (because the files are different), and Jupyter doesn't like this, so flags a `CalledProcessError`.

In [3]:
%%bash
pwd
cd ../data
diff sample1.txt sample2.txt || true

/d/user6/as004/biocomp-spring_2023/session11
2c2
< sed do eiusmod tempor incididunt ut labor et dolore magna aliqua. 
---
> sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
3a4
> nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in 
5d5
< nulla pariatur. Excepteur sint occaecat cupidatat non proident, 


The following command uses `grep` to find **all occurences of the string "flu"** (both upper- and lowercase) in files within the `data` subdirectory. To see the same output with syntax highlighting, copy and paste the command into a Jupyter terminal.

In [4]:
%%bash
grep -i flu ../data/*

../data/A0A0G2RR03.fasta:>tr|A0A0G2RR03|A0A0G2RR03_9INFA Hemagglutinin (Fragment) OS=Influenza A virus (A/Pusan/22/2002(H1N1)) OX=232836 GN=HA PE=3 SV=1
../data/A0A0G2RZ64.fasta:>tr|A0A0G2RZ64|A0A0G2RZ64_9INFA Hemagglutinin (Fragment) OS=Influenza A virus (A/Indiana/07/2010(H3N2)) OX=1318026 GN=HA PE=3 SV=1
../data/AF316817.gb:DEFINITION  Influenza A virus (A/Athens/2/98 (H3N2)) hemagglutinin gene,
../data/AF316817.gb:SOURCE      Influenza A virus (A/Athens/2/1998(H3N2))
../data/AF316817.gb:  ORGANISM  Influenza A virus (A/Athens/2/1998(H3N2))
../data/AF316817.gb:            Orthomyxoviridae; Influenzavirus A.
../data/AF316817.gb:            neuraminidase sequences from recent human influenza type A (H3N2)
../data/AF316817.gb:                     /organism="Influenza A virus (A/Athens/2/1998(H3N2))"
../data/AF316817.gb:                     /note="H3N2; similar to influenza A virus (A/South
../data/common_scientific.txt:European perch: Perca fluviatilis
Binary file ../data/emdb.db match

Write a single line of Linux that prints out the **first 3 lines** of all `.fasta` files in subdirectory `biocomp1/data` using the `head` command:

In [9]:
%%bash
head -n 3 ../data/*.fasta 

==> ../data/A0A0G2RR03.fasta <==
>tr|A0A0G2RR03|A0A0G2RR03_9INFA Hemagglutinin (Fragment) OS=Influenza A virus (A/Pusan/22/2002(H1N1)) OX=232836 GN=HA PE=3 SV=1
MKAKLLVLLCTFTATYADTICIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDSHNGKLCL
LKGIAPLQLGNCSVAGWILGNPECELLISKESWSYIVETPNPENGTCYPGYFADYEELRE

==> ../data/A0A0G2RZ64.fasta <==
>tr|A0A0G2RZ64|A0A0G2RZ64_9INFA Hemagglutinin (Fragment) OS=Influenza A virus (A/Indiana/07/2010(H3N2)) OX=1318026 GN=HA PE=3 SV=1
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQ
SSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPD

==> ../data/FA8_HUMAN.fasta <==
>sp|P00451|FA8_HUMAN Coagulation factor VIII OS=Homo sapiens GN=F8 PE=1 SV=1
MQIELSTCFFLCLLRFCFSATRRYYLGAVELSWDYMQSDLGELPVDARFPPRVPKSFPFN
TSVVYKKTLFVEFTDHLFNIAKPRPPWMGLLGPTIQAEVYDTVVITLKNMASHPVSLHAV

==> ../data/P03437.fasta <==
>sp|P03437|HEMA_I68A0 Hemagglutinin OS=Influenza A virus (strain A/Aichi/2/1968 H3N2) GN=HA PE=1 SV=1
MKTIIALSYIFCLALGQDLPGNDNSTATLCLGHHAVPNGTLVKTITDDQIEVTNATELVQ
SSSTG

## Sorting the contents of files

Here `sort` is used to print out the contents of a file **sorted alphabetically**:

In [10]:
%%bash
sort ../data/numbers.txt

0.932
1153.04
1.63
-187.0
2347.105
25.307
-2749.655
31.33333
-32.78
39.2
-4.1
4.2
-422.343
5.65
5.912
61.5
780.4592
8.0
-8205.9
87.612
928.7


Here the task is to modify the previous command so that it prints out the file in **reverse numerical order**. We will do this in two stages. Firstly, we will consult the manual (using the `man` command in combination with `grep`) to find out how to perform a reverse numerical sort: 

In [11]:
%%bash
man sort | grep reverse
man sort | grep numeric

       -r, --reverse
              reverse the result of comparisons
              consider only blanks and alphanumeric characters
       -g, --general-numeric-sort
              compare according to general numerical value
       -h, --human-numeric-sort
       -n, --numeric-sort
              compare according to string numerical value
              sort according to WORD: general-numeric  -g,  human-numeric  -h,
              month -M, numeric -n, random -R, version -V


Secondly, we will perform the sort: 

In [12]:
%%bash
sort -rn ../data/numbers.txt

2347.105
1153.04
928.7
780.4592
87.612
61.5
39.2
31.33333
25.307
8.0
5.912
5.65
4.2
1.63
0.932
-4.1
-32.78
-187.0
-422.343
-2749.655
-8205.9


## Putting it all together

Finally (if you have time), **write five Linux commands** that:
- Perform an ascending numerical sort on file `~/biocomp1/data/numbers.txt`
- Redirect the sorted output to a file called `sorted_tmp.txt` in the current working directory
- Prints out the last 3 lines of `sorted_tmp.txt`
- Prints out the text `Line count:` followed by the number of lines in file `sorted_tmp.txt` (not necessarily on the same line)
- Deletes file `sorted_tmp.txt`.

In [34]:
%%bash
sort -n ../data/numbers.txt > sorted_temp.txt
tail -n 3 sorted_temp.txt
echo "Line count:"
wc -l sorted_temp.txt
rm sorted_temp.txt

928.7
1153.04
2347.105
Line count:
21 sorted_temp.txt
