Lesson Navigation
- Introduction
- cat, tac, head, and tail
- wc and nl
- sort, uniq, and cut
- split and shuf
- paste and join
- tr and sed
- grep
- diff and comm
One way to quickly view text files is to print the contents to standard output (your terminal window) using cat
. Let's take a look at the subsetA_CID_51351709_similar.txt file located in the /Udata2/View
folder. First, use cd
to change your directory to the folder:
user@computer:~$ cd /home/user/Desktop/Udata2/View
We can then use the cat
utility to print the contents of a file to standard output:
user@computer:~$ cat subsetA_CID_5135709_similar.txt
CC.CC1CCC2CC(=C(CC2C1)C=C)C=C 154673141 232.400
CC1CCCCC1C2=CC=CCC2 154546688 176.300
CC1CCCC2C1=CC=C3C2C=CC=C3 154307178 198.300
CCC1=CC=C2C(C1)C(CCC2(C)C)(C)C 154280750 218.380
CC1C(=C)CCC2C1=CC=CC2 154236167 160.250
A common use of cat
is also to concatenate files. For example, let's say we wanted to combine the three subset*.txt files and redirect, >
, the output to a new file named combined_test.txt
:
user@computer:~$ cat subset*.txt > combined_test.txt
We can display the contents of files in reverse using tac
. Compare tac
to our example above with cat
:
user@computer:~$ tac subsetA_CID_5135709_similar.txt
CC1C(=C)CCC2C1=CC=CC2 154236167 160.250
CCC1=CC=C2C(C1)C(CCC2(C)C)(C)C 154280750 218.380
CC1CCCC2C1=CC=C3C2C=CC=C3 154307178 198.300
CC1CCCCC1C2=CC=CCC2 154546688 176.300
CC.CC1CCC2CC(=C(CC2C1)C=C)C=C 154673141 232.400
Instead of using cat
or tac
first, it is usually a good idea to look at a preview of the file, so we don't unnecessarily print 100s or 1000s of lines to the standard output. For example, the file CID_51351709_similar.txt
has ~500 lines. We can use head
to print the first 10 lines of the file:
user@computer:~$ head CID_51351709_similar.txt
CC.CC1CCC2CC(=C(CC2C1)C=C)C=C 154673141 232.400
CC1CCCCC1C2=CC=CCC2 154546688 176.300
CC1CCCC2C1=CC=C3C2C=CC=C3 154307178 198.300
CCC1=CC=C2C(C1)C(CCC2(C)C)(C)C 154280750 218.380
CC1C(=C)CCC2C1=CC=CC2 154236167 160.250
CC(=C)C1CCC2CCC=CC2C1 153974464 176.300
CC1CCC(C2C1C(C=CC2C)C)C 153940528 192.340
[2H]C1C=CC2=CC=C3C(C(C(C(C3(C2(C1([2H])[2H])[2H])[2H])([2H])[2H])([2H])C)([2H])[2H])([2H])[2H] 153707997 212.390
[2H]C1C=C2C(C(C(C(C2(C(C1([2H])[2H])([2H])[2H])[2H])([2H])[2H])([2H])[2H])([2H])C)([2H])C 153707996 176.360
CC1CCC(CC1)C2=CC(C(C=C2)C)C 153688226 204.350
Further, we can specify the number of lines printed to standard output with head
by adding the -n
option followed by a digit. For example, the first 4 lines:
user@computer:~$ head -n4 CID_51351709_similar.txt
CC.CC1CCC2CC(=C(CC2C1)C=C)C=C 154673141 232.400
CC1CCCCC1C2=CC=CCC2 154546688 176.300
CC1CCCC2C1=CC=C3C2C=CC=C3 154307178 198.300
CCC1=CC=C2C(C1)C(CCC2(C)C)(C)C 154280750 218.380
The tail
program is similar to head
, but it prints the last lines of files. For example, print the last 5 lines of CID_51351709_similar.txt
:
user@computer:~$ tail -n5 CID_51351709_similar.txt
CC1CCC(C2C1=CCC(=C2)C)C(C)C 519298 204.350
CCCC1C(C(CC2C1=CC=CC2)C)C 186376 204.350
CC1=C[C@H]2[C@H](CC1)C(=C)CC[C@@H]2C(C)C 92313 204.350
CC1=CC2C(CC1)C(=C)CCC2C(C)C 15094 204.350
CC1=CCCC2(C1CC(CC2)C(=C)C)C 10123 204.350
Note that the original order in the file is maintained with tail
, so the last 5 lines are not reversed. If we want to reverse the output, one way to to do this is to pipe the tail
results to tac
:
user@computer:~$ tail -n5 CID_51351709_similar.txt | tac
CC1=CCCC2(C1CC(CC2)C(=C)C)C 10123 204.350
CC1=CC2C(CC1)C(=C)CCC2C(C)C 15094 204.350
CC1=C[C@H]2[C@H](CC1)C(=C)CC[C@@H]2C(C)C 92313 204.350
CCCC1C(C(CC2C1=CC=CC2)C)C 186376 204.350
CC1CCC(C2C1=CCC(=C2)C)C(C)C 519298 204.350
Attribution
Some content in this workshop have been adapted and derive from the Software Carpentry The Unix Shell lesson (CC BY 4.0 license for Instructional Materials and MIT License for programs and code examples). We reused a few parts of their lesson including some descriptions and command examples, but included our own specific datasets and use-case workflows. We have maintained the MIT license for program and code examples. Molecular and bibliographic dataset examples were retrieved from NCBI via their EDirect utility and is credited to NCBI and NLM. Please see the NCBI Website and Data Usage Policies and Disclaimers for more information regarding the data.