# Text Processing
|command|function|
|-|-|
|cat| Concatenate files and print on the standard output|
|sort| Sort lines of text files|
|uniq| Report or omit repeated lines|
|cut| Remove sections from each line of files|
|paste| Merge lines of files|
|join| Join lines of two files on a common field|
|comm| Compare two sorted files line by line|
|diff| Compare files line by line|
|patch |Apply a diff file to an original|
|tr| Translate or delete characters|
|sed| Stream editor for filtering and transforming text|
|aspell| Interactive spellchecker|

### cat
Extra flags:
- A display non printing characters in display
    - can be used to find tabs or trailing spaces
- n numbers lines
- s suppresses multiple blank lines

In [1]:
!cat foo.txt

Roses are red


Violets are blue


In [2]:
!cat -ns foo.txt

     1	Roses are red
     2	
     3	Violets are blue


### sort

In [6]:
!cat foo.txt

zebra
cats
dogs



In [7]:
!sort foo.txt


cats
dogs
zebra


|Option| Description |
|-|-|
|-b| By default, sorting is performed on the entire line, starting with the first character in the line. This option causes sort to ignore leading  spaces in lines and calculates sorting based on the first non-whitespace character on the line. |
|-f|Make sorting case-insensitive.|
|-n| Perform sorting based on the numeric evaluation of a string. Using this option allows sorting  to be performed on numeric values rather than alphabetic values.|
|-r|  Sort in reverse order. Results are in descending rather than ascending order.|
|-k|  Sort based on a key field located from field1 to field2 rather than the entire line. See the following discussion.|
|-m|  Treat each argument as the name of a presorted  file. Merge multiple files into a single sorted result without performing any additional sorting.|
|-o|  Send sorted output to file rather than standard output.|
|-t|  Define the field-separator character. By default fields are separated by spaces or tabs.|

### uniq
For uniq to do its job, the input must be sorted first.

In [9]:
!cat foo.txt

apple
banana
apple
orange
apple


In [11]:
!uniq foo.txt

apple
banana
apple
orange
apple


In [13]:
!sort foo.txt | uniq -c

      3 apple
      1 banana
      1 orange


|Option| Description|
|-|-|
|-c| Output a list of duplicate lines preceded by the number of times the line occurs.|
|-d| Output only repeated lines, rather than unique lines.|
|-f n|  Ignore n leading fields in each line. Fields are separated by whitespace as they are in sort; however, unlike sort, uniq has no option for setting an alternate field separator.|
|-i|  Ignore case during the line comparisons.|
|-s n|  Skip (ignore) the leading n characters of each line.|
|-u|  Output only uniq|

## Slicing and Dicing
The next three programs we will discuss are used to peel columns of text 
out of files and recombine them in useful ways

### cut—Remove Sections from Each Line of Files
The cut program is used to extract a section of text from a line and output the extracted section to standard output. It can accept multiple file arguments or input from standard input.


|Option| Long option| Description|
|-|-|-|
|-c list| --characters=list| Extract the portion of the line defined by list. The list may consist of one or more comma-separated numerical ranges.|
|-f list| --fields=list| Extract one or more fields from the line as defined by list. The list may contain one or more fields or field  ranges separated by commas.|
|-d delim| --delimiter=delim| When -f is specified, use delim as the field delimiting character. By default, fields must be separated by  a single tab character.|
||--complement| Extract the entire line of text,  except for those portions specified by -c and/or -f.|

In [18]:
!cat -A foo.txt

a^I12/07/2020$
b^I02/30/2021$
c^I03/12/2022$


In [19]:
!cut -f 2 foo.txt

12/07/2020
02/30/2021
03/12/2022


In [20]:
!cut -f 2 foo.txt | cut -c 7-10

2020
2021
2022


In [36]:
!cut -f 2 foo.txt | cut -d '/' -f 1

12
02
03


### paste—Merge Lines of Files
The paste command does the opposite of cut. Rather than extracting a 
column of text from a file, it adds one or more columns of text to a file.

In [37]:
!cat foo.txt

a	12/07/2020
b	02/30/2021
c	03/12/2022


In [38]:
!cat bar.txt

a	12/07/2020
b	02/30/2021
c	03/12/2022


In [39]:
!paste foo.txt bar.txt

a	12/07/2020	a	12/07/2020
b	02/30/2021	b	02/30/2021
c	03/12/2022	c	03/12/2022


### join—Join Lines of Two Files on a Common Field
joins data from multiple files based on a shared key field.

In [41]:
!cat foo.txt

a	12/07/2020
b	02/30/2021
c	03/12/2022


In [42]:
!cat bar.txt

a	pizza
b	rice
c	soup


In [43]:
!join bar.txt foo.txt

a pizza 12/07/2020
b rice 02/30/2021
c soup 03/12/2022


## Comparing Text

### comm—Compare Two Sorted Files Line by Line
The comm program compares two text files and displays the lines that are 
unique to each one and the lines they have in common.

In [47]:
!cat foo.txt

a	pizza
b	rice
c	soup
d	waffle


In [48]:
!cat bar.txt

a	pizza
b	rice
c	soup
e	chicken


In [49]:
!comm foo.txt bar.txt

		a	pizza
		b	rice
		c	soup
d	waffle
	e	chicken


### diff—Compare Files Line by Line
Like the comm program, diff is used to detect the differences between files. However, diff is a much more complex tool.  Often used by software developers to examine changes between different versions of program source code

In [50]:
!diff foo.txt bar.txt

4c4
< d	waffle
---
> e	chicken


In [52]:
!diff -c foo.txt bar.txt

*** foo.txt	2022-01-20 23:10:17.733310000 -0500
--- bar.txt	2022-01-20 23:12:23.317395600 -0500
***************
*** 1,4 ****
  a	pizza
  b	rice
  c	soup
! d	waffle
--- 1,4 ----
  a	pizza
  b	rice
  c	soup
! e	chicken


|Indicator| Meaning|
|-|-|
|blank| A line shown for context. It does not indicate a difference between the two files.|
|-| A line deleted. This line will appear in the first file but not in the second file.|
|+| A line added. This line will appear in the second file but not in the first file.|
|!| A line changed. The two versions of the line will be displayed, each in its  respective section of the change group|

### patch—Apply a diff to an Original
The patch program is used to apply changes to text files. It accepts output from diff and is generally used to convert older-version files into newer versions.

Using diff/patch offers two significant advantages.
- The diff file is small, compared to the full size of the source tree.
- The diff file concisely shows the change being made, allowing reviewers of the patch to quickly evaluate it.


`diff -Naur old_file new_file > diff_file`

`patch < diff_file`

In [53]:
!cat foo.txt

a	pizza
b	rice
c	soup
d	waffle


In [54]:
!cat bar.txt

a	pizza
b	rice
c	soup
e	chicken


In [55]:
!diff -Naur foo.txt bar.txt > diff_file

In [57]:
!cat diff_file

--- foo.txt	2022-01-20 23:10:17.733310000 -0500
+++ bar.txt	2022-01-20 23:12:23.317395600 -0500
@@ -1,4 +1,4 @@
 a	pizza
 b	rice
 c	soup
-d	waffle
+e	chicken


In [58]:
!patch < diff_file

patching file foo.txt


In [59]:
!cat foo.txt

a	pizza
b	rice
c	soup
e	chicken


## Editing on the Fly

### tr—Transliterate or Delete Characters
The tr program is used to transliterate characters. We can think of this as a sort of character-based search-and-replace operation.


tr accepts two arguments: a set of characters to convert from and a corresponding set of characters to convert to. Character sets 
may be expressed in one of three ways.
- An enumerated list. For example, ABCDEFGHIJKLMNOPQRSTUVWXYZ.
- A character range. For example, A-Z.
- POSIX character classes. For example, [:upper:]

In [60]:
!echo "lowercase letters" | tr a-z A-Z

LOWERCASE LETTERS


In [62]:
!echo "lowercase letters" | tr [:lower:] A

AAAAAAAAA AAAAAAA


In addition to transliteration, tr allows characters to simply be deleted from the input stream.

In [63]:
!echo "aaabbbccc" | tr -s ab

abccc


### sed—Stream Editor for Filtering and Transforming Text
In general, the way sed works is that it is given either a single editing command (on the command line) or the name of a script file containing multiple commands, and it then performs these commands upon each line in the stream of text.

In [64]:
!echo "front" | sed 's/front/back/'

back


|Address| Description|
|-|-|
|n| A line number where n is a positive integer.|
|$| The last line.|
|/regexp/| Lines matching a POSIX basic regular expression. Note that the regular expression is delimited by slash characters. Optionally, the regular expression may be delimited by an alternate character,  by specifying the expression with \cregexpc, where c is the alternate  character.|
|addr1,addr2| A range of lines from addr1 to addr2, inclusive. Addresses may be any of the single address forms listed earlier.|
|first~step| Match the line represented by the number first and then each subsequent line at step intervals. For example, 1~2 refers to each odd numbered line, and 5~5 refers to the fifth line and every fifth line thereafter.|
|addr1,+n| Match addr1 and the following n lines.|
|addr!| Match all lines except addr, which may be any of the forms listed earlier.|


|Command| Description|
|-|-|
|=| Output the current line number.|
|a| Append text after the current line.|
|d| Delete the current line.|
|i| Insert text in front of the current line.|
|p| Print the current line. By default, sed prints every line and only edits lines that match a specified address within the file. The default behavior can be overridden by specifying the -n option.|
|q| Exit sed without processing any more lines. If the -n option is not specified, output the current line.|
|Q| Exit sed without processing any more lines.|
|s/regexp/replacement/| Substitute the contents of replacement wherever regexp is found. replacement may include the special character &, which is equivalent to the text matched by regexp. In addition, replacement may include the sequences \1 through \9, which are the contents of the corresponding subexpressions in regexp. For more about this, see the following discussion of back references. After the trailing slash following replacement, an optional flag may be specified to modify the s command’s behavior.|
|y/set1/set2| Perform transliteration by converting characters from set1 to the corresponding characters in set2. Note that unlike tr, sed requires that both sets be of the same length.|

### aspell—Interactive Spellchecker
an interactive spelling checker.

`aspell check textfile`

Can also be used to check code files like HTML or C

In [66]:
!echo "The quick brown fox jimped over the laxy dog." > foo.txt

In [67]:
!cat foo.txt

The quick brown fox jimped over the laxy dog.


In [68]:
!aspell check foo.txt

?[K ou sure you want to abort (y/n)?                                                   [24;1H[m?  jumped[14;41H6) comped^C

# Summing Up

- Slicing and dicing
- Comparing text
- editing text