# Manipulate text files in bash

The bash language has several commands to read and manipulate txt files, such as: head, tail, more, less, sort, join, wc, uniq. Here we are going to use some of them.

    cd $HOME/SE_data/exercise
    jupyter-notebook 03_bashinter_osgeo.ipynb


## Pattern matching

Create a little file from a large file:

In [12]:
! head -1000 txt/aver_month_nuts3_fire.asc > input.txt

Read/explore the input.txt file

In [1]:
! head input.txt

NUTS YYYY MM 0 BAREA
BG311 2005 04 2 0.282594
BG311 2006 11 2 0.600812
BG311 2007 01 3 65.8331
BG311 2007 02 3 9.78246
BG311 2007 04 2 44.4997
BG311 2007 06 2 30.5861
BG311 2007 07 2 5534.21
BG312 2005 04 3 10.6419
BG312 2006 10 2 0.293182


In [1]:
! tail input.txt

DE425 2000 06 4 3.1973
DE425 2000 07 4 0.724873
DE425 2000 08 4 4.67528
DE425 2000 09 3 0.194243
DE425 2001 04 2 0.0724194
DE425 2001 05 2 0.66708
DE425 2001 08 2 0.0421668
DE425 2002 02 2 0.0125149
DE425 2002 03 2 0.492932
DE425 2002 04 4 1.06466


Count the line/word/character in a input.txt

In [3]:
! wc input.txt

 1000  5000 24535 input.txt


Search for a word in a file

In [4]:
! grep "2007" input.txt

BG311 2007 01 3 65.8331
BG311 2007 02 3 9.78246
BG311 2007 04 2 44.4997
BG311 2007 06 2 30.5861
BG311 2007 07 2 5534.21
BG312 2007 01 3 114.535
BG312 2007 02 3 17.3247
BG312 2007 03 3 322.063
BG312 2007 04 3 521.189
BG312 2007 05 3 4.13178
BG312 2007 06 2 5.94132
BG312 2007 07 2 1117.64
BG313 2007 01 2 205.374
BG313 2007 02 2 103.186
BG313 2007 03 2 394.699
BG313 2007 04 5 36.6292
BG313 2007 05 3 7.68499
BG313 2007 07 3 267.507
BG313 2007 08 2 8.23134
BG314 2007 03 2 31.1266
BG314 2007 05 3 49.9327
BG314 2007 06 2 5.50835
BG314 2007 07 2 1243.67
BG314 2007 08 2 70.1061
BG315 2007 01 3 97.1538
BG315 2007 02 3 72.1268
BG315 2007 03 3 1825.58
BG315 2007 04 3 709.847
BG315 2007 05 4 37.3087
BG315 2007 07 4 2595.44
BG321 2007 01 2 34.6964
BG321 2007 02 2 15.8478
BG321 2007 03 2 128.72
BG321 2007 04 2 44.5743
BG321 2007 05 4 36.5963
BG321 2007 06 3 0.207622
BG321 2007 07 3 736.828
BG321 2007 08 3 16.4847
BG322 2007 01 2 25.6421
BG322 2007 02 3 2.47755


## Sorting a file
I want to search for a command able to sort the input.txt table based on the Year column (YYYY).

In [5]:
! man -k  sort

alphasort (3)        - scan a directory for matching entries
apt-sortpkgs (1)     - Utility to sort package index files
bsearch (3)          - binary search of a sorted array
bunzip2 (1)          - a block-sorting file compressor, v1.0.8
bzip2 (1)            - a block-sorting file compressor, v1.0.8
comm (1)             - compare two sorted files line by line
qsort (3)            - sort an array
qsort_r (3)          - sort an array
sort (1)             - sort lines of text files
tsort (1)            - perform topological sort
versionsort (3)      - scan a directory for matching entries
XConsortium (7)      - X Consortium information


One of the last lines contain:
sort (1) - sort lines of text files
So i will search how to use the sort command:

In [6]:
! man sort

SORT(1)                          User Commands                         SORT(1)

NNAAMMEE
       sort - sort lines of text files

SSYYNNOOPPSSIISS
       ssoorrtt [_O_P_T_I_O_N]... [_F_I_L_E]...
       ssoorrtt [_O_P_T_I_O_N]... _-_-_f_i_l_e_s_0_-_f_r_o_m_=_F

DDEESSCCRRIIPPTTIIOONN
       Write sorted concatenation of all FILE(s) to standard output.

       With no FILE, or when FILE is -, read standard input.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.  Ordering options:

       --bb, ----iiggnnoorree--lleeaaddiinngg--bbllaannkkss
              ignore leading blanks

       --dd, ----ddiiccttiioonnaarryy--oorrddeerr
              consider only blanks and alphanumeric characters

       --ff, ----iiggnnoorree--ccaassee
              fold lower case to upper case characters

       --gg, ---

The -k option identify the column of sorting:\
Sorting based on column number 2 ( -k 2,2)\
sorting based on column number 2 and then number 1 ( -k 2,1)\
See again man sort for more options like -n -g\
Alfa numeric sorting:

In [8]:
! sort -k 2,2 input.txt

DE121 1997 05 2 0.232016
DE122 1997 05 2 0.0637817
DE124 1997 03 2 1.28501
DE125 1997 05 2 0.107349
DE128 1997 04 2 0.340913
DE129 1997 03 2 0.297982
DE12A 1997 03 2 0.0815152
DE123 1998 04 2 0.434829
DE124 1998 03 2 0.0796515
DE12A 1998 05 2 0.345525
DE412 1998 05 3 1.22806
DE412 1998 06 2 0.0130324
DE412 1998 07 2 0.0611725
DE412 1998 08 2 0.913361
DE412 1998 09 2 0.0146242
DE413 1998 03 2 0.108482
DE413 1998 04 2 0.0403881
DE413 1998 05 3 0.0507124
DE413 1998 08 2 0.267596
DE414 1998 03 2 0.363849
DE414 1998 05 3 0.347403
DE414 1998 06 4 2.88662
DE414 1998 07 3 0.0392562
DE414 1998 08 4 0.49688
DE415 1998 02 3 0.20879
DE415 1998 03 3 0.0162351
DE415 1998 05 3 1.74778
DE415 1998 06 3 0.0208709
DE415 1998 07 3 0.27388
DE415 1998 08 3 1.21651
DE416 1998 06 3 0.0236748
DE416 1998 08 3 0.260007
DE417 1998 05 2 0.0931233
DE417 1998 06 2 0.0117269
DE417 1998 08 2 0.227174
DE418 1998 03 4 3.13756
DE418 1998 04 3 0.163528
DE418 1998 05 3 2.43519
DE418 19

General numerical sorting

In [9]:
! sort -k 2,2 -g  input.txt

NUTS YYYY MM 0 BAREA
DE121 1997 05 2 0.232016
DE122 1997 05 2 0.0637817
DE124 1997 03 2 1.28501
DE125 1997 05 2 0.107349
DE128 1997 04 2 0.340913
DE129 1997 03 2 0.297982
DE12A 1997 03 2 0.0815152
DE123 1998 04 2 0.434829
DE124 1998 03 2 0.0796515
DE12A 1998 05 2 0.345525
DE412 1998 05 3 1.22806
DE412 1998 06 2 0.0130324
DE412 1998 07 2 0.0611725
DE412 1998 08 2 0.913361
DE412 1998 09 2 0.0146242
DE413 1998 03 2 0.108482
DE413 1998 04 2 0.0403881
DE413 1998 05 3 0.0507124
DE413 1998 08 2 0.267596
DE414 1998 03 2 0.363849
DE414 1998 05 3 0.347403
DE414 1998 06 4 2.88662
DE414 1998 07 3 0.0392562
DE414 1998 08 4 0.49688
DE415 1998 02 3 0.20879
DE415 1998 03 3 0.0162351
DE415 1998 05 3 1.74778
DE415 1998 06 3 0.0208709
DE415 1998 07 3 0.27388
DE415 1998 08 3 1.21651
DE416 1998 06 3 0.0236748
DE416 1998 08 3 0.260007
DE417 1998 05 2 0.0931233
DE417 1998 06 2 0.0117269
DE417 1998 08 2 0.227174
DE418 1998 03 4 3.13756
DE418 1998 04 3 0.163528
DE418 1998 

String numerical sorting

In [10]:
! sort -k 2,2 -n  input.txt

NUTS YYYY MM 0 BAREA
DE121 1997 05 2 0.232016
DE122 1997 05 2 0.0637817
DE124 1997 03 2 1.28501
DE125 1997 05 2 0.107349
DE128 1997 04 2 0.340913
DE129 1997 03 2 0.297982
DE12A 1997 03 2 0.0815152
DE123 1998 04 2 0.434829
DE124 1998 03 2 0.0796515
DE12A 1998 05 2 0.345525
DE412 1998 05 3 1.22806
DE412 1998 06 2 0.0130324
DE412 1998 07 2 0.0611725
DE412 1998 08 2 0.913361
DE412 1998 09 2 0.0146242
DE413 1998 03 2 0.108482
DE413 1998 04 2 0.0403881
DE413 1998 05 3 0.0507124
DE413 1998 08 2 0.267596
DE414 1998 03 2 0.363849
DE414 1998 05 3 0.347403
DE414 1998 06 4 2.88662
DE414 1998 07 3 0.0392562
DE414 1998 08 4 0.49688
DE415 1998 02 3 0.20879
DE415 1998 03 3 0.0162351
DE415 1998 05 3 1.74778
DE415 1998 06 3 0.0208709
DE415 1998 07 3 0.27388
DE415 1998 08 3 1.21651
DE416 1998 06 3 0.0236748
DE416 1998 08 3 0.260007
DE417 1998 05 2 0.0931233
DE417 1998 06 2 0.0117269
DE417 1998 08 2 0.227174
DE418 1998 03 4 3.13756
DE418 1998 04 3 0.163528
DE418 1998 

Save the result of a command in a file by "">"" symbol

In [11]:
! sort -k 2,2 -g input.txt > input_s.txt
! wc -l input_s.txt

1000 input_s.txt


Which is the first and last year of observations?

## Append the command result to a file
Add the result of a command in the already existing "output" file by '>>' symbol

In [13]:
! sort -k 3,3 -g input.txt >> input_s.txt
! wc -l input_s.txt

2000 input_s.txt


## Concatenate commands
We want to count how many observations exist in year 2007 in the input.txt.
Concatenate command by the pipe "|" symbol by searching for the word "2007" and count the line/word/character

In [14]:
! grep "2007" input.txt | wc

    189     945    4539


## Use the variable
Define the value of the variable, print it by putting it in front of the $ symbol

In [16]:
! var=21 
! echo $var




Define the value of the variable using the result of a command

In [21]:
%%bash
var=$(grep "2007" input.txt | wc -l)
echo $var

189


## For loop
In computer science, a for-loop (or simply for loop) is a control flow statement for specifying iteration, which allows code to be executed repeatedly (source https://en.wikipedia.org/wiki/For_loop).
We want to automatically count how many observations exist in the years 2007, 2006 and 2005 in the input.txt file.
To solve this task we can use the variable and list word/number loop function

In [23]:
%%bash 
for var in 2005 2006 2007; do
    grep $var input.txt 
done 

BG311 2005 04 2 0.282594
BG312 2005 04 3 10.6419
BG313 2005 03 2 48.0927
BG314 2005 03 4 2.21985
BG315 2005 03 2 125.772
BG315 2005 04 2 95.5232
BG315 2005 06 2 3.70607
BG321 2005 03 3 59.201
BG321 2005 04 2 0.562725
BG322 2005 03 3 6.33855
BG322 2005 04 2 5.45605
BG323 2005 03 3 8.98824
BG323 2005 04 2 3.04655
BG324 2005 03 2 0.97613
BG324 2005 04 2 2.06119
BG325 2005 04 2 1.34959
BG325 2005 07 2 27.4231
BG331 2005 01 2 0.901016
BG331 2005 03 2 2.26082
BG331 2005 04 2 13.956
BG331 2005 06 2 0.0643451
BG332 2005 04 2 4.7556
BG333 2005 03 2 28.8306
BG333 2005 04 4 4.97396
BG341 2005 03 3 621.824
BG341 2005 04 4 9.6126
BG341 2005 06 2 1.6579
BG343 2005 03 2 3.71163
BG343 2005 04 2 4.73784
BG344 2005 01 3 4.74859
BG344 2005 03 2 7.04468
BG344 2005 04 4 138.517
BG411 2005 03 2 1.01033
BG411 2005 04 2 28.9588
BG412 2005 02 3 54.2146
BG412 2005 03 3 55.5808
BG412 2005 04 3 221.366
BG412 2005 05 2 0.676884
BG413 2005 03 4 51.1086
BG413 2005 04 5 33.3638
BG413 2005 05 3 2.01096
BG413 2005 06 3

and now we count

In [24]:
%%bash 
for var in 2005 2006 2007; do
    grep $var input.txt | wc -l 
done 

121
280
189


Now we want to automatically count and save in a file how many observations exist from year 2000 to 2008 in input.txt file. For this use the serial number list loop function.

In [26]:
%%bash 
rm -f input_wc.txt
for ((var=2000 ; var<=2008 ; var++)); do
    grep $var input.txt | wc -l  >> input_wc.txt  
done 

In [27]:
! head input_wc.txt

62
34
48
93
46
121
280
189
2


## If condition in a for loop
As for the previews exercise, we want to automatically count how many observations exist from year 2000 to 2008 in input.txt file, but not for the year 2003. For this you should use the serial number list loop function with the if condition.

In [29]:
%%bash
rm no2003output.txt
for ((year=2000 ; year<=2008 ; year++)); do
if [ $year != 2003 ] ; then
    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -l  >> no2003output.txt      
fi
done
cat no2003output.txt

2778
2643
2641
2894
2837
3011
775
0


I need to know in each year which was the biggest fire and print it. I can use the sort command and get the largest fire in the last position.

In [31]:
%%bash
for ((year=2000 ; year<=2008 ; year++)); do
if [ $year != 2003 ] ; then
    grep " $year " txt/aver_month_nuts3_fire.asc  | sort -k 5,5 -g | tail -1       
fi
done

GR253 2000 07 3 23216.3
PT117 2001 08 464 7226.27
PT118 2002 08 448 14574.7
PT150 2004 07 5 16599
PT164 2005 08 114 41830.3
ES114 2006 08 554 52093.4
BG422 2007 08 4 12972.6


**Exercise**

Perform the same loop but excluding the year from 2002 to 2004.
Use the "man test" to see the option for the if condition.
Googled "if statement with multiple condition bash".

## Checking the flow statement
How can I check that the results are correct and that i'm using the correct variables? 
By printing the variable during the process and if you need also in the file.

In [36]:
%%bash
rm no2003output.txt
for ((year=2000 ; year<=2008 ; year++)); do
if [ $year != 2003 ] ; then
    echo processing year $year 
    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -l >> no2003output.txt    
fi
done


processing year 2000
processing year 2001
processing year 2002
processing year 2004
processing year 2005
processing year 2006
processing year 2007
processing year 2008


In [37]:
! head no2003output.txt

2778
2643
2641
2894
2837
3011
775
0


I can also run manually a command and compare the results.

In [39]:
! grep " 2007 " txt/aver_month_nuts3_fire.asc  | wc -l 
! grep " 2002 " txt/aver_month_nuts3_fire.asc  | wc -l 

775
2641


In [42]:
%%bash
time for ((year=2000 ; year<=2008 ; year++)); do
if [ $year != 2003 ] ; then
    echo year $year 
    grep " $year " txt/aver_month_nuts3_fire.asc  | wc 
fi
done

year 2000
   2778   13890   67910
year 2001
   2643   13215   64585
year 2002
   2641   13205   64534
year 2004
   2894   14470   70924
year 2005
   2837   14185   69493
year 2006
   3011   15055   73972
year 2007
    775    3875   19026
year 2008
      0       0       0



real	0m0.033s
user	0m0.029s
sys	0m0.015s


## Debugging
The shell reports message and status symbols in case of error syntax, incorrect commands or inexistent files.
Here are reported the most common errors using the example:

In [44]:
%%bash
for ((year=2000 ; year<=2008 ; year++)); do
    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -l     
done

2778
2643
2641
3078
2894
2837
3011
775
0


Run the script and see the error results.

The loop was not close and after the bash error a series of no sense python errors are reported.

In [49]:
%%bash
for ((year=2000 ; year<=2008 ; year++)); do
    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -l

bash: line 3: syntax error: unexpected end of file


CalledProcessError: Command 'b'for ((year=2000 ; year<=2008 ; year++)); do\n    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -l\n'' returned non-zero exit status 2.

Bash: syntax error near unexpected token `('

The error is near the brackets. Often it is just a space or a bracket that has not been closed

In [50]:
%%bash 
for ( (year=2000 ; year<=2008 ; year++)); do
    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -l  
done   

bash: line 1: syntax error near unexpected token `('
bash: line 1: `for ( (year=2000 ; year<=2008 ; year++)); do'


CalledProcessError: Command 'b'for ( (year=2000 ; year<=2008 ; year++)); do\n    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -l  \ndone   \n'' returned non-zero exit status 2.

Bash command error: use "man -k" for searching the operation that you need.

In [51]:
%%bash
for ((year=2000 ; year<=2008 ; year++)); do
    grap " $year " txt/aver_month_nuts3_fire.asc  | wc -l     
done

0
0
0
0
0
0
0
0
0


bash: line 2: grap: command not found
bash: line 2: grap: command not found
bash: line 2: grap: command not found
bash: line 2: grap: command not found
bash: line 2: grap: command not found
bash: line 2: grap: command not found
bash: line 2: grap: command not found
bash: line 2: grap: command not found
bash: line 2: grap: command not found


Invalid command option: "wc: invalid option -- 'k'". Read carefully the manual for the wc command.

In [54]:
import warnings; warnings.simplefilter('ignore')

In [55]:
%%bash
for ((year=2000 ; year<=2008 ; year++)); do
    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -k
done

wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.
wc: invalid option -- 'k'
Try 'wc --help' for more information.


CalledProcessError: Command 'b'for ((year=2000 ; year<=2008 ; year++)); do\n    grep " $year " txt/aver_month_nuts3_fire.asc  | wc -k\ndone\n'' returned non-zero exit status 1.

The file or directory does not exist: search for the correct file and directory, by using "cd" and "pwd"

In [53]:
%%bash
for ((year=2000 ; year<=2008 ; year++)); do
    grep " $year " ../aver_month_nuts3_fire.asc  | wc -l     
done

0
0
0
0
0
0
0
0
0


grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory
grep: ../aver_month_nuts3_fire.asc: No such file or directory


**remove processed files**

In [3]:
! rm input_s.txt  input.txt input_wc.txt no2003output.txt