Regular expression can also be used to extract data. The script file regex_bash_script.sh
contains a good number of codes necessary to extract data from a web page.
curl
egrep
cut
sed
-
The
curl
command is used to retrive URL's, and html tags. Then you can save it as a text file to be used later to fetch the data. -
The
egrep
command fetches the part of the data that matches with what is specified inside quotes. Typeegrep --help
on terminal on kali linux and check for more options. On the script I have usedegrep
andegrep -o
. Theegrep -o
stands for "all that matches". -
The
cut
command is used (as the word says) to cut/skip the data from a particular point and keep the rest or a portion of the data. There are several ways of using thecut
tool, but the most useful I found is thecut -d
method. The-d
stands for 'delimited', usecut --help
in the console for more options. -
The
sed
command is used to replace parts of the data. It can be anything from white space, upper/lower case letters, numbers, words, symbols etc. First use commandsed
followed by"s/"
and after the forward slash place the symbol that needs to be replaced,"sed/A/a/"
and right after place the other symbol that replaces the capitalA
. In case of replacing white space with no white space do the followingsed "s/ //"
.