# Grabbing Web Pages

Accessible Web pages are files (usually in HTML format) that reside on servers which
accept HTTP requests from clients connected to the Internet. Browsers are software
applications that send HTTP requests and display the received Web pages. Using
Perl, Python, or Ruby, you can automate HTTP requests. For each language, the
easiest way to make an HTTP request is to use a module that comes bundled as a
standard component of the language.

In [1]:
import urllib2
req = urllib2.Request('http://www.julesberman.info/factoids/batch.htm')
try:
    response = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print('The server couldn\’t fulfill the request.')
    print('Error code: ', e.code)
except urllib2.URLError, e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    print(urllib2.urlopen(req).read())
req = urllib2.Request('http://www.julesberman.info/factoids/xxxxx.htm')
try:
    response = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print('The server couldn\’t fulfill the request.')
    print('Error code: ', e.code)
except urllib2.URLError, e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    print(urllib2.urlopen(req).read())

SyntaxError: invalid syntax (<ipython-input-1-d02b3752dbe6>, line 5)

## Script Algorithm: Grabbing Web Pages

1. Import the module that makes HTTP requests.
2. Make the HTTP request.
3. If the request returns the Web page, print the page. Otherwise, print a message
indicating the request was unsuccessful.

## Analysis: Grabbing Web Pages

Perl, Python, and Ruby use their own external modules for HTTP transactions. Each
language’s module has its own peculiar syntax. Still, the basic operation is the same:
your script initiates an HTTP request for a Web file at a specific network address (the
URL, or Uniform Resource Locator). A response is received, and the Web page is
retrieved, if possible. Otherwise, the response will contain some information indicating
why the page could not be retrieved.
In the example script, two Web pages were requested. The first is located at http://
www.julesberman.info/factoids/batch.htm, and is a valid URL. The second is located
at http://www.julesberman.info/factoids/xxxxx.htm, and is an invalid address.
You can see that, with a little effort, you can use this basic script to collect and
examine a large number of Web pages. With a little more effort, you can write your own spider software that searches for Web addresses, and iteratively collects information
from Web links within Web pages.

# CGI Scipt for Searching the Neoplasm Classification

Here are the steps for using CGI scripts:
1. Get yourself a server account with access to a “public_html” directory and a
“cgi-bin” subdirectory. This is usually accomplished by paying a commercial
ISP (Internet Service Provider) for a Web account, or by asking your company
or academic sponsor for an account. When you get your account, the provider
will explain to you how you can deposit, via FTP, Web pages (that you create)
onto the public_html directory. The provider will also explain how you can
deposit your CGI scripts onto the cgi-bin subdirectory. He will also explain
how you can assign settings to your CGI scripts that restrict access to certain
sets of users. The provider will also tell you if there are limitations on the kinds
of scripts permitted on the server (e.g., specific versions of a language might be
required by the server, and the server may be set up for one language and not
another).
2. Create a Web page that creates an HTML form. Almost every HTML book
contains information about forms. Forms are HTML objects that accept user
input and send the input to a designated server. Text boxes and radio buttons
are commonly encountered form objects. They can be created in just a
few lines of HTML code. You will put the Web page in your public_html
directory. This Web page will be accessible to anyone in the world who happens
to know the Web address of the HTML page. Your server manager will
provide you with the Web address of your public_html directory, and the
complete address of the Web page is simply the HTML file name appended
to the directory address.
3. Create a script that sits in the cgi-bin subdirectory of a server, whose specific
address is included in the form that you previously included in your Web
page. When anyone viewing your Web page, enters information in the form,
and submits the information (usually by clicking on a button in the form), the
information will be sent to your server-side script and processed.
This describes the basic steps for a CGI script. With a little imagination, you can see
the enormous power of this approach. The best thing about CGI is that you do not
need to learn another language. You simply apply the programming skills you have
already mastered.
The neoplasm taxonomy is an example of a medical nomenclature that is easy to
parse, search, and produce an output in a preferred format. We can use the neoplasm taxonomy to search for neoplasm terms that match words and phrases submitted on a
Web page. This will be our introductory CGI script.

HTML text for client (requesting) web page:

<html>
<head>
<title>post</title>
</head>
<body>
<br><form name=”sender” method=”GET”
action=”http://www.julesberman.info/cgi-bin/neopull.py”>
<br><center><input type=”text” name=”tx” size=38
maxlength=48 value=””>
<input type=”submit” name=”bx” value=”SUBMIT”></center>
</form>
<br><br>
</body>
</html>

In [None]:
import cgi, re, sys
import cgitb; cgitb.enable()
print “Content-type: text/html”
print
print “<html><head><title>Sample CGI Script</title></head><body>”
form = cgi.FieldStorage()
message = form.getvalue(“tx”, “(no message)”)
term_check = re.search(r’[A-Za-z ]+$’, message)
if not term_check:
    print “<br>Only alphabetic letters and spaces are permitted in the query box”
    print “</body></html>”
    sys.exit()
print “<br>Your query term is “ + message + “<br>”
in_text = open(“neoself”, “r”)
for line in in_text:
    query_match = re.search(message, line)
    if query_match:
        line = re.sub(r’\|’,”<br>”, line)
        print “<br>” + line + “<br>”

## Script Algorithm: CGI Scipt for Searching the Neoplasm Classification

1. Create a very simple Web page, consisting of a simple form containing a text box
for user input (Figure 16.1). The form will contain the URL (Universal Resource
Locator, or Web address) for the cgi-bin where your CGI script resides.
2. Upload the HTML document (your Web page) to the public_html directory
on your Web server. Clients will send requests by entering information on the
HTML document.
3. Create a script that you will upload to the cgi-bin of your server, which has the
address specified in the Web page form (steps 1 and 2). The script will execute
steps 4–8 when it receives a request from a client.
4. Capture the character string sent by the Web page, using command syntax
specific to your preferred programming language, and place the text into a
string object.
5. Print out the HTML header of the Web page that will be returned to the client
(the user, sitting at a browser, somewhere on planet Earth, and looking at
your Web page).
6. Process the text that the user sent to the CGI script. In this case, the information
will be matched against every line in the neoself document, a 17+ megabyte
(MB) collection of neoplasm terms that we previously created in Chapter 11.
The neoself document must be deposited onto the server’s cgi-bin.
7. Parse through every line of the neoself document. When a line that contains
the string entered by the Web user is encountered, it is printed.
8. Print the HTML tags that mark the end of the Web page.

## Analysis: CGI Scipt for Searching the Neoplasm Classification

In this case, the user entered the word “rhabdoid” into the Web page query box.
The output immediately appears, as another Web page, in the same user’s browser
(Figure 16.2).
Notice that when the user pushes the “submit” button, all of the transmitted information
appears in the browser’s entry box, at the top of the Web page: