#### Overview


If the IR cycle is something like this:
1. Collect documents (i.e. web crawling or retrieving specific pages)
1. Extract *structured* information from documents (i.e. convert the document external format to match your schema)
1. Index the documents
1. Query the index

Most of the documents of interest have some structured elements and some unstructured elements. 
We will look at Seattle U computer science faculty home pages.  For example, Prof. Dingle's page:
https://www.seattleu.edu/scieng/about/faculty-and-staff/profiles/adair-dingle-phd.html

Things like the name, email, office number, and phone number are structured, and the list of research interests is (probably) unstructured narrative. 

Here we are going to concentrate on the first two.  We will work on
1. Making a service call to get an HTML document
1. Parsing the document to pull out certain fields 
1. Packaging and storing document data so it's ready for indexing

The exercise:  for any/all faculty pages, extract this information

1. Name
1. Phone number
1. Email address
1. Research interests

Many sites have a "structured API"  -- for example https://docs.microsoft.com/en-us/linkedin/ -- which takes a request (e.g. for a person or handle) and returns a data structure (e.g. containing the person's name, contacts, employment history).  
But sometimes we have to extract structured information directly from a web page -- that is tricky and dangerous, because the HTML is structured for display purposes and not semantically -- the HTML can change abruptly and break all your extraction code, and there is no guarantee that the structure of every page of interest is the same.


#### Service Calls

Getting the HTML source for a page.

We will be making calls to an HTTP server, so we need to talk about requests and responses.  This will be useful to you both in the retrieval context, but also because you will be making requests to SOLR, which is itself a service.

We will use Python requests library http://docs.python-requests.org/en/master/


In [2]:
import requests
PROFILES = "https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/profiles/"
url = PROFILES + "adair-dingle-phd.html"
response = requests.get(url)

In [3]:
type(response)

requests.models.Response

In [4]:
response.status_code

200

In [5]:
response.headers

{'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '8327', 'Connection': 'keep-alive', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=0, no-cache="set-cookie"', 'Content-Encoding': 'gzip', 'Date': 'Fri, 14 Jan 2022 03:34:45 GMT', 'ETag': '"8bcd-5d58276fe97b3-gzip"', 'Expires': 'Fri, 14 Jan 2022 03:34:45 GMT', 'Last-Modified': 'Fri, 14 Jan 2022 03:33:22 GMT', 'Server': 'Apache', 'Set-Cookie': 'AWSELB=F1CBAFA51E2419F9186A0F571FDB29018C4C0532CFE701F6B790FF55096CB7A0088E73F83705B6F82B0D6AB326344066EE08B245C4D0613066751D2B2B2397E6EBE38F2575;PATH=/, AWSELBCORS=F1CBAFA51E2419F9186A0F571FDB29018C4C0532CFE701F6B790FF55096CB7A0088E73F83705B6F82B0D6AB326344066EE08B245C4D0613066751D2B2B2397E6EBE38F2575;PATH=/;SECURE;SAMESITE=None', 'Strict-Transport-Security': 'max-age=0', 'Vary': 'Accept-Encoding', 'X-Cache': 'Miss from cloudfront', 'Via': '1.1 567b44ed19c8caed2570b7bcd8c70034.cloudfront.net (CloudFront)', 'X-Amz-Cf-Pop': 'SEA73-P1', 'X-Amz-Cf-Id': 'N0xjQbWuOPLEIQFtI3KoINrOBFzf5r

In [6]:
response.text

'<!DOCTYPE html>\r\n<html lang="en" class="no-js">\r\n\r\n<head>\r\n\r\n    <meta charset="utf-8" />\r\n    <meta http-equiv="x-ua-compatible" content="ie=edge">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n    <title>\r\n        Profiles | Adair Dingle, Ph.D.\r\n    </title>\r\n    <meta name="id" content="199980" />\r\n    <meta name="author" content="Seattle University" />\r\n    <meta property="og:type" content="website" />\r\n    <meta property="og:site_name" content="Seattle University">\r\n    <meta name="twitter:card" content="summary" />\r\n    <meta name="twitter:site" content="@seattleu" />\r\n    <meta name="twitter:creator" content="@seattleu" />\r\n    <meta property="og:title" content="Adair Dingle, Ph.D." />\r\n    <meta name="twitter:title" content="Adair Dingle, Ph.D." />\r\n    <link rel="canonical" href="https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/profiles/adair-dingle-phd.html"/><meta name="twitter:url" c

#### HTML String to Parsed HTML

Beautiful Soup package
* Documentation:  https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Installation: pip install beautifulsoup4

In [8]:
from bs4 import BeautifulSoup as soup

In [9]:
page = soup(response.text, "html.parser")

In [10]:
type(page)

bs4.BeautifulSoup

In [11]:
print(page.title)

<title>
        Profiles | Adair Dingle, Ph.D.
    </title>


In [16]:
print(page.head)

<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<title>
        Profiles | Adair Dingle, Ph.D.
    </title>
<meta content="199980" name="id"/>
<meta content="Seattle University" name="author"/>
<meta content="website" property="og:type"/>
<meta content="Seattle University" property="og:site_name"/>
<meta content="summary" name="twitter:card">
<meta content="@seattleu" name="twitter:site"/>
<meta content="@seattleu" name="twitter:creator"/>
<meta content="Adair Dingle, Ph.D." property="og:title"/>
<meta content="Adair Dingle, Ph.D." name="twitter:title"/>
<link href="https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/profiles/adair-dingle-phd.html" rel="canonical"/><meta content="https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/profiles/adair-dingle-phd.html" name="twitter:url"/><meta content="https://www.seattleu.edu/scieng/computer-science/fa

In [17]:
print(page.body)

<body class="fulltext bioSubPage 2908978Bio">
<nav aria-label="Skip to important sections">
<a class="sr-only sr-only-focusable" href="#zoneA">Skip to main content</a>
<a class="sr-only sr-only-focusable" href="#siteNavigation">Skip to site navigation</a>
<a class="sr-only sr-only-focusable" href="#contactInformationAnchor">Skip to contact information</a>
<a class="sr-only sr-only-focusable" href="#ctaLinksAnchor">Skip to Apply, Request Info, Jobs, Contact links</a>
</nav>
<div aria-atomic="true" class="emergencynotice" role="alert"></div>
<header class="container-fluid is-visible" data-nav-status="toggle" id="globalHeader">
<button aria-expanded="false" class="navbar-toggle collapsed" data-target=".collapseMe" data-toggle="collapse" type="button"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button>
<div class="col-xs-6 col-md-2" id="mainlogo">
<a aria-label="Link back to Seattle Universit

In [18]:
print(page.text)








        Profiles | Adair Dingle, Ph.D.
    




























 


 
 
 




.staffBioBox .staffBioPhoto1 {
    width: 105px;
}




Skip to main content
Skip to site navigation
Skip to contact information
Skip to Apply, Request Info, Jobs, Contact links



 Toggle navigation    















 





          // Add listener to submit search query on form submit
          $("form.gsc-search-box").on('submit', function() {
            var URL = "//www.seattleu.edu/search/";
            window.location.href = URL + "?q=" + encodeURIComponent($("form.gsc-search-box input.gsc-input").val());
            return false; // Stops the form from executing its default submit (which reloads the page and breaks our desired functionality)
          });
        


Visit
Apply
Give
Alumni
Student Support
 SU Resources  

Email
Canvas
mySeattleU
SU Online
Library
Quicklinks







 About 



About Seattle UInclusive ExcellenceCampus SustainabilityCenters and InstitutesFacts and F

In [19]:
print(page.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport">
   <title>
    Profiles | Adair Dingle, Ph.D.
   </title>
   <meta content="199980" name="id"/>
   <meta content="Seattle University" name="author"/>
   <meta content="website" property="og:type"/>
   <meta content="Seattle University" property="og:site_name"/>
   <meta content="summary" name="twitter:card">
    <meta content="@seattleu" name="twitter:site"/>
    <meta content="@seattleu" name="twitter:creator"/>
    <meta content="Adair Dingle, Ph.D." property="og:title"/>
    <meta content="Adair Dingle, Ph.D." name="twitter:title"/>
    <link href="https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/profiles/adair-dingle-phd.html" rel="canonical"/>
    <meta content="https://www.seattleu.edu/scieng/computer-science/faculty-and-staff/profiles/adair-dingle-phd.ht

In [12]:
## This extracts the name!
page.find('div', {'id': 'zoneA'}).find('h1', {'id': 'pageTitle'}).text

'Adair Dingle, Ph.D.'

In [21]:
## This extracts the email!!
page.find('div', {'id': 'zoneA'}).\
find('div', {'class': "staffBioPageInfo"}).\
find('p', {'class': 'Email'}).\
find('a').\
text

'dingle@seattleu.edu'

In [22]:
## This extracts the phone!!
page.find('div', {'id': 'zoneA'}).\
find('div', {'class': "staffBioPageInfo"}).\
find('p', {'class': 'Phone'}).\
text.replace('Phone: ', '')

'206.296.5516'

In [23]:
## This extracts the bio information!!!
page.find('div', {'class': "ExtendedBiography"}).text

'Biography\n\nBS Mathematics, Duke University\nMS Computer Science, Northwestern University\nPhD Computer Science, University of Texas/Dallas\n'

### Packaging it Up

In [16]:
handles = ['adair-dingle-phd', 'mckee-michael', 'hanks-steve', 'khadivi-pejman', 'leblanc-richard']

In [18]:
def extract_faculty_info(handle):
    url = f"{PROFILES}/{handle}.html"
    response = requests.get(url)
    if (response.status_code != 200):
        print(f"No page for {handle} {response.status_code}")
        return {}
    page = soup(response.text, "html.parser")
    name = page.find('div', {'id': 'zoneA'}).find('h1', {'id': 'pageTitle'}).text
    email = page.find('div', {'id': 'zoneA'}).find('div', {'class': "staffBioPageInfo"}).find('p', {'class': 'Email'})
    if email == None:
        email = None
    else:
        email = email.find('a').text
    phone = page.find('div', {'id': 'zoneA'}).find('div', {'class': "staffBioPageInfo"}).find('p', {'class': 'Phone'})
    if phone == None:
        phone = None
    else:
        phone = phone.text.replace('Phone: ', '')
    bio = page.find('div', {'class': "ExtendedBiography"})
    if bio == None:
        bio = None
    else:
        bio = bio.text
    return {'name': name, 'email': email, 'phone': phone, 'bio': bio}


In [19]:
for handle in handles:
    print(extract_faculty_info(handle))

{'name': 'Adair Dingle, Ph.D.', 'email': 'dingle@seattleu.edu', 'phone': '206.296.5516', 'bio': 'Biography\n\nBS Mathematics, Duke University\nMS Computer Science, Northwestern University\nPhD Computer Science, University of Texas/Dallas\n'}
{'name': 'Michael McKee, MSE', 'email': 'mckeem@seattleu.edu', 'phone': None, 'bio': '\xa0\nTeaching Interests:\n\nProgramming & Problem Solving\nData Structures And Algorithms\nDatabases\nSoftware Economics\nSoftware Testing\nData Analytics\n\nResearch Interests:\n\nCS Education, Databases, Data Warehousing, Computer Languages, Economics, STEM.\n'}
{'name': 'Steve Hanks, Ph.D.', 'email': 'hankssteven@seattleu.edu', 'phone': '206.296.2505', 'bio': 'Teaching interests:\n\nData science\nArtificial intelligence\nSoftware design\nText and natural language processing, and search\n\nResearch interests:\n\nApplication of data-science methodologiesConnecting machine learning and AI – how ML algorithms can learn representations useful for “general commonsen

#### Serializing / Storing

It is often useful/necessary to store these "documents" prior to indexing.  Usually this consists of storing the URL or handle, and have it point to the parsed document.   That way a crawler can skip the page if it wants

Two implementations
* Quick and easy and efficient:  python "pickle" serializer.
* Stil quick and easy to use but leaves us readable text for indexing:  write JSON string

In [43]:
import pickle
handle = 'adair-dingle-phd'
dingle = extract_faculty_info(handle)
pickle.dump(dingle, open(f"{handle}.p", "wb"))
recovered = pickle.load( open( f"{handle}.p", "rb" ))

In [44]:
type(recovered)

dict

#### Put some aside for next lecture

In [None]:
for handle in handles:
    data = extract_faculty_info(handle)
    print(str(data) + "\n")
    pickle.dump(data, open( f"stored/{handle}.p", "wb" ))

#### Also put out a plain text version so we can use non-python tools 

In [None]:
data = extract_faculty_info('dingle-adair')
str(data)
f = open("json\dingle-adair.json", "w")
f.write(str(data))
f.close()

In [None]:
import json
for handle in handles:
    data = extract_faculty_info(handle)
    with open(f"json/{handle}.json", "w") as f:
        json.dump(data, f)