# ECC058 - DHLab

# Finding Web Data


#### Prof. Stephen White
#### stephen.white@unive.it

## Fetching information from the Web
There are several methods for getting data from the web.
 - Download datafiles of various formats: csv, tsv, xslx, xml, json, etc.
 - Capture response data from calling a URL to services (websites, Restful, php, asp, etcs) that returns various formats: html, xml, json, csv, text, etc.
 - With html it is often necessary to extract the data from the html using a library
 
Python includes a `urllib` module to download Web pages as raw html as seen by the browser before the browser processes it. The same as using `view source` on the context menu, only with out the pretty printing.
```Python
import urllib.request
```

In [80]:
import urllib.request

urlBase = "https://edh-www.adw.uni-heidelberg.de"
urlQueryBase = "/inschrift/erweiterteSuche?hd_nr=&tm_nr=&beleg=c&land=&fo_antik=&fo_modern=&fundstelle=&region=&compFundjahr=eq&fundjahr=&aufbewahrung=&inschriftgattung=&sprache=L&inschrifttraeger=&compHoehe=eq&hoehe=&compBreite=eq&breite=&compTiefe=eq&tiefe=&bh=&palSchreibtechnik=&dat_tag=&dat_monat=&dat_jahr_a=&dat_jahr_e=&hist_periode=&religion=&literatur=&kommentar=&p_name=&p_praenomen=&p_nomen=&p_cognomen=&p_supernomen=&p_tribus=&p_origo=&p_geschlecht=&p_status=&compJahre=eq&p_lJahre=&compMonate=eq&p_lMonate=&compTage=eq&p_lTage=&compStunden=eq&p_lStunden=&atext1=&bool=AND&atext2=&beleg89=ja&nurMitFoto=ja&sort=hd_nr&anzahl=100&addFeldMaterial=ja&addFeldDTyp=ja&addFeldIGat=ja&start="
offset = 0

url = urlBase + urlQueryBase + str(offset)

f = urllib.request.urlopen(url)
str_all = f.read() 
f.close()

print(str_all)

b'<!DOCTYPE html>\n<html>\n<head>\n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<meta name="keywords" content="latin epigraphy,epigraphy,inscription,database,ancient,Rome,Roman Empire,fotos,images,bibliography" />\n<meta name="description" content="The Epigraphic Database Heidelberg contains the texts of Latin and bilingual (i.e. Latin-Greek) inscriptions of the Roman Empire." />\n<meta name="publisher" content="Heidelberg Academy of Sciences and Humanities" />\n<meta name="author" content="Frank Grieshaber" />\n\n<title>Inscriptions: Advanced Search - Epigraphic Database Heidelberg</title>\n<link rel="stylesheet" href="/edh-css/jquery-ui-slider-pips.css" />\n<link rel="stylesheet" href="/edh-css/jquery.tree.css" />\n<link rel="stylesheet" href="/edh-css/jq/jquery-ui.css" />\n<link rel="stylesheet" href="https://code.jquery.com/ui/1.10.1/themes/base/jquery-ui.css" />\n<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/

## Managing HTML 
HTML is fundamentally made up of XML elements delimited by open `<tagname>` and close `</tagname>` tags. Common tagnames for HTML include html, body, script, link, div, span, p, ul, li, a, img. Several of the elements can be nested creating a tree of elements. Python3 comes with `html.parser` library which has a class *HTMLParser* which is used to write a custom parser.

The easiest way to manage html is to use a library that parses raw html into a tree of html elements and provides functions for finding various parts of the tree.  

BeautifulSoup is such a library. You can install it from Jupyter Notebook by

``` python
!pip install beautifulsoup4
```

After which you can import the parser class and use it as follows:
``` python
from bs4 import BeautifulSoup
myHtml = # call code to get html here
soup = BeautifulSoup(myHtml)
```

In [None]:
#!pip install beautifulsoup4

### Viewing Formatted HTML

BeautifulSoup has a `prettify()` method to structurally indent the HTML for readability.

In [74]:
# take a quick look at the html separated
from bs4 import BeautifulSoup
myHtml = str_all # call code to get html here
soup = BeautifulSoup(myHtml,'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="latin epigraphy,epigraphy,inscription,database,ancient,Rome,Roman Empire,fotos,images,bibliography" name="keywords">
   <meta content="The Epigraphic Database Heidelberg contains the texts of Latin and bilingual (i.e. Latin-Greek) inscriptions of the Roman Empire." name="description">
    <meta content="Heidelberg Academy of Sciences and Humanities" name="publisher"/>
    <meta content="Frank Grieshaber" name="author"/>
    <title>
     Inscriptions: Advanced Search - Epigraphic Database Heidelberg
    </title>
    <link href="/edh-css/jquery-ui-slider-pips.css" rel="stylesheet"/>
    <link href="/edh-css/jquery.tree.css" rel="stylesheet"/>
    <link href="/edh-css/jq/jquery-ui.css" rel="stylesheet"/>
    <link href="https://code.jquery.com/ui/1.10.1/themes/base/jquery-ui.css" rel="stylesheet"/>
    <link href="https://cdnjs.cloudflare.com/

### Accessing HTML elements

BeautifySoup use a derefernce model much line python Dictionairies or pandas Dataframes, for example
``` python
soup.a #the first anchor element or None
soup.a.string #the first anchor element's text or None
```


In [75]:
print("1st anchor element =>  ",soup.a)
print("1st anchor element class attribute =>  ",soup.a.get("class"))
print("1st anchor element href link attribute =>  ",soup.a.get("href"))
print("1st anchor element text =>  ",soup.a.string)
#get second anchor
a2 = soup.a.find_next('a')
print("2nd anchor element =>  ",a2)
print("2nd anchor element class attribute =>  ",a2.get("class"))
print("2nd anchor element href link attribute =>  ",a2.get("href"))
print("2nd anchor element text =>  ",a2.string)
li1 = soup.li
print("1st listItem element =>  ",li1)
print("1st listItem element class attribute =>  ",li1.get("class"))
print("1st listItem element text =>  ",li1.string)


1st anchor element =>   <a href="http://www.haw.uni-heidelberg.de/index.en.html">HEIDELBERG ACADEMY OF SCIENCES AND HUMANITIES</a>
1st anchor element class attribute =>   None
1st anchor element href link attribute =>   http://www.haw.uni-heidelberg.de/index.en.html
1st anchor element text =>   HEIDELBERG ACADEMY OF SCIENCES AND HUMANITIES
2nd anchor element =>   <a href="/home">Home</a>
2nd anchor element class attribute =>   None
2nd anchor element href link attribute =>   /home
2nd anchor element text =>   Home
1st listItem element =>   <li><a href="">Project</a>
<ul>
<li><a href="/projekt/konzept">Concept</a></li>
<li><a href="/projekt/geschichte">History</a></li>
<li><a href="/projekt/kooperationen">Cooperations</a></li>
<li><a href="/projekt/mitarbeiter">Research Team Members</a></li>
<li><a href="/projekt/veranstaltungen">Events and Presentations</a></li>
</ul>
</li>
1st listItem element class attribute =>   None
1st listItem element text =>   None


In [76]:
# look at all the hyperlinks for this web page.
for hyperlink in soup.find_all('a'):
    print(hyperlink.get('href'))

http://www.haw.uni-heidelberg.de/index.en.html
/home

/projekt/konzept
/projekt/geschichte
/projekt/kooperationen
/projekt/mitarbeiter
/projekt/veranstaltungen
/inschrift/suche
/inschrift/suche
/inschrift/erweiterteSuche
/inschrift/browse
/foto/suche
/bibliographie/suche
/data
/links
/inschrift/suche
/inschrift/suche
/inschrift/suche
/fotos/suche
/foto/suche
/foto/suche
/bibliographie/suche
/bibliographie/suche
/bibliographie/suche
/bibliographie/suche
#
#
/inschrift/suche
/inschrift/erweiterteSuche
/inschrift/erweiterteSuche?hd_nr=&tm_nr=&beleg=c&land=&fo_antik=&fo_modern=&fundstelle=&region=&compFundjahr=eq&fundjahr=&aufbewahrung=&inschriftgattung=&sprache=L&inschrifttraeger=&compHoehe=eq&hoehe=&compBreite=eq&breite=&compTiefe=eq&tiefe=&bh=&palSchreibtechnik=&dat_tag=&dat_monat=&dat_jahr_a=&dat_jahr_e=&hist_periode=&religion=&literatur=&kommentar=&p_name=&p_praenomen=&p_nomen=&p_cognomen=&p_supernomen=&p_tribus=&p_origo=&p_geschlecht=&p_status=&compJahre=eq&p_lJahre=&compMonate=eq&p_

In [77]:
# show any table where None means no tables in HTML. This can happen when content is dynamically loaded.
for table in soup.find_all('table'):
    print (table.prettify(),"\n")

<table id="tabelleCont" width="100%">
 <!-- Kopf Anfang -->
 <tr id="kopfHoehe">
  <td colspan="3">
   <table width="100%">
    <tr>
     <td id="col1" rowspan="2">
     </td>
     <td id="col2_oben">
      <a href="http://www.haw.uni-heidelberg.de/index.en.html">
       HEIDELBERG ACADEMY OF SCIENCES AND HUMANITIES
      </a>
     </td>
     <td id="col3" rowspan="2">
     </td>
    </tr>
    <tr>
     <td id="col2_unten">
      <img alt="Epigraphic Database Heidelberg" src="/images/edh_en.png" style="width:584px;height:37px"/>
     </td>
    </tr>
   </table>
   <table width="100%">
    <tr>
     <td id="subHeaderLinks">
     </td>
     <td id="subHeaderMitte">
     </td>
     <td id="subHeaderRechts">
     </td>
    </tr>
   </table>
   <!-- Navigation Anfang -->
   <table width="100%">
    <tr>
     <td id="naviHome">
      <a href="/home">
       Home
      </a>
     </td>
     <td class="naviHell">
      <ul class="jsddm2">
       <li>
        <a href="">
         Project
       

In [78]:
# show any table where None means no tables in HTML. This can happen when content is dynamically loaded.
# look at all the tables with of a given class attribute this web page.
for table in soup.find_all('table'):
    tclass = table.get('class')
    # filter everything but tables that have class = "treffertabelle" which refers to another webpage information table
    if tclass != None and isinstance(tclass,list) and tclass[0] == 'treffertabelle':
        print (table.prettify())

<table class="treffertabelle">
 <thead>
  <tr>
   <th colspan="3" id="zwischenueberschrift">
    Number 1:
    <a class="linkLastUpdateDetail" href="/edh/inschrift/HD000004">
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color: #FFFF00">
     </span>
     HD000004  – open detailed view...
    </a>
   </th>
  </tr>
 </thead>
</table>
<table class="treffertabelle">
 <thead>
  <tr>
   <th colspan="3" id="zwischenueberschrift">
    Number 2:
    <a class="linkLastUpdateDetail" href="/edh/inschrift/HD000035">
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color: #FFFF00">
     </span>
     <span style="background-color:

In [79]:
# show any table where None means no tables in HTML. This can happen when content is dynamically loaded.
# look at all the tables with of a given class attribute this web page.
for table in soup.find_all('table'):
    tclass = table.get('class')
    if tclass != None and isinstance(tclass,list) and tclass[0] == 'treffertabelle':
        print (table.a.get('href'))

/edh/inschrift/HD000004
/edh/inschrift/HD000035
/edh/inschrift/HD000042
/edh/inschrift/HD000047
/edh/inschrift/HD000049
/edh/inschrift/HD000073
/edh/inschrift/HD000099
/edh/inschrift/HD000127
/edh/inschrift/HD000144
/edh/inschrift/HD000224
/edh/inschrift/HD000231
/edh/inschrift/HD000264
/edh/inschrift/HD000271
/edh/inschrift/HD000273
/edh/inschrift/HD000282
/edh/inschrift/HD000298
/edh/inschrift/HD000301
/edh/inschrift/HD000304
/edh/inschrift/HD000307
/edh/inschrift/HD000316
/edh/inschrift/HD000319
/edh/inschrift/HD000322
/edh/inschrift/HD000328
/edh/inschrift/HD000361
/edh/inschrift/HD000364
/edh/inschrift/HD000365
/edh/inschrift/HD000477
/edh/inschrift/HD000480
/edh/inschrift/HD000496
/edh/inschrift/HD000502
/edh/inschrift/HD000505
/edh/inschrift/HD000508
/edh/inschrift/HD000511
/edh/inschrift/HD000529
/edh/inschrift/HD000532
/edh/inschrift/HD000595
/edh/inschrift/HD000598
/edh/inschrift/HD000601
/edh/inschrift/HD000604
/edh/inschrift/HD000607
/edh/inschrift/HD000610
/edh/inschrift/H

In [81]:
# show any table where None means no tables in HTML. This can happen when content is dynamically loaded.
# look at all the tables with of a given class attribute this web page.

for table in soup.find_all('table'):
    tclass = table.get('class')
    if tclass != None and isinstance(tclass,list) and tclass[0] == 'treffertabelle':
        break
print (table.a.get('href'))
url1 = urlBase + table.a.get('href')
f = urllib.request.urlopen(url)
epiInfoHtmlString = f.read()
f.close()
epiSoup = BeautifulSoup(epiInfoHtmlString,'html.parser')
print(epiSoup.prettify())
 

/edh/inschrift/HD000004
<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="latin epigraphy,epigraphy,inscription,database,ancient,Rome,Roman Empire,fotos,images,bibliography" name="keywords">
   <meta content="The Epigraphic Database Heidelberg contains the texts of Latin and bilingual (i.e. Latin-Greek) inscriptions of the Roman Empire." name="description">
    <meta content="Heidelberg Academy of Sciences and Humanities" name="publisher"/>
    <meta content="Frank Grieshaber" name="author"/>
    <title>
     Inscriptions: Advanced Search - Epigraphic Database Heidelberg
    </title>
    <link href="/edh-css/jquery-ui-slider-pips.css" rel="stylesheet"/>
    <link href="/edh-css/jquery.tree.css" rel="stylesheet"/>
    <link href="/edh-css/jq/jquery-ui.css" rel="stylesheet"/>
    <link href="https://code.jquery.com/ui/1.10.1/themes/base/jquery-ui.css" rel="stylesheet"/>
    <link href="https

In [82]:
#import urllib.request

#url = "https://archive.4plebs.org/pol/"
url = "http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244319/Digital%20Humanities%20Tools"
#urlQueryBase = "erweiterteSuche?hd_nr=&tm_nr=&beleg=c&land=&fo_antik=&fo_modern=&fundstelle=&region=&compFundjahr=eq&fundjahr=&aufbewahrung=&inschriftgattung=&sprache=L&inschrifttraeger=&compHoehe=eq&hoehe=&compBreite=eq&breite=&compTiefe=eq&tiefe=&bh=&palSchreibtechnik=&dat_tag=&dat_monat=&dat_jahr_a=&dat_jahr_e=&hist_periode=&religion=&literatur=&kommentar=&p_name=&p_praenomen=&p_nomen=&p_cognomen=&p_supernomen=&p_tribus=&p_origo=&p_geschlecht=&p_status=&compJahre=eq&p_lJahre=&compMonate=eq&p_lMonate=&compTage=eq&p_lTage=&compStunden=eq&p_lStunden=&atext1=&bool=AND&atext2=&beleg89=ja&nurMitFoto=ja&sort=hd_nr&anzahl=100&addFeldMaterial=ja&addFeldDTyp=ja&addFeldIGat=ja&start="
#offset = 0

#url = urlBase + urlQueryBase + str(offset)

f = urllib.request.urlopen(url)
str_all = f.read() 
f.close()

print(str_all)



In [83]:
# take a quick look at the html separated
from bs4 import BeautifulSoup
myHtml = str_all # call code to get html here
soup = BeautifulSoup(myHtml,'html.parser')
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <link href="http://vs1.pbworks.com/shared/statics/packed-m-prod-v07998121.css" rel="stylesheet" type="text/css"/>
  <!--[if lte IE 6]><link rel="stylesheet" type="text/css" href="http://vs1.pbworks.com/shared/statics/ie6-m-prod-v85351277.css" /><![endif]-->
  <style>
   body, body.wikipage, body.has-ws-nav.wikipage, body #ws-nav ul.nav-tabs #ws-nav-wiki.active { background-color:#fccf2d ; }  ul#main-tools li a,ul#secondary-tools li a, div#top-content a, #ws-nav ul.nav-tabs #ws-nav-wiki.active a { color:#47375A ; }  ul#main-tools, input#input-search, body.login table.inner, body.login #standalone-login table.inner, body.login table.inner td.table-col-2, div.workspace-panel-outer { border-color:#FABD24 ; }  div#page-toolbar, body.login table.i

### Dynamic content requires user interaction. 
When a web page requires the user to select from choices to show information, you will need to automate the web page UI by using a UI driver as Selinium as shown below. 

First we should look at how to use web API's.

---
---
# Automating and accessing dynamic web pages - Selinium


Some Web pages are dynamic.

Their content is created inside the browser:
 - e.g., by Javascript code connecting to a database

In this cases, the `urllib.request` package is not able to download content of the webpage displayed to the user.

We need an actual browser simulator.

We can use **Selenium**.

See: http://selenium-python.readthedocs.io/

Selenium is a powerful tool to simulate a user interacting with a browser.

Not only it is possible to get dynamic web pages, but it is also possible to simulate users typing text, clicking, etc.

To install the Selenium python package:
 - execute `pip install selenium` in the JuWe also need to install an external tool wrapping the actual browser.

You can us Chrome and ChromeDriver.
Or Firefox and GeckoDriver.

See below for information and installation instructions:
 - https://sites.google.com/a/chromium.org/chromedriver/home
 

On Mac, you can also install ChromDriver as follows:
 - `!brew install chromedriver`
 - This requries to install `brew` first. see https://brew.sh/index_itpyter Lab Terminal
 - exectue `!pip install selenium` in your usual Jupyter Notebook (Python 3) 
 
 

In [None]:
#!pip install selenium

In [None]:
#At this point it is very easy to download the content of a Web page.


<div class="navigation__container js-nav-fixed">
            
            <div class="row">
                <div class="column large-12">

                    <nav class="navigation" data-ui-more-nav="more-nav">
                        <ul class="navigation__list showMoreEnabled">


    <li class="navigation__item js-navigation-item" data-nav-index="0">
                <a href="/video" class="navigation__link"><span class="icn sprite-tv-icon"></span>Video</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="1">
                <a href="/supporters" class="navigation__link"><span class="icn sprite-tv-icon"></span>Supporters</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="2">
                <a href="/tickets" class="navigation__link"><span class="icn sprite-tv-icon"></span>Tickets</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="3">
                <a href="/news" class="navigation__link"><span class="icn sprite-tv-icon"></span>News</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="4">
                <a href="/shop" class="navigation__link"><span class="icn sprite-tv-icon"></span>Shop</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="5">
        <div tabindex="0" class="navigation__link"><span class="icn sprite--icon"></span>RWC 2019<span class="arrow sprite-caret-down-white"></span></div>
            <div class="navigation-dropdown">
                <div class="navigation-dropdown__list-container">
                        <ul>
                            ...
                        </ul>
                    ...
                        <ul>
                            ...
                        </ul>
                </div>
            </div>
            <a href="/tournament-overview" class="navigation__link dropdown-link"><span class="icn sprite-tv-icon"></span>RWC 2019</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="6">
        <div tabindex="0" class="navigation__link"><span class="icn sprite-qualifying-icon"></span>Qualifying<span class="arrow sprite-caret-down-white"></span></div>
            <div class="navigation-dropdown">
                <div class="navigation-dropdown__list-container">
                        <ul>
                            ...
                        </ul>
                    ...
                        <ul>
                            ...
                        </ul>
                </div>
            </div>
            <a href="/qualifying" class="navigation__link dropdown-link"><span class="icn sprite-tv-icon"></span>Qualifying</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="7">
        <div tabindex="0" class="navigation__link"><span class="icn sprite--icon"></span>Info<span class="arrow sprite-caret-down-white"></span></div>
            <div class="navigation-dropdown">
                <div class="navigation-dropdown__list-container">
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/archive" class="navigation-dropdown__link">RWC Archive</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/request-for-proposal" class="navigation-dropdown__link">2019 RFP</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/media" class="navigation-dropdown__link">Media Information</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/stats" class="navigation-dropdown__link">Statistics</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/sponsor-family" class="navigation-dropdown__link">Sponsor Family</a>
                            </li>
                        </ul>
                </div>
            </div>
            <a href="/media" class="navigation__link dropdown-link"><span class="icn sprite-tv-icon"></span>Info</a>
    </li>

    <li class="navigation__item js-navigation-item" data-nav-index="8">
                <a href="/social" class="navigation__link"><span class="icn sprite-tv-icon"></span>Social</a>
    </li>

    <li class="navigation__item js-navigation-item is-hidden" data-nav-index="9">
                <a href="/france2023" class="navigation__link"><span class="icn sprite-tv-icon"></span>France 2023</a>
    </li>

                        <li class="more"><div class="more-toggle" tabindex="0">More<span class="icn sprite-arrow-black-down"></span></div><ul class="more-dropdown"><li class="navigation__item js-navigation-item" data-nav-index="9">
                <a href="/france2023" class="navigation__link"><span class="icn sprite-tv-icon"></span>France 2023</a>
    </li></ul></li></ul>

                    </nav>

                </div>
            </div>
        </div>
```

In [None]:
<html>
 <body>
  <div> ...
           <div class="navigation-dropdown">
                <div class="navigation-dropdown__list-container">
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/archive" class="navigation-dropdown__link">RWC Archive</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/request-for-proposal" class="navigation-dropdown__link">2019 RFP</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/media" class="navigation-dropdown__link">Media Information</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/stats" class="navigation-dropdown__link">Statistics</a>
                            </li>
                        </ul>
                        <ul class="navigation-dropdown__list">
                            <li class="navigation-dropdown__item navigation-dropdown__item--title">
                                <a href="/sponsor-family" class="navigation-dropdown__link">Sponsor Family</a>
                            </li>
                        </ul>
                </div>
            </div>
    ...
  </div>
</body>
<html>

## html  and xhtml and xml
Standards that define a hierarchical structuring of data that most all browsers know how to display.
```
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>
    <book>
      <title lang="en">Harry Potter</title>
      <price>29.99</price>
    </book>

    <book>
      <title lang="en">Learning XML</title>
      <price>39.95</price>
    </book>
</bookstore>
```
## xpath
A standard way to address various parts of the hierarchical structure.

"/bookstore/book/title" addresses all title nodes that exist under a book node as a way of getting all the titles of all the books.
xpath has various alternative notations:
//book/title

xpath also allows you to test for value matching

https://www.w3schools.com/xml/xpath_syntax.asp
 
"/html/body//div[@class='navigation-dropdown']"

In [None]:
from selenium import webdriver
driver = webdriver.Firefox(executable_path = '/Anaconda3/geckodriver')

In [None]:
driver.get("https://www.rugbyworldcup.com/stats/alltime/teams/points")

In [None]:
content = driver.find_element_by_class_name('statsSection')
content.text

In [None]:
for line in content.text.split("\n"):
    print (line)

In [None]:
content.text.split("\n")[4::3]

In [None]:
#from selenium import webdriver
#driver = webdriver.Firefox(executable_path = '/Anaconda3/geckodriver')

driver.get("https://www.rugbyworldcup.com")

# note the new browser page, place it so that you can see it and this jupyter notebook at the same time

In [None]:
#find the element with a class of 'corporate-dropdown__button' for the menu
ddmenu = driver.find_element_by_class_name('corporate-dropdown__button')

#click it
ddmenu.click()

In [None]:
# try clicking it multiple times
ddmenu.click()

In [None]:
# find the first element that has class `navigation__link` and text of "Info"
menu = driver.find_element_by_xpath("//*[@class='navigation__link'and contains(text(),'Info')]")

# click it
menu.click()

In [None]:
statsmenu = driver.find_element_by_xpath("//a[@class='navigation-dropdown__link'and contains(text(),'Statistics')]")
statsmenu.click()
#add code wait for it to show

In [None]:
#find div that has class with "multiple" and "teamVersion" that has a child div with text of "Most points scored" then get child divs
divTopTeams = driver.find_elements_by_xpath("//div[contains(@class,'multiple') and contains(@class,'teamVersion') and count(div[contains(text(),'Most points scored')]) > 0]/div")
for div in divTopTeams:
    print(div.text)

In [None]:
linkShowAll = driver.find_element_by_xpath("//div[contains(@class,'multiple') and contains(@class,'teamVersion') and count(div[contains(text(),'Most points scored')]) > 0]/div/a")
linkShowAll.click()

In [None]:
content = driver.find_element_by_class_name('statsSection')
for line in content.text.split("\n"):
    print (line)

In [None]:
topTeams = content.text.split("\n")[4::3]
topScores = content.text.split("\n")[5::3]

In [None]:
driver.close()