<font size="6"><b>WORKING WITH AND QUERYING XML AND HTML OBJECTS</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(XML)
library(httr)
library(jsonlite)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

# What is XML

(https://www.w3schools.com/XML/xml_whatis.asp)

XML is a software- and hardware-independent tool for storing and transporting data.

What is XML?

- XML stands for eXtensible Markup Language
- XML is a markup language much like HTML
- XML was designed to store and transport data
- XML was designed to be self-descriptive
- XML is a W3C Recommendation


XML Does Not DO Anything

Maybe it is a little hard to understand, but XML does not DO anything.

This note is a note to Tove from Jani, stored as XML:
```XML
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>
```

The XML above is quite self-descriptive:

- It has sender information.
- It has receiver information
- It has a heading
- It has a message body.

But still, the XML above does not DO anything. XML is just information wrapped in tags.



The Difference Between XML and HTML

XML and HTML were designed with different goals:

- XML was designed to carry data - with focus on what data is
- HTML was designed to display data - with focus on how data looks
- XML tags are not predefined like HTML tags are

XML Does Not Use Predefined Tags

The XML language has no predefined tags.

```XML
The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are "invented" by the author of the XML document.

HTML works with predefined tags like <p>, <h1>, <table>, etc.

With XML, the author must define both the tags and the document structure.
```

XML is Extensible

Most XML applications will work as expected even if new data is added (or removed).

```XML
Imagine an application designed to display the original version of note.xml (<to> <from> <heading> <body>).

Then imagine a newer version of note.xml with added <date> and <hour> elements, and a removed <heading>.
```

The way XML is constructed, older version of the application can still work:

```XML
<note>
  <date>2015-09-01</date>
  <hour>08:30</hour>
  <to>Tove</to>
  <from>Jani</from>
  <body>Don't forget me this weekend!</body>
</note>
```

# Basic XML Syntax

(https://www.w3schools.com/XML/xml_syntax.asp)

The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to use.

XML Documents Must Have a Root Element

XML documents must contain one root element that is the parent of all other elements:

```XML
<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root> 
```

In this example <note> is the root element:

```XML
<?xml version="1.0" encoding="UTF-8"?>
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note> 
```

All XML Elements Must Have a Closing Tag

In XML, it is illegal to omit the closing tag. All elements must have a closing tag:

```XML
<p>This is a paragraph.</p>
<br />
```

XML Tags are Case Sensitive

XML tags are case sensitive. The tag <Letter> is different from the tag <letter>.

Opening and closing tags must be written with the same case:
```XML
<message>This is correct</message> 
```

XML Elements Must be Properly Nested

In HTML, you might see improperly nested elements:
```XML
<b><i>This text is bold and italic</b></i>
```

In XML, all elements must be properly nested within each other:
```XML
<b><i>This text is bold and italic</i></b>
```

In the example above, "Properly nested" simply means that since the <i> element is opened inside the <b> element, it must be closed inside the <b> element.

XML Attribute Values Must Always be Quoted

XML elements can have attributes in name/value pairs just like in HTML.

In XML, the attribute values must always be quoted:
```XML
<note date="12/11/2007">
  <to>Tove</to>
  <from>Jani</from>
</note>
```

Entity References

Some characters have a special meaning in XML.

If you place a character like "<" inside an XML element, it will generate an error because the parser interprets it as the start of a new element.

This will generate an XML error:
```XML
<message>salary < 1000</message>
```
    
To avoid this error, replace the "<" character with an entity reference:
```XML
<message>salary &lt; 1000</message>
```

There are 5 pre-defined entity references in XML:
```
&lt; 	< 	less than
&gt; 	> 	greater than
&amp; 	& 	ampersand 
&apos; 	' 	apostrophe
&quot; 	" 	quotation mark
```

# A real XML/HTML dataset

On the 2nd of December 2018, I scraped 994 realty listing pages of residences for sale in Mecidiyekoy, Sisli neighbourhood from www.hurriyetemlak.com

Some of these pages are in our binder repor databb directory:

In [None]:
hemlak_path <- "~/databb/html/he_sisli"

In [None]:
hemlak_files <- list.files(hemlak_path, full.names = T)

In [None]:
hemlak_files

Let's the files as text first:

In [None]:
hemlak_text <- lapply(hemlak_files, readLines)

In [None]:
hemlak_text %>% str

But this is not so suitable for extracting data from the objects.

We should parse them:

In [None]:
hemlak_parsed <- lapply(hemlak_text, htmlParse)

In [None]:
hemlak_parsed %>% str

To navigate through the nested structure, the best option is to open the page with a web browser (preferably Chrome), hitting the F12 button and viewing the "elements" pane is the best option.

This way we will get the Xpath nodes for the information we want from the files.

# XPath basics

https://www.w3schools.com/xml/xpath_intro.asp

- XPath can be used to navigate through elements and attributes in an XML document.
- XPath stands for XML Path Language
- XPath uses "path like" syntax to identify and navigate nodes in an XML document
- XPath contains over 200 built-in functions
- XPath uses path expressions to select nodes or node-sets in an XML document.
- These path expressions look very much like the path expressions you use with traditional computer file systems

## XPath Nodes

https://www.w3schools.com/xml/xpath_nodes.asp

XPath Terminology
Nodes

In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes.

XML documents are treated as trees of nodes. The topmost element of the tree is called the root element.

Look at the following XML document:

```XML
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>
  <book>
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
</bookstore>
```

Example of nodes in the XML document above:
```XML
<bookstore> (root element node)

<author>J K. Rowling</author> (element node)

lang="en" (attribute node) 
```

Atomic values

Atomic values are nodes with no children or parent.

Example of atomic values:

J K. Rowling

"en"

Items

Items are atomic values or nodes.

### Relationship of Nodes

#### Parent

Parent

Each element and attribute has one parent.

In the following example; the book element is the parent of the title, author, year, and price:
```XML
<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
```

#### Children

Element nodes may have zero, one or more children.

In the following example; the title, author, year, and price elements are all children of the book element:
```XML
<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
```

#### Siblings

Nodes that have the same parent.

In the following example; the title, author, year, and price elements are all siblings:
```XML
<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>
```

#### Ancestors

A node's parent, parent's parent, etc.

In the following example; the ancestors of the title element are the book element and the bookstore element:
```XML
<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore> 
```

#### Descendants

A node's children, children's children, etc.

In the following example; descendants of the bookstore element are the book, title, author, year, and price elements:
```XML
<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore> 
```

## XPath Syntax

https://www.w3schools.com/xml/xpath_syntax.asp

### The XML Example Document

We will use the following XML document in the examples below.
```XML
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
  <title lang="en">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="en">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
```

### Selecting Nodes

XPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps. The most useful path expressions are listed below:

<table class="w3-table-all notranslate">
  <tbody><tr>
 <th style="width:25%">Expression</th>
    <th>Description</th>
  </tr>
  <tr>
    <td><i>nodename</i></td>
    <td>Selects all nodes with the name "<i>nodename</i>"</td>
    </tr>
  <tr>
    <td>/</td>
    <td>Selects from the root node</td>
    </tr>
  <tr>
    <td>//</td>
    <td>Selects nodes in the document from the current node that match the selection no matter where they are </td>
  </tr>
  <tr>
    <td>.</td>
    <td>Selects the current node</td>
  </tr>
  <tr>
    <td>..</td>
    <td>Selects the parent of the current node</td>
  </tr>
  <tr>
    <td>@</td>
    <td>Selects attributes</td>
  </tr>
</tbody></table>

In the table below we have listed some path expressions and the result of the expressions:

<table class="w3-table-all notranslate">
  <tbody><tr>
 <th style="width:25%">Path Expression</th>
    <th>Result</th>
  </tr>
  <tr>
    <td>bookstore</td>
    <td>Selects all nodes with the name "bookstore"</td>
    </tr>
  <tr>
    <td>/bookstore</td>
    <td>Selects the root element bookstore<p><b>Note:</b> If the path starts with a slash ( / ) it always represents an absolute 
path to an element!</p></td>
    </tr>
  <tr>
    <td>bookstore/book</td>
    <td>Selects all book elements that are children of bookstore</td>
  </tr>
  <tr>
    <td>//book</td>
    <td>Selects all book elements no matter where they are in the document</td>
  </tr>
  <tr>
    <td>bookstore//book</td>
    <td>Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element</td>
  </tr>
  <tr>
    <td>//@lang</td>
    <td>Selects all attributes that are named lang</td>
  </tr>
  </tbody></table>

### Predicates

- Predicates are used to find a specific node or a node that contains a specific value.
- Predicates are always embedded in square brackets.
- In the table below we have listed some path expressions with predicates and the result of the expressions:

<table class="w3-table-all notranslate">
  <tbody><tr>
 <th style="width:40%">Path Expression</th>
    <th>Result</th>
  </tr>
  <tr>
    <td>/bookstore/book[1] </td>
    <td>Selects the first book element that is the child of the bookstore element.
 <p><b>Note:</b> In IE 5,6,7,8,9 first node is[0], but according to W3C, it is [1]. To solve this problem in IE, set the SelectionLanguage to XPath:</p>
 <i>In JavaScript: xml</i>.setProperty("SelectionLanguage","XPath");</td>
    </tr>
  <tr>
    <td>/bookstore/book[last()]</td>
    <td>Selects the last book element that is the child of the bookstore element</td>
    </tr>
  <tr>
    <td>/bookstore/book[last()-1]</td>
    <td>Selects the last but one book element that is the child of the bookstore element</td>
  </tr>
  <tr>
    <td>/bookstore/book[position()&lt;3]</td>
    <td>Selects the first two book elements that are children of the bookstore element</td>
  </tr>
  <tr>
    <td>//title[@lang]</td>
    <td>Selects all the title elements that have an attribute named lang</td>
  </tr>
  <tr>
    <td>//title[@lang='en']</td>
    <td>Selects all the title elements that have a "lang" attribute  with a value of "en"</td>
  </tr>
  <tr>
    <td>/bookstore/book[price&gt;35.00]</td>
    <td>Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00</td>
  </tr>
  <tr>
    <td>/bookstore/book[price&gt;35.00]/title</td>
    <td>Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00</td>
  </tr>
  </tbody></table>

### Selecting Unknown Nodes

XPath wildcards can be used to select unknown XML nodes.

<table class="w3-table-all notranslate">
  <tbody><tr>
 <th style="width:25%">Wildcard</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>*</td>
    <td>Matches any element node</td>
    </tr>
  <tr>
    <td>@*</td>
    <td>Matches any attribute node</td>
  </tr>
  <tr>
    <td>node()</td>
    <td>Matches any node of any kind</td>
    </tr>
  </tbody></table>

In the table below we have listed some path expressions and the result of the expressions:

<table class="w3-table-all notranslate">
  <tbody><tr>
 <th style="width:25%">Path Expression</th>
    <th>Result</th>
  </tr>
  <tr>
    <td>/bookstore/*</td>
    <td>Selects all the child element nodes of the bookstore element</td>
    </tr>
  <tr>
    <td>//*</td>
    <td>Selects all elements in the document</td>
    </tr>
  <tr>
    <td>//title[@*]</td>
    <td>Selects all title elements which have at least one attribute of any kind</td>
  </tr>
  </tbody></table>

### Selecting Several Paths

- By using the | operator in an XPath expression you can select several paths.
- In the table below we have listed some path expressions and the result of the expressions:

<table class="w3-table-all notranslate">
  <tbody><tr>
 <th style="width:40%">Path Expression</th>
    <th>Result</th>
  </tr>
  <tr>
    <td>//book/title | //book/price</td>
    <td>Selects all the title AND price elements of all book elements</td>
    </tr>
  <tr>
    <td>//title | //price</td>
    <td>Selects all the title AND price elements in the document</td>
    </tr>
  <tr>
    <td>/bookstore/book/title | //price</td>
    <td>Selects all the title elements of the book element of the bookstore element AND all the price elements in the document</td>
  </tr>
  </tbody></table>

# Xpath example

## Get the price info from listings

Now please open the `30487516.html` file under

In [None]:
hemlak_path

- Navigate to any listing, hit F12 (debug tools) and select the elements pane
- By using the element selector on top left, hit any point on the web page and see how the Elements windows navigates
- Now hit the price info with the selector
- Right click the highlighted element on the right pane, and click on "Copy Element"
- The result will be something like:

```XML
<span>530.000 TL</span>
```

Now we want to get the path to this node:

- Right click again, this time click on "Copy XPath"

```XPath
/html/body/div[1]/div[2]/div[2]/div[1]/div[3]/div[1]/div[3]/div[1]/ul/li[1]/span
```

Now we can use xpath to get the value at this XPath, provided that the queried XML/HTML file has a similar DOM structure (hierarchy of nodes) 

In [None]:
hemlak1 <- hemlak_parsed[[1]]

In [None]:
xpathSApply(hemlak1, "/html/body/div[1]/div[2]/div[2]/div[1]/div[3]/div[1]/div[3]/div[1]/ul/li[1]/span",
            xmlValue)

However, traversing using only indices might not be correct in all cases: The count of a certain element may change across similar pages

So we will use attributes to be more robust:

In [None]:
price1 <- xpathSApply(hemlak1,
                      "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()",
            xmlValue)

In [None]:
price1

Or course it is better to get only the numeric values and skip "." and "TL" parts

In [None]:
price1 %>% parse_number(locale = locale(decimal_mark = ",", grouping_mark = "."))

Now we can traverse through three files to get price information

In [None]:
prices <- sapply(hemlak_parsed, xpathSApply,
                 "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']/li[@class='price-line clearfix']/span/text()",
            xmlValue)

In [None]:
prices

In [None]:
prices2 <- parse_number(prices, locale = locale(decimal_mark = ",", grouping_mark = "."))

In [None]:
prices2

## Get the square meter information

Similarly we will get the square meter information from listing files:

In [None]:
sqms <- sapply(hemlak_parsed, xpathSApply,
               "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[@id='realtyGrossSqm']/following-sibling::span/text()",
            xmlValue)

In [None]:
sqms

In [None]:
sqms2 <- sqms %>% parse_number

In [None]:
sqms2

## Get loan eligibility info

Note that we also have to control for missing values so that the output is parallel to previous ones

In [None]:
kredis <- sapply(hemlak_parsed, xpathSApply,
                 "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[text()='Krediye Uygunluk']/following-sibling::span/text()",
            xmlValue)

In [None]:
kredis

## Get property age info

Now let's get the age (Bina Yaşı) info from files and save into "ages" object.

In [None]:
ages <- sapply(hemlak_parsed, xpathSApply,
                 "//div[@class='realty-details realty-details-right clearfix']/ul[@class='clearfix']//span[text()='Bina Yaşı']/following-sibling::span/text()",
            xmlValue)

In [None]:
ages

In [None]:
ages2 <- ages %>% as.integer

In [None]:
ages2

# HTTP GET and POST requests

HTTP is the main protocol to transfer data through a network especially for web servers.

According to [Wikipedia](https://en.wikipedia.org/wiki/HTTP):

> The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems.[1] HTTP is the foundation of data communication for the World Wide Web, where hypertext documents include hyperlinks to other resources that the user can easily access, for example by a mouse click or by tapping the screen in a web browser.

The main library to deliver HTTP requests is cURL.

According to [Wikipedia](https://en.wikipedia.org/wiki/CURL):

> cURL ... is a computer software project providing a library (libcurl) and command-line tool (curl) for transferring data using various network protocols. The name stands for "Client for URL".

The main requests that we use in HTTP are GET and POST requests:

> GET
The GET method requests that the target resource transfer a representation of its state. GET requests should only retrieve data and should have no other effect. (This is also true of some other HTTP methods.)[1] For retrieving resources without making changes, GET is preferred over POST, as they can be addressed through a URL. This enables bookmarking and sharing and makes GET responses eligible for caching, which can save bandwidth...

>POST
The POST method requests that the target resource process the representation enclosed in the request according to the semantics of the target resource. For example, it is used for posting a message to an Internet forum, subscribing to a mailing list, or completing an online shopping transaction.

So the main difference is GET only retrieves a data while POST request is processed and may cause side effects or changes

We can send these requests by interfacing to `curl` command on Bash or using `GET` or `POST` functions of `httr` package in R.

To get any detailed curl request of any action on a browser:

- First open the debug tools with F12
- Navigate to the network pane
- When the action is done on the browser, track the traffic from the pane, right click and select `Copy as cURL`

## GET request

In order to make a sample GET request, https://httpbin.org, a test site for echoing the result of requests will be used.

First we will send a curl request to the Bash using `system` command:

In [None]:
requestget1 <- "curl https://httpbin.org/get"

In [None]:
returnget1 <- system(requestget1, intern = T)

This is the return value:

In [None]:
returnget1

In [None]:
fromJSON(returnget1)

Now let's add some arguments to the request, that happens when you select some options from a web page to get a more specific results (for example select a city for realty adds)

In [None]:
requestget2 <- "curl 'https://httpbin.org/get?a=1&b=2'"

In [None]:
returnget2 <- system(requestget2, intern = T)

In [None]:
fromJSON(returnget2)

Now let's add some random headers. Headers customize how the web server should handle the request. These headers are mostly created by the browser automatically:

In [None]:
requestget3 <- "curl 'https://httpbin.org/get?a=1&b=2' -H 'c: 3' -H 'd: 4'"

In [None]:
returnget3 <- system(requestget3, intern = T)

In [None]:
fromJSON(returnget3)

Now let's make the same requests and get the returned value using GET function from httr package:

In [None]:
returnget2b <- GET("https://httpbin.org/get?a=1&b=2")

In [None]:
returnget2b %>% content

And we can pass the headers again:

In [None]:
returnget3b <- GET("https://httpbin.org/get?a=1&b=2", add_headers(c = 3, d = 4))

In [None]:
returnget3b %>% content

## POST request

Now let's make a post request with the data option -d, using https://httpbin.org again:

In [None]:
requestpost1 <- "curl https://httpbin.org/post -d 'a=1' -d 'b=2'"

In [None]:
returnpost1 <- system(requestpost1, intern = T)

In [None]:
returnpost1

What if we have non-alphanumeric characters or non-ASCII characters that have to be encoded:

In [None]:
requestpost2a <- "curl https://httpbin.org/post --d 'a=İş güç' -d 'b=2'"

In [None]:
returnpost2a <- system(requestpost2a, intern = T)

In [None]:
returnpost2a

We can automatically send the request by url encoding using --data-urlencode options instead of -d:

In [None]:
requestpost2 <- "curl https://httpbin.org/post --data-urlencode 'a=İş güç' -d 'b=2'"

In [None]:
returnpost2 <- system(requestpost2, intern = T)

In [None]:
fromJSON(returnpost2)

And let's add headers again:

In [None]:
requestpost3 <- "curl https://httpbin.org/post --data-urlencode 'a=İş güç' -d 'b=2' -H 'c: 3' -H 'd: 4'"

In [None]:
returnpost3 <- system(requestpost3, intern = T)

In [None]:
fromJSON(returnpost3)

Now let's see how that is handled with `POST` command from `httr`:

In [None]:
returnpost3a <- POST("https://httpbin.org/post",
                      body = list(a = "İş güç", b = 2),
                      add_headers(c = 3, d = 4))

In [None]:
returnpost3a %>% content

# Extracting tables

Let's get the detailed schedule of Management department for a semester:

On a browser the request has many headers created automatically by the browser:

```Bash
curl 'https://registration.boun.edu.tr/scripts/sch.asp?donem=2023/2024-2&kisaadi=AD&bolum=MANAGEMENT' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'Accept-Language: en-US,en;q=0.9,tr;q=0.8' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'DNT: 1' \
  -H 'Pragma: no-cache' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-Site: none' \
  -H 'Sec-Fetch-User: ?1' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Linux"'
```

However in this simple example we do not need those details:

In [None]:
requestx <- 'curl "http://registration.boun.edu.tr/scripts/sch.asp?donem=2023/2024-2&kisaadi=AD&bolum=MANAGEMENT"'

Note the part that starts after "?", which includes some parameters to be passed as key/value pairs separated by ampersand ("&") sign:

- donem=2023/2024-2
- kisaadi=AD
- bolum=MANAGEMENT

In [None]:
requestx

And now execute it on Bash remotely and retrieve the results:

In [None]:
schedule1 <- system(requestx, intern = T)

In [None]:
schedule1 %>% str

You can automate this behaviour by parametrizing the requests with different values

In order to ensure reproducibility a saved version is available on our repo:

In [None]:
schedule1 <- readLines("~/databb/html//schedule1.html")

In [None]:
schedule1 %>% str

Now let's parse the html:

In [None]:
schedulep <- htmlParse(schedule1)

From the Elements pane of the debug tools on the browser, locate the `tabel` label at the beginning of the table and create a suitable XPath expression to get to it. Since no class or id attributes exist, the style attributes will be used:

In [None]:
table1 <- xpathSApply(schedulep, "//table[@width='1300px']")

This returns a list:

In [None]:
table1 %>% str

Now convert the first element of the list into a table easily!:

In [None]:
schedule_dt <- readHTMLTable(table1[[1]]) %>% as.data.table

In [None]:
schedule_dt