# STA 141B Data & Web Technologies for Data Analysis

### Lecture 11, 02/10/26, Scraping


### Announcements

- Midterm on Thursday, at 7:30 AM - 8:30 AM (60 minutes) @ [Young 198](https://maps.app.goo.gl/P3ohUqDXhkV6psJZ8)
- Multiple Choice, only one correct answer per question
- Please bring your pen and your StudentID
- Second homework due February 13.
- First homework was graded. Please make sure to execute the Validation cells!

### Today's topics

 - Scraping Tables with `pandas`
 - HTML
 - XML
 - Parser
 - Extracting Elements

### Ressources

* [`requests` documentation](http://docs.python-requests.org/en/master/)
* [`requests-html` documentation](https://html.python-requests.org/)
* [W3 Schools](https://www.w3schools.com/html/default.asp)
* [MDN HTML Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
* [XPath Diner](http://www.topswagcode.com/xpath/) - an interactive XPath tutorial
* [CSS Diner](https://flukeout.github.io/) - an interactive CSS Selector tutorial

### Scraping Tables with `pandas`

For data in a `table` element, we can use __Pandas__ instead of writing a scraper. 

Wikipedia provides lots of useful information in tables. Let's get the Wikipedia list of [US cities by area][wiki].

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

In [None]:
import pandas as pd
import requests

In [None]:
# not working, since no header is provided
tabs = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area")

You can find out more about your User-Agent [here](https://www.whatismybrowser.com/detect/what-is-my-user-agent/).

In [None]:
# N
import pandas as pd
import requests

url = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area"

# Define the User-Agent header
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:140.0) Gecko/20100101 Firefox/140.0"
}
tabs = pd.read_html(url, storage_options = headers)
tabs

In [None]:
type(tabs)

In [None]:
len(tabs)

In [None]:
tabs[0] #N
# overview table to the right

In [None]:
tabs[1]

In [None]:
tbl = tabs[1]
tbl.head()

To process this information, unusable items have to be removed. We are going to do that with `regex` (recall the discussion section)!

In [None]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**', '', string)
        # \s means every whitespace (incl. space and newline) followed by any text between square brackets and an trailing
        # * means zero or more occurences, . any character
        # this aims to remove the [a]* after Tribune
    return string

In [None]:
tbl.iloc[4,0]

In [None]:
remove(tbl.iloc[4,0])

In [None]:
remove(1706.8)

In [None]:
remove('First text [some random text]*')

In [None]:
remove('First text*')

In [None]:
remove('First text[some random text]')

Only the square brackets are mandatory.

In [None]:
tbl.columns

In [None]:
tbl.columns = [remove(i) for i in tbl.columns] # remove from table columns 

In [None]:
tbl.columns

In [None]:
tbl = tbl.map(remove) #remove from all rows

In [None]:
tbl.head()

In [None]:
tbl.dtypes

Alternatively, we could define the columns by hand:

In [None]:
tbl.columns = ['City', 'State', 'Land area (mi2)', 'Land area (km2)', 'Water area (mi2)', 'Water area (km2)', 'Total area (mi2)', 'Total area (km2)', 'Population']

In [None]:
tbl.head()

In [None]:
tbl.groupby('State').count()['Population'].sort_values(ascending=False)

`pd.read_html(url, headers=headers)` --> if you have a table

### HTML

Web pages are written in _hypertext markup language_ (HTML). HTML files (`.htm` or `.html`) are plain text, just like JSON, Python scripts, and R scripts.

In HTML, we use _tags_ to create _elements_ of a web page. Elements add formatting and structure to the page.

* Tags usually come in pairs: an opening tag and a closing tag.
* Tags are written `<NAME>` for opening tags, `</NAME>` for closing tags, and `<NAME />` for singleton tags.
* Opening and singleton tags can have _attributes_ that contain additional information. Attributes are written `ATTRIBUTE=VALUE` after the tag name. 

See [here](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics) for a more detailed explanation, and [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of valid HTML elements.

#### Example

[wiki]: https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area

From now on, we will use an artificial an example:

```html
<p>This page is famous and this <b>word</b> is emphasized.</p>
```

```html
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
```

```html
<li>1. Something</li>
```

<p>This page is famous and this <strong>word</strong> is emphasized.</p>
<p>This <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">page</a> is famous and this <strong>word</strong> is emphasized.</p>
<li>1. Something</li>

The `p` tag marks a paragraph, the `a` tag marks a link (an _anchor_), the `strong` tag marks emphasized text,
and `li` tag marks a list.

Here's a string that contains HTML for a simple, complete website:

In [None]:
page = """
<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>

    <p>This is a new paragraph!</p>
    <p><a href="https://pudding.cool">The Pudding</a></p>
</body>

</html>
""" 

In [None]:
page

<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>

<body>
    <p>This is a new paragraph!</p>
    <p><a href="https://pudding.cool">The Pudding</a><p/>
</body>

</html>

The `<span>` tag is an inline container used to mark up a part of a text, or a part of a document.
    
For example, you can write the code
```
<p>My hat is <span style="color:blue">blue</span>.</p>    
```  
    
<p>My hat is <span style="color:blue">blue</span>.</p>     

### XML

_Extensible markup language_ (XML) also uses tags to create elements. We say XML is _extensible_ because you can create your own XML elements (unlike HTML). People typically use XML to describe structure and meaning of data, rather than for formatting.

We'll use the same process to extract data from both HTML and XML.

### Parser

A _parser_ converts formatted data into familiar data structures. We've used __requests__' built-in JSON parser, but the package doesn't have a built-in HTML/XML parser. Fortunately, there are many other Python packages for parsing HTML/XML and web scraping.

HTML/XML Parsers:
* [lxml](https://lxml.de/)
* [html5lib](https://github.com/html5lib/html5lib-python)
* [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/)
* [requests-html](https://docs.python-requests.org/projects/requests-html/en/latest/)

Scraper Frameworks (_convenient after learning the basics with parsers_):
* [scrapy](https://scrapy.org/)
* [newspaper3k](https://github.com/codelucas/newspaper)

Even more [here](https://github.com/lorien/awesome-web-scraping/blob/master/python.md#web-scraping-frameworks).

We'll use __lxml__ here (check the [doc](https://lxml.de/apidoc/index.html)), but you're welcome to use other packages on assignments and the project. 

In [None]:
import lxml.html as lx

html = lx.fromstring(page)
html

<html> 
<head>
    <title>This is the Title!</title>
</head>

<body>
    <p>This is a paragraph!</p>
    <p id="best-paragraph">This is another paragraph! &#127790;</p>
    <p>Visit <a href="https://pudding.cool">The Pudding</a>.</p>
    <span>This is a span, it comes with an taco &#127790;</span>
</body>

<body>
    <p>This is a new paragraph!</p>
</body>

</html>

In [None]:
page

#### Finding Elements

Elements are nested, so an HTML document is like a tree:
```
html
├── head
│   └── title
└── body
    ├── p
    ├── p
    ├── p
    │   └── a
    └── span
```
This is similar to the file system on your computer. The key difference is that elements at the same level can have the same tag name.

#### XPath

The _XML Path Language_ (XPath) lets us write paths to elements. XPath paths look a lot like file paths. XPath is not Python-specific!

The `.xpath()` method gets all elements at an XPath path:

In [None]:
html.xpath("/html/head/title")

In [None]:
html.xpath("/html/body/p/a")

Since there may be more than one element, the method always returns a list.

Absolute paths are not robust for scraping. An update to a web page that adds a single tag can break a scraper that uses absolute paths. In XPath, `//` means "anywhere below". We'll use `//` often because it's more robust:

In [None]:
html.xpath("//p/a")

What if we just elements want that satisfy a certain condition? In XPath, `[ ]` filters out elements that don't match a condition. For example:

In [None]:
html.xpath("//p[@id = 'best-paragraph']")

[XPath Diner](http://www.topswagcode.com/xpath/) is an interactive tutorial that teaches most of the XPath syntax. It takes about 20-60 minutes. Work through it to become an XPath ninja! 

You can copy the absolute path of a tag from the developer tools. 

In [None]:
'//*[@id="mw-content-text"]/div[1]/table[2]/tbody/tr[7]/td[3]'

#### CSS Selectors

_Cascading Style Sheets_ (CSS) is another language for formatting elements in an HTML document. CSS provides another way to select elements, called _CSS selectors_.

CSS selectors are more concise but less flexible than XPath paths. The `.cssselect()` method gets all elements at a CSS selector:

In [None]:
html.cssselect("a")

Check out the [CSS Diner](https://flukeout.github.io/)!

### Extracting Text and Attributes

There are two ways to get text from an element:

* `.text` gives text inside the element, but not its children
* `.text_content()` gives text inside the element and its children, with all tags removed

In [None]:
page

In [None]:
html.text_content()

In [None]:
a = html.xpath("//a")[0]

In [None]:
a.text_content()

In [None]:
a.text

In [None]:
html.text_content()

In [None]:
html.text

We can get values from attributes on an element with `.attrib`, which is a dictionary:

In [None]:
a.attrib["href"]

In [None]:
[x.attrib["href"] for x in html.xpath("//a")]

### Writing Scrapers

Lets scrape the wiki table ourselves. Attention: We are using request, so pay attention to the file that is being returned. Check on devtools the html element for `<thead>` and see what is returned in the network. 

In [None]:
import requests

# Define the User-Agent header
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
}

result = requests.get(url = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area', headers = headers)
html = lx.fromstring(result.text)

In [None]:
tables = html.xpath('//table')

In [None]:
tables

In [None]:
table = tables[1]

In [None]:
table.text_content()

In [None]:
html.xpath('//table[2]/thead')

In [None]:
html.xpath('//table[2]/tbody')

In [None]:
def retrieve_rows(html): 
    rows = html.xpath('//table[2]/tbody/tr') # get all rows of the second table
    cells = []
    for row in rows: 
        # ./td|th means we start at the node (not searching the whole doc again), and choose td OR th children
        cells.append([cell.text_content() for cell in row.xpath('./td|th')]) # no text, as some cells are in <b>
    return cells

In [None]:
retrieve_rows(html)

In [None]:
df = pd.DataFrame(retrieve_rows(html))
df.head()

In [None]:
df.columns = df.iloc[0]
df = df.drop(index = range(2))
df.head()

In [None]:
df = df.iloc[:, [True, True, True, False, True, False, True, False, True]]

In [None]:
df

In [None]:
df.dtypes

In [None]:
from re import sub 
def remove(string):
    '''
    Removes everything inside [], a whitespace before that and *'s.
    '''
    if isinstance(string, str):
        string = sub(r'\s*\[.*\]\**|\n|,|\*', '', string)
        # \s means every whitespace (incl. space and newline) followed by any text between square brackets and an trailing * OR just \n OR just comma,
        # * means zero or more occurences, . any character
        # this aims to remove the [a]* after Tribune and the /n in the columns
    return string

In [None]:
df.columns = [remove(i) for i in df.columns] # remove from table columns
df = df.map(remove) #remove from all rows
df.head()

In [None]:
df

In [None]:
for col in df.columns[3:]: #only those cols with vals
    df[col] = df[col].astype(float)

In [None]:
df.head()

In [None]:
df.dtypes

### Summary 

- HTML pages are set up like a filesystem
- use `lxml` to parse them in Python
- navigate through HTML via xpath or css