# Scraping I

---

## BeautifulSoup

See the official [quickstart](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start). And [this tutorial](https://realpython.com/beautiful-soup-web-scraper-python/).

If you need to load/save to your drive:

```python
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive/')

import os
os.chdir('drive/My Drive/IS53055B-DMLCP/DMLCP/python') # change to your directory
```

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

Render straight into the notebook!

In [None]:
import IPython # https://stackoverflow.com/a/55329863
IPython.display.HTML(soup.prettify())

Now we can programmatically navigate the webpage.

In [None]:
soup.title

In [None]:
soup.title.name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.p['class']

In [None]:
soup.a

In [None]:
soup.find_all('a') # also try 'p'

In [None]:
soup.find(id="link3")

Extract links.

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

Extract only the text.

In [None]:
print(soup.get_text())

If you wanted to save this text in Python, you would do:

```python
with open("dormhouse-story.txt") as o: # open file object
    o.write(soup.get_text())           # write the text
```    

Don't forget to read (or, more like, search for stuff in) the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)! (Or ask ChatGPT...)

---

## Important Note: Download Webpages in Python

Above we had the html code at hand. But of course IRL we want to grab it from the web!

For that we can use [requests](https://pypi.org/project/requests/).

(See the [mentioned tutorial](https://realpython.com/beautiful-soup-web-scraper-python/#step-2-scrape-html-content-from-a-page).)

In Colab it's already installed, otherwise (**in an environment!**):

```bash
 conda install -c anaconda requests
```

In [None]:
import requests

In [None]:
URL = "https://en.wikipedia.org/wiki/Artificial_intelligence"
page = requests.get(URL)

In [None]:
page # response 200 means OK

In [None]:
print(page.text)

Like before, we can display the text.

In [None]:
IPython.display.HTML(page.text) # not sure why the images didn't get fetched

We can also do that with [urllib](https://docs.python.org/3/howto/urllib2.html)

In [None]:
import urllib.request

with urllib.request.urlopen(URL) as response:
    html = response.read()

In [None]:
# IPython.display.HTML(html.decode()) # will print the page as before (with images!)

The main skill to have now in order to scrape successfully is, actually, [**html**](https://www.w3schools.com/html/)! That is, understand how webpages are constructed.

## Grab images

Everything has to be done manually, trial and error.

What we do:
- We check all `img` tags, and get their `src`
- We inspect the links, see if they work out of the box or not
- We correct the links, and try and request their contents
- We find the way to save that to files

### Note: this process is error-prone and sometimes *painful*!

Don't underestimate the time and learning you will need to do this, it's a huge chunk of the process.

In [None]:
wiki_soup = BeautifulSoup(html) # works the same with page.content from earlier
images_links = []
for img in wiki_soup.find_all('img'):
    # print(img)
    images_links.append(img.get('src'))

for i in images_links:
    print(i)

In [None]:
import os
d = 'scraped-images'
os.mkdir(d)
os.listdir()

In [None]:
import time   # time module for pausing programme
import shutil # OS module for saving a stream of bytes

Helper functions.

In [None]:
def make_request(link):
    r = requests.get(link, stream=True)
    if r.status_code == 200:
        # print('got it!')
        return r
    else:
        # print('nope)
        return None

def make_filename(link):
    idx = link.rfind('/') # find the last /
    return link[idx+1:]

In [None]:
wiki_url = 'https://en.wikipedia.org'

print('attempting to scrape:')
for l in images_links:
    print(f'- {l}')
                                                             # three types of links (found by trial and error!)
    if l.startswith('//'):                                   # - ones with //, used as is
        l = l.replace('//', '') # remove the leading dash
        l = f"https://{l}"
        resp = make_request(l)
    elif l.startswith('http'):                               # - ones requiring nothing
        resp = make_request(l)
    else:                                                    # - ones requiring to add the leading wiki url
        l = f"{wiki_url}{l}"
        resp = make_request(l)

    if resp is not None:                                     # if we got something
        fname = make_filename(l)                             # get the filename
        print(f'  attempting to save {fname}')
        with open(os.path.join(d, fname), 'wb') as o:        # saving logic, see here: https://towardsdatascience.com/a-tutorial-on-scraping-images-from-the-web-using-beautifulsoup-206a7633e948
            resp.raw.decode_content = True                   #                         https://stackoverflow.com/a/29328036
            shutil.copyfileobj(resp.raw, o)                  #                         https://stackoverflow.com/a/13137873
    else:
        print(f'  could not retrieve this one')

    time.sleep(1) # BE NICE, let the server breathe and space out your calls

In [None]:
from PIL import Image
Image.open(os.path.join(d, 'wikipedia.png')) # display one of our downloaded images

---

## Next steps

BeautifulSoup will not be able to handle much interactivity in website (for example if you need to click on something to open the page). The next level for scraping is then to use a *headless browser* that you can automate (devilish, really).

The tool for that is [Selenium](https://selenium-python.readthedocs.io/installation.html#installing-python-bindings-for-selenium). Note that you not only need to install the library, but also the *driver* that will pass on the commands to whichever browser you wish to use (Chrome, Chromium, Firefox, etc.) The website has an intro and tutorial.