# Scraping II

---

## Selenium

See [the reference](https://selenium-python.readthedocs.io/getting-started.html) and the tutorial on [RealPython](https://realpython.com/modern-web-automation-with-python-and-selenium/).

## For Colab only!

See [here](https://stackoverflow.com/a/54077842).

In [None]:
%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

# Install chromium and chromium-driver
apt-get update
apt-get install chromium chromium-driver

# Install selenium
pip install selenium

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

In [None]:
# https://stackoverflow.com/a/76432322
service = Service(executable_path=r'/usr/bin/chromedriver')
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(service=service, options=options)

In [None]:
URL = "https://en.wikipedia.org/wiki/Artificial_intelligence"
driver.get(URL)

In [None]:
import IPython
IPython.display.HTML(driver.page_source)

Find elements a bit like in BeautifulSoup. See the [Locating Elements](https://selenium-python.readthedocs.io/locating-elements.html#locating-elements) chapter.

In [None]:
from selenium.webdriver.common.by import By
driver.find_elements(By.TAG_NAME, "img")

In [None]:
first_img = driver.find_element(By.TAG_NAME, "img") # only the first element

In [None]:
first_img.get_attribute('src')