# Data Extraction with Selenium
In this tutorial, we discuss how to use Selenium to extract data from the web.  Please see https://selenium-python.readthedocs.io for more details.

## Installation
Before using selenium, we will have to install a webdriver of your choice.  It can be Chrome or Firefox.  Once installed, you will need to know the location of the drive as it will be used as a parameter to start a browser.  To install the driver, just install python helper package chromedriver_autoinstaller. 

        pip install chromedriver_autoinstaller

We also have to install selenium package.

        pip install selenium

In [1]:
from selenium import webdriver
import chromedriver_autoinstaller
import time
import os



In [2]:
chromedriver_autoinstaller.install()

CHROME >= 115, using mac-arm64 as architecture identifier


'/Users/natawut/Library/Mobile Documents/com~apple~CloudDocs/Classes/Data Science and Data Engineering/Examples/data-extraction-examples/.venv/lib/python3.9/site-packages/chromedriver_autoinstaller/141/chromedriver'

In [3]:
browser = webdriver.Chrome()

## Browsing a webpage
Once the browser starts, we can tell it to visit a webpage.

In [4]:
url = 'https://www.google.com'

In [13]:
browser.get(url=url)

In [14]:
html = browser.execute_script("return document.documentElement.outerHTML")
html[:3000]

'<html itemscope="" itemtype="http://schema.org/WebPage" lang="th"><head><meta charset="UTF-8"><meta content="dark light" name="color-scheme"><meta content="origin" name="referrer"><link href="//www.gstatic.com/images/branding/searchlogo/ico/favicon.ico" rel="icon"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="">window._hst=Date.now();</script><script nonce="">(function(){var _g={kEI:\'RXP_aKln4_zV7w_J5fOICA\',kEXPI:\'31\',kBL:\'KYuT\',kOPI:89978449};(function(){var a;((a=window.google)==null?0:a.stvsc)?google.kEI=_g.kEI:window.google=_g;}).call(this);})();(function(){google.sn=\'webhp\';google.kHL=\'th\';google.rdn=false;})();(function(){\nvar g=this||self;function k(){return window.google&&window.google.kOPI||null};var l,m=[];function n(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||l}function p(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid"))

## Interact with a webpage
When the page is loaded, we can interact with all elements in the webpage.  In this example, we will perform a search for a particular keyword in Google.  We will have to locate the correct element and then send the proper keys.

In [15]:
from selenium.webdriver.common.by import By

In [16]:
q_element = browser.find_element(By.CSS_SELECTOR, '[name=q]')

In [17]:
q_element.clear()
q_element.send_keys('ประเทศไทย')
q_element.send_keys(u'\ue007')

## Navigate the webpage
We can navigate the current webpage, similar to Beautiful Soup.  Selenium supports several navigation approaches.

In [20]:
all_link = browser.find_elements(By.CSS_SELECTOR, '#search a[jsname]')

In [21]:
for link in all_link:
    print('[link text]', link.text)
    print('[link href]', link.get_attribute('href'))
    print('---')

[link text] ประเทศไทย
Wikipedia
https://th.wikipedia.org › wiki › ประเทศไทย
[link href] https://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B0%E0%B9%80%E0%B8%97%E0%B8%A8%E0%B9%84%E0%B8%97%E0%B8%A2
---
[link text] Thairath
ประกาศฉบับ 1 เตือน "อากาศแปรปรวน" บริเวณประเทศไทย มีผลกระทบ 29 ต.ค. – 2 พ.ย.
3 ชั่วโมงที่ผ่านมา
[link href] https://www.thairath.co.th/news/local/2891690
---
[link text] Sanook.com
10 จังหวัดที่สงบที่สุดในประเทศไทย เหมาะแก่การพักกายพักใจ
4 ชั่วโมงที่ผ่านมา
[link href] https://www.sanook.com/campus/1430980/
---
[link text] thestandard.co
สมเด็จพระพันปีหลวง จอมพลหญิง พระองค์แรกของประเทศไทย
10 ชั่วโมงที่ผ่านมา
[link href] https://thestandard.co/queen-sirikit-first-female-field-marshal/
---
[link text] BBC
เปิดรายละเอียด ถ้อยแถลงร่วมไทย-กัมพูชา ที่มี โดนัลด์ ทรัมป์ เป็นสักขีพยาน
1 วันที่ผ่านมา
[link href] https://www.bbc.com/thai/articles/c5ypg4eqzdko
---
[link text] 
[link href] https://th.wikipedia.org/wiki/%E0%B8%88%E0%B8%B1%E0%B8%87%E0%B8%AB%E0%B8%A7%E0%B8%B1%E0%B8

In [22]:
all_link[0].click()

In [33]:
all_headlines = browser.find_elements(By.CSS_SELECTOR, '.vector-toc-text span:not([class]')

In [38]:
for h in all_headlines:
    print('[text]', h.text)
    print('[tag]', h.tag_name)
    parent = h.find_element(By.XPATH, '..')
    print('[parent / class] {} / {}'.format(parent.tag_name, parent.get_attribute('class')))
    print('---')

[text] ชื่อเรียก
[tag] span
[parent / class] div / vector-toc-text
---
[text] ประวัติศาสตร์
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] ภูมิประเทศ
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] การเมืองการปกครอง
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[parent / class] div / vector-toc-text
---
[text] 
[tag] span
[p

## End browsing session

In [39]:
browser.quit()