# Scrape Digi-Key Table Data with pandas and Selenium
* In this project we will demonstrate the use of the pandas `read_html()` method to extract table data from the Digi-Key website, which is a massive database of electronic components
* We will also use Selenium to extract a specialized piece of data which cannot be rendered by `read_html()`
* We will extract table data from the following webpage: https://www.digikey.com/en/products/filter/accessories/159
* Scrolling down, we can see that the table appears as below:

![digikey-table.png](attachment:digikey-table.png)

## Import required libraries

In [None]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

## Read in all tables in the url
* The `read_html()` method will read in all tables from the website's html as a list

In [None]:
url = 'https://www.digikey.com/en/products/filter/accessories/159'

table_dk = pd.read_html(url)

print(f'Total tables: {len(table_dk)}')

## Initialize new pandas DataFrame to store the table data
* The length of `table_dk` is one, so only one table was extracted from the html
* We will initialize a new pandas DataFrame and set it equal to the first element of the list

In [None]:
df = table_dk[0]

df = df.iloc[1:]

df.head()

## Remove unneeded column
* We don't need the column 'Compare' since it doesn't hold any relevant data
    * The column can be removed using the `drop()` method
    * We specify `axis=1` to indicate that we want to remove a column and not a row

In [None]:
df.drop('Compare', inplace=True, axis=1)

df.head()

## Extract specialized link data not captured by `read_html()`
* We want to extract the link to the part data, which is also stored in the table
    * `read_html()` only extracts the text content, so we have to extract the link using another method (Selenium)
* First we need the XPath for each row element of the table, as well as the XPath for the `<a>` tag containing the link data

## Instantiate WebDriver

In [None]:
chrome_options = Options()  
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')
driver = webdriver.Chrome(executable_path='./chromedriver', options=chrome_options)
driver.get(url)

print(driver.title)

## Determine number of rows in table
* The total number of rows should be equal to 25 in this case

In [None]:
time.sleep(3)

tr_xpath = '/html/body/div[2]/main/section/div/div[2]/div/div[2]/div/div[1]/table/tbody/tr'

rows = 1+len(driver.find_elements_by_xpath(tr_xpath))

print(rows)

## Loop through table and extract link from each row
* We will use the `find_element_by_xpath` method to get the `<a>` tag
* Then we will use the Selenium `get_attribute` method to extract the data stored in the `href` attribute
* We will store our extracted links in a list and add these to the original DataFrame

In [None]:
links = []

for i in range(1, rows):
    try:
        elem = driver.find_element_by_xpath(tr_xpath+'['+str(i)+']/td[2]/div/div[3]/div[1]/a')
        link = elem.get_attribute('href')
        links.append(link)
    except NoSuchElementException:
        pass

df['Link'] = links

df.head()

## Close the WebDriver

In [None]:
driver.quit()

## Complete scraper

In [None]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

url = 'https://www.digikey.com/en/products/filter/accessories/159'

table_dk = pd.read_html(url)

print(f'Total tables: {len(table_dk)}')

df = table_dk[0]

df = df.iloc[1:]

df.drop('Compare', inplace=True, axis=1)

chrome_options = Options()  
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1920x1080')
driver = webdriver.Chrome(executable_path='./chromedriver', options=chrome_options)
driver.get(url)

print(driver.title)

time.sleep(3)

tr_xpath = '/html/body/div[2]/main/section/div/div[2]/div/div[2]/div/div[1]/table/tbody/tr'

rows = 1+len(driver.find_elements_by_xpath(tr_xpath))

print(rows)

links = []

for i in range(1, rows):
    try:
        elem = driver.find_element_by_xpath(tr_xpath+'['+str(i)+']/td[2]/div/div[3]/div[1]/a')
        link = elem.get_attribute('href')
        links.append(link)
    except NoSuchElementException:
        pass

df['Link'] = links

driver.quit()