*Notebook details first attempts at scraping the web with Selenium.* 

We can use Selenium to automate actions within a web-browser. Setup was pretty easy: 

1. Install Selenium libraries with `pip install selenium`

2. Install webdriver manager with `pip install webdriver-manager`

### What is Selenium?

Selenium is a programming library that is compatible with multiple languages, including Python, C#, Ruby, and JavaScript. Often used for testing web applications, Selenium is popular amongst data scientists, developers, and software engineers alike with an interest in the creation and maintenance of applications. 

### Why use Selenium over Beautiful Soup?

The main use of Selenium is extracting various data types and elements from websites and applications in order to gain information about a topic or dataset.

Beautiful Soup is great for extracting context from websites but can run into trouble when websites are loading content after HTML using Javascript.

You can check which websites use Javascript to load content after html by installing and activating an extension like [Disable Javascript](https://developer.chrome.com/docs/devtools/javascript/disable/). As some examples: 

- Bandcamp stops the music and you can't click play
- NTS allows you to navigate links but not click any play buttons, also only loads like 12 shows at a time 
- Amazon doesn't load products, menus at all 
- AirBnb doesn't load anything!

Lets begin by installing everything we need: 

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service 
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
import pandas as pd
from IPython.display import display, clear_output

First, we need to install the latest `webdriver_manager` for the browser we would like to automate. The webdriver acts like a bridge between Selenium and the browser. I'm using Chrome in this case so that's the driver I'll be installing: 

In [None]:
service = Service(executable_path=ChromeDriverManager().install())

Starts chrome browser:

In [None]:
driver = webdriver.Chrome(service=service)


Navigate to webpage we want and wait a period of time to ensure webpage has loaded fully before we try new commands. In this case I'm searching AirBnb for a cave I can stay in anywhere for a weekend:

In [None]:
driver.get("https://www.airbnb.co.uk/?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&search_mode=flex_destinations_search&source=structured_search_input_header&search_type=category_change&date_picker_type=flexible_dates&flexible_trip_lengths%5B%5D=weekend_trip&location_search=MIN_MAP_BOUNDS&category_tag=Tag%3A670")
time.sleep(1)

The following will return a BeautifulSoup object, which represents the html document as a nested data structure:

In [None]:
soup= BeautifulSoup(driver.page_source, 'html.parser')
soup

Now we can extract the listings on the AirBnb page from within our Soup object easily because they have the `itemprop` attribute with the `itemListElement` value:

In [None]:
listings = soup.select('[itemprop="itemListElement"]')

In [None]:
def property_info(listings):
    df = pd.DataFrame(columns = ['Location' , 'Price', 'Guests' , 'Bedrooms', 'Beds', 'Bathrooms'])
    for i,listing in enumerate(listings):
        details = listing.findAll(text=True)
        location, price = details[1], details[4]
        url = listing.select_one('[itemprop="url"]')['content']
        driver.get("https://" + str(url))
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        amneties = []
        for element in soup.findAll('li'):
            for text in element.findAll(text=True):
                if ' · ' not in text:
                    amneties.append(text)
        row = [location, price]+amneties
        df.loc[len(df)] = row[0:6]
        time.sleep(1)
        clear_output(wait=True)
        display(str(100*(i+1)/len(listings)) + "% complete")
    return(df)

In [None]:
caves = property_info(listings)
caves