# An example of how scraping works

The website we want to scrape is RM Sotheby's, one of the biggest car auction sites. Because the site is dynamically generated, BeautifulSoup library is not enough -  we have to use Selenium to access the page via a webdriver. For general introduction to the project, see [ReadMe](README.md). 

Imports and constants.

In [1]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver_path = "driver/chromedriver.exe"
page_url = "https://rmsothebys.com/en/search#/?SortBy=Default&SearchTerm=&Category=All%20Categories&IncludeWithdrawnLots=false&Auction=&OfferStatus=Results&AuctionYear=&Model=Model&Make=Make&FeaturedOnly=false&StillForSaleOnly=false&Collection=All%20Lots&WithoutReserveOnly=false&Day=All%20Days&CategoryTag=All%20Motor%20Vehicles&page=1&pageSize=200&ToYear=NaN&FromYear=NaN"

Let's include some webdriver options to make the whole process faster by disabling images. We'll create a dictionary in case we want to add more options later on.

In [2]:
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

Now let's start the webdriver we provided and feed it the URL of the page we want to open. 

In [3]:
driver = webdriver.Chrome(executable_path=driver_path, options=chrome_options)
driver.get(page_url)

Please note, chromedriver included with this repository works with Chrome version 79, you can download the latest version [here](https://sites.google.com/a/chromium.org/chromedriver/home) 

From now on, I will use BeautifulSoup for scraping content because I had already used it before, but Selenium can also be used in the following steps.  

In [4]:
html_soup = BeautifulSoup(driver.page_source)

Upon entering the site, there is a popup window asking us to sign up for a newsletter; we can close it by inspecting the element and copying it's xpath and using selenium's find_element_by_xpath method, then clicking on it.


In [5]:
btn_cls = driver.find_element_by_xpath('//*[@id="tailoredEmailModal"]/div/div/button')
btn_cls.click()

While not neccessary to just scrap the page, it must be done if we want to click any buttons on the page.

In order to perform webscraping using BeautifulSoup, we have to identify the portion of the page we want to scrape and where are it's located in the HTML code. Taking a quick look at the page structure we find that each of the search results on site are nested in `search-result__caption` class. We'll call the element containing all of them a container. Let's access the fifth element of that container, e.g. the fifth search result.  

In [6]:
result_container = html_soup.find_all('div', class_="search-result__caption")
search_result = result_container[0]

Let's see what the first search result looks like:

In [7]:
search_result

<div class="search-result__caption">
<h5 class="heading-details--bolder ellipsis ng-binding">
                                    ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
                                </h5>
<h5 class="heading-subtitle--bold line ng-binding" ng-show="item.Lot">
                                    Lot 107
                                </h5>
<h5 class="heading-subtitle--bold line ng-hide" ng-hide="item.Lot">
</h5>
<p class="heading-subtitle--bold ng-binding">2017 Jeep Wrangler Custom </p>
<p>
<span class="heading-subtitle--bold ng-binding">Sold For $57,120</span><br/>
<span class="heading-subtitle--bold ng-binding ng-hide" ng-show="item.PreSaleEstimate.length &gt; 0"></span><br/>
<span class="heading-details--bolder ng-binding"></span><br/>
<span class="heading-details--bold ng-binding"></span><br/>
<span class="heading-details--bolder ng-binding ng-hide" ng-show="item.CurrentBid">
                                        Current Bid: <br/>
</span>
<span class="headin

Now we can easily tell, where's the data we want to collect; let's access those elements by using find_all method with appropriate HTML tags and restricting the method's output to one element, by providing the order of occurence of the HTML tag within the search result.

In [8]:
car_info = search_result.find_all('p')[0]
price = search_result.find_all('span')[0] #price can be found in the first span, on the "0th" position
additional_info = search_result.find_all('span')[3]
auction_type = search_result.find_all('span')[-1] #auction type can be found in the last span found, the "-1st" 
auction_location = search_result.find_all('h5')[0]

print(car_info)

<p class="heading-subtitle--bold ng-binding">2017 Jeep Wrangler Custom </p>


Now we know how to obtain values that are of interest to us, but we are stuck with HTML code. To get rid of it, we'll use regular expressions. The following expression uses: 
* lookahead to determine characters preceding our pattern,
* a capture group to make accessing our pattern easier,
* lookbehind to determine characters following our pattern.

Because re.search expects it's arguments to be a string, we need to convert our search results first.

In [9]:
pattern = r'(?<=">)\s*(.*)\s*(?=</)'

car_re = re.search(pattern, str(car_info), re.IGNORECASE).group(1)
price_re = re.search(pattern, str(price), re.IGNORECASE).group(1)
additional_info_re = re.search(pattern, str(additional_info), re.IGNORECASE).group(1)
auct_type_re = re.search(pattern, str(auction_type), re.IGNORECASE).group(1)
auct_loc_re = re.search(pattern, str(auction_location), re.IGNORECASE).group(1)

Let's see the results:

In [10]:
print(car_re, price_re, additional_info_re, auct_type_re, auct_loc_re)

2017 Jeep Wrangler Custom  Sold For $57,120  RM | ONLINE ONLY ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019


Now, let's close the webdriver.

In [11]:
driver.quit()

# Conclusions 

We have successfully scraped some data using Selenium and Beautifulsoup. Next step wolud be to automate the whole process for multiple web pages and save the data, enabling further analysis. To see the automation process, click [here](sothebys-scraping-automated.ipynb). To see the data cleaning process, click [here](sothebys-data-cleaning.ipynb).