# An example of how scraping works

The website we want to scrape is RM Sotheby's, one of the biggest car auction sites. Because the site is dynamically generated, BeautifulSoup library is not enough -  we have to use Selenium to access the page via a webdriver. For general introduction to the project, see [ReadMe](). 

Imports and constants.

In [24]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from bs4 import BeautifulSoup
import re

driver_path = "driver/chromedriver.exe"
page_url = "https://rmsothebys.com/en/search#/?SortBy=Default&SearchTerm=&Category=All%20Categories&IncludeWithdrawnLots=false&Auction=&OfferStatus=Results&AuctionYear=&Model=Model&Make=Make&FeaturedOnly=false&StillForSaleOnly=false&Collection=All%20Lots&WithoutReserveOnly=false&Day=All%20Days&CategoryTag=All%20Motor%20Vehicles&page=1&pageSize=200&ToYear=NaN&FromYear=NaN"

Let's include some webdriver options to make the whole process faster by disabling image loading.

In [25]:
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

Now let's start the webdriver we provided and feed it the URL of the page we want to open. 

In [26]:
driver = webdriver.Chrome(executable_path=driver_path, options=chrome_options)
driver.get(page_url)

From now on, I will use BeautifulSoup because I had already used it before, but Selenium can also be used in the following steps.  

In [27]:
html_soup = BeautifulSoup(driver.page_source)

In order to perform webscraping using BeautifulSoup, we have to identify the portion of the page we want to scrape and where are it's located in the HTML code. Taking a quick look at the page structure we find that each of the search results on site are nested in `search-result__caption` class. We'll call the element containing all of them a container. Let's access the fifth element of that container, e.g. the fifth search result.  

In [28]:
result_container = html_soup.find_all('div', class_="search-result__caption")
search_result = result_container[4]

Let's see what the fifth search result looks like:

In [29]:
search_result

<div class="search-result__caption">
<h5 class="heading-details--bolder ng-binding">
                                    AUBURN SPRING 2019 - LOT 5003
                                </h5>
<p class="heading-subtitle--bold ng-binding">1971 Dnepr KMZ MV-750 </p>
<p>
<span class="heading-subtitle--bold ng-binding">Sold For $3,410</span><br/>
<span class="heading-subtitle--bold ng-binding ng-hide" ng-show="item.PreSaleEstimate.length &gt; 0"></span><br/>
<span class="heading-details--bolder ng-binding"></span><br/>
<span class="heading-details--bolder ng-binding ng-hide" ng-show="item.StyleClass == 'RMOnlineAuctions' &amp;&amp; item.CurrentBid"><br/></span>
<span class="heading-details--bolder ng-binding ng-hide" ng-show="item.StyleClass == 'RMOnlineAuctions' &amp;&amp; item.TimeLeft"><br/></span>
</p>
<span class="heading-details--bolder search-result__bottom-left ng-binding">RM | AUCTIONS</span>
</div>

Now we can easily tell, where's the data we want to collect, let's access those elements by using find_all method with appropriate HTML tags and restricting the method's output to one element, by providing the order of occurence of the HTML tag within the search result.

In [30]:
car_info = search_result.find_all('p')[0]
price = search_result.find_all('span')[0] #price can be found in the first span, on the "0th" position
auction_type = search_result.find_all('span')[-1] #auction type can be found in the last span found, the "-1st" 
auction_location = search_result.find_all('h5')[0]
print(price)

<span class="heading-subtitle--bold ng-binding">Sold For $3,410</span>


Now we know how to obtain values that are of interest to us, but we are stuck with HTML code. To get rid of it, we'll use regular expressions. The following expression uses: 
* lookahead to determine characters preceding our pattern,
* a capture group to make accessing our pattern easier,
* lookbehind to determine characters following our pattern.

Because re.search expects it's arguments to be a string, we need to convert our search results first.

In [31]:
pattern = r'(?<=">)\s*(.*)\s*(?=</)'

car_re = re.search(pattern, str(car_info), re.IGNORECASE).group(1)
price_re = re.search(pattern, str(price), re.IGNORECASE).group(1)
auct_type_re = re.search(pattern, str(auction_type), re.IGNORECASE).group(1)
auct_loc_re = re.search(pattern, str(auction_location), re.IGNORECASE).group(1)

Let's see the results:

In [32]:
print(car_re, price_re, auct_type_re, auct_loc_re)

1971 Dnepr KMZ MV-750  Sold For $3,410 RM | AUCTIONS AUBURN SPRING 2019 - LOT 5003


Now, let's close the webdriver.

In [33]:
driver.quit()

# Conclusions 

We have successfully scraped some data using Selenium and Beautifulsoup. Next step wolud be to automate the whole process for multiple web pages and save the data, enabling further analysis. To see the automation process, click [here](). To see the data cleaning process, click [here]().