# Automating the scraping process

Now, there are two ways to go about this:
1. Get the search results, append them to one of the tables, construct a DataFrame object, change the types to String and then apply regular expressions to a DataFrame using vectorised methods or
2. Get the search results, change the types to String, apply regular expressions, then append them to one ofthe tables and construct a DataFrame object.

Common sense suggests, that the first approach would be better, but let's keep in mind, that the raw scrape results are long HTML tags, making everything really hard to read and spotting potential errors even harder. Measuring performance differences between the two approaches is currently not the scope of the project, but can be easily done using for example timeit module.

Imports and constants.

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import re
import time

export_path = "data/sothebys_scraped.csv"
driver_path = "driver/chromedriver.exe"
page_url = "https://rmsothebys.com/en/search#/?SortBy=Default&SearchTerm=&Category=All%20Categories&IncludeWithdrawnLots=false&Auction=&OfferStatus=Results&AuctionYear=&Model=Model&Make=Make&FeaturedOnly=false&StillForSaleOnly=false&Collection=All%20Lots&WithoutReserveOnly=false&Day=All%20Days&CategoryTag=All%20Motor%20Vehicles&page=1&pageSize=200&ToYear=NaN&FromYear=NaN"
num_of_pages = 1206

Disabling image loading, to make the whole process faster.

In [2]:
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
# chrome_options.add_argument('--ignore-certificate-errors')
# chrome_options.add_argument('--incognito')
# chrome_options.add_argument('--headless')

Creating tables that will hold our data.

In [23]:
car_info_table = []
price_table = []
additional_info_table = []
auction_type_table = []
auction_location_table = []

Automating the process of scraping each of the web pages, for more detail and step by step guide to scraping elements of the web page, go [here](sothebys-scraping-example.ipynb).

In [24]:
#Pattern we will look for in HTML code
pattern = r'(?<=">)\s*(.*)\s*(?=</)'

#Running the webdriver and loading our page
driver = webdriver.Chrome(executable_path=driver_path, options=chrome_options)
driver.get(page_url)

#Clicking the pop-up window when visiting the site for the first time
driver.find_element_by_xpath('//*[@id="tailoredEmailModal"]/div/div/button').click()

for page_number in range(num_of_pages):
    #Extracting page source from the driver
    current_page_source = driver.find_element_by_xpath("//*").get_attribute("outerHTML")
    html_page = BeautifulSoup(current_page_source)
    result_container = html_page.find_all('div', class_="search-result__caption")
    
    for result_number in range(len(result_container)):
        search_result = result_container[result_number]
                
        car_info = search_result.find_all('p')[0]
        price = search_result.find_all('span')[0] 
        additional_info = search_result.find_all('span')[3]
        auction_type = search_result.find_all('span')[-1] 
        auction_location = search_result.find_all('h5')[0]
        
        car_re = re.search(pattern, str(car_info), re.IGNORECASE).group(1)
        price_re = re.search(pattern, str(price), re.IGNORECASE).group(1)
        additional_info_re = re.search(pattern, str(additional_info), re.IGNORECASE).group(1)
        auct_type_re = re.search(pattern, str(auction_type), re.IGNORECASE).group(1)
        auct_loc_re = re.search(pattern, str(auction_location), re.IGNORECASE).group(1)
        
        car_info_table.append(car_re)
        price_table.append(price_re)
        additional_info_table.append(additional_info_re)
        auction_type_table.append(auct_type_re)
        auction_location_table.append(auct_loc_re)
    
    #Clicking the next page button
    driver.find_element_by_css_selector("a[ng-click='vm.setPage(vm.pager.currentPage + 1)']").click() 
    #Delaying the next iteration by one second, it is needed for the page's source code to fully load
    time.sleep(1)

driver.quit()

Let's save the tables as a DataFrame object, preparing for further analysis.

In [25]:
sothebys_df = pd.DataFrame({"car_info": car_info_table, "price": price_table,
                            "additional_info": additional_info_table, "auction_type": auction_type_table, 
                            "auction_location": auction_location_table})

In [26]:
sothebys_df.head(50)

Unnamed: 0,car_info,price,additional_info,auction_type,auction_location
0,2017 Jeep Wrangler Custom,"Sold For $57,120",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
1,1966 Austin-Healey 3000 Mk III BJ8,"Sold For $58,240",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
2,1989 Ferrari Testarossa,Sold After Auction,,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
3,2018 Audi SQ5,"Sold For $42,560",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
4,1960 Austin-Healey 3000 Mk I BN7,"Sold For $40,320",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
5,2006 Ford GT,Sold After Auction,,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
6,1967 Austin Mini Moke,"Sold For $50,400",,RM | ONLINE ONLY,ONLINE ONLY: DRIVE INTO THE HOLIDAYS 2019
7,2009 Mercedes-Benz SL 65 AMG Black Series,"Sold For $161,000",,RM | SOTHEBY'S,ABU DHABI 2019
8,2011 Porsche 911 Speedster,"$300,000 - $350,000",,RM | SOTHEBY'S,ABU DHABI 2019
9,1973 Ferrari 365 GTB/4 Daytona Berlinetta by S...,"Sold For $484,375",,RM | SOTHEBY'S,ABU DHABI 2019


In [28]:
sothebys_df.shape

(200, 5)

Last step is to save the data as a CSV file.

In [27]:
exported = sothebys_df.to_csv(export_path, index = None, header=True)

# Conclusions

We have successfully scraped nearly 245.000 data points, from over 1200 web pages. Our dataset combining all the data, has more than 48.000 rows across 5 columns, making it ready for data cleaning and feature engineering process. To see the scraping process in detail, click [here](sothebys-scraping-example.ipynb). To see the data cleaning process, click [here](sothebys-data-cleaning.ipynb).