# Automating the scraping process

Now, there are two ways to go about this:
1. Get the search results, append them to one of the tables, construct a DataFrame object, change the types to String and then apply regular expressions to a DataFrame using vectorised methods or
2. Get the search results, change the types to String, apply regular expressions, then append them to one ofthe tables and construct a DataFrame object.

Common sense suggests, that the first approach would be better, but let's keep in mind, that the raw scrape results are long HTML tags, making everything really hard to read and spotting potential errors even harder. Measuring performance differences between the two approaches is currently not the scope of the project, but can be easily done using with for example timeit module.

Imports and constants.

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import re

export_path = "data/sothebys_scraped.csv"
driver_path = "driver/geckodriver.exe"
page_url = "https://rmsothebys.com/en/search#/?SortBy=Default&SearchTerm=&Category=All%20Categories&IncludeWithdrawnLots=false&Auction=&OfferStatus=Results&AuctionYear=&Model=Model&Make=Make&FeaturedOnly=false&StillForSaleOnly=false&Collection=All%20Lots&WithoutReserveOnly=false&Day=All%20Days&CategoryTag=All%20Motor%20Vehicles&page=1&pageSize=200&ToYear=NaN&FromYear=NaN"

Creating tables that will hold our data.

In [2]:
car_info_table = []
price_table = []
auction_type_table = []
auction_location_table = []

Automating the process of scraping each of the web pages, for more detail and step by step guide to scraping elements of the web page, go [here]().

In [4]:
num_of_pages = 5
pattern = r'(?<=">)\s*(.*)\s*(?=</)'

for page_number in range(num_of_pages):
    page = "https://rmsothebys.com/en/search#/?SortBy=Default&SearchTerm=&Category=All%20Categories&IncludeWithdrawnLots=false&Auction=&OfferStatus=Results&AuctionYear=&Model=Model&Make=Make&FeaturedOnly=false&StillForSaleOnly=false&Collection=All%20Lots&WithoutReserveOnly=false&Day=All%20Days&CategoryTag=All%20Motor%20Vehicles&page={}&pageSize=200&ToYear=NaN&FromYear=NaN".format(page_number+1)
    driver = webdriver.Firefox(executable_path=driver_path)
    driver.get(page)
    html_page = BeautifulSoup(driver.page_source)
    result_container = html_page.find_all('div', class_="search-result__caption")
    
    for result_number in range(len(result_container)):
        search_result = result_container[result_number] #0 is the first result of the search, in this case, first car
                
        car_info = search_result.find_all('p')[0]
        price = search_result.find_all('span')[0] #price can be found in the first span, on the 0th position so to speak
        auction_type = search_result.find_all('span')[-1] #auction type can be found in the last span found
        auction_location = search_result.find_all('h5')[0]
        
        car_re = re.search(pattern, str(car_info), re.IGNORECASE).group(1)
        price_re = re.search(pattern, str(price), re.IGNORECASE).group(1)
        auct_type_re = re.search(pattern, str(auction_type), re.IGNORECASE).group(1)
        auct_loc_re = re.search(pattern, str(auction_location), re.IGNORECASE).group(1)
        
        car_info_table.append(car_re)
        price_table.append(price_re)
        auction_type_table.append(auct_type_re)
        auction_location_table.append(auct_loc_re)

In [5]:
len(car_info_table)

200

Let's save the tables as a DataFrame object, preparing for further analysis.

In [6]:
sothebys_df = pd.DataFrame({"car_info": car_info_table, "price": price_table, 
                            "auction_type": auction_type_table, "auction_location": auction_location_table})

Last step is to save the data as a CSV file.

In [7]:
exported = sothebys_df.to_csv (export_path, index = None, header=True)

# Conclusions

We have successfully scraped nearly 200.000 data points, from over 1100 web pages. Our dataset combining all the data, has almost 50.000 rows across 4 columns, making it ready for data cleaning process. To see the scraping process in detail, click [here](). To see the data cleaning process, click [here]().