##**What's web commercial web scraping and why it is valuable?**

Web scraping definition:
https://en.wikipedia.org/wiki/Web_scraping

An important part of web scraping is concentrated into web scraping of commercial webpages in order to collect data for market studies such as: e-commerce, real state or service offering.

The study of this data such as prices, locations, quantities, providers, etc. has lots of commercial value and can bring a lot of business insights specially when real-time data and analytics are needed.

Web-scraping bots combined with data storage and real-time analytics can provide your business with automated insights on the prices, quantities, locations and overall state of the market your business is competing in.

Imagine having a completely automated tool that runs 24/7 collecting, storing and analyzing valuable data for your business.

Tools such as Python, Cloud Computing and NoSQL databases have provided this field with huge potential.

##**What's a commercial website , what's a catalog of offers and how can we extract value from its structure?**

A commercial website definition is as simple as it's own name:

It's a website(https://en.wikipedia.org/wiki/Website) dedicated to e-commerce(https://en.wikipedia.org/wiki/E-commerce) or simply dedicated to traditional commerce and trade while hosting it's offers also on the Internet as a form of reaching more clients.

As an example:

[Amazon Website](https://www.amazon.es/)

There is a characteristic shared between commercial webpages, all of them have a catalog where they group their resources.

Inside this catalog we have all the urls that redirect to all the offers for each of the categories.

As an example, in the amazon website url provided above at the top left there is a displayable with the name of "All" if we click in, we are shown a displayable with the different categories of products provided by Amazon, each category has its own url.

Once we access an specific category, we will be shown a list of products and its characteristics, the bot would collect the desired data and then pass to the next page of products of the category or simply scroll down until there are no more products depending on the structure of the website.

This would be done for each page of each category until there are no more products.

When we combine different sources of data(websites) we are able to get an overall real-time vision of a concrete market that could not be achieved otherwise.

This repetitive structure of commercial websites allows us to a create a base of code that remains the same for most cases.

We know that we are going to get all the url's for each category and then access each category and surf through the catalog, this will be needed to do for all commercial websites.

So given this structure, the only aspect we have to change between websites would be the name of the elements inside the HTML, where the desired data is located, all the other methodology and code can remain unchanged.

This allows great versatility and low effort in scraping lots of different domains. Providing us with a huge chunk of real-time and real-market data to store and analyse.

This tutorial will only be focused in the methodology of collection and the fixed structure to scrap a commercial website.

##**Use-case and creation of the process of collection**

###**Particular use-case 1**

In our example we are going to scrap Fotocasa a spanish portal containing data about real state in Spain. It's basically a portal where owners advertise the properties they are selling.

In this case our targeted is data belonging to properties on sale, not rent.

So the first step of any scrap process should be what's my starting point.

The starting point should always be an url that contains the catalog of the products/services we are interested in.

It might be a url completely dedicated to be the catalog or it could be any url of the domain were there is a displayable accessing the catalog.

In our example all url's of the subdomain dedicated to selling properties offer a displayable with the catalog, we start at the main one, could be any other pertaining to the subdomain.

Firstly we'll access the url: https://www.fotocasa.es/es/comprar/viviendas/espana/todas-las-zonas/l

If we access this url, at the top left we can see a displayable named 'Provincia' meaning State/Province in spanish, if we click on it, it displays a list of all the provinces in Spain, each province has a href(HTML element) containing the url that contains the offers for that province.

What our bot is going to do is access that url, click on the displayable, scroll down until the bottom of the displayable so all provinces are displayed, capture the url of each of the provinces.

Once arrived here, we got the first step of any commercial webscraping bot, we have got all the root urls that we are going to scrap.

As an example: https://www.fotocasa.es/es/comprar/viviendas/asturias-provincia/todas-las-zonas/l

Contains the offers for one specific province.

We are going to access it, scroll down the page and get all the data of each property offered, once we are at the bottom and there are no more advertisements we are going to capture the next url of the catalog pertaining to that province(equivalent to pressing next button to display next page) and do the same over and over until there are no more advertisements for the given province. We'll do this process for each province.

We are capturing:

- Title of the advertisement.
- Price
- Price reduction from initial price.
- Rooms
- Bathrooms
- Meters
- Province

#### **Technical process of collection**

Our bot will be coded in Python.

I am specifically using Selenium a JavaScript rendering framework that allows us to programatically interact with webpages as we were a human using a browser.

I am also using a proxy provider for technical reasons explained in the code section.

You'll need to configure your proxy provider so that you can connect to its services using Python. In my case I whitelisted my IP in the proxy provider management tools so I can access my proxies without providing user or password,
this may change between providers.

In order for this processes to be interesting they must scrap thousands and milions of urls, this implies that the bigger your project is bigger will be your computing necesities.

You'll need multiprocessing in order to be time and resource efficient, this is integrated into the code I provide, in my case due to the scale of this example using my 12 cpu's were enough. Each CPU scraps one url, so I'm able to be running 12 url scraps at the same time.

This could be increased using a cloud provider, where you can get virtually unlimited computing capacity and cpu's, also the code could be tweaked to be even more efficient and have multithreads inside the multiprocesses, so the scale this bots can reach is quite high, around the tens of millions of urls.

Our data will be stored in a csv, but this step could easily be an insert of the data into a relational database. Because don't forget it, another advantatge of this data collection is that our data is structured, we are getting the fields we want and that can be easily modeled as relations and the fields have fixed data-types.

Basically were are getting valuable structured data from unstructured public data.

The collection of the data in this example has not commercial purposes, it's solely done as an example and only a small section of the domain was scraped while being respectful with the server and not causing any problems to the host. Also the tutorial does not have any cost.

#### **Code structure and explanation**

Import all  necessary libraries and modules.

Mainly:

- Selenium in order to render the JavaScript and be able to scroll down in the target webpages. We also use some utilities for waiting while interacting with the webpage, so we seem more human and meanwhile we let time to the code so it can get rendered. Also we import settings related to user agents and conexion related, those settings are used to make our scraper look as a human.

- Beautifulsoup will be our html code parser. So we can get the data we are interested in from the whole html of the webpage targeted.

- Random, re, requests, time, csv, datetime and os as general utilities.

- Webdriver managers, we get the specifical drivers we want our selenium engine to scrap on. Mainly Chrome.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import random
import re
import requests
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import csv
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.chrome.service import Service
import os

Define a list of proxies. If you do this using your IP, the server will block you after 3/4 automatic requests in order to avoid so I use a proxy provider to get a list of residential proxies so I make each connection from a different IP, hence a different location hence the server doesn't block my scraper because it views it as a human user.

My proxy provider is a highly recognized one: https://smartproxy.com/

In this case I have just to specify the proxies because MY IP is whitelisted in my proxy provider, so when the proxy provider server detects its my IP the one that is trying to connect it allows the connection, if you choose another proxy provider this might work in another way.

In my case those are residential proxies from Spain, high quality IP's that won't make you look like a robot.

In [None]:
proxy_list = [('es.smartproxy.com',10001,"CHROME"),
              ('es.smartproxy.com',10002,"CHROME"),
              ('es.smartproxy.com',10003,"CHROME"),
              ('es.smartproxy.com',10004,"CHROME"),
              ('es.smartproxy.com',10005,"CHROME"),
              ('es.smartproxy.com',10006,"CHROME"),
              ('es.smartproxy.com',10007,"CHROME"),
              ('es.smartproxy.com',10008,"CHROME"),
              ('es.smartproxy.com',10009,"CHROME"),
              ('es.smartproxy.com',10010,"CHROME")]

We are going to do functionally decompose the whole process, meaning that for each major step of the process we are going to define a function that does specifically that step. So in case of errors, it's easier to troubleshoot those because we specifically know in which part of the process it was produced and we can go to the function that handles that and fix the code.

The smartproxy function sets the options of our webdriver that will be used for the scraping.

We basically enable the rendering of JavaScript, so the JavaScript dependent elements such as images or scrolling down are displayed and so we are able to get data from them.

We also set the user agent as a human one. If you do not specify a user-agent, the scraper will use the default Python one, showing to the server that we are a bot instead of a human, the server will then reject our connexion, by faking a human user-agent we are able to seem human to the server.

Finally, we set the proxy options(HOSTNAME, PORT, DRIVER) through which the scraper will be able to connect to internet.

In [None]:
def smartproxy(HOSTNAME, PORT, DRIVER):
    """
    Sets up options for a Chrome driver such as user-agent, enabled javascript or proxy details.

    Parameters:
      - HOSTNAME: Name of our proxy direction
      - PORT: Number of port through which we make the connection to the proxy.
      - DRIVER: Driver we are going to use(Chrome, Firefox, Safari)

    Returns:
      The options object already configured ready to be applied to a driver.
    """
    # Instantiate an ChromeOptions object (We are going to use Chrome driver, others could be used)
    options = ChromeOptions()
    # Activate javascript rendering
    options.add_argument("--enable-javascript")
    # Create a fake human user-agent
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    # Specifies the mode of the Selenium engine, if headless is selected as I do in the following code line, then while scraping our browser doesn't pop up, in case of experimentation or solving errors, comment the line so
    # you are able to see what the scraper is doing visually.
    options.add_argument("--headless")
    # Disable blink features
    options.add_argument("--disable-blink-features=AutomationControlled")
    # Configure the driver to use our proxy details to connect to the internet
    proxy_str = '{hostname}:{port}'.format(hostname=HOSTNAME, port=PORT)
    options.add_argument('--proxy-server={}'.format(proxy_str))
    # We are not interested in images, so for efficiency reasons I disable the loading of images
    options.add_argument("--disable-image-loading")

    return options

The next function called webdriver_instance creates the driver we are going to use for the selenium engine and applies to that driver the options we specify in the smartproxy function, that is applied inside the webdriver_instance function.

So firstly, we pick a random proxy from our proxy_list previously specified.
We get its hostname, port and driver.

Then, we instantiate a Chrome driver and we set its options to be the ones provided by the previous smart_proxy function provided above.

We are already able to create a driver with custom options.

In [None]:
def webdriver_instance():
    """
    Creates an instance and configures the options of a driver ready for Selenium to be used.
    """
    # Get a random value from the list of proxies.
    hostname, port, driver = random.choice(proxy_list)
    # Instantiate a Chrome driver and configure the options using the previously defined smartproxy function.
    browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
                                   options=smartproxy(hostname, port, driver))

    return browser

Now, we are going to define a function that enters our starting point url, searches for the displayable, clicks on it, scrolls down to show all provinces in the displayable and gets all the url's that redirect to each of the provinces.

So the function will return a list with the urls that we need to scrap in order to get all the data from the domain.

We need to do this for all commercial domains we want to scrap, there might be little changes between commercial domains, for example:

  - Maybe the domain you are interested in scraping has a page dedicated to specifically be the catalog, then just access it, you won't need to display anything and get the urls from the href that are contained in the html code directly.

  - In case there is also a displayable in the domain you are interested in, then just change the html container where the displayable is located to the actual one for your website and then also change the anchor tags where the url of the categories are located.
  As an example you would have to change:
  This line:
  dropdown = driver.find_element(By.CLASS_NAME, "sui-MoleculeSelectPopover-select")
  Change this class name where my catalog is contained: "sui-MoleculeSelectPopover-select" to your actual one.

  And then change this line:
  link_items = soup.find_all('a', class_="sui-LinkBasic")
  That gets the urls of the categories, change class_="sui-LinkBasic" this parameter for the anchor container where your urls are located.

In [None]:
def get_categories():
    """
    Enters a webpage and gets the urls of all categories, this one clicks a displayable and retrieves the code after
    having clicked, as explained maybe your website has not a displayable and lets you access the urls of categories directly,
    then you won't need that part.

    Returns:
      List of urls with all the categories we want to scrap.
    """
    # Instantiate and assign to a variable an already options configured driver using the previously defined function.
    driver = webdriver_instance()
    # Assign the catalog url to be scraped to a variable
    url = "https://www.fotocasa.es/es/comprar/viviendas/espana/todas-las-zonas/l"
    # Do a get petition to the url using the driver
    driver.get(url)
    # Sleep, remember that the bot must not perform actions too quickly we want to be seen as humans, so we wait 5 seconds.
    time.sleep(5)
    # Search for the displayable inside the webpage, adapt the class name to the website you want to scrap.
    # In case there is no displayable and the catalog its a page itself, just pass this step and just retrieve the whole code of the page and search for the tags
    # that contain the urls that redirect to each category.
    dropdown = driver.find_element(By.CLASS_NAME, "sui-MoleculeSelectPopover-select")
    # Click on the displayable
    dropdown.click()
    # Get all the html code of the webpage, including the displayable.
    displayable_html = driver.page_source
    # Parse the html
    soup = BeautifulSoup(displayable_html, 'html.parser')
    # Search for the specific tags that contain the urls for each category.
    link_items = soup.find_all('a', class_="sui-LinkBasic")
    # Get the href(attribute that contains an url that redirects) for each tag.
    hrefs = [link.get('href') for link in link_items]
    for href in hrefs:
        print(href)
    # Sleep
    time.sleep(5)
    driver.quit()
    # Get the urls
    return hrefs

The above function gets all the urls of the catalog, it's very possible that you'll want to filter based on certain characteristics, so i.e. you may not want all provinces or you may not want certain categories.

This function right here filters the list we have retrieved of urls, you can filter using whatever criteria you need, in my case I filter for certain provinces that I'm interested in.

Just take a look at the list of urls retrieved by the function previously and exclude the words you are not interested in or just take the ones you are interested in.

To adapt this function all you need is a little bit of experimentation meaning, just retrieve your list of urls and see what you want and what you don't want and then adapt the function and check the result. Once you did it, is completely automated for that specific website domain.

In [None]:
def filter_urls(hrefs, provinces):
    """
    Given a list of url and list of words, filters the list so that the urls remaining contain the words we are interested in.
    Also filters for several words we do not want the urls to have.

    Takes:
      - hrefs: list of urls.
      - provinces: list of words, we want our urls to include.

    Returns:
      Desired list of urls.
    """
    search_words = provinces
    # Define your pattern to search based in the words you want to search for.
    pattern = r'\b(?:' + '|'.join(re.escape(word) for word in search_words) + r')\b'
    # Filter the original list
    filtered_strings = [s for s in hrefs if re.search(pattern, s, re.IGNORECASE)]
    # Define an empty list to retrieve the new list of filtered urls.
    result = []
    # Check for words you dont want your urls to have just add and not re.search(r"obra-nueva", url) conditions as follows
    for url in filtered_strings:
        if not re.search(r"particulares", url) and not re.search(r"obra-nueva", url) and re.match(r"^/es/comprar/viviendas/", url):
            result.append(url)
    return result

This is our function to store the data.

It requires a list of tuples with our data, the province/category the data belongs to  and a counter that basically will tell us which page of the catalog of the specific category the data belongs to.

It outputs a csv or appends to an existing csv the data we are interested in.

For bigger projects, it would be pretty easy to substitute this function for one that does inserts into a database instead of writing a csv.

In [None]:
def get_csv(data, provincia, counter):
    '''
      This functions writes a list of tuples in a csv.

      Inputs:
              - data: list of tuples.
              - provincia: name of the province to which the data belongs.
              - counter: page of the catalog in integer format.
      Output:
              Fully written the data list of tuples into a csv. Prints output of the execution.
    '''
    current_date = datetime.now().strftime("%Y_%m_%d")
    # Set to a variable the path where we want to write our csv
    csv_file = f"C:\\Users\\Usuario\\PycharmProjects\\PRA1\\fotocasa_{current_date}.csv" # Substitute for yours.

    # Check if file exists
    file_exists = os.path.isfile(csv_file)

    # Write or append to the file if it already exists.
    with open(csv_file, mode='a' if file_exists else 'w', newline='') as file:
        writer = csv.writer(file)

        if not file_exists:
            # Create the header
            header = ["Title", "Price", "Reduction", "Rooms", "Bathrooms", "Meters", "Province"]
            writer.writerow(header)

        # write data
        writer.writerows(data)
    # Print message for traceability
    return print(f'Csv data written successfully {provincia} {counter}')

Finally we are going to use all the functions defined above in a main process that will do all the subprocesses.

In [None]:
def main(max_pages):
  # Get all the urls for each category
  provinces = ['tarragona','lleida','barcelona', 'girona']
  hrefs = get_categories()
  # Filter those urls for the ones we are interested in
  urls = filter_urls(hrefs, provinces)
  # Define the base url of the domain, the url list we are retrieved will probably be partial urls, meaning you will need to add the base domain url to acces an actual url.
  # If this is different for your website and it gives you full urls, you don't need to do this, otherwise you'll need to figure out what you have to add to those partial urls
  # to be able to access the actual urls.
  base_url = "https://www.fotocasa.es"
  # For every partial url, we get the province/category we are going to scrap, you'll need to figure out where in your partial urls this province/category is contained
  # For each of our partial urls we are going to do the following process.
  for i in urls:
      parts = i.split('/')
      for part in parts:
          if '-' in part:
              element = part.split('-')[0]
              break
      # Get the specific province of that url
      provincia = element
      # Set up a counter, this will be used to monitor how many pages in the catalog we want to deep in, each page of a given category will increase the counter in 1 until
      # we reach our max_pages parameter where the process will stop because we are not interested in deepening more in the catalog, maybe you want the full catalog, then just setup
      # a big number as the parameter.
      counter = 0
      # While the counter is <= than the max_pages parameter we'll keep scraping pages of the catalog.
      while counter <= max_pages:
        # If its the starting url for the category, we are simply getting the base_url + the partial url retrieved from the displayable or catalog starting page.
          if counter == 0:
              url = base_url + i
        # If its not the starting url, we are getting the base_url + the partial url obtained from getting the next link of the catalog
          else:
              url = base_url + next_link
          # Time the process
          start_time = time.time()
          # Create a webdriver
          driver = webdriver_example()
          # Access the page
          driver.get(url)
          # Sleep, adapt your sleeps so the server doesn't block you but it also doesn't take an eternity to scrap all pages
          time.sleep(5)
          # Once we are on the page we are going to scroll down to show all its content
          scroll_height = 0
          while True:
              driver.execute_script("window.scrollTo(0, arguments[0]);", scroll_height)
              time.sleep(2)
              scroll_height += 1000
              if scroll_height >= driver.execute_script("return document.body.scrollHeight"):
                  break
         # Once we have scrolled the whole page, we are retrieving the page html code
          page_source = driver.page_source
          # Sleep, again adapt to your necessities
          time.sleep(7)
          # Close the driver
          driver.quit()
          # Parse the html code
          soup = BeautifulSoup(page_source, 'html.parser')
          # Get the containers that have each advertisement. That's usually a common structure in comercial websites, offers
          # will be contained in containers of the same class. Get all of these containers, each one is an advertisement.
          info_divs = soup.find_all('div', class_='re-CardPackPremium-info') # Change the name of the class for the specific name of the class of your website
          # If there are enough containers just pass and apply the normal process, because that means we have targeted the right class to get the containers.
          if len(info_divs) > 10:
              pass
          # Else means that was not the right class because we have not retrieved enough containers, try with a nother class where they might be.
          else:
              info_divs = soup.find_all('div', class_='re-CardPackAdvance-info') # Substitute for your specific class.
          # Create a list to store data
          flats = []
          # For each container get the data we are interested in.
          for info_div in info_divs:
              # Get the title of the advertisement
              # In order for an advertisement to exist it must have a title
              title = info_div.find('span', class_="re-CardTitle re-CardTitle--big").text # Substitute for your specific class inside the container that contains the title.
              # Get the price of the advertisement, targeting the class that contains the data, if it does not exist, mark the price as null
              precio_span = info_div.find('span', class_="re-CardPrice")
              price = precio_span.text if precio_span else "-"
              # Get the reduction of the price, targetting the class that contains the data, if it does not exists, mark the reduction as null
              reduccio_span = info_div.find('span', class_="re-CardPriceReduction")
              reduccio = reduccio_span.text if reduccio_span else "-"
              # Get the rooms of the property, targetting the class that contains the data, if it does not exists, mark the rooms as null
              rooms_span = info_div.find('span', class_="re-CardFeaturesWithIcons-feature-icon re-CardFeaturesWithIcons-feature-icon--rooms")
              rooms = rooms_span.text if rooms_span else "-"
              # Get the bathrooms of the property, targetting the class that contains the data, if it does not exists, mark the bathrooms as null
              bathrooms_span = info_div.find('span', class_="re-CardFeaturesWithIcons-feature-icon re-CardFeaturesWithIcons-feature-icon--bathrooms")
              bathrooms = bathrooms_span.text if bathrooms_span else "-"
              # Get the meters of the property, targetting the class that contains the data, if it does not exists, mark the meters as null
              meters_span = info_div.find('span', class_="re-CardFeaturesWithIcons-feature-icon re-CardFeaturesWithIcons-feature-icon--surface")
              meters = meters_span.text if meters_span else "-"
              # Save this data in a list as a tuple, at the end of the for loop this list will have the tuples with the all the data of the specific page of the catalog of the category
              flats.append((title, price, reduccio, rooms, bathrooms, meters, provincia))
          # Store the data in a csv using the previously defined function
          get_csv(flats,provincia, counter)
          # Get the partial url of the next page in the catalog
          li_tags = soup.find_all('li', class_='sui-MoleculePagination-item') # Target the container that has the url redirecting to next page. usually located at the Next button of the catalog.
          # We assume it does not exists
          href = ''
          # Get the anchor tag containing the next url, extract the href, it might be contained in several containers, we check each one and get the one that is not empty
          for li_tag in li_tags:
              a_tag = li_tag.find('a', class_='sui-AtomButton sui-AtomButton--primary sui-AtomButton--outline sui-AtomButton--center sui-AtomButton--small sui-AtomButton--link sui-AtomButton--empty sui-AtomButton--rounded') # Change for your targeted class
              if a_tag:
                  href = a_tag.get('href')
              else:
                  pass
          # Save the partial url for next iteration
          next_link = href
          # Get finishing time
          end_time = time.time()
          # Get total time of 1 iteration
          execution_time = end_time - start_time
          # Print which page we have scraped
          print(f'Page {counter + 1} of {provincia} scraped')
          # Print how much time it took
          print(f"Execution time: {execution_time:.4f} seconds")
          # Print how many advertisements we have collected data from
          print(len(flats))
          # Raise the counter, so we know we have passed a page in the catalog
          counter += 1
          # Sleep, adapt to your necessities
          time.sleep(20)
          # If we reach the max_pages parameter in our catalog stop the process and go to the next url obtained from the catalog of categories.
          if counter > max_pages:
              break
  # Print finish of the process
  return print('Succesful')

In a nutshell, this code structure will work for all commercial websites with a catalog.

You must customize certain code aspects to meet the requirements of your specific website.

Those customization can be resumed as:

  - The domain has an specific url for the catalog?
      - Yes, then modify the get_categories function to skip the part where we display the displayable.
  - You must also change all the html classes, including anchors and hrefs, to adapt them to your target domain structure. Maybe there is certain functionalities that might be different such as getting the next link, maybe your next link has a certain container where is located, then modify the code to adapt it to your specific necesites.
  - Adapt the data extraction to the elements you want to extract, if there are more add, if they are different modify, etc.

Other than that, the structure can be applied to mainly all commercial websites in order to scrap its data.

#### **Getting bigger in scale**

If you are planning to do this in a big scale, you will have to adapt the function for each website.

Maybe you want to monitor 10, 15 websites. Then adapt your function for each one of them, do testing in order to check you are extracting the data alright.

Once you have a function for each web domain you want to scrap, then you'll probably need to run those functions in multiprocessing due to time and efficiency.

As I said my computer has 12 cpu's, I always leave atleast 1 or 2 cpu's for other purposes, so count that I can have 10 cpu's dedicated to this process.
Then I can carry simultaneuosly 10 different webscraping processes.

In a code example it would be.

In [None]:
import multiprocessing

if __name__ == "__main__":
    # Create a list of functions
    functions = [func1, func2, func3, func4, func5, func6, func7, func8, func9, func10]

    # Create a pool of worker processes
    with multiprocessing.Pool() as pool:
        # Use the pool to map the functions to the worker processes
        # This will run the functions in parallel
        pool.map(lambda f: f(), functions)

We don't need to pass any arguments to the global main function we'll have for each web domain we want to scrap, so just run all the 10 processes at the same time with a multiprocessing pool.

If you wanted to get tricky and become much more efficient, you can multithread at the same time,  some of the subprocesses that are happening inside each main function. This would make the code run much faster.

Also you'll probably need to handle errors with try blocks. But as I said before it all comes down to testing each function and see if it works and fix in case it doesn't once you have tested each function the bot will give you huge data collection capacity and you'll have to change literally nothing for months while being able to extract valuable data.

The only costs associated with this process is computation capacity and the proxy provider, by having a good deal with your proxy provider, once you have already defined your functions for each target website, you'll be able to retrieve valuable data 24/7 all year long while applying little to no effort and at a very low economic cost. Probably this data will provide you with business insights that could not be achieved otherwise because this is real time information.

#### **Results**

Finally, I executed this example with a depth of 4 pages into the catalog for category/province. I filtered and only used 4 provinces that are the ones that compose the specific region where I live.

The result was this csv:

https://github.com/zetag33/FotoCasa/blob/main/dataset/fotocasa_2023_11_06.csv

Structured and valuable market data at a cheap cost.



### **Particular use-case 2**

#### **Explanation**

Now, let's imagine we do not need want all data in the whole catalog of the whole page, we do not even want all the info about a category, we are just interested in one single product.

This product can be searched for in the searchbox of the webpage.
And then we are automatically redirected to an url that contains all the advertisements for that specific product.

This is also an interesting use-case that will repeat itself for many webpages, so just let's develop the infrastructure for this use-case so we have to do minimum changes between domains.

We are going to test it in WallaPop a Spanish second-hand market of a wide range of products.

https://es.wallapop.com/

Our webpage has a peculiarity, once you search for a specific product, the webpage you are redirected to, has no catalog, it shows more products by scrolling and not by clicking to go to the next page in a catalog, the scrolling also is infinite you can't reach the end, it will let you scroll forever just by repeating the products shown.

We are going to solve the scrolling issue by setting a timer, once we have scroll for a desired time, the bot will stop, it's our responsibility to fix how much time is the optimum, this will be known by testing.

In case our target webpage doesn't behave like that which is quiet likely, we should replace the scrolling code with the iteration of the catalog code shown in the real state bot. I'll show an example of how we would do that.



#### **Code infrastructure**

We are again going to use Python and Selenium which fit perfect for this use-case.

We are basically going to:

    - Enter a webpage.
    - Write in the searchbox.
    - Click Search into the searchbox.
    - Enter the redirect url.
    - Scroll down until the end of the page or until we are not interested anymore in the data.
    - Retrieve the page code.
    - Extract our desired data.
    - Store the data.

Let's then start by importing our dependencies, they will be mainly the same as in the bot based in the catalog.

In [None]:
from selenium import webdriver
import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import random
import re
import requests
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import csv
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.chrome.service import Service
import os
import concurrent.futures
import threading

We are then going to define to functions we have already seen in the catalog bot.

There is a difference in this use-case compared to the past one.

We are not going to specifically need proxies for this one. We are only making two connections. One for the search and another for the redirect.

And most important, the url that redirects to the search results doesn't have a catalog, so we don't have to enter a different website each time. That's the main reason we are not using proxies here, the server won't block us for 2 connexions for each time we search for a specific product.

If needed we could just take the code of the first use case bot and add the proxy part.

We are going to use the same 2 functions we had for the catalog bot.

One to set-up the options of our driver and another one to set up the driver already configured.

The only difference is here we are not using proxy details, if needed use the functions defined before.

In [None]:
def options():
    """
    Sets up an options object for a Chrome driver such as user-agent, enabled javascript or proxy details.
    """
    # Instantiate an options object
    options = ChromeOptions()
    # Activate javascript rendering
    options.add_argument("--enable-javascript")
    # Use a "human" user-agent
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    # Comment this line for production environments, this makes the browser pop up in your screen so you can inspect the process
    options.add_argument("--headless")
    # Disable blink features and various
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--disable-image-loading")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-plugins")
    options.add_argument("--disable-popup-blocking")
    return options

In [None]:
def webdriver_example():
  '''
  Creates an already options configured Chrome driver and returns the object
  '''
  # Instantiate the driver using the options
  browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
                              options=options())

  return browser

Now we are going to specifically create a function for the scrolling part.

In [None]:
def scroll_down_slowly(driver):
  '''
  Scrolls down a webpage once the get request of a driver is done
  '''
  # Scroll
  driver.execute_script("window.scrollBy(0, 100);")
  # Wait between scrolls to seem more human
  time.sleep(0.5)  # Adjust the sleep time based on your preference

Now given that we don't need to access a catalog. We'll get all the webpage we are interested in a single function. I'll also provide a version of this function adapted to a case that when you are redirected to the search of a specific product there is a catalog in those results. As said before in this one there is not, you just scroll until you want to stop and products are displayed.

In [None]:
def get_code(palabra: str):
  '''
  Given a word(the product we are interested in), acceses the root url, writes in the
  searchbox, clicks for search, displays items, scrolls down until we are interested in and extracts webpage code
  with the data we are interested in
  '''
  # Create a webdriver
  driver = webdriver_example()
  # Do the get petition to the url root domain
  driver.get("https://es.wallapop.com/")
  # Gets the searchbox
  search_box = driver.find_element(By.ID, "searchbox-form-input")
  search_query = palabra
  # Writes into the searchbox
  search_box.send_keys(search_query)
  search_box.send_keys(Keys.RETURN)
  time.sleep(5)
  target_class = "btn-load-more"
  # Searches for the button of search
  button = driver.find_element(By.ID, target_class)
  # Clicks on it
  if button:
      button.click()
  time.sleep(10)
  scroll_height = 0
  start_time = time.time()
  # Scrolls down
  while True:
      driver.execute_script("window.scrollTo(0, arguments[0]);", scroll_height)
      # Adapt sleeps to your necessities
      time.sleep(5)
      # Adapt how much you scroll each time to your necessities
      scroll_height += 500
      # When reaches bottom of the webpage stops scrolling(here you'll never reach the bottom since there is no bottom)
      if scroll_height >= driver.execute_script("return document.body.scrollHeight"):
          break
      elapsed_time = time.time() - start_time
      # If we have been scrolling for more than 30 seconds, stop, adapt this to your necessities
      if elapsed_time >= 30:
          break
  # Gets webpage code
  info = driver.page_source
  driver.quit()
  return info

Once we got all the info, we have to parse it in order to get only the data we are interested in.

In [None]:
def parse_code(info):
    '''
    Given the retrieved code of the webpage parses the content to get title and price
    '''
    # Create a list to store the data
    data = []
    # Parse the content
    soup = BeautifulSoup(info, 'html.parser')
    # Get all the containers that contain the data for each product
    containers = soup.find_all('div', class_='ItemCard__data')
    # For each one
    for container in containers:
        # Extract price information
        price_span = container.find('span', class_='ItemCard__price')
        price = price_span.text.strip() if price_span else "N/A"

        # Extract title information
        title_p = container.find('p', class_='ItemCard__title')
        title = title_p.text.strip() if title_p else "N/A"
        data.append((title, price))
    return data

Now simply store this data into a csv with the same function we used in the previous use case

In [None]:
def get_csv(data, palabra):
    '''
    Given the extracted data and our search target writes a csv with the data.
    '''
    current_date = datetime.now().strftime("%Y_%m_%d")
    # Path where you want the csv file to be.
    csv_file = f"C:\\Users\\ROG STRIX\\PycharmProjects\\pythonProject\\wallapop_{palabra}_{current_date}.csv"

    # Check if this path already exists
    file_exists = os.path.isfile(csv_file)

    # If if already exists we append to the file else we create it
    with open(csv_file, mode='a' if file_exists else 'w', newline='') as file:
        writer = csv.writer(file)

        if not file_exists:
            # Create a header
            header = ["Title", "Price"]
            writer.writerow(header)

        # Write data
        writer.writerows(data)
    # Print for traceability
    return print(f'Csv data written successfully {palabra}{current_date}')

Finally put it all together in a single process and call for the execution.

In [None]:
def main(palabra: str):
    info = get_code(palabra)
    data = parse_code(info)
    get_csv(data, palabra)

main("secador dyson")

This will open our root url, write "secador dyson" that is a specific hair dryer. Redirect to the results url and get the data of all the results we want until we stop scrolling using the time parameter.

Like that we have easily got real time data of a product we might be interested in, so we are able to know the average price of the given product in the market, the number of advertisements there are, the location of them etc...
Execution is so easy and results are pretty clean.

If we were interested we can multiprocess it easily, since it's a single function and search for various products we are interested in, we could even do NLP into the description of the offers or even some kind of Computer Vision using the images of the offers, it can get as complex as you want.

#### **Result**

The extracted csv of the process is this one.
We have extracted 81 offers of this product, the timer is there because products repeat itself once you have scrolled for a while so we are only interested in those because they are the uniques.
https://github.com/zetag33/WebScraping-Tutorial-Use-Case/blob/main/dataset/wallapop_secador%20dyson_2023_11_29.csv

#### **Catalog variation**

Now let's imagine that instead of an ethernal scrolldown you got a catalog as search results. Then we'll have to scroll down until the bottom, capture the next link to the next page of the catalog and do the same. Let's also provide a function to do that.

In [None]:
def get_code_catalog_variation(palabra: str, next_link=''):
  '''
  Given a word(the product we are interested in), acceses the root url, writes in the
  searchbox, clicks for search, displays items, scrolls down until we are interested in and extracts webpage code
  with the data we are interested in. If we already got a next_link, simply accesses the next_link, scroll and gets its data.
  '''
  if next_link != '':
    driver = webdriver_example()
    driver.get(next_link)
    # Here differences start.
    scroll_height = 0
    start_time = time.time()
    # Scrolls down till the end of the page
    while True:
        driver.execute_script("window.scrollTo(0, arguments[0]);", scroll_height)
        # Adapt sleeps to your necessities
        time.sleep(5)
        # Adapt how much you scroll each time to your necessities
        scroll_height += 500
        # When reaches bottom of the webpage stops scrolling(here you'll never reach the bottom since there is no bottom)
        if scroll_height >= driver.execute_script("return document.body.scrollHeight"):
            break
    # Gets webpage code
    info = driver.page_source
    driver.quit()
  else:
    ## Remains the same
    # Create a webdriver
    driver = webdriver_example()
    # Do the get petition to the url root domain
    driver.get("https://es.wallapop.com/")
    # Gets the searchbox
    search_box = driver.find_element(By.ID, "searchbox-form-input")
    search_query = palabra
    # Writes into the searchbox
    search_box.send_keys(search_query)
    search_box.send_keys(Keys.RETURN)
    time.sleep(5)
    target_class = "btn-load-more"
    # Searches for the button of search
    button = driver.find_element(By.ID, target_class)
    # Clicks on it
    if button:
        button.click()
    time.sleep(10)
    # Here differences start.
    scroll_height = 0
    start_time = time.time()
    # Scrolls down till the end of the page
    while True:
        driver.execute_script("window.scrollTo(0, arguments[0]);", scroll_height)
        # Adapt sleeps to your necessities
        time.sleep(5)
        # Adapt how much you scroll each time to your necessities
        scroll_height += 500
        # When reaches bottom of the webpage stops scrolling(here you'll never reach the bottom since there is no bottom)
        if scroll_height >= driver.execute_script("return document.body.scrollHeight"):
            break
    # Gets webpage code
    info = driver.page_source
    driver.quit()
  return info

Let's now create a function that will return the next link of the catalog

In [None]:
def get_next_link(info):
  '''
  Given the webpage source code, returns next link of the catalog
  '''
  # Parse the code
  soup = BeautifulSoup(info, 'html.parser')
  # Get all the containers that contain the data for each product
  next_link = soup.find('a', class_='next_link').get(href)# Usually its an a tag and it's located inside an href, substitute for the location of yours
  return next_link

Now let's put it all together in a main function

In [None]:
def main(word, max_pages):
  counter = 0
  # Start with the first page
  info = get_code_catalog_variation(word)
  data = parse_code(info)
  get_csv(data, word)
  # Get next_link
  next_link = get_next_link(info)
  while counter <= max_pages and next_link != ''#Exists:
    # Change this condition according to your necessities, this will make it stop at 5th iteration inside the catalog given next_link exists
    info = get_code_catalog_variation(word, next_link)
    data = parse_code(info)
    get_csv(data, word)
    next_link = get_next_link(info)
    counter += 1

This has also been an example of how easily can you change the structure of the code based in your necessities. You only have to pick up which parts are usefuls and which ones you should add or modify. There is plenty of examples in the notebook for you to have code structure for a huge part of the commercial webscraping possibilities.

### **Particular use-case 3**

#### **Explanation**

Let's do our last use-case.

Now we'll scrap a catalog website again, the difference will be that the catalog is not a displayable, the webpage will have a dedicated url for the catalog where you have to get the links from and you don't need to display nothing, then access all those urls and get the data.

This one was quite a difficult one because of the web structure. The codes provided here are general guidelines but as said many times before you'll need to adapt it to the specific structure of your website. Meaning sometimes you'll have to use tricks and understand the situation.

We are going to access PAGINAS AMARILLAS a Spanish website that offers the contact info of businesses and particulars that offer services and we are going to get its data segmentated by categories.

This is our starting point: https://www.paginasamarillas.es/all_tarragona_reus.html

A catalog. We could do it on the catalog that is for all spain, I'm doing in this one because it is my province and was what i was interested in at the moment.

If you enter the url provided, you'll see that this example is much more difficult because we'll need to get firstly the url's for each town/village/city and then get the specific url for each service in each town/village/city. This is kind of a stack of catalogs.

This will amount to more than 10.000 url's we are going to access.

This example will be tricky, because we'll see that many times containers change classes between different pages in the catalog so we'll check if we have data and in case we don't then we'll try another approach in order to get it. I have checked that when you don't have it the first way you will be able to have it the second way. As I explained above, webscraping is all about being creative and finding tricks as well as understanding the situation.

This example will also intrinsecaly involve multiprocessing.

So let's solve this much more difficult challenge.

#### **Code Infrastructure**

Firstly, as always, import our dependencies, they are the same as in the other 2 use-cases.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import random
import re
import requests
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import csv
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.chrome.service import Service
import os
import builtwith
import concurrent.futures
import threading

Now, define our 2 functions to create the driver, one to define options another one to instantiate a driver with those options. We've already also seen these functions for the previous examples

In [None]:
def options():

    options = ChromeOptions()
    # Activem javascript
    options.add_argument("--enable-javascript")
    # Associem al driver un useragent humà
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    options.add_argument("--headless")
    # Desactivem les blink features
    options.add_argument("--disable-blink-features=AutomationControlled")
    # Configurem la proxy a las opcions del driver
    # Desactivem el carregat d'imatges.
    options.add_argument("--disable-image-loading")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-plugins")
    options.add_argument("--disable-popup-blocking")
    return options



def webdriver_example():

    # El creem amb les opcions de la proxy escollida cridant a la funcio smartproxy anterior.
    browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
                               options=options())

    return browser


Following this, we know that for this case we have a dedicated url for the catalog, so let's access that catalog and extract all the url's we'll need to access, we'll write them in a file.

In [None]:
def get_links_provincia(url):
  '''
  Given a url for a whole province, extract the url's for each town/city in the province
  '''
  # Instantiate driver
  driver = webdriver_example()
  # Access webpage
  driver.get(url)
  # Get code
  page_source = driver.page_source
  time.sleep(10)
  # Parse it
  soup = BeautifulSoup(page_source, 'html.parser')
  # Get the containers that contain the hrefs
  menuscroll_divs = soup.find_all('div', class_="menuscroll")
  hrefs = []
  for div in menuscroll_divs:
      # Get each anchor tag
      anchor_tags = div.find_all('a', href=True)
      # Extract the text url
      for anchor in anchor_tags:
          hrefs.append(anchor['href'])
  # Check how many we extracted
  print(len(hrefs), hrefs)
  return hrefs

Once we have the link for each city/town/village we'll need to access each one and get the url that redirects to a given type of service for each city/town/village

In [None]:
def get_links_localidad(url):
  '''
  Given a url for a specific town/village/city, extracts the url for each service in the town/village/city
  '''
  # Instantiate driver
  driver = webdriver_example()
  # Access webpage
  driver.get(url)
  # Get the code of the webpage
  page_source = driver.page_source
  # Parse it
  soup = BeautifulSoup(page_source, 'html.parser')
  # Get the urls for each service in the town/city/village
  ficha_divs = soup.find_all('div', class_="ficha")
  hrefs = []
  for div in ficha_divs:
      p_tags = div.find_all('p')
      for p in p_tags:
          anchor_tags = p.find_all('a', href=True)
          for anchor in anchor_tags:
              href = anchor['href']
              if 'juzgados' not in href and 'caja-rural' not in href: # We don't want those for specific reasons of the underlying study
                  hrefs.append(href)
  return hrefs

Let's call it all together and then write a txt with all the url's we'll need to access

In [None]:
# Get the link for each village/city/town
localidades = get_links_provincia('https://www.paginasamarillas.es/all_tarragona_reus.html')
total_links = []
# For each town
for i in localidades:
    # Get the link of each service
    sectores = get_links_localidad(i)
    # Store those urls
    total_links.append(sectores)
# Put it all together in a single list
total_links = [item for sublist in total_links for item in sublist]
# Write the content of this list in a txt, so we have a document with all the url's we'll have to access.
with open('urls.txt', 'w') as file:
    for link in total_links:
        file.write(link + '\n')

Now we'll need to access each one of those urls, extract the data and store it.
Those url's can be found here in the txt I generated using this chunk of code:
https://github.com/zetag33/paginas_amarillas/blob/main/urls.txt

Let's define some utility functions for this specific case. You'll need to get yours adapting it to your necessities

In [None]:
def remove_last_number(url):
  '''
  In case there is a certain feature in the url we do not want, remove it.
  '''
  # Search for the pattern
  pattern = r'/(\d+)$'
  match = re.search(pattern, url)
  # If found, remove it
  if match:
      last_number = match.group(1)
      url_without_last_number = url[:-len(last_number)]
      return url_without_last_number
  # Else return the original url
  else:
      return url

Get the province. Given the url, using patterns extract the province name, we know by watching url structure, that the province will always be located in a specific spot after a key word and followed by another key word, so use this pattern to extract the province name of the url

In [None]:
def get_province(url):
  ''' Get the province name from the url
  '''
    # Remove unwanted feature
    url = remove_last_number(url)
    # Specify a pattern
    match = re.search(r'/([^/]+)/([^/]+)/$', url)
    # If match, extract the province
    if match:
        province = match.group(1)
    # Else set the province as null
    else:
        province = None
    return province

Now we do the same for the city

In [None]:
def get_city(url):
  '''
  Get the city name from url
  '''
  # Remove unwanted feature
  url = remove_last_number(url)
  # Specify a pattern
  match = re.search(r'/([^/]+)/([^/]+)/$', url)
  # If found extract the city
  if match:
      city = match.group(2)
  # Else set it as null
  else:
      city = None
  return city

We again do the same for the category of services

In [None]:
def get_category(url):
  '''
  Given an url extract the category of the services offered
  '''
  # Remove unwanted feature
  url = remove_last_number(url)
  # Get the category part of the url
  url_parts = url.split('/')
  category = url_parts[4]
  return category

Enter the url and extract it's code

In [None]:
def get_soup(url):
  '''
  Given an url, access it and return its code
  '''
  # Instantiate driver
  driver = webdriver_example()
  # Access the url
  driver.get(url)
  # Get it's page source
  page_source = driver.page_source
  # Parse it
  soup = BeautifulSoup(page_source, 'html.parser')
  return soup

For this particular case, our data storage functions will implement csv locking.
This is done because we'll use multiprocessing and we'll have several processes writing in the same file, what we'll do is lock that file while is being written so only one process at each time can write in that file. This is done to prevent interferences between processes the downside is it will slow down a little bit our multiprocessing because processes we'll have to wait to have access to the file but it's necessary for data integrity.

In [None]:
def get_csv(data ,province ,city):
  # Create a lock
  csv_lock = threading.Lock()
  # Lock the file
  with csv_lock:
      # Get today's date time
      current_date = datetime.now().strftime("%Y_%m_%d")
      # Get the file path
      csv_file = f"C:\\Users\\Usuario\\PycharmProjects\\paginas_amarillas\\paginas_amarillas_{current_date}.csv"

      # Check if it already exists.
      file_exists = os.path.isfile(csv_file)

      # Write if it doesn't exist append else
      with open(csv_file, mode='a' if file_exists else 'w', newline='') as file:
          writer = csv.writer(file)

          if not file_exists:
              # Create a header
              header = ["Title", "Price", "Reduction", "Rooms", "Bathrooms", "Meters", "Province"]
              writer.writerow(header)

          # Write data
          writer.writerows(data)
      # Print for traceability.
      return print(f'Csv data written successfully {province}{city}')

This is a common feature in webscraping, sometimes webpages inside a same domain will be structured differenly, that's really bad for us because we'll have to solve it out in a different and more complicated way.

Luckily, the number of structures inside the same domain that are different will be limited, for example in this domain, url's can be of two types of structures. So the urls that are showcasing services can have 2 different HTML's structure.

We'll check which type of structure is each page and scrap it accordingly.

Detect how many different HTML infrastructures are inside your target domain and define a checker of type of structure and a process that parses it accordingly to the given type.

As I said, usually there are a limited number of types, it's difficult to find more than 5 types inside a domain. Decide how many types there are, how to scrap each one and if it's worth it.

We'll now define a function to check which type of structure is a given url. We will have type 1 and type 2.

In [None]:
def type_checker(soup):
  '''
  Check of which type is a given url
  '''
  # Having this characteristic differentiates which class of structure is a given url, if
  # this feature is present then the url is of type 2, else it's of type 1. You'll have to discover this for your target domain by yourself by looking into
  # the structure of different url's and checking how many types and what is the key characteristic for each type
  map = soup.find('div', class_="mapping")
  if map:
      type = 2
  else:
      type = 1
  return type

Now that we know of which HTML structure is each url, we are going to define a parsing process for each type.

Having a different structure means our desired data will be located on different HTML tags, so we have to adapt the parsing process to correctly access the HTML code to retrieve the desired data.

Let's start with type 1.

In [None]:
def parse_one(soup, province, city):
  '''
  Given a soup object(code of the webpage) of structure type 1, a province and a city extract the desired data of the webpage
  '''
  # Find the boxes that have the data for each advertisement
  boxes = soup.find_all('div', class_="box")
  # Create a list to store data
  data = []
  # For each advertisement
  for box in boxes:
      # Get the name
      name = box.find('span', {'itemprop': 'name'})
      # If name does not exist, we do not collect data
      if name is not None:
          # Get the name
          name = name.text
          # Get the category of the advertisement, which type of service is it
          category_p = box.find('p', class_='categ')
          category = category_p.text if category_p else "-"
          # Get the address
          address_span = box.find('span', {'itemprop': 'streetAddress'})
          address = address_span.text if address_span else "-"
          # Get the postal code
          postal_code_span = box.find('span', {'itemprop': 'postalCode'})
          postal_code = postal_code_span.text if postal_code_span else "-"
          # Get the telephone
          telephone_span = box.find('span', {'itemprop': 'telephone'})
          telephone = telephone_span.text if telephone_span else "-"
          # Get the web.
          web_span = box.find('a', class_='web')
          web = web_span.get('href') if web_span else "-"
          # Get the technology with which the web was constructed if it's possible, I'll explain what's the purpose of this
          try:
              techno = builtwith.builtwith(web) if web != "-" else "-"
          except Exception:
              techno = "Unable to decode"
          # Get the description of the advertisement
          description_span = box.find('div', {'itemprop': 'description'})
          # If it exists, extract text
          if description_span:
              p_tag_span = description_span.find('p')
              description = p_tag_span.text if p_tag_span else "-"
          # else, set it as null
          else:
              description = "-"
          # Store data
          data.append((name, category, address, postal_code, telephone, web, techno, description, province, city))
      # If the adverisement has no name don't capture it
      else:
          pass
  # Write the csv with the data
  get_csv(data, province, city)
  # Print for traceability
  return print(f'Se han escrito{len(data)}')

Type 2

In [None]:
def parse_two(soup ,province ,city ,category):
  '''
  Given a soup object(code of the webpage) of structure type 2, a province, city and category extract the desired data of the webpage
  '''
  # Get each advertisement
  boxes = soup.find_all('div', class_="box")
  # Create a data storage
  data = []
  # For each advertisement
  for box in boxes:
    # Get the name
    name = box.find('h2', {'itemprop': 'name'})
    if name is not None:
        name = name.text
        # Category
        categoria = category
        # Address
        address_span = box.find('span', {'itemprop': 'streetAddress'})
        address = address_span.text if address_span else "-"
        # Postal code
        postal_code_span = box.find('span', {'itemprop': 'postalCode'})
        postal_code = postal_code_span.text if postal_code_span else "-"
        # Phone
        telephone_span = box.find('span', {'itemprop': 'telephone'})
        telephone = telephone_span.text if telephone_span else "-"
        # Set web as null, this type of structure has no web
        web = '-'
        # Set techno as null
        techno = '-'
        # Set description as null, this type of structure has no description
        description = '-'
        # Save data
        data.append((name, categoria, address, postal_code, telephone, web, techno, description, province, city))
    # If no name, don't include a register
    else:
        pass
  # Store the data into a csv
  get_csv(data ,province ,city)
  # Print for traceability
  return print(f'Se han escrito {len(data)} de {categoria}')

This one will get us the next link of a catalog given an url

In [None]:
def get_next_link(soup):
  '''
  Get next link of the catalog given the soup object of a url
  '''
  # Find the container of the next link
  pagination_ul = soup.find('ul', class_='pagination')
  # If there is one
  if pagination_ul:
    # Find all the li tags inside
      li_tags = pagination_ul.find_all('li')
      # If it follows this pattern
      if str(li_tags[-1]) == '<li></li>':
        # Extract this
          last_li = li_tags[-2]
      else:
        # Extract that
          last_li = li_tags[-1]
      # Get the anchor tag of the li
      last_a = last_li.find('a')
      # Get the href value of the a anchor
      href_value = last_a.get('href') if last_a else ''
      # If it exists
      if href_value is not None and href_value != 'javascript:void()':
          # Assign the link to a variable
          next_link = href_value
      else:
          # There is no next link
          next_link = ''
  else:
      # There is no next_link
      next_link = ''
  return next_link

Now let's put it all together.

Firstly we'll generate a file with all the url's to scrap using the functions we have defined previously.

In [None]:
# Get the links for each city/town/village
localidades = get_links_provincia('https://www.paginasamarillas.es/all_tarragona_reus.html')
total_links = []
# For each city/town/village
for i in localidades:
    # Get the links of all the services offered in that given city/town/village
    sectores = get_links_localidad(i)
    # Save the data
    total_links.append(sectores)
# Convert it into a single list, original format is list of lists.
total_links = [item for sublist in total_links for item in sublist]
# Write the file with all the url's we have to scrap.
with open('urls.txt', 'w') as file:
    for link in total_links:
        file.write(link + '\n')

Now that we have all the urls let's scrap each one. We'll define a function describing what we will do for each url, this function will use all the utility and process functions defined above

In [None]:
def micro_url(url):
  '''
  Given a url, scrap all the data of it and the next pages of the catalog in the same tree url
  '''
  # Get the province to which the url belongs
  province = get_province(url)
  # Get the city to which the url belongs
  city = get_city(url)
  # Get the category of services to which the url belongs
  category = get_category(url)
  # Get the raw code of the url
  soup = get_soup(url)
  # Check which type of HTML structure the url is
  type = type_checker(soup)
  # If its type one
  if type == 1:
      # Parse accordingly to type 1
      parse_one(soup ,province ,city)
      # Get next link of the catalog
      next_link = get_next_link(soup)
  # If its type two
  else:
      # Parse accordingly to type 2
      parse_two(soup ,province ,city, category)
      # There is no next_link for type 2, set it as null
      next_link = ''
  # If there is a next_link
  while next_link != '':
      print("I'm in next_link")
      # Get the province of next_link
      province = get_province(next_link)
      # Get the city of next_link
      city = get_city(next_link)
      # Get the category of next_link
      category = get_category(next_link)
      # Get the raw code of next link
      soup = get_soup(next_link)
      print(f'Scrapped {next_link}')
      # Check for HTML structure type
      type = type_checker(soup)
      # If type 1
      if type == 1:
          # Parse the code accordingly
          parse_one(soup, province, city)
          # Get next_link
          next_link = get_next_link(soup)
      # If type 2
      else:
          # Parse type two
          parse_two(soup, province, city ,category)
          # Set null next_link, type two has no next link
          next_link = ''
      print(f'Getting {next_link} 1')
  # When reaching here means we have already scraped all the url tree of the original url provided to the function, so means there is no next link in the catalog to scrap for that url.
  # Print for traceability
  print('Scapped the while loop 1')
  print(url, "Scraped succesfully")

We already have defined what to do with each url of the generated file.
Now let's put it all together in a main process and let's use multiprocessing to scrap multiple url's of the file at the same time so we speed up the process.

In [None]:
def main():
  '''
  Main process of the use-case
  '''
  # Open the generated file, change the route to yours
  with open('C:\\Users\\Usuario\\PycharmProjects\\paginas_amarillas\\urls.txt', 'r') as file:
      # Parse each line of it and get a list of urls to scrap
      url_list = [line.strip() for line in file]

  # Set the maximum number of parallel processes, usually leave 1 or 2 CPU's for other purposes
  max_processes = 10
  # Set up a pool of processes
  with concurrent.futures.ProcessPoolExecutor(max_processes) as executor:
      # Each process executes the micro_url function to one url until there are no more in the list
      futures = [executor.submit(micro_url, url) for url in url_list]
      for future in concurrent.futures.as_completed(futures):
          # Process has completed
          print(f"Number of completed processes: {len([f for f in futures if f.done()])}")
  # Wait for all processes to finish
  concurrent.futures.wait(futures, return_when=concurrent.futures.ALL_COMPLETED)

# Call the process
if __name__ == "__main__":
    # Time it
    initial = time.time()
    # Execute
    main()
    # Time finish time
    end_time = time.time()
    # Get and print total time of process
    elapsed_time = end_time - initial
    print(f"Processed in {elapsed_time} seconds.")

#### **Result and conclusion of use-case**

The resulting csv can be found here:

https://raw.githubusercontent.com/zetag33/WebScraping-Tutorial-Use-Case/main/dataset/paginas_amarillas_2023_11_16.csv
Data about 110.000 businesses.

The techno feature we are extracting, gets with which technology the website of the business was built. This was done to be able to have data about businesses that could need to build a website if they had none or improve an existing one if it was built with an old technology.

So by extracting this data we are able to built a customer base for a website offering business.

Proxies have not been used since the server doesn't cut off the process and by faking the user-agent is enough.

## **Conclusion**

Webscraping is a diverse business.
This tutorial/guide is built with the intention to provide you with examples of real world applications.
It's also built to provide the basic code structure and problem approach methodology to most webscraping use-cases.
We have learnt how to scrap a catalog based website, having the website a page for a catalog itself or by displaying it in some url of the domain.
We also have learnt how to search for a specific product and retrieve the data we are interested in, it could be by endless scrolling or by again passing pages of a catalog.
We have learnt how to multiprocess it in order to speed up.
How to store data in a csv and how to deal with multiprocesses writing a same csv.

The use case might change and as explained in the tutorial you'll need to adapt some aspects of the code to your target domain, such as location of HTML tags and intrisic aspects of the infrastructure of the code.
I have provided examples of how to use so.
I would advise to use this tutorial as a guideline and to get functions and methodology that will be useful for your particular project.

Webscraping is a business that requires thinking out of the box, experimentation and troubleshooting problems of all kinds. You'll also have to deal with diverse firewall problems, the troubleshooting of firewall done in this code is basic but effective, for greater projects you'll probably have to handle CAPTCHA's and  more effective protection tools. This could be another whole topic itself.

As a wise man once said, I did not fish for you but I taught you the tools to do so.