##**What's web commercial web scraping and why it is valuable?**

Web scraping definition:
https://en.wikipedia.org/wiki/Web_scraping

An important part of web scraping is concentrated into web scraping of commercial webpages in order to collect data for market studies such as: e-commerce, real state or service offering.

The study of this data such as prices, locations, quantities, providers, etc. has lots of commercial value and can bring a lot of business insights specially when real-time data and analytics are needed.

Web-scraping bots combined with data storage and real-time analytics can provide your business with automated insights on the prices, quantities, locations and overall state of the market your business is competing in.

Imagine having a completely automated tool that runs 24/7 collecting, storing and analyzing valuable data for your business.

Tools such as Python, Cloud Computing and NoSQL databases have provided this field with huge potential.

##**What's a commercial website , what's a catalog of offers and how can we extract value from its structure?**

A commercial website definition is as simple as it's own name:

It's a website(https://en.wikipedia.org/wiki/Website) dedicated to e-commerce(https://en.wikipedia.org/wiki/E-commerce) or simply dedicated to traditional commerce and trade while hosting it's offers also on the Internet as a form of reaching more clients.

As an example:

[Amazon Website](https://www.amazon.es/)

There is a characteristic shared between commercial webpages, all of them have a catalog where they group their resources.

Inside this catalog we have all the urls that redirect to all the offers for each of the categories.

As an example, in the amazon website url provided above at the top left there is a displayable with the name of "All" if we click in, we are shown a displayable with the different categories of products provided by Amazon, each category has its own url.

Once we access an specific category, we will be shown a list of products and its characteristics, the bot would collect the desired data and then pass to the next page of products of the category or simply scroll down until there are no more products depending on the structure of the website.

This would be done for each page of each category until there are no more products.

When we combine different sources of data(websites) we are able to get an overall real-time vision of a concrete market that could not be achieved otherwise.

This repetitive structure of commercial websites allows us to a create a base of code that remains the same for most cases.

We know that we are going to get all the url's for each category and then access each category and surf through the catalog, this will be needed to do for all commercial websites.

So given this structure, the only aspect we have to change between websites would be the name of the elements inside the HTML, where the desired data is located, all the other methodology and code can remain unchanged.

This allows great versatility and low effort in scraping lots of different domains. Providing us with a huge chunk of real-time and real-market data to store and analyse.

This tutorial will only be focused in the methodology of collection and the fixed structure to scrap a commercial website.

##**Use-case and creation of the process of collection**

###**Particular use-case**

In our example we are going to scrap Fotocasa a spanish portal containing data about real state in Spain. It's basically a portal where owners advertise the properties they are selling.

In this case our targeted is data belonging to properties on sale, not rent.

So the first step of any scrap process should be what's my starting point.

The starting point should always be an url that contains the catalog of the products/services we are interested in.

It might be a url completely dedicated to be the catalog or it could be any url of the domain were there is a displayable accessing the catalog.

In our example all url's of the subdomain dedicated to selling properties offer a displayable with the catalog, we start at the main one, could be any other pertaining to the subdomain.

Firstly we'll access the url: https://www.fotocasa.es/es/comprar/viviendas/espana/todas-las-zonas/l

If we access this url, at the top left we can see a displayable named 'Provincia' meaning State/Province in spanish, if we click on it, it displays a list of all the provinces in Spain, each province has a href(HTML element) containing the url that contains the offers for that province.

What our bot is going to do is access that url, click on the displayable, scroll down until the bottom of the displayable so all provinces are displayed, capture the url of each of the provinces.

Once arrived here, we got the first step of any commercial webscraping bot, we have got all the root urls that we are going to scrap.

As an example: https://www.fotocasa.es/es/comprar/viviendas/asturias-provincia/todas-las-zonas/l

Contains the offers for one specific province.

We are going to access it, scroll down the page and get all the data of each property offered, once we are at the bottom and there are no more advertisements we are going to capture the next url of the catalog pertaining to that province(equivalent to pressing next button to display next page) and do the same over and over until there are no more advertisements for the given province. We'll do this process for each province.

We are capturing:

- Title of the advertisement.
- Price
- Price reduction from initial price.
- Rooms
- Bathrooms
- Meters
- Province

### **Technical process of collection**

Our bot will be coded in Python.

I am specifically using Selenium a JavaScript rendering framework that allows us to programatically interact with webpages as we were a human using a browser.

I am also using a proxy provider for technical reasons explained in the code section.

You'll need to configure your proxy provider so that you can connect to its services using Python. In my case I whitelisted my IP in the proxy provider management tools so I can access my proxies without providing user or password,
this may change between providers.

In order for this processes to be interesting they must scrap thousands and milions of urls, this implies that the bigger your project is bigger will be your computing necesities.

You'll need multiprocessing in order to be time and resource efficient, this is integrated into the code I provide, in my case due to the scale of this example using my 12 cpu's were enough. Each CPU scraps one url, so I'm able to be running 12 url scraps at the same time.

This could be increased using a cloud provider, where you can get virtually unlimited computing capacity and cpu's, also the code could be tweaked to be even more efficient and have multithreads inside the multiprocesses, so the scale this bots can reach is quite high, around the tens of millions of urls.

Our data will be stored in a csv, but this step could easily be an insert of the data into a relational database. Because don't forget it, another advantatge of this data collection is that our data is structured, we are getting the fields we want and that can be easily modeled as relations and the fields have fixed data-types.

Basically were are getting valuable structured data from unstructured public data.

The collection of the data in this example has not commercial purposes, it's solely done as an example and only a small section of the domain was scraped while being respectful with the server and not causing any problems to the host. Also the tutorial does not have any cost.

### **Code structure and explanation**

Import all  necessary libraries and modules.

Mainly:

- Selenium in order to render the JavaScript and be able to scroll down in the target webpages. We also use some utilities for waiting while interacting with the webpage, so we seem more human and meanwhile we let time to the code so it can get rendered. Also we import settings related to user agents and conexion related, those settings are used to make our scraper look as a human.

- Beautifulsoup will be our html code parser. So we can get the data we are interested in from the whole html of the webpage targeted.

- Random, re, requests, time, csv, datetime and os as general utilities.

- Webdriver managers, we get the specifical drivers we want our selenium engine to scrap on. Mainly Chrome.


In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import random
import re
import requests
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import csv
from datetime import datetime
from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.chrome.service import Service
import os

Define a list of proxies. If you do this using your IP, the server will block you after 3/4 automatic requests in order to avoid so I use a proxy provider to get a list of residential proxies so I make each connection from a different IP, hence a different location hence the server doesn't block my scraper because it views it as a human user.

My proxy provider is a highly recognized one: https://smartproxy.com/

In this case I have just to specify the proxies because MY IP is whitelisted in my proxy provider, so when the proxy provider server detects its my IP the one that is trying to connect it allows the connection, if you choose another proxy provider this might work in another way.

In my case those are residential proxies from Spain, high quality IP's that won't make you look like a robot.

In [None]:
proxy_list = [('es.smartproxy.com',10001,"CHROME"),
              ('es.smartproxy.com',10002,"CHROME"),
              ('es.smartproxy.com',10003,"CHROME"),
              ('es.smartproxy.com',10004,"CHROME"),
              ('es.smartproxy.com',10005,"CHROME"),
              ('es.smartproxy.com',10006,"CHROME"),
              ('es.smartproxy.com',10007,"CHROME"),
              ('es.smartproxy.com',10008,"CHROME"),
              ('es.smartproxy.com',10009,"CHROME"),
              ('es.smartproxy.com',10010,"CHROME")]

We are going to do functionally decompose the whole process, meaning that for each major step of the process we are going to define a function that does specifically that step. So in case of errors, it's easier to troubleshoot those because we specifically know in which part of the process it was produced and we can go to the function that handles that and fix the code.

The smartproxy function sets the options of our webdriver that will be used for the scraping.

We basically enable the rendering of JavaScript, so the JavaScript dependent elements such as images or scrolling down are displayed and so we are able to get data from them.

We also set the user agent as a human one. If you do not specify a user-agent, the scraper will use the default Python one, showing to the server that we are a bot instead of a human, the server will then reject our connexion, by faking a human user-agent we are able to seem human to the server.

Finally, we set the proxy options(HOSTNAME, PORT, DRIVER) through which the scraper will be able to connect to internet.

In [None]:
def smartproxy(HOSTNAME, PORT, DRIVER):
    """
    Sets up options for a Chrome driver such as user-agent, enabled javascript or proxy details.

    Parameters:
      - HOSTNAME: Name of our proxy direction
      - PORT: Number of port through which we make the connection to the proxy.
      - DRIVER: Driver we are going to use(Chrome, Firefox, Safari)

    Returns:
      The options object already configured ready to be applied to a driver.
    """
    # Instantiate an ChromeOptions object (We are going to use Chrome driver, others could be used)
    options = ChromeOptions()
    # Activate javascript rendering
    options.add_argument("--enable-javascript")
    # Create a fake human user-agent
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
    # Specifies the mode of the Selenium engine, if headless is selected as I do in the following code line, then while scraping our browser doesn't pop up, in case of experimentation or solving errors, comment the line so
    # you are able to see what the scraper is doing visually.
    options.add_argument("--headless")
    # Disable blink features
    options.add_argument("--disable-blink-features=AutomationControlled")
    # Configure the driver to use our proxy details to connect to the internet
    proxy_str = '{hostname}:{port}'.format(hostname=HOSTNAME, port=PORT)
    options.add_argument('--proxy-server={}'.format(proxy_str))
    # We are not interested in images, so for efficiency reasons I disable the loading of images
    options.add_argument("--disable-image-loading")

    return options

The next function called webdriver_instance creates the driver we are going to use for the selenium engine and applies to that driver the options we specify in the smartproxy function, that is applied inside the webdriver_instance function.

So firstly, we pick a random proxy from our proxy_list previously specified.
We get its hostname, port and driver.

Then, we instantiate a Chrome driver and we set its options to be the ones provided by the previous smart_proxy function provided above.

We are already able to create a driver with custom options.

In [None]:
def webdriver_instance():
    """
    Creates an instance and configures the options of a driver ready for Selenium to be used.
    """
    # Get a random value from the list of proxies.
    hostname, port, driver = random.choice(proxy_list)
    # Instantiate a Chrome driver and configure the options using the previously defined smartproxy function.
    browser = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
                                   options=smartproxy(hostname, port, driver))

    return browser

Now, we are going to define a function that enters our starting point url, searches for the displayable, clicks on it, scrolls down to show all provinces in the displayable and gets all the url's that redirect to each of the provinces.

So the function will return a list with the urls that we need to scrap in order to get all the data from the domain.

We need to do this for all commercial domains we want to scrap, there might be little changes between commercial domains, for example:

  - Maybe the domain you are interested in scraping has a page dedicated to specifically be the catalog, then just access it, you won't need to display anything and get the urls from the href that are contained in the html code directly.

  - In case there is also a displayable in the domain you are interested in, then just change the html container where the displayable is located to the actual one for your website and then also change the anchor tags where the url of the categories are located.
  As an example you would have to change:
  This line:
  dropdown = driver.find_element(By.CLASS_NAME, "sui-MoleculeSelectPopover-select")
  Change this class name where my catalog is contained: "sui-MoleculeSelectPopover-select" to your actual one.

  And then change this line:
  link_items = soup.find_all('a', class_="sui-LinkBasic")
  That gets the urls of the categories, change class_="sui-LinkBasic" this parameter for the anchor container where your urls are located.

In [None]:
def get_categories():
    """
    Enters a webpage and gets the urls of all categories, this one clicks a displayable and retrieves the code after
    having clicked, as explained maybe your website has not a displayable and lets you access the urls of categories directly,
    then you won't need that part.

    Returns:
      List of urls with all the categories we want to scrap.
    """
    # Instantiate and assign to a variable an already options configured driver using the previously defined function.
    driver = webdriver_instance()
    # Assign the catalog url to be scraped to a variable
    url = "https://www.fotocasa.es/es/comprar/viviendas/espana/todas-las-zonas/l"
    # Do a get petition to the url using the driver
    driver.get(url)
    # Sleep, remember that the bot must not perform actions too quickly we want to be seen as humans, so we wait 5 seconds.
    time.sleep(5)
    # Search for the displayable inside the webpage, adapt the class name to the website you want to scrap.
    # In case there is no displayable and the catalog its a page itself, just pass this step and just retrieve the whole code of the page and search for the tags
    # that contain the urls that redirect to each category.
    dropdown = driver.find_element(By.CLASS_NAME, "sui-MoleculeSelectPopover-select")
    # Click on the displayable
    dropdown.click()
    # Get all the html code of the webpage, including the displayable.
    displayable_html = driver.page_source
    # Parse the html
    soup = BeautifulSoup(displayable_html, 'html.parser')
    # Search for the specific tags that contain the urls for each category.
    link_items = soup.find_all('a', class_="sui-LinkBasic")
    # Get the href(attribute that contains an url that redirects) for each tag.
    hrefs = [link.get('href') for link in link_items]
    for href in hrefs:
        print(href)
    # Sleep
    time.sleep(5)
    driver.quit()
    # Get the urls
    return hrefs

The above function gets all the urls of the catalog, it's very possible that you'll want to filter based on certain characteristics, so i.e. you may not want all provinces or you may not want certain categories.

This function right here filters the list we have retrieved of urls, you can filter using whatever criteria you need, in my case I filter for certain provinces that I'm interested in.

Just take a look at the list of urls retrieved by the function previously and exclude the words you are not interested in or just take the ones you are interested in.

To adapt this function all you need is a little bit of experimentation meaning, just retrieve your list of urls and see what you want and what you don't want and then adapt the function and check the result. Once you did it, is completely automated for that specific website domain.

In [None]:
def filter_urls(hrefs, provinces):
    """
    Given a list of url and list of words, filters the list so that the urls remaining contain the words we are interested in.
    Also filters for several words we do not want the urls to have.

    Takes:
      - hrefs: list of urls.
      - provinces: list of words, we want our urls to include.

    Returns:
      Desired list of urls.
    """
    search_words = provinces
    # Define your pattern to search based in the words you want to search for.
    pattern = r'\b(?:' + '|'.join(re.escape(word) for word in search_words) + r')\b'
    # Filter the original list
    filtered_strings = [s for s in hrefs if re.search(pattern, s, re.IGNORECASE)]
    # Define an empty list to retrieve the new list of filtered urls.
    result = []
    # Check for words you dont want your urls to have just add and not re.search(r"obra-nueva", url) conditions as follows
    for url in filtered_strings:
        if not re.search(r"particulares", url) and not re.search(r"obra-nueva", url) and re.match(r"^/es/comprar/viviendas/", url):
            result.append(url)
    return result

This is our function to store the data.

It requires a list of tuples with our data, the province/category the data belongs to  and a counter that basically will tell us which page of the catalog of the specific category the data belongs to.

It outputs a csv or appends to an existing csv the data we are interested in.

For bigger projects, it would be pretty easy to substitute this function for one that does inserts into a database instead of writing a csv.

In [None]:
def get_csv(data, provincia, counter):
    # Aquesta funció s'encarrega d'escriure el fitxer csv, en cas de que ja existeixi simplement escriu les dades al final del mateix fitxer
    # Obtenim la data d'avui
    current_date = datetime.now().strftime("%Y_%m_%d")
    # Fixem el nom del fitxer i la ruta, es genera amb la data d'execució. Si ja s'ha generat previament ho comprovarem i escriurem al final
    csv_file = f"C:\\Users\\Usuario\\PycharmProjects\\PRA1\\fotocasa_{current_date}.csv" # Substituir la ruta per la que naltros volguem

    # Comprovem si el fitxer ja existeix
    file_exists = os.path.isfile(csv_file)

    # Escrivim el fitxer o afegim al fitxer
    with open(csv_file, mode='a' if file_exists else 'w', newline='') as file:
        writer = csv.writer(file)

        if not file_exists:
            # creem un header
            header = ["Title", "Price", "Reduction", "Rooms", "Bathrooms", "Meters", "Province"]
            writer.writerow(header)

        # Escrivim les dades
        writer.writerows(data)
    # Enviem missatge d'exit
    return print(f'Csv data written successfully {provincia} {counter}')

Finally we are going to use all the functions defined above in a main process that will do all the subprocesses.

In [None]:
def main(max_pages):
  # Get all the urls for each category
  provinces = ['tarragona','lleida','barcelona', 'girona']
  hrefs = get_categories()
  # Filter those urls for the ones we are interested in
  urls = filter_urls(hrefs, provinces)
  # Define the base url of the domain, the url list we are retrieved will probably be partial urls, meaning you will need to add the base domain url to acces an actual url.
  # If this is different for your website and it gives you full urls, you don't need to do this, otherwise you'll need to figure out what you have to add to those partial urls
  # to be able to access the actual urls.
  base_url = "https://www.fotocasa.es"
  # For every partial url, we get the province/category we are going to scrap, you'll need to figure out where in your partial urls this province/category is contained
  # For each of our partial urls we are going to do the following process.
  for i in urls:
      parts = i.split('/')
      for part in parts:
          if '-' in part:
              element = part.split('-')[0]
              break
      # Get the specific province of that url
      provincia = element
      # Set up a counter, this will be used to monitor how many pages in the catalog we want to deep in, each page of a given category will increase the counter in 1 until
      # we reach our max_pages parameter where the process will stop because we are not interested in deepening more in the catalog, maybe you want the full catalog, then just setup
      # a big number as the parameter.
      counter = 0
      # While the counter is <= than the max_pages parameter we'll keep scraping pages of the catalog.
      while counter <= max_pages:
        # If its the starting url for the category, we are simply getting the base_url + the partial url retrieved from the displayable or catalog starting page.
          if counter == 0:
              url = base_url + i
        # If its not the starting url, we are getting the base_url + the partial url obtained from getting the next link of the catalog
          else:
              url = base_url + next_link
          # Time the process
          start_time = time.time()
          # Create a webdriver
          driver = webdriver_example()
          # Access the page
          driver.get(url)
          # Sleep, adapt your sleeps so the server doesn't block you but it also doesn't take an eternity to scrap all pages
          time.sleep(5)
          # Once we are on the page we are going to scroll down to show all its content
          scroll_height = 0
          while True:
              driver.execute_script("window.scrollTo(0, arguments[0]);", scroll_height)
              time.sleep(2)
              scroll_height += 1000
              if scroll_height >= driver.execute_script("return document.body.scrollHeight"):
                  break
         # Once we have scrolled the whole page, we are retrieving the page html code
          page_source = driver.page_source
          # Sleep, again adapt to your necessities
          time.sleep(7)
          # Close the driver
          driver.quit()
          # Parse the html code
          soup = BeautifulSoup(page_source, 'html.parser')
          # Get the containers that have each advertisement. That's usually a common structure in comercial websites, offers
          # will be contained in containers of the same class. Get all of these containers, each one is an advertisement.
          info_divs = soup.find_all('div', class_='re-CardPackPremium-info') # Change the name of the class for the specific name of the class of your website
          # If there are enough containers just pass and apply the normal process, because that means we have targeted the right class to get the containers.
          if len(info_divs) > 10:
              pass
          # Else means that was not the right class because we have not retrieved enough containers, try with a nother class where they might be.
          else:
              info_divs = soup.find_all('div', class_='re-CardPackAdvance-info') # Substitute for your specific class.
          # Create a list to store data
          flats = []
          # For each container get the data we are interested in.
          for info_div in info_divs:
              # Get the title of the advertisement
              # In order for an advertisement to exist it must have a title
              title = info_div.find('span', class_="re-CardTitle re-CardTitle--big").text # Substitute for your specific class inside the container that contains the title.
              # Get the price of the advertisement, targeting the class that contains the data, if it does not exist, mark the price as null
              precio_span = info_div.find('span', class_="re-CardPrice")
              price = precio_span.text if precio_span else "-"
              # Get the reduction of the price, targetting the class that contains the data, if it does not exists, mark the reduction as null
              reduccio_span = info_div.find('span', class_="re-CardPriceReduction")
              reduccio = reduccio_span.text if reduccio_span else "-"
              # Get the rooms of the property, targetting the class that contains the data, if it does not exists, mark the rooms as null
              rooms_span = info_div.find('span', class_="re-CardFeaturesWithIcons-feature-icon re-CardFeaturesWithIcons-feature-icon--rooms")
              rooms = rooms_span.text if rooms_span else "-"
              # Get the bathrooms of the property, targetting the class that contains the data, if it does not exists, mark the bathrooms as null
              bathrooms_span = info_div.find('span', class_="re-CardFeaturesWithIcons-feature-icon re-CardFeaturesWithIcons-feature-icon--bathrooms")
              bathrooms = bathrooms_span.text if bathrooms_span else "-"
              # Get the meters of the property, targetting the class that contains the data, if it does not exists, mark the meters as null
              meters_span = info_div.find('span', class_="re-CardFeaturesWithIcons-feature-icon re-CardFeaturesWithIcons-feature-icon--surface")
              meters = meters_span.text if meters_span else "-"
              # Save this data in a list as a tuple, at the end of the for loop this list will have the tuples with the all the data of the specific page of the catalog of the category
              flats.append((title, price, reduccio, rooms, bathrooms, meters, provincia))
          # Store the data in a csv using the previously defined function
          get_csv(flats,provincia, counter)
          # Get the partial url of the next page in the catalog
          li_tags = soup.find_all('li', class_='sui-MoleculePagination-item') # Target the container that has the url redirecting to next page. usually located at the Next button of the catalog.
          # We assume it does not exists
          href = ''
          # Get the anchor tag containing the next url, extract the href, it might be contained in several containers, we check each one and get the one that is not empty
          for li_tag in li_tags:
              a_tag = li_tag.find('a', class_='sui-AtomButton sui-AtomButton--primary sui-AtomButton--outline sui-AtomButton--center sui-AtomButton--small sui-AtomButton--link sui-AtomButton--empty sui-AtomButton--rounded') # Change for your targeted class
              if a_tag:
                  href = a_tag.get('href')
              else:
                  pass
          # Save the partial url for next iteration
          next_link = href
          # Get finishing time
          end_time = time.time()
          # Get total time of 1 iteration
          execution_time = end_time - start_time
          # Print which page we have scraped
          print(f'Page {counter + 1} of {provincia} scraped')
          # Print how much time it took
          print(f"Execution time: {execution_time:.4f} seconds")
          # Print how many advertisements we have collected data from
          print(len(flats))
          # Raise the counter, so we know we have passed a page in the catalog
          counter += 1
          # Sleep, adapt to your necessities
          time.sleep(20)
          # If we reach the max_pages parameter in our catalog stop the process and go to the next url obtained from the catalog of categories.
          if counter > max_pages:
              break
  # Print finish of the process
  return print('Succesful')

In a nutshell, this code structure will work for all commercial websites with a catalog.

You must customize certain code aspects to meet the requirements of your specific website.

Those customization can be resumed as:

  - The domain has an specific url for the catalog?
      - Yes, then modify the get_categories function to skip the part where we display the displayable.
  - You must also change all the html classes, including anchors and hrefs, to adapt them to your target domain structure. Maybe there is certain functionalities that might be different such as getting the next link, maybe your next link has a certain container where is located, then modify the code to adapt it to your specific necesites.
  - Adapt the data extraction to the elements you want to extract, if there are more add, if they are different modify, etc.

Other than that, the structure can be applied to mainly all commercial websites in order to scrap its data.

### **Getting bigger in scale**

If you are planning to do this in a big scale, you will have to adapt the function for each website.

Maybe you want to monitor 10, 15 websites. Then adapt your function for each one of them, do testing in order to check you are extracting the data alright.

Once you have a function for each web domain you want to scrap, then you'll probably need to run those functions in multiprocessing due to time and efficiency.

As I said my computer has 12 cpu's, I always leave atleast 1 or 2 cpu's for other purposes, so count that I can have 10 cpu's dedicated to this process.
Then I can carry simultaneuosly 10 different webscraping processes.

In a code example it would be.

In [None]:
import multiprocessing

if __name__ == "__main__":
    # Create a list of functions
    functions = [func1, func2, func3, func4, func5, func6, func7, func8, func9, func10]

    # Create a pool of worker processes
    with multiprocessing.Pool() as pool:
        # Use the pool to map the functions to the worker processes
        # This will run the functions in parallel
        pool.map(lambda f: f(), functions)

We don't need to pass any arguments to the global main function we'll have for each web domain we want to scrap, so just run all the 10 processes at the same time with a multiprocessing pool.

If you wanted to get tricky and become much more efficient, you can multithread at the same time,  some of the subprocesses that are happening inside each main function. This would make the code run much faster.

Also you'll probably need to handle errors with try blocks. But as I said before it all comes down to testing each function and see if it works and fix in case it doesn't once you have tested each function the bot will give you huge data collection capacity and you'll have to change literally nothing for months while being able to extract valuable data.

The only costs associated with this process is computation capacity and the proxy provider, by having a good deal with your proxy provider, once you have already defined your functions for each target website, you'll be able to retrieve valuable data 24/7 all year long while applying little to no effort and at a very low economic cost. Probably this data will provide you with business insights that could not be achieved otherwise because this is real time information.

## **Results**

Finally, I executed this example with a depth of 4 pages into the catalog for category/province. I filtered and only used 4 provinces that are the ones that compose the specific region where I live.

The result was this csv:

https://github.com/zetag33/FotoCasa/blob/main/dataset/fotocasa_2023_11_06.csv

Structured and valuable market data at a cheap cost.

