## scraper.py & scraping.ipynb

Automate the process of extracting reviews from Google Maps, handling dynamic content (such as scrolling and expanding long reviews), and saving both review summaries and detailed reviews in structured CSV files. The approach is modular and can be adapted to different scraping needs. The final version of the code, after refactoring and modularizing it for scalability and maintainability, is located in the scraper.py file inside the src directory. This version includes all necessary improvements, follows best practices for web scraping, and is ready for use. The script efficiently handles dynamic content on Google Maps and saves the extracted data into well-organized CSV files for further analysis. All further updates and enhancements should be implemented in this scraper.py file to maintain consistency.

### Setting up WebDriver

Configure the Chrome WebDriver:

- chromedriver_path: Specifies the location of chromedriver (this needs to be correctly configured based on your environment).
- chrome_options: Adds custom options (in this case, disabling image loading to improve page load speed).
- webdriver.Chrome(): Initializes the Chrome browser with these settings.

In [None]:
import time
import pandas as pd

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# chromedriver PATH
chromedriver_path = '../chromedriver'
service = Service(chromedriver_path)
chrome_options = Options()
chrome_options.add_argument("--disable-images")
driver = webdriver.Chrome(service=service, options=chrome_options)

### Navigating to the Target Page and Handling Cookies

- Opens the target Google Maps page using get().
- A cookie popup is handled by clicking the 'Accept All' button. If it doesn’t appear, the script moves on without failing.

In [None]:
target = "https://www.google.com/maps/place/Cafeter%C3%ADa+HD/@40.4361928,-3.7182915,17.02z/data=!4m8!3m7!1s0xd4228432da1a851:0x3d7986427fb2312e!8m2!3d40.4360712!4d-3.7132937!9m1!1b1!16s%2Fg%2F1tdxx9wf?entry=ttu&g_ep=EgoyMDI0MDkxOC4xIKXMDSoASAFQAw%3D%3D"
driver.get(target)
wait = WebDriverWait(driver, 5)

try:
    accept_button = driver.find_element(By.XPATH, "//button[@aria-label='Aceptar todo']")
    accept_button.click()
    print("Se hizo clic en 'Aceptar todo'")
    time.sleep(1)  # Wait
except Exception as e:
    # If button doesnt appear, continue
    pass

### Extracting Review Summary (Stars and Review Counts)

- Wait until the reviews container is fully loaded.
- Collect a summary of star ratings and the number of reviews associated with each rating level. The data is stored in two lists (ratings and reviews_counts), which will later be saved into a DataFrame.

In [None]:
# Extract starts resume
wait = WebDriverWait(driver, 30)
reviews_container = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.m6QErb.DxyBCb.kA9KIf.dS8AEf.XiKgde')))

raw_html_reviews = driver.find_elements(By.CSS_SELECTOR, 'div.m6QErb.DxyBCb.kA9KIf.dS8AEf.XiKgde')

#print(raw_html_reviews[0].text)

raw_html_stars_resumme = raw_html_reviews[0].find_elements(By.CSS_SELECTOR, 'tr.BHOKXe')

ratings = []
reviews_counts = []

for star in raw_html_stars_resumme:
    rating_text = star.get_attribute('aria-label')
    rating_parts = rating_text.split(',')
    stars = rating_parts[0].split()[0] 
    num_reviews = rating_parts[1].strip().split()[0].replace('.', '')
    ratings.append(int(stars))
    reviews_counts.append(int(num_reviews))

stars_resumme = pd.DataFrame({'stars': ratings, 'reviews': reviews_counts})
display(stars_resumme)

### Extracting Detailed Reviews

- Iterates over all visible reviews on the page and extracts relevant information, such as the reviewer’s name, rating, review text, and date.
- If a review has a "See More" button, it clicks to expand the full text.
- The script scrolls down to load more reviews and continues until no new reviews are loaded.

In [None]:
# Extract all visible reviews

reviewers = []
ratings = []
review_texts = []
review_dates = []
local_guides = []
text_backups = []

# Init a set for process reviews
processed_review_ids = set()

SCROLL_PAUSE_TIME = 120

last_height = driver.execute_script("return arguments[0].scrollHeight", reviews_container)
while True:
    # Extract all visible reviews
    raw_html_reviews = driver.find_elements(By.CSS_SELECTOR, 'div.jftiEf.fontBodyMedium')
    
    # Extract data for each review
    for review in raw_html_reviews:
        try:
            review_id = review.get_attribute('data-review-id')
            
            # Verify if review has been processed
            if review_id in processed_review_ids:
                continue  # Omit processed review
            
            processed_review_ids.add(review_id)
            
            try:
                more_buttons = review.find_elements(By.CSS_SELECTOR, 'button.w8nwRe')
                for button in more_buttons:
                    if button.is_displayed():
                        button.click()
                        time.sleep(1)
            except Exception as e:
                print(f"No se encontró un botón 'Ver más' en esta reseña: {str(e)}")

            try: reviewer = review.find_element(By.CSS_SELECTOR, 'div.d4r55').text
            except: reviewer = ''

            try: review_text = review.find_element(By.CSS_SELECTOR, 'span.wiI7pd').text
            except: review_text = ''

            try: rating = review.find_element(By.CSS_SELECTOR, 'span.kvMYJc').get_attribute('aria-label')
            except: rating = ''

            try: review_date = review.find_element(By.CSS_SELECTOR, 'span.rsqaWe').text
            except: review_date = ''

            try: local_guide = review.find_element(By.CLASS_NAME, "RfnDt").text
            except: local_guide = ''

            print(reviewer + ' - ' + rating + ' - ' + review_date + ' - ' + review_text) 

            reviewers.append(reviewer)
            review_texts.append(review_text)
            ratings.append(rating)
            review_dates.append(review_date)
            local_guides.append(local_guide)

            text_backups.append(review.text)

        except Exception as e:
            print(f"Error al extraer una reseña: {str(e)}")

    # Scroll down the page
    driver.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight", reviews_container)
    time.sleep(10)

    new_height = driver.execute_script("return arguments[0].scrollHeight", reviews_container)
    print(new_height)
    print(last_height)
    # If height doesn't change, wait to load more reviews.
    if new_height == last_height:
        print('No more reviews detected, forcing additional scroll')
        # Wait more reviews to be loaded
        time.sleep(SCROLL_PAUSE_TIME) 
        new_raw_html_reviews = driver.find_elements(By.CSS_SELECTOR, 'div.jftiEf.fontBodyMedium')
        if len(new_raw_html_reviews) == len(raw_html_reviews): 
            print('No more reviews loaded, exiting scroll')
            break
    else:
        last_height = new_height

### Saving the Extracted Data to CSV


Once all the review data is collected, it is organized into a pandas DataFrame and saved as CSV files:
-  CSV contains detailed reviews.
-  CSV contains the summary of star ratings and the number of reviews for each rating level.

In [None]:
collected_data = pd.DataFrame({
    'author': reviewers,
    'local_guide_info': local_guides,
    'rating': ratings,
    'review': review_texts,
    'date_text': review_dates,
    'text_backup': text_backups
})
display(collected_data)

In [None]:
name = 'hd.csv'

csv_file_path = '../data/raw/collected_reviews_'
collected_data.to_csv(csv_file_path + name, index=False)

stars_resumme_path = '../data/raw/resumme_'
stars_resumme.to_csv(stars_resumme_path + name, index=False)