Cell 1: Install Required Libraries

In [1]:
!apt-get update
!apt-get install -y chromium-chromedriver
!pip install python-docx selenium spacy textblob pandas
!python -m spacy download en_core_web_sm


Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,237 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,692 kB]
Get:13 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packa

Cell 2: Import Libraries and Set Up Exclusion List

In [2]:
import time
import random
import re
import urllib.parse
import requests
import pandas as pd

from bs4 import BeautifulSoup
from docx import Document
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import spacy
from textblob import TextBlob
from dateutil.parser import parse  # Added for date parsing

# Initialize spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Updated list of domains to exclude from search results.
EXCLUDED_DOMAINS = [
    "https://www.zillow.com/",
    "https://www.reco.on.ca/",
    "https://en.m.wikipedia.org/",
    "https://en.wikipedia.org",
    "https://www.linkedin.com/",
    "https://www.canada.ca/",
    "https://www.ontario.ca/",
    "https://www.condoauthorityontario.ca/",
    "https://www.toronto.ca/",
    "https://www.realtor.ca/",
    "https://condos.ca/",
    "https://property.ca/",
    "https://agents.property.ca/",
    "https://www.zolo.ca/",
    "https://www.realtor.com/",
    "https://www.investopedia.com/",
    "https://www.expedia.com/",
    "https://www.homes.com/",
    "https://www.glassdoor.ca/",
    "https://www.reddit.com/",
    "https://www.apartments.com/",
    "https://strata.ca/",
    "https://www.rentcafe.com/",
    "https://rentals.ca/",
    "https://hotpads.com/",
    "https://properties.lefigaro.com",
    "https://www.kijiji.ca/",
    "https://www.apartmenthomeliving.com/",
    "https://www.bloomberg.com/",
    "https://www.apartmentlist.com/",
    "https://www.nytimes.com",
    "https://www.justanswer.com",
    "https://www.highrises.com/",
    "https://coastalheatpumps.com/",
    "https://airtekshop.com/",
    "https://www.a-plusquality.com/",
    "https://www.quora.com/",
    "https://www.youtube.com",
    "https://www.instagram.com/",
    "https://www.imdb.com/",
    "https://www.facebook.com/",
    "https://www.wsj.com/",
    "https://www.ontario.ca/",
    "https://www.yelp.ca/",
    "https://www.pinterest.com/",
    "https://www.flickr.com/",
    "https://ca.linkedin.com/",
    "https://twitter.com/",
    "https://www.tiktok.com/",
    "https://www.imdb.com/",
    "https://www.airbnb.com/",
    "https://www.movoto.com/",
    "https://x.com/",
    "https://www.rent.com/",
    "https://www.movemeto.com/",
    "https://www.ziprecruiter.com/",
    "https://vimeo.com/",
    "https://www.amazon.com/",
    "https://www.rew.ca/",
    "https://www.crunchbase.com/",
    "https://ca.hotels.com/",
    "https://hotels.com/",
    "https://www.boston.com/",
    "https://www.bostonglobe.com/",
    "https://www.ctvnews.ca/",
    "https://www.brampton.ca/",
    "https://www.pas.gov.on.ca/",
    "http://www.ontario.ca/",
    "https://www.reca.ca/",
    "https://trreb.ca/",
    "https://www.mississauga.ca/",
    "https://london.ca/",
    "https://bostonagentmagazine.com/",
    "https://www.oshawa.ca/",
    "https://www.milton.ca/"

]

# Only scrape URLs that contain the specified subsection.
# For example, if you want to include only blog pages.
REQUIRED_SUBSECTION = "/blog/"


Cell 3: Load Keywords from the Uploaded DOCX File

In [3]:
def load_keywords(docx_filename):
    """
    Loads keywords from a DOCX file.
    Each non-empty paragraph is considered a keyword.
    """
    document = Document(docx_filename)
    keywords = [para.text.strip() for para in document.paragraphs if para.text.strip()]
    return keywords

# Update the filename as needed; this example assumes the file is named 'keywords_beam.docx'
keywords = load_keywords('keywords_CAO_updated_03_22.docx')
print("Loaded Keywords:")
print(keywords)


Loaded Keywords:
['toronto condos issues', 'ontario condos issue', 'pipe issue in Ontario condos', 'pipe issue in Toronto condos', 'Toronto Condos high rent', 'Ontario Condos high rent', 'Ontario condo heat', 'Toronto condo heat', 'Ontario condo costs', 'Toronto condo costs', 'Ontario condo safety', 'Toronto condo safety', 'Ontario condo noise', 'Toronto condo noise', 'Ontario condo parking', 'Toronto condo parking', 'Ontario condo lifespan', 'Toronto condo lifespan', 'Ontario condo amenities', 'Toronto condo amenities', 'condominium authority ontario', 'condominium corporation ontario', 'ontario condominium corporation maintenance fee', 'ontario condominium corporation high fee', 'ontario condo corporation managers bad', 'condominium authority ontario directors', 'Condominium Authority Tribunal', 'Real Estate Council Ontario', 'Ottawa condo issues', 'Mississauga condo issues', 'Brampton condo issues', 'Hamilton ontario condo issues', 'London ontario condo issues', 'Markham ontario con

Cell 4: Define a Helper Function for Random User Agents and Delay

In [4]:
def get_random_headers():
    """
    Returns HTTP headers with a randomly chosen User-Agent.
    """
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.1 Safari/605.1.15",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
    ]
    headers = {
        "User-Agent": random.choice(user_agents)
    }
    return headers

def random_delay(min_seconds=3, max_seconds=7):
    """
    Sleeps for a random time between min_seconds and max_seconds.
    """
    time.sleep(random.uniform(min_seconds, max_seconds))


Cell 5: Define the Function to Scrape Google Search Results

In [5]:
def is_allowed_url(url):
    """
    Checks if the URL is allowed:
      - If the URL belongs to one of the domains in EXCLUDED_DOMAINS,
        it is allowed only if it contains the REQUIRED_SUBSECTION.
      - URLs not matching any excluded domain are allowed.
    """
    for ex in EXCLUDED_DOMAINS:
        if ex in url:
            # URL is from an aggregator site; allow only if it includes the '/blog/' subsection.
            if REQUIRED_SUBSECTION not in url:
                return False
    return True

def scrape_google_selenium(query, num_results=10):
    """
    Uses Selenium to scrape Google search results for a given query from the first two pages.
    Before extracting results, it attempts to click on "Tools" and select the "Past year" filter.
    Returns a list of result URLs that are allowed according to the following rules:
      - They do not belong to the exclusion sites,
      - Or, if they do belong to those sites, they must include the '/blog/' subsection.
    """
    all_results = []
    # We'll iterate over two pages: page 1 (start=0) and page 2 (start=10)
    for start in [0, 10]:
        # URL-encode the query text and add the start parameter for pagination.
        query_encoded = urllib.parse.quote(query)
        url = f"https://www.google.com/search?q={query_encoded}&num={num_results}&start={start}"

        # Set up Chrome options for headless browsing.
        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--window-size=1920,1080")
        chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36")
        chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
        chrome_options.add_experimental_option('useAutomationExtension', False)
        chrome_options.add_argument("--disable-blink-features=AutomationControlled")

        # Initialize the Chrome webdriver.
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(url)

        # Wait for the page to load.
        time.sleep(8)

        # Attempt to click the "Tools" button.
        try:
            # First candidate: Look for an element with text "Tools"
            tools_button = driver.find_element(By.XPATH, '//div[contains(text(), "Tools")]')
            tools_button.click()
            print("Clicked Tools button.")
            time.sleep(2)
        except Exception as e:
            print("Tools button not found or click not required:", e)

        # Attempt to select the "Past year" filter using multiple candidate selectors.
        past_year_button = None
        candidate_xpaths = [
            '//span[text()="Past year"]',
            '//span[contains(text(),"Past year")]',
            '//a[contains(text(),"Past year")]',
            '//*[contains(@aria-label, "Past year")]'
        ]
        for xpath in candidate_xpaths:
            try:
                past_year_button = driver.find_element(By.XPATH, xpath)
                if past_year_button:
                    try:
                        past_year_button.click()
                        print(f"Clicked Past year filter using selector: {xpath}")
                    except Exception as click_error:
                        # Fallback: use JavaScript to click if not interactable.
                        driver.execute_script("arguments[0].click();", past_year_button)
                        print(f"JS Clicked Past year filter using selector: {xpath}")
                    time.sleep(5)
                    break  # Exit the loop once we've successfully clicked.
            except Exception as e:
                print(f"Past year candidate with XPath {xpath} not found or not clickable:", e)

        if not past_year_button:
            print("Past year filter not found or click not required.")

        # Additional wait to ensure filtered results are loaded.
        time.sleep(5)

        # Get the page source.
        html = driver.page_source
        print(f"HTML snippet from Selenium (page start={start}):", html[:500])
        driver.quit()

        # Parse the HTML with BeautifulSoup.
        soup = BeautifulSoup(html, 'html.parser')
        page_results = []

        # Method 1: Extract URLs using the CSS selector.
        link_tags = soup.select('div.yuRUbf a')
        if link_tags:
            for tag in link_tags:
                link = tag.get('href')
                if link and not any(excluded in link for excluded in EXCLUDED_DOMAINS):
                    page_results.append(link)

        # Fallback: Use regex extraction if no links were found.
        if not page_results:
            print(f"Method 1 did not find links for query '{query}' on page starting at {start}, trying regex-based extraction.")
            regex_links = re.findall(r'<a href="/url\?q=(https?://[^&]+)&', html)
            for link in regex_links:
                if link and not any(excluded in link for excluded in EXCLUDED_DOMAINS):
                    page_results.append(link)

        if not page_results:
            print(f"Debug: No result links found for query '{query}' on page starting at {start}. Check if the HTML structure has changed.")

        # Append results from this page.
        all_results.extend(page_results)

    # Filter URLs: if URL belongs to an aggregator site, allow only if it contains '/blog/'.
    filtered_results = [url for url in all_results if is_allowed_url(url)]
    print(f"Filtered results: {len(filtered_results)} out of {len(all_results)} results allowed (blog filter applied on aggregator sites).")

    return filtered_results

# Optional: Test the function with a single keyword.
test_query = keywords[0]
print("Test query using Selenium:", test_query)
test_results = scrape_google_selenium(test_query)
print("Test results:")
for r in test_results:
    print(r)


Test query using Selenium: toronto condos issues
Clicked Tools button.
Past year candidate with XPath //span[text()="Past year"] not found or not clickable: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//span[text()="Past year"]"}
  (Session info: chrome=134.0.6998.165); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
#0 0x572658eb7ffa <unknown>
#1 0x572658976970 <unknown>
#2 0x5726589c8385 <unknown>
#3 0x5726589c85b1 <unknown>
#4 0x572658a173c4 <unknown>
#5 0x5726589ee2bd <unknown>
#6 0x572658a1470c <unknown>
#7 0x5726589ee063 <unknown>
#8 0x5726589ba328 <unknown>
#9 0x5726589bb491 <unknown>
#10 0x572658e7f42b <unknown>
#11 0x572658e832ec <unknown>
#12 0x572658e66a22 <unknown>
#13 0x572658e83e64 <unknown>
#14 0x572658e4abef <unknown>
#15 0x572658ea6558 <unknown>
#16 0x572658ea6736 <unknown>
#17 0x572658eb6e76 <unknown>
#18 0x7f9a2d250ac3 <un

Cell 6: Define Helper Functions for Content & Date Extraction
In this cell, we define the functions that process page content to extract only the relevant sentences (using NLP and sentiment analysis) and extract the publication date from the page.

In [6]:
def extract_relevant_sentences(text, keyword, sentiment_threshold=0.2):
    """
    Uses spaCy to split text into sentences and TextBlob to analyze sentiment.
    Returns a concatenated string of sentences that contain the keyword and have
    an absolute sentiment polarity above the threshold. If none meet criteria,
    returns a fallback excerpt (first 300 characters).
    """
    doc = nlp(text)
    relevant_sentences = []
    for sent in doc.sents:
        sentence_text = sent.text.strip()
        if keyword.lower() in sentence_text.lower():
            polarity = TextBlob(sentence_text).sentiment.polarity
            if abs(polarity) >= sentiment_threshold:
                relevant_sentences.append(sentence_text)
    if relevant_sentences:
        return " ".join(relevant_sentences)
    else:
        return text[:300]  # fallback excerpt

print("Function 'extract_relevant_sentences' created successfully.")

def fetch_relevant_content(url, keyword, sentiment_threshold=0.2):
    """
    Fetches page content from the given URL, extracts text from paragraph tags,
    and returns only the relevant sentences based on the keyword and sentiment.
    """
    try:
        response = requests.get(url, headers=get_random_headers(), timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(f"Error fetching URL '{url}': {e}")
        return ""

    soup = BeautifulSoup(response.text, 'html.parser')
    paragraphs = soup.find_all('p')
    full_text = " ".join([para.get_text(separator=" ", strip=True) for para in paragraphs if para.get_text(strip=True)])

    print(f"Fetched {len(full_text)} characters from URL: {url}")
    return extract_relevant_sentences(full_text, keyword, sentiment_threshold)

print("Function 'fetch_relevant_content' created successfully.")

def fetch_article_date(url):
    """
    Attempts to extract the publication date from the page.
    Checks common meta tags and <time> elements.
    Tries to parse and reformat the date in a uniform format (e.g. 'Mar 15 2023').
    If parsing fails, returns the raw extracted date for exhaustive output.
    """
    extracted_date = ""
    try:
        response = requests.get(url, headers=get_random_headers(), timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(f"Error fetching URL '{url}' for date extraction: {e}")
        return ""

    soup = BeautifulSoup(response.text, 'html.parser')

    # Check meta tags for published time.
    meta_date = (soup.find("meta", property="article:published_time") or
                 soup.find("meta", attrs={"name": "pubdate"}) or
                 soup.find("meta", attrs={"name": "publication_date"}))
    if meta_date and meta_date.get("content"):
        extracted_date = meta_date.get("content")
    else:
        # Check for <time> tag with datetime attribute.
        time_tag = soup.find("time")
        if time_tag:
            datetime_val = time_tag.get("datetime")
            if datetime_val:
                extracted_date = datetime_val
            elif time_tag.text:
                extracted_date = time_tag.text.strip()
        else:
            # Fallback: Look for date-like patterns (YYYY-MM-DD or MM/DD/YYYY).
            date_regex = re.compile(r'(\d{4}-\d{2}-\d{2})|(\d{1,2}/\d{1,2}/\d{2,4})')
            match = date_regex.search(response.text)
            if match:
                extracted_date = match.group(0)

    if extracted_date:
        try:
            # Parse the date and reformat to "Mon dd yyyy" (e.g. "Mar 15 2023").
            dt = parse(extracted_date, fuzzy=True)
            return dt.strftime("%b %d %Y")
        except Exception as e:
            print(f"Error parsing date '{extracted_date}' from URL '{url}': {e}")
            # Return the raw extracted date if parsing fails.
            return extracted_date

    return ""


print("Function 'fetch_article_date' created successfully.")
print("Cell 6 executed successfully, all helper functions are defined.")


Function 'extract_relevant_sentences' created successfully.
Function 'fetch_relevant_content' created successfully.
Function 'fetch_article_date' created successfully.
Cell 6 executed successfully, all helper functions are defined.


Cell 7: Loop Over Keywords, Extract Data, and Generate CSV
This cell uses the helper functions from Cell 6 and the Selenium-based search function (from Cell 5) to process each keyword. It extracts the relevant content and publication date from each URL and then writes the results into a CSV file.

In [7]:
from datetime import datetime, timedelta

# List to collect rows for the CSV.
rows = []
sr_no = 1

# Define the date cutoff: current date minus 300 days.
date_cutoff = datetime.now() - timedelta(days=300)

# Loop over each keyword.
for keyword in keywords:
    print(f"\nScraping results for: {keyword}")
    urls = scrape_google_selenium(keyword)  # This function is defined in Cell 5.

    for url in urls:
        print(f"Processing URL: {url}")
        # Extract only the relevant content (filtered by keyword and sentiment).
        relevant_content = fetch_relevant_content(url, keyword, sentiment_threshold=0.2)
        # Skip if content is empty or too short.
        if not relevant_content.strip() or len(relevant_content.strip()) < 50:
            print("Skipping URL due to empty or insufficient content.")
            continue

        # Extract publication date.
        publication_date = fetch_article_date(url)
        include_row = True

        # If a publication date is present, try to filter based on recency.
        if publication_date:
            try:
                pub_date_obj = datetime.strptime(publication_date, "%b %d %Y")
            except Exception as e:
                # If parsing with strptime fails, try using dateutil's parse.
                try:
                    pub_date_obj = parse(publication_date, fuzzy=True)
                except Exception as e:
                    print(f"Error parsing publication date '{publication_date}' from URL '{url}': {e}")
                    pub_date_obj = None

            # If a valid date is obtained, compare with the cutoff.
            if pub_date_obj:
                if pub_date_obj < date_cutoff:
                    print(f"Skipping URL because publication date {pub_date_obj} is older than cutoff.")
                    include_row = False

        # Only include the row if content is present and (if date exists) it's recent.
        if include_row:
            # Calculate sentiment score using TextBlob.
            sentiment_score = TextBlob(relevant_content).sentiment.polarity

            # Determine sentiment type based on score.
            if sentiment_score > 0.25:
                sentiment_type = "positive"
            elif sentiment_score < -0.25:
                sentiment_type = "negative"
            else:
                sentiment_type = "neutral"

            rows.append({
                'Sr No.': sr_no,
                'Keyword': keyword,
                'URL': url,
                'Date': publication_date,
                'Content': relevant_content,
                'Sentiment Score': sentiment_score,
                'Sentiment Type': sentiment_type
            })
            sr_no += 1
            # Small delay to avoid rapid requests.
            time.sleep(1)

    # Additional delay between keywords.
    time.sleep(3)

# Convert the collected rows to a DataFrame.
df = pd.DataFrame(rows)

# Export the DataFrame to a CSV file.
csv_filename = "CAO_data.csv"
df.to_csv(csv_filename, index=False)
print(f"CSV file '{csv_filename}' generated with {len(df)} rows.")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
#12 0x5ca4c1f49a22 <unknown>
#13 0x5ca4c1f66e64 <unknown>
#14 0x5ca4c1f2dbef <unknown>
#15 0x5ca4c1f89558 <unknown>
#16 0x5ca4c1f89736 <unknown>
#17 0x5ca4c1f99e76 <unknown>
#18 0x7ea77447dac3 <unknown>

JS Clicked Past year filter using selector: //a[contains(text(),"Past year")]
HTML snippet from Selenium (page start=0): <html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Toronto condo costs - Google Search</title><script type="text/javascript" id="www-widgetapi-script" src="https://www.youtube.com/s/player/69f581a5/www-widgetapi.vflset/www-widgetapi.js" async="" nonce=""></script><script nonce="">window._hst=Date.now();</s
Clicked Tools button.
Past year candidate with XPath //span[text()="Past year"] not found or not clickable: M