# Thu thập dữ liệu một chủ đề VOZ bằng Selenium
# Crawl data from a VOZ thread using Selenium
Link: [đánh giá - Google pixel - xuất sắc ở tầm giá 4->7 triệu (trải nghiệm sử dụng, cách mua)](https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/)

### Thiết lập, Setup

In [None]:
# ### python envrionment (venv, virtualenv)
# %pip install selenium==4.12.0
# %pip install openpyxl==3.1.2
# %pip install numpy==1.25.2
# %pip install pandas==2.1.0
# ### conda envrionment (anaconda, miniconda)
# %conda install selenium==4.12.0
# %conda install openpyxl==3.1.2
# %conda install numpy==1.25.2
# %conda install pandas==2.1.0

| Các bước cào dữ liệu | Steps to crawl data
| ---------------------|---------------------
| 1. Cài đặt và mở Google Chome. | 1. Install and open Google Chrome.
| 2. Nhập ``chrome://version`` trên thanh url. | 2. Type ``chrome://version`` in the URL bar.
| 3. Sao chép **Profile Path**. | 3. Copy **Profile Path**.
| 4. Tách **Profile Path** đã sao chép tương ứng với ``profile_path`` và ``--profile-directory``. Ví dụ, **Profile Path** là ``C:\Users\Vinh\AppData\Local\Google\Chrome\User Data\Default`` thì ``profile_path`` là ``C:\Users\Vinh\AppData\Local\Google\Chrome\User Data`` (bỏ ``\Default\`` ở cuối) và ``--profile-directory`` là ``Default``. | 4. Separate the copied **Profile Path** with ``profile_path`` and ``--profile-directory`` respectively. For example, **Profile Path** is ``C:\Users\Vinh\AppData\Local\Google\Chrome\User Data\Default`` then ``profile_path`` is ``C:\Users\Vinh\AppData\Local\Google\Chrome\User Data`` (exclude ``\Default`` at the end) and ``--profile-directory`` is ``\Default``.
| 5. (Quan trọng) Truy cập link [đánh giá - Google pixel - xuất sắc ở tầm giá 4->7 triệu (trải nghiệm sử dụng, cách mua)](https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/](https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/)). Điều này để lưu cookie và tránh Cloudflare CAPTCHA. | 5. (Important) Open the link [đánh giá - Google pixel - xuất sắc ở tầm giá 4->7 triệu (trải nghiệm sử dụng, cách mua)](https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/). This is to save cookies and avoid Cloudflare CAPTCHA.
| 6. Đóng tất cả tiến trình Google Chrome đang chạy. | 6. Close all running Google Chrome processes.
| 7. Chạy code để tiến hành cào dữ liệu. | 7. Run the code to crawl data.

In [9]:
import random
from time import sleep
import re
import os

import numpy as np
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import (
    ElementClickInterceptedException,
    ElementNotInteractableException,
    NoSuchElementException,
    StaleElementReferenceException,
)
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# Create a ChromeOptions object
options = webdriver.ChromeOptions()

profile_path = r"C:\Users\\Vinh\AppData\Local\Google\Chrome\User Data"
options.add_argument(f"--user-data-dir={profile_path}")

# provide the profile name with which we want to open browser
options.add_argument(r"--profile-directory=Profile 1")

# specify where your chrome driver present in your pc
driver = webdriver.Chrome(options=options)

# crawl page
url = "https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/"
print("Crawl page:", url)
driver.get(url)
sleep(random.randint(10, 12))

Crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-338


In [10]:
def click_element(driver, element):
    driver.execute_script(
        "arguments[0].click();",
        WebDriverWait(driver, 20).until(EC.element_to_be_clickable(element)),
    )


def remove_element(driver, element):
    driver.execute_script("arguments[0].remove();", element)


def expand_all_hidden_elements(driver):
    hidden_elements = driver.find_elements(By.CSS_SELECTOR, ".bbCodeBlock-expandLink")
    for element in hidden_elements:
        try:
            element.click()
            sleep(random.randint(3, 5))
        except ElementNotInteractableException:
            continue
        except ElementClickInterceptedException:
            continue


def remove_duplicate_quote_elements(driver):
    root_quotes = driver.find_elements(By.CSS_SELECTOR, "blockquote")
    for root_quote in root_quotes:
        try:
            quotes = root_quote.find_elements(By.CSS_SELECTOR, "div.bbCodeBlock-title")
            for quote in quotes:
                parent_quote = quote.find_element(By.XPATH, "..")
                remove_element(driver, parent_quote)
        except NoSuchElementException:
            continue
        except StaleElementReferenceException:
            continue


def get_user_infomation(driver):
    user_elements = driver.find_elements(By.CSS_SELECTOR, "h4.message-name [href]")
    username_list = []
    usertitle_list = []
    join_date_list = []
    message_counts_list = []
    reactions_counts_list = []
    points_list = []
    for idx, user_element in enumerate(user_elements):
        click_element(driver, user_element)
        sleep(0.8)
        username_element = driver.find_elements(
            By.CSS_SELECTOR, "span.memberTooltip-nameWrapper"
        )[idx]
        usertitle_element = driver.find_elements(By.CSS_SELECTOR, "span.userTitle")[idx]
        join_date_element = driver.find_elements(
            By.CSS_SELECTOR,
            "div.memberTooltip-blurbContainer > div:nth-child(2) > dl > dd > time",
        )[idx]
        message_counts_element = driver.find_elements(
            By.CSS_SELECTOR, "dl.pairs--rows--centered:nth-child(1) > dd > a"
        )[idx]
        reactions_counts_element = driver.find_elements(
            By.CSS_SELECTOR, "dl.pairs--rows--centered:nth-child(2) > dd"
        )[idx]
        points_element = driver.find_elements(
            By.CSS_SELECTOR, "dl.pairs--rows--centered:nth-child(3) > dd > a"
        )[idx]

        username_list.append(username_element.text)
        usertitle_list.append(usertitle_element.text)
        join_date_list.append(join_date_element.text)
        message_counts_list.append(message_counts_element.text)
        reactions_counts_list.append(reactions_counts_element.text)
        points_list.append(points_element.text)

    return dict(
        username=username_list,
        usertitle=usertitle_list,
        join_date=join_date_list,
        message_counts=message_counts_list,
        reactions_counts=reactions_counts_list,
        points=points_list,
    )


def get_message_infomation(driver):
    # get message elements
    date_elements = driver.find_elements(By.CSS_SELECTOR, "ul.message-attribution-main")
    comment_elements = driver.find_elements(By.CSS_SELECTOR, ".bbWrapper")

    # get message data
    date_list = []
    username_list = []
    comment_list = []
    for date_element, comment_element in zip(date_elements, comment_elements):
        comment = comment_element.text
        # remove duplicate new lines
        comment = re.sub("\n{2,}", "\n", comment)
        # remove duplicate spaces
        comment = re.sub(" {2,}", " ", comment)

        date_list.append(date_element.text)
        comment_list.append(comment_element.text)

    return dict(comment_date=date_list, comment=comment_list)


def get_next_page(driver):
    try:
        next_page_button = driver.find_element(By.CSS_SELECTOR, ".pageNav-jump--next")
        driver.execute_script("arguments[0].click();", next_page_button)
        sleep(random.randint(10, 12))
        return True
    except NoSuchElementException:
        return False


def get_page_number(driver):
    url = driver.current_url

    # Extract page number using regular expression
    page_number_match = re.search(r"page-(\d+)", url)

    if page_number_match:
        page_number = page_number_match.group(1)
        return page_number
    else:
        return 1


def save_to_sheet(data: dict, excel_file_path: str, sheet_name: str):
    df = pd.DataFrame(data)

    with pd.ExcelWriter(excel_file_path, mode="a") as writer:
        df.to_excel(writer, sheet_name=sheet_name, index=False)


def crawl_page(driver):
    # create new excel file
    if not os.path.exists("data.xlsx"):
        df = pd.DataFrame()
        df.to_excel("data.xlsx")
    # crawl page
    while True:
        print("crawl page:", driver.current_url)
        expand_all_hidden_elements(driver)
        remove_duplicate_quote_elements(driver)
        message_data = get_message_infomation(driver)
        user_data = get_user_infomation(driver)
        merged_data = {**message_data, **user_data}
        page_number = get_page_number(driver)
        save_to_sheet(merged_data, "data.xlsx", str(page_number))
        if not get_next_page(driver):
            print("no next page")
            break


crawl_page(driver)

crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-338
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-339
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-340
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-341
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-342
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-343
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-344
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-trai-nghiem-su-dung-cach-mua.122469/page-345
crawl page: https://voz.vn/t/google-pixel-xuat-sac-o-tam-gia-4-7-trieu-t