# Scraping specific information from detail pages using Selenium

The objective of this section is to scrape specific information from the detail pages of all Chinese novels on the Wuxia World website. The key information includes the title, author, genres, rating, number of chapters, number of reviews, the titles of all the chapters, and details of the reviews.

The essence and basic process of web scraping, please refer to https://rayobyte.com/blog/web-crawling-vs-web-scraping/

## 1.Installing Selenium and Browser Driver

In this case, we chose to use Selenium because the Wuxia World website has implemented an anti-scraping mechanism that requires browser simulation to obtain complete data. For installing and documentation, please refer to https://selenium-python.readthedocs.io/index.html

Note: Please make sure to install the necessary libraries such as Selenium and BeautifulSoup before running the code. You can use pip, the Python package installer, to install these libraries. To install them, open your command prompt or terminal and enter the following commands:

pip install selenium;
pip install beautifulsoup4

In [89]:
# Import the required libraries
import csv
import os
import re
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

Here, we take ChromeDriver as an example. Please follow the setup instruction https://sites.google.com/chromium.org/driver/getting-started

## 2. Analyzing the webpage structure

Firstly, we need to browse the page structure of the target website and examine the web pages to determine their types.

Based on observation, we have noticed that all the Chinese novels are listed in alphabetical order in a table on the webpage. Each target novel's title corresponds to a hyperlink that leads to its detail page. In the table, only partial information, such as the novel's name and rating, is displayed. To access more detailed information, one needs to click on the hyperlink. Therefore, our overall web scraping approach is as follows:

(1).Obtain a complete list of links corresponding to all the novels;
(2).Iterate through the target link list to scrape specific information from each webpage and store it locally.

## 3. Obtaining the URL list

Using the developer tool, we discovered that the URL is contained within the "href" attribute of the novel's title. To obtain a complete list of URLs for the novels, we will follow the steps of locating the novel's title and extracting the "href" attribute value of the element.

Selenium provides various methods for locating elements, such as XPath, ID, and name. In this case, we will use XPath to locate the element. Once we have obtained the XPath of the corresponding element, we will use the syntax "@href" to select the element's attribute.

In [90]:
# Specify the path of the chromedriver and launch the browser
driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver')
# Navigate to the URL of wuxiaworld
driver.get('https://www.wuxiaworld.com/novels')

# Wait for 10 seconds for the page to load
time.sleep(10)
    
# Click the 'Chinese' checkbox to show only Chinese novels
driver.find_element(By.XPATH,'//*[@id="loading-container-replacement"]/div/div[1]/div[2]/div/div/div/div[1]/div/label[2]/span[2]').click()

# Wait for 10 seconds for the page to load
time.sleep(10)
    
# Scroll down the page multiple times to load all the novel data
for i in range(60):
    # Simulate the Page Down key press using ActionChains
    ActionChains(driver).key_down(Keys.PAGE_DOWN).key_up(Keys.PAGE_DOWN).perform()
    # Wait for a short time for the page to load
    time.sleep(0.2)
        
# Create an empty list to store all Chinese novels URLs
all_url_list = []

# Xpath locate the element
url_xpath = '//*[@id="loading-container-replacement"]/div/div[2]/div/div/div/div/div/div/div/div/div[2]/p/a/@href'

# Get the HTML source of the page
html = driver.page_source
# Parse the HTML source using the etree.HTML function
tree = etree.HTML(html)
# Extract the list of URLs using the XPath expression
url_list = tree.xpath(url_xpath)
# Extend the all_url_list with the extracted URL list
all_url_list.extend(url_list)
    

  driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver')


## 4. Iterating through the URL list to scrape specific information from the detail pages

After obtaining the complete URL list of the novels, we need to iterate through the list to retrieve the details page of each novel. On this page, the key content we need to scrape includes:

(1)Title;
(2)Author;
(3)Genres;
(4)Rating;
(5)Number of chapters;
(6)Number of reviews;
(7)The titles of all the chapters;
(8)Details of the reviews.

Please note that the above content is the crucial information we need to extract from the details page of each novel.

### 4.1 Scraping the Title, Author, Genres, Rating, Number of chapters, and Number of reviews from the novel pages corresponding to each URL 

In [74]:
# Browser driver configuration and usage
# Create an instance of ChromeOptions to customize Chrome browser settings
CHROME_OPTIONS = webdriver.ChromeOptions()
# Specify the preference for displaying images
# 1 for displaying images, 2 for not displaying images. When images are not needed to be crawled, they can be set to not load images to save time.
prefs = {"profile.managed_default_content_settings.images":2}   
# Add the image display preference to the Chrome options
CHROME_OPTIONS.add_experimental_option("prefs", prefs)
# Create a Chrome WebDriver instance with the specified driver executable path and options
driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver', options=CHROME_OPTIONS)

# Xpath locating for basic novel information
title_xpath = '//*[@id="loading-container-replacement"]/div/div[1]/div/div/div[2]/div[1]/div[2]/h1'
author_xpath = '//*[@id="loading-container-replacement"]/div/div[1]/div/div/div[2]/div[3]/div[1]/div[2]'
genres_xpath = '//*[@id="full-width-tabpanel-0"]/div/div[1]/div[2]/div/a/div/div'
rating_xpath = '//*[@id="loading-container-replacement"]/div/div[1]/div/div/div[2]/div[2]/div/span/span'
chapters_xpath = '//*[@id="full-width-tabpanel-0"]/div/div[1]/div[1]/div[1]/div[2]'
reviews_xpath = '//*[@id="loading-container-replacement"]/div/div[1]/div/div/div[2]/div[2]/div/div/span'

# Create an empty list to store all the title，author,genres,rating,chapters and reviews 
all_title = []
all_author = []
all_genres = []
all_rating = []
all_chapters = []
all_reviews = []

# Iterate through the list of URLs
for url_item in all_url_list:
    url = 'https://www.wuxiaworld.com' + url_item
    driver.get(url) 
    time.sleep(5)
    # Parse the webpage
    html = driver.page_source
    tree = etree.HTML(html)
    
    # Extract the text nodes of novel basic information
    title_list = tree.xpath(title_xpath + '/text()')
    all_title.append(title_list[0])
    
    author_list = tree.xpath(author_xpath + '/text()')
    all_author.append(author_list[0])
    
    genres_list = tree.xpath(genres_xpath+ '/text()')
    content = ';'.join(genres_list)  # There may be multiple genres, concatenate them with a semicolon
    all_genres.append(content)
    
    rating_list = tree.xpath(rating_xpath + '/text()')
    all_rating.append(rating_list[0])
    
    chapters_list = tree.xpath(chapters_xpath + '/text()')
    all_chapters.append(chapters_list[0])
    
    reviews_list = tree.xpath(reviews_xpath + '/text()')
    all_reviews.append(reviews_list[0])

# Specify the file path and open the CSV file in write mode with UTF-8 encoding    
with open('data_list.csv', 'w',encoding='utf-8',newline='') as csvfile:
    # Create a CSV writer object
    writer = csv.writer(csvfile)
    # Write the header row with column names
    writer.writerow(['title','author','genres','rating','chapters','reviews','url'])
    # Iterate through each URL and corresponding data, and write them as rows in the CSV file
    for url_item, row in zip(all_url_list, zip(all_title, all_author, all_genres, all_rating, all_chapters, all_reviews)):
        # Convert the row elements to a list and append the URL item to the end
        writer.writerow(list(row) + [f'https://www.wuxiaworld.com{url_item}'])

# Close the browser driver
driver.quit()


  driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver', options=CHROME_OPTIONS)


### 4.2 Scraping the titles of all the chapters from the novel pages corresponding to each URL

In [88]:
# Browser driver configuration and usage
# Create an instance of ChromeOptions to customize Chrome browser settings
CHROME_OPTIONS = webdriver.ChromeOptions()
# Specify the preference for displaying images
# 1 for displaying images, 2 for not displaying images. When images are not needed to be crawled, they can be set to not load images to save time.
prefs = {"profile.managed_default_content_settings.images":2}   
# Add the image display preference to the Chrome options
CHROME_OPTIONS.add_experimental_option("prefs", prefs)
# Create a Chrome WebDriver instance with the specified driver executable path and options
driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver', options=CHROME_OPTIONS)

# Open the novel page
driver.get('https://www.wuxiaworld.com/novel/tranxending-vision')

# Wait for 10 seconds for the page to load
time.sleep(10)

# Click the 'Chapters' button
driver.find_element(By.XPATH,'//*[@id="full-width-tab-1"]').click()

# Wait for 10 seconds for the page to load
time.sleep(10)

# Xpath locating for chapter names
title_chapters_xpath = '//*[@id="full-width-tabpanel-1"]/div/div[2]/div/div[2]/div/div/div/div/div/div/div/a/div/div[1]/div[1]/span'

# Find all elements matching the specified XPath on the webpage
chapter_elements = driver.find_elements(By.XPATH, '//*[@id="full-width-tabpanel-1"]/div/div[2]/div/div[1]/div[1]/section/div/span')
# Determine the number of chapters by counting the elements found
num_chapters = len(chapter_elements)

# Create an empty list to store all the title chapters
all_title_chapters=[]

# Iterate through each fascicle
for i in range(num_chapters):
    # Click on each fascicle
    chapter_elements[i].click()
    time.sleep(5)
    # Parse the webpage
    html = driver.page_source
    tree = etree.HTML(html)
    # Extract the text nodes of chapter names using the specified XPath
    title_chapters = tree.xpath(title_chapters_xpath + '/text()')
    # Join the extracted chapter names into a single string separated by newline characters
    content = '\n'.join(title_chapters)
    # Append the content (chapter names) to the list of all_title_chapters
    all_title_chapters.append(content)

# Specify the file path where the chapter names will be written
file_path = os.path.join(os.getcwd(), 'all_title_chapters.txt')
# Open the file in write mode and specify the encoding as UTF-8
with open(file_path, 'w', encoding='utf-8') as f:
    # Write the content of the last loop (i.e., all chapter names) to the file
    f.write(all_title_chapters[-1])

# Close the browser driver
driver.quit()

  driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver', options=CHROME_OPTIONS)


### 4.3 Scraping the details of the reviews from the novel pages corresponding to each URL

In [92]:
# Specify the path of the chromedriver and launch the browser
driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver')
# Open the novel page
driver.get('https://www.wuxiaworld.com/novel/rmji')

# Wait for 10 seconds for the page to load
time.sleep(10)

# Replace the generic view_all path
driver.find_element(By.XPATH,'//*[@id="full-width-tabpanel-0"]/div/div[3]/div[2]/div[2]/div/div[2]/div/span').click()

# Wait for 10 seconds for the page to load
time.sleep(10)

# Create an empty list to store all the detail reviews
detail_reviews_list= []

# Get the reviews on the first page
html = driver.page_source # Get the HTML source code of the current page
soup = BeautifulSoup(html, "html.parser") # Create a BeautifulSoup object to parse the HTML
data = soup.find_all('div', class_="absolute top-0 -z-10 line-clamp-1 font-set-r15-h150 text-gray-t1 sm2:font-set-r16-h150")
# Find all the review elements with the specified class
# Note: The class represents the CSS styles applied to the review elements

data = data[3:] # Remove the first three reviews from the list of review elements
detail_reviews_list.extend(data) # Add the remaining reviews to the detail_reviews_list

# Use a set to keep track of already retrieved reviews
seen_reviews = set([r.get_text(strip=True) for r in detail_reviews_list])

# Get the reviews from the next page
next_page=driver.find_element(By.XPATH,'/html/body/div[2]/div[3]/div/div/div/div[2]/div[3]/nav/ul/li[last()-0]/button')

# Keep clicking the next page button until it's disabled
while next_page.is_enabled():
    next_page.click()
    time.sleep(5)
    html = driver.page_source # Get the HTML source code of the current page
    soup = BeautifulSoup(html, "html.parser") # Create a BeautifulSoup object to parse the HTML
    data = soup.find_all('div', class_="absolute top-0 -z-10 line-clamp-1 font-set-r15-h150 text-gray-t1 sm2:font-set-r16-h150")
    # Find all the review elements with the specified class
    # Note: The class represents the CSS styles applied to the review elements
    
    # Add the unseen reviews to the detail_reviews_list
    for review in data:
        if review.get_text(strip=True) not in seen_reviews:
            detail_reviews_list.append(review)
            seen_reviews.add(review.get_text(strip=True))
    # Find the next page button and wait until it is present
    next_page = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,'/html/body/div[2]/div[3]/div/div/div/div[2]/div[3]/nav/ul/li[last()-0]/button')))

# Open a text file named 'detail_reviews.txt' in write mode, using UTF-8 encoding
with open('detail_reviews.txt', 'w', encoding='utf-8') as f:
    # Iterate through each detail_review in the detail_reviews_list
    for detail_review in detail_reviews_list:
        # Write the stripped text of the detail_review to the file
        f.write(detail_review.get_text(strip=True))
        # Write two newline characters to create a blank line between reviews
        f.write('\n\n')

# Close the browser driver        
driver.quit()

  driver = webdriver.Chrome(executable_path='/Users/wanshuo/Desktop/Master/DH_MA_thesis/Dataset/chromedriver_mac_arm64/chromedriver')
