## Quest Data Scraper - wowhead.com

**Collects quest title, objective and description text from wowhead.com**

The following files are required to be in your current directory to perform a search as the urls generated are category specific.

*   'wow_quest_categories.csv'
*   'wow_classic_quest_categories.csv'
*   'wow_wotlk_quest_categories.csv'

wowhead.com/quests has a 1000 quest limit on the loaded database, and only through filtering can you access the remanining data.

This notebook uses selenium to load each wowhead.com quest directory page, and then searches for urls matching a certain format.

In [1]:
import os

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
os.chdir('drive/MyDrive/cmpt413_proj')

In [None]:
!pip install -r requirements.txt

Collecting jupyter_contrib_nbextensions (from -r requirements.txt (line 6))
  Downloading jupyter_contrib_nbextensions-0.7.0.tar.gz (23.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m55.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jupyter_nbextensions_configurator (from -r requirements.txt (line 7))
  Downloading jupyter_nbextensions_configurator-0.6.3-py2.py3-none-any.whl (466 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m466.9/466.9 kB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
Collecting datasets~=2.12.0 (from -r requirements.txt (line 10))
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
Collecting sacrebleu (from -r requirements.txt (line 12))
  Downloading sacrebleu-2.4.0-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━

## Collect URLs:

In [None]:
QUEST_ERA = ''

In [None]:
# Set up for running selenium in Google Colab
## You don't need to run this code if you do it in Jupyter notebook, or other local Python setting
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install selenium

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:5 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease [18.1 kB]
Get:6 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:7 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1,346 kB]
Hit:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:12 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [664 kB]
Get:13 https://ppa.



In [None]:
!pip install chromedriver_autoinstaller

Collecting chromedriver_autoinstaller
  Downloading chromedriver_autoinstaller-0.6.3-py3-none-any.whl (7.6 kB)
Installing collected packages: chromedriver_autoinstaller
Successfully installed chromedriver_autoinstaller-0.6.3


In [None]:
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import chromedriver_autoinstaller

# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()

# set up the webdriver
driver = webdriver.Chrome(options=chrome_options)

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

import pandas as pd

import re
import csv

In [None]:
driver = webdriver.Chrome(options=chrome_options)

In [None]:
def replace_characters(s):
    s = s.lower()
    # Remove apostrophes
    s = s.replace("'", "")
    # Replace spaces with hyphens
    s = s.replace(" ", "-")
    s = s.replace(".", "-")
    s = s.replace(":", "-")
    return s

In [None]:
if (QUEST_ERA != ''):
  quest_era_str = QUEST_ERA + "_"
else:
  quest_era_str = ''

file_path = 'wow_{}quest_categories.csv'.format(quest_era_str)
df = pd.read_csv(file_path)

categories = {}

for c in df.columns:
  category_name = replace_characters(c)

  sub_categories = df[c]
  sub_categories_fixed = []
  for s in sub_categories:
      if pd.isna(s):
          continue
      s_fixed = replace_characters(s)
      sub_categories_fixed.append(s_fixed)

  categories[category_name] = sub_categories_fixed

if QUEST_ERA == 'wotlk':
  categories['northrend'][0] = 'acherus-the-ebon-hold'

In [None]:
print(categories)

{'dragonflight': ['azmerloth', 'dragon-isles', 'dragonscale-expedition', 'dream-wardens', 'dreamsurge', 'dungeon', 'emberflow', 'emerald-dream', 'engine-of-innovation', 'eons-fringe', 'evoker', 'iskaara-tuskarr', 'little-scales-daycare', 'maruuk-centaur', 'morqut-village', 'obsidian-citadel', 'ohnahran-plains', 'ohniri-springs', 'passage-of-time', 'primalist-storms', 'professions', 'raid', 'sharpscale-coast', 'special', 'suffusion-camps', 'tempest-unleashed', 'thaldraszus', 'the-azure-span', 'the-forbidden-reach', 'the-primalist-future', 'the-waking-shores', 'the-waking-shores', 'time-rifts', 'trading-post', 'trial-of-style', 'tyrhold', 'valdrakken', 'valdrakken-accord', 'world-pvp', 'zaralek-cavern', 'zskera-vaults'], 'shadowlands': ['9-1-campaign', 'abominable-stitching', 'ardenweald', 'bastion', 'covenant-assaults', 'covenant-sanctum', 'dungeon', 'ember-court', 'keepers-respite', 'korthia', 'kyrian-callings', 'maldraxxus', 'necrolord-callings', 'night-fae-callings', 'oribos', 'path-

In [None]:
#@title Search by Categories

# Specify a staring point (should be an upper category from the .csv files)
starting_point = 'northrend'
if (QUEST_ERA != ''):
  quest_era_str = QUEST_ERA + "/"
else:
  quest_era_str = ''

id_start = 0
id_end = 9

print_debug = False  # True = print output from console in terminal ; False = no print output

urls = []

def examine_url(url):

    print("examining url: {}".format(url))
    driver.get(url)
    driver.implicitly_wait(10)  # Waits for 10 seconds
    html_content = driver.page_source

    pattern = r"https://www.wowhead.com/{}quest=\d+/[a-z0-9\-]+".format(quest_era_str)
    # Use regular expressions to find all occurrences of the pattern
    matches = re.findall(pattern, html_content)

    if len(matches) == 0:
      # Define the pattern to search for
      pattern = "https://www.wowhead.com/{}quest=".format(quest_era_str)
      # Use regular expressions to find all occurrences of the pattern
      matches = re.findall(rf'{re.escape(pattern)}\d+', html_content)

    counter = 0
    for match in matches:
        url_name = str(match)
        if (url_name not in urls):
          urls.append(str(match))

    print("url counter: {}".format(len(urls)))

has_started = False

for category in categories:
  if category == starting_point and not has_started:
    has_started = True

  if not has_started:
    continue

  if len(categories[category]) > 0:
    for sub_category in categories[category]:
      for i in range(id_start, id_end+1):
        padded_string = "{:<03}".format(str(i))
        path_to_quest_page = "https://www.wowhead.com/{}quests/{}/{}#{}".format(
            quest_era_str, category, sub_category, padded_string
        )
        url = path_to_quest_page
        examine_url(url)
  else:
    for i in range(id_start, id_end+1):
      padded_string = "{:<03}".format(str(i))
      path_to_quest_page = "https://www.wowhead.com/{}quests/{}#{}".format(
          quest_era_str, category, padded_string
      )
      url = path_to_quest_page
      examine_url(url)


filename = "urls_raw_{}.csv".format(QUEST_ERA)
with open(filename, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    # Write each row to the CSV file
    for row in urls:
        writer.writerow([row])

print("found {} urls!".format(len(urls)))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#100
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#200
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#300
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#400
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#500
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#600
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#700
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#800
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/shadowglen#900
url counter: 5264
examining url: https://www.wowhead.com/quests/kalimdor/silithus#000
url counter: 5335
examining url: https://www.wowhead.com/

In [None]:
#@title Search by Index Interval
if (QUEST_ERA != ''):
  quest_era_str = QUEST_ERA + "/"
else:
  quest_era_str = ''

id_start = 0
id_end = 9

print_debug = False  # True = print output from console in terminal ; False = no print output

urls = []

def examine_url(url):
    print("examining url: {}".format(url))
    # Assuming 'driver' is already initialized and 'url' is defined
    driver.get(url)

    # Wait for the necessary time to let JavaScript content load

    # Get the HTML content after JavaScript has been executed
    html_content = driver.page_source

    # Define the pattern to search for
    #pattern = "https://www.wowhead.com/wotlk/quest="
    pattern = r"https://www.wowhead.com/{}quest=\d+/[a-z0-9\-]+".format(quest_era_str)
    # Use regular expressions to find all occurrences of the pattern
    #matches = re.findall(rf'{re.escape(pattern)}\d+', html_content)
    matches = re.findall(pattern, html_content)
    counter = 0

    # Print each match
    for match in matches:
        #print(match)
        url_name = str(match)
        if (url_name not in urls):
          urls.append(str(match))


    print("url counter: {}".format(len(urls)))

# Starting value
start = 0
# Ending value
end = 500000
# Interval
interval = 1000
for i in range(start, end + 1, interval):
    print(i)

    for k in range(id_start, id_end+1):
      padded_string = "{:<03}".format(str(k))
      path_to_quest_page = "https://www.wowhead.com/{}quests?filter=30:30;5:1;{}:{}#{}".format(
          quest_era_str, i+interval, i, padded_string
      )
      url = path_to_quest_page
      examine_url(url)

filename = "urls_raw.csv"
with open(filename, 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    # Write each row to the CSV file
    for row in urls:
        writer.writerow([row])

print("found {} urls!".format(len(urls)))


0
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#000
url counter: 100
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#100
url counter: 200
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#200
url counter: 300
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#300
url counter: 400
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#400
url counter: 500
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#500
url counter: 600
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#600
url counter: 700
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#700
url counter: 800
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#800
url counter: 900
examining url: https://www.wowhead.com/classic/quests?filter=30:30;5:1;1000:0#900
url counter: 992
1000
exa

KeyboardInterrupt: ignored

In [None]:
list_without_duplicates = list(set(urls))

In [None]:
# quit the driver when necessary
driver.quit()



## Collect Data:

In [None]:
import re
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
from requests.api import requestA
import csv
import pandas as pd

In [None]:
path_to_csv = 'data/urls_raw_kalimdor_to_professions.csv' # path to a file with just urls - see above 'Collect URLs'
df = pd.read_csv(path_to_csv)
df.columns = ['urls']

print_debug = False  # True = print output from console in terminal ; False = no print output

def get_description(tag_string, soup):
    # Find the <h2> tag with the text "Description"
    description_tag = soup.find('h2', string=tag_string)

    if description_tag:
        text_following_description = ""
        start_collecting = False

        for element in description_tag.next_elements:
            # Stop if another <h2> tag is encountered
            if isinstance(element, Tag) and element.name == "h2":
                break

            if isinstance(element, Tag) and element.name == "script":
                break

            # Skip the "Description" header itself
            if element.parent == description_tag:
                start_collecting = True
                continue

            # Collect text from NavigableString and <a> tag elements
            if start_collecting:
                if isinstance(element, NavigableString):
                    # Skip NavigableStrings that are direct children of <a> tags
                    if element.parent.name != 'a':
                        text_following_description += element.strip()
                elif element.name == "a":
                    text_following_description += " {} ".format(element.get_text(strip=True))


        return text_following_description.strip()

    return ""

def adjust_punctuation_spacing(text):
    # Remove space before punctuation
    text = re.sub(r'\s+([,.?!;:])', r'\1', text)
    # Add space after punctuation if not already there
    text = re.sub(r'([,.?!;:])(?![\s])', r'\1 ', text)
    return text

i = 0
for i, url in enumerate(df['urls']):
    filename = 'quest_output_en' + '_test.csv'
    with open(filename, mode='a', encoding='utf-8-sig') as csv_output:
        csv_writer = csv.writer(
            csv_output, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL, lineterminator='\n',)

        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')


        # title
        title_text = soup.find(
                "h1", {"class": "heading-size-1"}).text.strip()

        if title_text == "":
            continue

        title_text = adjust_punctuation_spacing(title_text)

        if (print_debug):
            print(title_text)

        # description
        description_text = get_description("Description", soup)
        if description_text == "":
            description_text = get_description("Completion", soup)

        description_text = adjust_punctuation_spacing(description_text)

        if description_text == "":
            continue

        if (print_debug):
            print(description_text)

        # objective
        meta_description = soup.find('meta', attrs={'name': 'description'})
        objective_text_all = meta_description['content'] if meta_description else ""
        if objective_text_all != "":
            objective_text = objective_text_all.split('.')[0] + '.' if description_text else ""

        objective_text = adjust_punctuation_spacing(objective_text)

        if objective_text == "":
            continue

        if (print_debug):
            print(objective_text)

        # write to csv
        csv_writer.writerow([i, title_text, objective_text, description_text])
        csv_output.close()

    if (print_debug):
        print("=========")

print("DONE")