This notebook uses Beautiful Soup to scrape Valve's video game Deadlock's changelog page to get the links to all patch notes (updates). Each patch note is extracted from the page by parsing the HTML and finding the tags that link to the individual patch notes. Based on the URL structure, loop through and extract the text data from each individual patch note. Store the extracted (raw) data in a .txt file. Data is first stored locally for initial development and then pushed to Google Cloud Storage in batch.

As of 24Nov2024, all patch notes are located in [this forum](https://forums.playdeadlock.com/forums/changelog.10/)

![Deadlock changelog menu](images/phase1-changelog-homepage.png)

In [17]:
import requests
import re
import os
import json
from datetime import datetime
from bs4 import BeautifulSoup

def get_patch_note_links(page_num):
    # determine URL for the current page
    if page_num == 1:
        url = "https://forums.playdeadlock.com/forums/changelog.10/"
    else:
        url = f"https://forums.playdeadlock.com/forums/changelog.10/page-{page_num}"
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # extract all the thread links that are patch notes
    links = []
    for a_tag in soup.find_all('a', href=True):
        link = a_tag['href']
        if '/threads/' in link and 'update' in link:  # only look for valid patch note threads
            # normalize the URL by removing '/latest' and '/' if present
            normalized_link = re.sub(r'/latest$', '', link)
            normalized_link = re.sub(r'/$', '', normalized_link)
            full_url = f"https://forums.playdeadlock.com{normalized_link}"
            links.append(full_url)

    return links


# initialize
patch_note_links = set()  # use a set to avoid duplicates
page_num = 1

# store links of the previous page
prev_page_links = None 

# loop through pages until no new patch links are found
while True:
    print(f"Checking page {page_num}...")

    # get patch note links 
    current_page_links = get_patch_note_links(page_num)
    
    # compare with the previous page content
    if current_page_links == prev_page_links:
        print(f"Page {page_num} content is the same as page {page_num - 1}. Stopping loop.")
        break
    
    # update prev_content to current page content
    prev_page_links = current_page_links  
    
    # add new patch note links from current page
    patch_note_links.update(current_page_links)

    page_num += 1

# sort links by newest patch notes first
sorted_patch_note_links = sorted(patch_note_links, reverse=True)

# print all collected links
print("Collected Patch Note Links:")
for link in sorted_patch_note_links:
    print(link)


Checking page 1...
Checking page 2...
Checking page 3...
Checking page 4...
Page 4 content is the same as page 3. Stopping loop.
Collected Patch Note Links:
https://forums.playdeadlock.com/threads/11-21-2024-update.47476
https://forums.playdeadlock.com/threads/11-13-2024-update.46391
https://forums.playdeadlock.com/threads/11-10-2024-update.45689
https://forums.playdeadlock.com/threads/11-07-2024-update.44786
https://forums.playdeadlock.com/threads/11-01-2024-update.43705
https://forums.playdeadlock.com/threads/10-29-2024-update.42985
https://forums.playdeadlock.com/threads/10-27-2024-update.42492
https://forums.playdeadlock.com/threads/10-24-2024-update.40951
https://forums.playdeadlock.com/threads/10-18-2024-update.39630
https://forums.playdeadlock.com/threads/10-18-2024-update-2.39693
https://forums.playdeadlock.com/threads/10-15-2024-update.38925
https://forums.playdeadlock.com/threads/10-11-2024-update.37641
https://forums.playdeadlock.com/threads/10-10-2024-update.36958
https://f

In [23]:
# let's see what one patch note looks like first

def extract_patch_note_content(patch_note_url):
    response = requests.get(patch_note_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # find the user who posted the patch note 
    # in most cases it will be Yoshi, but we want to know if there is a change
    user_div = soup.find('div', class_='message-userContent')
    if user_div and 'data-lb-caption-desc' in user_div.attrs:
        poster = user_div['data-lb-caption-desc']
    else:
        poster = "Poster information not found."

    # find the patch note content and preserve newlines
    content_div = soup.find('div', class_='bbWrapper')
    if content_div:
        patch_note_content = content_div.get_text(separator='\n', strip=True)  # Keeps newlines
    else:
        patch_note_content = "No content found."

    return poster, patch_note_content

# Example usage
first_patch_note_url = sorted_patch_note_links[0] 
poster, content = extract_patch_note_content(first_patch_note_url)

print(f"Posted by: {poster}")
print(f"Content: {content}")

Posted by: Yoshi · Nov 21, 2024 at 3:21 PM
Content: [ Matchmaking Rework ]
- This update includes a new version of the matchmaker. The matchmaking pools are no longer split between normal and ranked, there is only one primary matchmaking mode and there are no limited hours.
- Badges will update immediately whenever you gain or lose enough MMR to change badges. You no longer have to wait a week or play a certain game count. There will sometimes be monthly global maintenance updates where we readjust the global curve based on the population, cheaters banned, recalculation adjustments, etc (this will be done as needed and not necessarily every month).
- Hero MMR is now used in the Matchmaker. Each player will have a "core" MMR and your MMR per hero will be offsets of your core MMR. When you queue, we will match you based on that hero's MMR. So if you are unfamiliar or play worse with a given hero, you will be put in an easier match than your usual. Judgement on your skill and familiarity 

Confirm that the final line of patch notes matches what is on the website
![Deadlock changelog menu](images/phase2-confirm-extracted-content-matches.png)

In [24]:
# poster information and datetime is stored in the 'data-lb-caption-desc' attribute
# an example looks like this "Yoshi · Nov 21, 2024 at 3:21 PM"
def extract_poster_info(poster_str):
    # split the string by '·' to separate the poster name and date-time part
    parts = poster_str.split('·')
    
    # check if the split resulted in two parts, otherwise return None
    if len(parts) != 2:
        return None  
    
    # extract poster name and strip any extra spaces
    poster = parts[0].strip()
    
    # split the date-time part by ' at ' to get date and time separately
    date_time_part = parts[1].strip()
    date, time = date_time_part.split(' at ')
    
    # format the date to yyyy-mm-dd using datetime.strptime and strftime
    date_obj = datetime.strptime(date.strip(), '%b %d, %Y')
    formatted_date = date_obj.strftime('%Y-%m-%d')
    
    # return the parsed info as a dictionary
    return {
        'poster': poster,
        'date': formatted_date,
        'time': time.strip()
    }

# example usage
poster_str = "Yoshi · Nov 21, 2024 at 3:21 PM"
info = extract_poster_info(poster_str)
print(info)

{'poster': 'Yoshi', 'date': '2024-11-21', 'time': '3:21 PM'}


In [25]:
def save_patch_note(patch_note_data, patch_note_url):
    # extract the numeric ID from the URL and the date from the 'data-lb-caption-desc'
    # assuming 'data-lb-caption-desc' contains the date in the format "Poster · MMM DD, YYYY at HH:MM AM/PM"
    date_str = patch_note_data['date'] 
    patch_note_id = f"{datetime.strptime(date_str, '%b %d, %Y').strftime('%Y-%m-%d')}_{patch_note_url.split('.')[-1]}"

    # create folder if it doesn't exist
    folder = 'json-patch-notes'
    if not os.path.exists(folder):
        os.makedirs(folder)

    # save as JSON file
    file_path = os.path.join(folder, f"{patch_note_id}.json")
    with open(file_path, 'w') as json_file:
        json.dump(patch_note_data, json_file, indent=4)
    print(f"Patch note saved: {file_path}")

# Example patch note data
patch_note_data = {
    'poster': 'Yoshi',
    'date': 'Nov 21, 2024',
    'time': '3:21 PM',
    'content': 'Here are the updates for this patch...'
}

# Example patch note URL (numeric ID is 91397)
patch_note_url = 'https://forums.playdeadlock.com/threads/05-03-2024-update.427'

save_patch_note(patch_note_data, patch_note_url)

Patch note saved: json-patch-notes\2024-11-21_427.json


In [33]:
# write contents for all patch notes to the json-patch-notes folder
for link in sorted_patch_note_links:
    print(f"Extracting content from: {link}")
    patch_note_data = extract_patch_note_content(link)
    save_patch_note(patch_note_data, link)

Extracting content from: https://forums.playdeadlock.com/threads/11-21-2024-update.47476
Patch note saved: json-patch-notes\2024-11-21_47476.json
Extracting content from: https://forums.playdeadlock.com/threads/11-13-2024-update.46391
Patch note saved: json-patch-notes\2024-11-13_46391.json
Extracting content from: https://forums.playdeadlock.com/threads/11-10-2024-update.45689
Patch note saved: json-patch-notes\2024-11-10_45689.json
Extracting content from: https://forums.playdeadlock.com/threads/11-07-2024-update.44786
Patch note saved: json-patch-notes\2024-11-07_44786.json
Extracting content from: https://forums.playdeadlock.com/threads/11-01-2024-update.43705
Patch note saved: json-patch-notes\2024-11-01_43705.json
Extracting content from: https://forums.playdeadlock.com/threads/10-29-2024-update.42985
Patch note saved: json-patch-notes\2024-10-29_42985.json
Extracting content from: https://forums.playdeadlock.com/threads/10-27-2024-update.42492
Patch note saved: json-patch-notes\