# Discourse Forum Data Extraction

This notebook is dedicated to extracting data from a Discourse forum. 

Discourse is an open-source discussion platform commonly used for community forums. We retrieve relevant data such as posts, user information, and discussion threads from a Discourse forum using web scraping techniques or available APIs.

Reference: [docs.discourse](https://docs.discourse.org)

## 1. Import Libraries

In [4]:
import requests
import json
from bs4 import BeautifulSoup

## 2. Set Base Configuration

In [None]:
# replace with your Discourse forum URL
forum_url = "https://community.lsst.org"

# this endopoint doesn't require authentication.

## 3. Define Utility Functions

In [None]:
# fetch categories from the Discourse forum
def get_categories(forum_url):
    endpoint = f"{forum_url}/categories.json"
    response = requests.get(endpoint,)
    if response.status_code == 200:
        return response.json()['category_list']['categories']
    else:
        print(f"Failed to fetch categories: {response.status_code}")
        return []

In [None]:
# fetch topics for a given category
def get_topics(forum_url, category_slug):
    endpoint = f"{forum_url}/c/{category_slug}.json"
    response = requests.get(endpoint,)
    if response.status_code == 200:
        return response.json()['topic_list']['topics']
    else:
        print(f"Failed to fetch topics for category {category_slug}: {response.status_code}")
        return []

In [None]:
# fetch posts for a given topic
def get_posts(forum_url, topic_id):
    endpoint = f"{forum_url}/t/{topic_id}.json"
    response = requests.get(endpoint,)
    if response.status_code == 200:
        return response.json()['post_stream']['posts']
    else:
        print(f"Failed to fetch posts for topic {topic_id}: {response.status_code}")
        return []

In [None]:
# extract text and image URLs from the cooked HTML
def extract_text_and_images(cooked_html):
    soup = BeautifulSoup(cooked_html, 'html.parser')
    text = soup.get_text(separator='\n', strip=True)
    images = [img['src'] for img in soup.find_all('img') if 'src' in img.attrs]
    return text, images

## 4. Data Extraction

In [29]:
# get all the categories
categories = get_categories(forum_url)

# print(json.dumps(categories, indent=4))

In [30]:
# initialize a list to hold all the data
data_dump = []

# extract information from each category
for category in categories:
    category_id = category['id']
    category_name = category['name']
    category_slug = category['slug']
    print(f"Category: {category_name} (ID: {category_id})")

    # get all the topics for the category
    topics = get_topics(forum_url, category_slug)

    # extract information from each topic
    for topic in topics:
        topic_id = topic['id']
        topic_title = topic['title']
        #print(f"  Topic: {topic_title} (ID: {topic_id})")

        # get all the posts for the topic
        posts = get_posts(forum_url, topic_id)
        for post in posts:
            # print(json.dumps(post, indent=4))
            post_data = {
                "category_id": category_id,
                "category_name": category_name,
                "category_slug": category_slug,
                "topic_id": topic_id,
                "topic_title": topic_title,
                "post_data": post
            }

            # append the post data to the list
            data_dump.append(post_data)

Category: News (ID: 7)
Category: Support (ID: 6)
Category: Science (ID: 23)
Category: Commissioning (ID: 47)
Category: Data Management (ID: 10)
Category: EPO (ID: 22)
Category: Meta (ID: 3)
Category: Archive (ID: 48)


In [31]:
# save the data dump to a JSON file

output_filename = "forum_data_dump.json"
with open(output_filename, "w") as json_file:
    json.dump(data_dump, json_file, indent=4)

print("Data saved!")

Data saved!
