<a href="https://colab.research.google.com/github/yunchengyang515/hybrid-toolbox-data-analysis/blob/main/scripts/hybrid_training_questions_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 📌 Reddit Data Extraction & Insights Notebook
### 🚀 This notebook extracts, cleans, and analyzes Reddit posts to identify frequently asked questions in hybrid training.

## Extract Posts And Analyse the Topics

Installing Dependencies

In [9]:
# Keywords to search within each type of subreddit
endurance_keywords = ["lifting", "strength", "squats", "deadlifts", "resistance training", "barbell", "weights"]
strength_keywords = ["running", "cardio", "aerobic", "zone 2", "marathon", "cycling"]

# Define subreddit categories
endurance_subs = ["running", "ironman"]
strength_subs = ["weightlifting", "strength_training"]
hybrid_subs = ["hybridathlete", "hyrox"]


In [10]:
!pip install praw pandas nltk gensim pyLDAvis groq --quiet

Use praw to establish Reddict authorization

In [11]:
import praw
import pandas as pd
from google.colab import userdata

# Replace with your actual credentials
reddit = praw.Reddit(
    client_id=userdata.get('REDDIT_CLIENT_ID'),
    client_secret=userdata.get('REDDIT_CLIENT_SECRET'),
    redirect_uri=userdata.get('REDDIT_REDIRECT_URL'),
    user_agent='personal use script by fabyang'
)

# Test the connection
try:
    print(reddit.user.me())
except Exception as e:
    print(f"Error connecting to Reddit: {e}")
    print("Please check your credentials and internet connection.")


None


### Fetch posts from subreddits

In [12]:
def fetch_posts(subreddit_name, limit=50):
    """Fetch posts from a given subreddit."""
    subreddit = reddit.subreddit(subreddit_name)
    posts = []

    for submission in subreddit.hot(limit=limit):  # Change to `.search()` if needed
        posts.append({
            "subreddit": subreddit_name,
            "title": submission.title,
            "body": submission.selftext or "",
            "upvotes": submission.score,
            "num_comments": submission.num_comments,
            "url": submission.url
        })

    return posts


In [13]:
def filter_posts(posts, keywords):
    """Filter posts that contain at least one relevant keyword."""
    def contains_keyword(text):
        text = text.lower()
        return any(keyword in text for keyword in keywords)

    return [post for post in posts if contains_keyword(post["title"] + " " + post["body"])]


In [14]:
def process_endurance_subs():
    """Fetch & filter strength-related posts from endurance subs."""
    endurance_posts = []
    for sub in endurance_subs:
        posts = fetch_posts(sub,limit=500)
        endurance_posts.extend(filter_posts(posts, endurance_keywords))
    return endurance_posts

def process_strength_subs():
    """Fetch & filter endurance-related posts from strength subs."""
    strength_posts = []
    for sub in strength_subs:
        posts = fetch_posts(sub, limit=500)
        strength_posts.extend(filter_posts(posts, strength_keywords))
    return strength_posts

def process_hybrid_subs():
    """Fetch all posts from hybrid subs (no filtering needed)."""
    hybrid_posts = []
    for sub in hybrid_subs:
        hybrid_posts.extend(fetch_posts(sub, limit=1000))
    return hybrid_posts


In [15]:
all_filtered_posts = process_endurance_subs() + process_strength_subs() + process_hybrid_subs()

# Convert to DataFrame & Save
df = pd.DataFrame(all_filtered_posts)

print(f"✅ Extracted {len(df)} relevant posts.")
df.head()

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

✅ Extracted 2030 relevant posts.


Unnamed: 0,subreddit,title,body,upvotes,num_comments,url
0,running,Race Report - Crying in Disney (Marathon Weeke...,# Race Information\n\nName: Disney Marathon \...,80,32,https://www.reddit.com/r/running/comments/1iyv...
1,running,Austin Marathon 2025 - My Experience With 3 Tr...,### Race Information\n* **Name:** Austin Marat...,6,3,https://www.reddit.com/r/running/comments/1ivy...
2,running,Hanson's Marathon Beginner Plan Review,"I, mid 30s M, recently finished the Hanson's M...",99,15,https://www.reddit.com/r/running/comments/1iss...
3,running,Race Report: Austin Marathon 2025,**Race Information**\n\n* **Name:** Austin Mar...,24,6,https://www.reddit.com/r/running/comments/1is4...
4,running,Race Report -- First timer Austin Marathon!,* **Name:** Austin Marathon\n* **Date:** Febru...,62,13,https://www.reddit.com/r/running/comments/1ir6...


### Data Cleaning
We'll: ✅ Remove duplicates
✅ Handle missing values
✅ Normalize text (lowercase, remove special characters)

In [16]:
import re
# Remove duplicate rows (if any)
df.drop_duplicates(subset=["title", "body"], inplace=True)

# Handle missing values (fill empty body with "")
df["body"].fillna("", inplace=True)

# Function to clean text
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"http\S+|www\S+", "", text)  # Remove URLs
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters
    text = text.strip()  # Trim spaces
    return text

# Apply cleaning to title & body
df["clean_title"] = df["title"].apply(clean_text)
df["clean_body"] = df["body"].apply(clean_text)

# Check results
df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["body"].fillna("", inplace=True)


Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,clean_title,clean_body
0,running,Race Report - Crying in Disney (Marathon Weeke...,# Race Information\n\nName: Disney Marathon \...,80,32,https://www.reddit.com/r/running/comments/1iyv...,race report crying in disney marathon weekend...,race information\n\nname disney marathon \nda...
1,running,Austin Marathon 2025 - My Experience With 3 Tr...,### Race Information\n* **Name:** Austin Marat...,6,3,https://www.reddit.com/r/running/comments/1ivy...,austin marathon 2025 my experience with 3 tra...,race information\n name austin marathon\n date...
2,running,Hanson's Marathon Beginner Plan Review,"I, mid 30s M, recently finished the Hanson's M...",99,15,https://www.reddit.com/r/running/comments/1iss...,hansons marathon beginner plan review,i mid 30s m recently finished the hansons mara...
3,running,Race Report: Austin Marathon 2025,**Race Information**\n\n* **Name:** Austin Mar...,24,6,https://www.reddit.com/r/running/comments/1is4...,race report austin marathon 2025,race information\n\n name austin marathon\n da...
4,running,Race Report -- First timer Austin Marathon!,* **Name:** Austin Marathon\n* **Date:** Febru...,62,13,https://www.reddit.com/r/running/comments/1ir6...,race report first timer austin marathon,name austin marathon\n date february 16 2024\n...


### Text Tokenization

In [17]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt_tab')

# Download required NLTK data
nltk.download("punkt")
nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

# Tokenization function
def tokenize_text(text):
    tokens = word_tokenize(text)  # Tokenize
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]  # Remove stopwords & short words
    return tokens

# Apply tokenization
df["tokens"] = df["clean_title"] + " " + df["clean_body"]
df["tokens"] = df["tokens"].apply(tokenize_text)

df.head()


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,clean_title,clean_body,tokens
0,running,Race Report - Crying in Disney (Marathon Weeke...,# Race Information\n\nName: Disney Marathon \...,80,32,https://www.reddit.com/r/running/comments/1iyv...,race report crying in disney marathon weekend...,race information\n\nname disney marathon \nda...,"[race, report, crying, disney, marathon, weeke..."
1,running,Austin Marathon 2025 - My Experience With 3 Tr...,### Race Information\n* **Name:** Austin Marat...,6,3,https://www.reddit.com/r/running/comments/1ivy...,austin marathon 2025 my experience with 3 tra...,race information\n name austin marathon\n date...,"[austin, marathon, 2025, experience, training,..."
2,running,Hanson's Marathon Beginner Plan Review,"I, mid 30s M, recently finished the Hanson's M...",99,15,https://www.reddit.com/r/running/comments/1iss...,hansons marathon beginner plan review,i mid 30s m recently finished the hansons mara...,"[hansons, marathon, beginner, plan, review, mi..."
3,running,Race Report: Austin Marathon 2025,**Race Information**\n\n* **Name:** Austin Mar...,24,6,https://www.reddit.com/r/running/comments/1is4...,race report austin marathon 2025,race information\n\n name austin marathon\n da...,"[race, report, austin, marathon, 2025, race, i..."
4,running,Race Report -- First timer Austin Marathon!,* **Name:** Austin Marathon\n* **Date:** Febru...,62,13,https://www.reddit.com/r/running/comments/1ir6...,race report first timer austin marathon,name austin marathon\n date february 16 2024\n...,"[race, report, first, timer, austin, marathon,..."


### Topic Modeling (LDA)

In [18]:
from gensim import corpora, models

# Create dictionary & corpus for LDA
dictionary = corpora.Dictionary(df["tokens"])
dictionary.filter_extremes(no_below=2, no_above=0.5)  # Remove very rare & common words

corpus = [dictionary.doc2bow(text) for text in df["tokens"]]

# Train LDA Model
num_topics = 5  # Can be adjusted
lda_model = models.LdaModel(
    corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42, passes=10
)

# Print topics
for idx, topic in lda_model.print_topics(num_topics=num_topics, num_words=10):
    print(f"📝 Topic #{idx}: {topic}\n")

📝 Topic #0: 0.028*"hyrox" + 0.016*"like" + 0.013*"event" + 0.011*"know" + 0.009*"anyone" + 0.009*"first" + 0.009*"didnt" + 0.009*"also" + 0.009*"felt" + 0.007*"race"

📝 Topic #1: 0.044*"hyrox" + 0.029*"race" + 0.026*"looking" + 0.025*"would" + 0.024*"tickets" + 0.021*"partner" + 0.018*"need" + 0.016*"dont" + 0.016*"doubles" + 0.013*"gym"

📝 Topic #2: 0.032*"hyrox" + 0.020*"f45" + 0.012*"houston" + 0.011*"train" + 0.011*"open" + 0.010*"mens" + 0.009*"training" + 0.009*"transfer" + 0.009*"workouts" + 0.008*"ticket"

📝 Topic #3: 0.057*"hyrox" + 0.028*"marathon" + 0.021*"run" + 0.020*"strength" + 0.020*"training" + 0.020*"lunges" + 0.020*"like" + 0.017*"body" + 0.017*"full" + 0.016*"weeks"

📝 Topic #4: 0.016*"training" + 0.015*"running" + 0.011*"week" + 0.011*"run" + 0.011*"time" + 0.010*"day" + 0.007*"ive" + 0.006*"first" + 0.006*"like" + 0.006*"days"



### Get Topic Function

In [19]:
def get_topic(post):
    topics = lda_model[dictionary.doc2bow(post)]
    return sorted(topics, key=lambda x: x[1], reverse=True)[0][0]  # Get highest-scoring topic

df["dominant_topic"] = df["tokens"].apply(get_topic)

df.head()

Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,clean_title,clean_body,tokens,dominant_topic
0,running,Race Report - Crying in Disney (Marathon Weeke...,# Race Information\n\nName: Disney Marathon \...,80,32,https://www.reddit.com/r/running/comments/1iyv...,race report crying in disney marathon weekend...,race information\n\nname disney marathon \nda...,"[race, report, crying, disney, marathon, weeke...",4
1,running,Austin Marathon 2025 - My Experience With 3 Tr...,### Race Information\n* **Name:** Austin Marat...,6,3,https://www.reddit.com/r/running/comments/1ivy...,austin marathon 2025 my experience with 3 tra...,race information\n name austin marathon\n date...,"[austin, marathon, 2025, experience, training,...",4
2,running,Hanson's Marathon Beginner Plan Review,"I, mid 30s M, recently finished the Hanson's M...",99,15,https://www.reddit.com/r/running/comments/1iss...,hansons marathon beginner plan review,i mid 30s m recently finished the hansons mara...,"[hansons, marathon, beginner, plan, review, mi...",4
3,running,Race Report: Austin Marathon 2025,**Race Information**\n\n* **Name:** Austin Mar...,24,6,https://www.reddit.com/r/running/comments/1is4...,race report austin marathon 2025,race information\n\n name austin marathon\n da...,"[race, report, austin, marathon, 2025, race, i...",4
4,running,Race Report -- First timer Austin Marathon!,* **Name:** Austin Marathon\n* **Date:** Febru...,62,13,https://www.reddit.com/r/running/comments/1ir6...,race report first timer austin marathon,name austin marathon\n date february 16 2024\n...,"[race, report, first, timer, austin, marathon,...",4


### Dominant Topics

In [20]:
# Count number of posts per topic
topic_counts = df["dominant_topic"].value_counts()

print("📊 Topic Distribution:")
print(topic_counts)

# Extract top questions per topic
for topic_id in topic_counts.index:
    print(f"\n🔥 Top Questions for Topic {topic_id}:")
    print(df[df["dominant_topic"] == topic_id][["title"]].head(5))


📊 Topic Distribution:
dominant_topic
4    1605
2     216
1     127
0      53
3      23
Name: count, dtype: int64

🔥 Top Questions for Topic 4:
                                               title
0  Race Report - Crying in Disney (Marathon Weeke...
1  Austin Marathon 2025 - My Experience With 3 Tr...
2             Hanson's Marathon Beginner Plan Review
3                  Race Report: Austin Marathon 2025
4        Race Report -- First timer Austin Marathon!

🔥 Top Questions for Topic 2:
                                                title
29                   Umfrage für meine Bachelorarbeit
47                               Programming chin ups
72              First Hybrid Program, feedback needed
75  Here’s what I’ve been doing as a “tactical ath...
76  Training for 10k, feedback on current plan and...

🔥 Top Questions for Topic 1:
                                                 title
208                        New fitness event in stoke 
340     Best Shoes for Hyrox under $150 with 

## Further Analyze Topics With LLM

### Setup Groq

In [21]:
from groq import Groq
groq_client = Groq(api_key=userdata.get('GROQ_API_KEY'))

In [25]:
grouped_questions = df.groupby("dominant_topic")["title"].apply(lambda x: " | ".join(x[:10]))  # First 10 per topic
grouped_questions = grouped_questions.to_dict()

# Prepare LLM inputs
llm_inputs = [
    f"Topic: {topic}\nQuestions: {questions}\n\nSummarize common concerns and generate common question related to hybrid training, you can filter unmatched information."
    for topic, questions in grouped_questions.items()
]

In [26]:
# Store AI-generated FAQs
topic_faqs = {}

for input_text in llm_inputs:
    chat_completion = groq_client.chat.completions.create(
        model="deepseek-r1-distill-qwen-32b",  # Adjust to your preferred Groq model
        messages=[
            {"role": "system", "content": "You are an expert analyzing fitness training discussions. And discover the most common hybrid training discussion"},
            {"role": "user", "content": input_text}
        ]
    )

    # Extract response
    topic_name = input_text.split("\n")[0].split(": ")[1]
    topic_faqs[topic_name] = chat_completion.choices[0].message.content

# Print generated FAQs
for topic, faq in topic_faqs.items():
    print(f"\n🔍 **{topic} FAQ:**\n{faq}\n")



🔍 **0 FAQ:**
<think>
Okay, so I need to figure out how to approach this query. The user is asking me to act as an expert analyzing fitness training discussions, specifically looking for the most common hybrid training discussions. They provided a list of topics and questions from a live session, and they want me to summarize the common concerns and generate related questions about hybrid training, filtering out any irrelevant information.

First, I should understand what hybrid training is. From what I know, hybrid training combines different types of exercises or training methods to enhance overall fitness. It could involve mixing strength training with endurance exercises, or using a combination of indoor and outdoor workouts, for example.

Looking at the provided topics and questions, I see a variety of subjects. Some are about specific events like the HYROX Elite 15 and World Championships in Nice 2024, others are about personal training issues like hitting a deka PR, Achilles/cal

In [27]:
import pandas as pd

# Convert to DataFrame for easy storage
faq_df = pd.DataFrame(topic_faqs.items(), columns=["Topic", "FAQ_Response"])

# Save to CSV/JSON for future use
faq_df.to_csv("hybrid_training_faqs.csv", index=False)
faq_df.to_json("hybrid_training_faqs.json", orient="records")

print("✅ FAQs saved successfully!")


✅ FAQs saved successfully!
