

### Introduction

In this notebook, we'll walk through building a **Reddit Summarizer and Q&A Bot** using **Retrieval-Augmented Generation (RAG)**. This application demonstrates how to leverage the **Reddit API** to fetch and process Reddit posts and comments, summarize lengthy discussions.

The setup combines several powerful tools and techniques:
- **PRAW** to access Reddit data programmatically,
- **LangChain** to handle prompt generation and retrieval workflows,
- **FAISS** to efficiently store and retrieve relevant text chunks, and
- **Llama 3.2** from **Ollama**, which powers the language model for natural language understanding and response generation.

By the end of this tutorial, you’ll have a functional RAG-based application that summarizes and answers questions about Reddit discussions. This guide serves as a foundation for more advanced applications, like conducting subreddit-wide searches, analyzing trending topics, or building a comprehensive Reddit insights tool.



### Prerequisites

Before starting with the RAG solution for summarizing Reddit posts using PRAW, ensure you have the following set up:

1. **Python 3.7+**: Make sure Python is installed. [Download it here](https://www.python.org/downloads/).

2. **PRAW Library**: A Python library for accessing the Reddit API, used to fetch posts and comments.

3. **Reddit API Credentials**: 
   - Go to [Reddit App Preferences](https://www.reddit.com/prefs/apps) and create a new application.
   - Choose **"Script"** as the app type, set a name (e.g., "Reddit Summarizer"), and use `http://localhost:8000` as the redirect URI.
   - This setup will provide your **client_id**, **client_secret**, and **user_agent**, which are needed to authenticate with the Reddit API.

4. **Key Libraries**:
   - **LangChain**: Manages prompt generation and helps structure the RAG workflow.
   - **FAISS**: Efficiently stores and retrieves similar text chunks.
   - **Hugging Face Transformers**: Provides pre-trained models for text embeddings.

5. **Llama 3.2 via Ollama**:
   - Set up **Ollama** to access **Llama 3.2** as the language model, which powers both the summarization and Q&A functionalities in the app. Visit the [Ollama website](https://ollama.com/) for setup instructions.

With these prerequisites ready, you’re all set to build a robust RAG-powered Reddit Summarizer.


In [None]:
!pip install praw langchain faiss-cpu sentence-transformers streamlit python-dotenv


#### Configure Reddit API Credentials
To securely access Reddit, store your credentials in a .env file in your project directory. This allows your code to load credentials without hardcoding them. Use the python-dotenv package to load these credentials:

In [1]:
from dotenv import load_dotenv
import os
import praw
# Load environment variables from the .env file
load_dotenv()

# Access the variables
client_id = os.getenv("client_id")
client_secret = os.getenv("client_secret")
user_agent = os.getenv("user_agent")

# Initialize PRAW with credentials
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent
)

With your environment set up and libraries installed, you’re ready to start fetching and processing Reddit data!

### Fetch Data from Reddit
Example 1: Fetching Posts from a Subreddit

In [15]:
# Choose a subreddit
subreddit = reddit.subreddit("MachineLearning")

# Fetch the top 5 posts from 'Hot'
for post in subreddit.hot(limit=5):
    print(f"Title: {post.title}")
    print(f"Score: {post.score}")
    print(f"ID: {post.id}")
    print(f"URL: {post.url}\n")


Title: [D] Simple Questions Thread
Score: 4
ID: 1giq4ia
URL: https://www.reddit.com/r/MachineLearning/comments/1giq4ia/d_simple_questions_thread/

Title: [D] Monthly Who's Hiring and Who wants to be Hired?
Score: 30
ID: 1ftdkmb
URL: https://www.reddit.com/r/MachineLearning/comments/1ftdkmb/d_monthly_whos_hiring_and_who_wants_to_be_hired/

Title: [R] Never Train from scratch
Score: 63
ID: 1gk7dny
URL: https://www.reddit.com/r/MachineLearning/comments/1gk7dny/r_never_train_from_scratch/

Title: [D] To what cross-entropy loss value can LLMs converge?
Score: 25
ID: 1gk92rs
URL: https://www.reddit.com/r/MachineLearning/comments/1gk92rs/d_to_what_crossentropy_loss_value_can_llms/

Title: [D] Autograd vs JAX? Both are google products aimed at gradient based methods. What’s the main difference? (GPU/TPU?)
Score: 3
ID: 1gkms4w
URL: https://www.reddit.com/r/MachineLearning/comments/1gkms4w/d_autograd_vs_jax_both_are_google_products_aimed/



### Example 2: Fetching Comments from a Post

In [16]:
# Get a specific post by ID
post = reddit.submission(id="1gk7dny")

# Print post details
print(f"Title: {post.title}")
print(f"Content: {post.selftext}")

# Fetch top-level comments
for comment in post.comments[:5]:  # limit comments if needed
    print(f"Comment by {comment.author}: {comment.body}")


Title: [R] Never Train from scratch
Content: https://arxiv.org/pdf/2310.02980 

The authors show that when transformers are pre trained, they can match the performance with S4 on the Long range Arena benchmark. 
Comment by like_a_tensor: I don't get why this paper was accepted as an Oral. It seems obvious, and everyone already knew that pre-training improves performance. I thought the interesting question was always whether long-range performance could be achieved via architecture alone without any pre-training task.
Comment by Sad-Razzmatazz-5188: TL;DR self-supervised pre-training on the downstream task is always better than random initialization, and structured initialization is a bit better even for pretraining; fancy models are not much better than transformers when all's pretrained.


Take home message: we're still messing around because backpropagation almost always converges to a local minimum, but we ignore most of the loss landscape and how privileged regions bring to privile

### Example 3: Searching Subreddit Posts

In [17]:
# Search for posts containing specific keywords
for post in subreddit.search("API", limit=5):
    print(f"Title: {post.title}")
    print(f"Score: {post.score}")
    print(f"ID: {post.id}")
    print(f"URL: {post.url}\n")


Title: Should r/MachineLearning join the reddit blackout to protest changes to their API?
Score: 2620
ID: 14265di
URL: https://www.reddit.com/r/MachineLearning/comments/14265di/should_rmachinelearning_join_the_reddit_blackout/

Title: [D] I feel like ever since LLM APIs have become a thing the quality of discussion regarding ML and ML products has gone down drastically.
Score: 409
ID: 1fl5be0
URL: https://www.reddit.com/r/MachineLearning/comments/1fl5be0/d_i_feel_like_ever_since_llm_apis_have_become_a/

Title: [D] New Reddit API terms effectively bans all use for training AI models, including research use.
Score: 595
ID: 12r7qi7
URL: https://www.reddit.com/r/MachineLearning/comments/12r7qi7/d_new_reddit_api_terms_effectively_bans_all_use/

Title: [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API)
Score: 576
ID: 11fbccz
URL: https://www.reddit.com/r/MachineLearning/comments/11fbccz/d_openai_introduces_chatgpt_and_whisper_apis/

Title: [News] New

## Retrieval Augmented Generation on Reddit

In [4]:

from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import Ollama
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
import faiss


### Chunk Text for Vector Storage
We'll use PRAW to get the main content and all comments from a Reddit post. The goal is to create a rich text dataset that combines the post and comment threads for more context in summaries and answers.

In [5]:

def process_reddit_post(url):
    """Fetch and process Reddit post and comments, returning chunked Document objects."""
    submission = reddit.submission(url=url)
    submission.comments.replace_more(limit=None)
    content = submission.selftext + "\n" + "\n".join([comment.body for comment in submission.comments.list()])

    # Chunk content for FAISS storage using RecursiveCharacterTextSplitter
    splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
    chunks = splitter.split_text(content)

    # Create Document objects for the chunks
    documents = [Document(page_content=chunk) for chunk in chunks]

    # Ingest into FAISS vector database
    vector_db=ingest_into_vectordb(documents)

    return vector_db

This `chunk_content` function splits text into chunks and creates Document objects, preparing them for vector storage.

In [8]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
def ingest_into_vectordb(split_docs):
    """Store split documents in FAISS vector database and save locally."""
    db = FAISS.from_documents(split_docs, embeddings)
    DB_FAISS_PATH = 'vectorstore/db_faiss'
    db.save_local(DB_FAISS_PATH)
    print("Documents are inserted into FAISS vectorstore")
    return db

# Define PromptTemplate for summarization and Q&A
prompt_template = PromptTemplate(input_variables=["text"], template="{text}")


  embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
  from tqdm.autonotebook import tqdm, trange


In [18]:
url="https://www.reddit.com/r/MachineLearning/comments/1gjoxpi/what_problems_do_large_language_models_llms/"
if url:
    # Process Reddit post if URL is provided
    vector_db = process_reddit_post(url)



Documents are inserted into FAISS vectorstore


### Using Llama 3.2 for Summarization and Q&A
After storing Reddit data in FAISS, we leverage Llama 3.2 to generate summaries and answer questions based on Reddit threads. Here's how it works in each mode:
#### Summarization Mode
We perform a similarity search in FAISS to retrieve the most relevant chunks from the Reddit post. A prompt like "Summarize this content" is passed to Llama 3.2, which generates a concise summary based on the retrieved data.


In [19]:
llm = Ollama(model="llama3.2")




In [20]:
query = "Summarize this reddit content."
relevant_docs = vector_db.similarity_search(query, k=5)
context = " ".join([doc.page_content for doc in relevant_docs])

# Chain prompt with LLM using RunnableSequence
summary = (prompt_template | llm).invoke(context)


In [21]:
summary

"Here's a summary of the key points:\n\n* LLMs excel at Natural Language Generation tasks such as summarizing text, creating coherent and grammatically correct content.\n* They can recognize when image recognition is requested and can initiate that process.\n* However, LLMs are not capable of image recognition themselves.\n* They primarily perform text generation, which is a part of structured prediction.\n* Many professionals believe in continuous improvement, leading to skepticism about the need for proof.\n\nAdditionally, some key tasks mentioned include:\n\n* Summarization\n* Coding\n* Information Retrieval\n* Spelling/grammar correction\n* Needle-in-a-haystack search (finding relevant text in large corpora)\n\nLet me know if you'd like me to expand on any of these points!"

#### Q&A Mode
When users ask a question, FAISS finds the most relevant text chunks, which are then combined with a question-specific prompt. Llama 3.2 processes this input to provide an answer based on the context and user question.

In [13]:
question = "what exaCtly happened"
if question:
            # Retrieve top chunks for Q&A
            relevant_docs = vector_db.similarity_search(question, k=5)
            context = " ".join([doc.page_content for doc in relevant_docs])

            # Prepare the input for the question prompt template
            question_template = PromptTemplate(
                input_variables=["text", "question"],
                template="Here are the comments on a reddit post\n Answer the question based on context: {text}. Question: {question}"
            )
            input_data = {"text": context, "question": question}  

            answer = (question_template | llm).invoke(input_data)
            print(answer)


### Conclusion
In this Notebook, we explored how to build a Reddit Summarizer and Q&A Bot using a Retrieval-Augmented Generation (RAG) approach. By leveraging PRAW for data access, LangChain for prompt management, FAISS for similarity search, and Llama 3.2 for language generation, we created a streamlined application that can summarize Reddit discussions and answer questions based on user input.
This app demonstrates a practical RAG application on Reddit data, providing a sample solution for extracting insights from community discussions.
#### Future Enhancements
While this version focuses on individual Reddit posts, there are several ways to expand its capabilities:
- Subreddit-Wide Searches: Extend the app to handle searches across an entire subreddit, summarizing trending discussions or providing answers based on broader topic data.
- Trending Topic Analysis: Integrate analytics to detect trending topics or sentiments within specific subreddits, offering more comprehensive insights.
- Advanced Question-Answering: Use additional LLMs or refine prompts to provide even more accurate and contextually rich answers.

This project offers a foundation for further exploration and customization, opening the door to powerful applications in community-driven content analysis. With its adaptable setup, this RAG-based Reddit bot can be modified for a wide range of use cases across social media platforms.