# RAG Basic WebBaseLoader

- Author: [Sunyoung Park (architectyou)](https://github.com/architectyou)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

This tutorial will cover the implementation of a news article QA app that can query the content of news articles using web data for RAG practice. This guide builds a RAG pipeline using OpenAI Chat models, Embedding, and ChromaDB vector store, utilizing Forbes News pages and Naver News pages which is the most popular news website in Korea.

### 1. Pre-processing - Steps 1 to 4
![Pre-processing](./assets/12-rag-rag-basic-pdf-rag-process-01.png)
![](./assets/12-rag-rag-basic-pdf-rag-graphic-1.png)

The pre-processing stage involves four steps to load, split, embed, and store documents into a Vector DB (database).

- **Step 1: Document Load** : Load the document content.  
- **Step 2: Text Split** : Split the document into chunks based on specific criteria.  
- **Step 3: Embedding** : Generate embeddings for the chunks and prepare them for storage.  
- **Step 4: Vector DB Storage** : Store the embedded chunks in the database.  

### 2. RAG Execution (RunTime) - Steps 5 to 8
![RAG Execution](./assets/12-rag-rag-basic-pdf-rag-process-02.png)
![](./assets/12-rag-rag-basic-pdf-rag-graphic-1.png)
- **Step 5: Retriever** : Define a retriever to fetch results from the database based on the input query. Retrievers use search algorithms and are categorized as Dense or Sparse:
  - **Dense** : Similarity-based search.
  - **Sparse** : Keyword-based search.

- **Step 6: Prompt** : Create a prompt for executing RAG. The **context** in the prompt includes content retrieved from the document. Through prompt engineering, you can specify the format of the answer.  

- **Step 7: LLM** : Define the language model (e.g., GPT, Clause, Gemini).  

- **Step 8: Chain** : Create a chain that connects the prompt, LLM, and output.  


### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Web News Based QA(Question-Answering) Chatbot](#web-news-based-qa(question-answering)-chatbot)

### References

- [LangChain Tutorial : QA with RAG](https://python.langchain.com/docs/how_to/#qa-with-rag)
- [LangChain WebLoader Tutorial](https://python.langchain.com/docs/integrations/document_loaders/web_base/)
- [Naver News](https://n.news.naver.com/)
- [Forbes](https://www.forbes.com/)

---

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "bs4",
        "langsmith",
        "langchain",
        "langchain-text-splitters",
        "langchain-community",
        "langchain-core",
        "langchain-openai",
        "langchain-chroma",
        "faiss-cpu" #if gpu is available, use faiss-gpu
    ],
    verbose=False,
    upgrade=False,
)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "RAG-Basic-WebLoader",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

If a warning is displayed due to the USER_AGENT not being set when using the WebBaseLoader,

please add USER_AGENT = myagent to the .env file

## Web News Based QA(Question-Answering) Chatbot

In this tutorial we'll learn about the implementation of a news article QA app that can query the content of news articles using web data for RAG practice. This guide builds a RAG pipeline using OpenAI Chat models, Embedding, and FAISS vector store, utilizing Forbes News pages and Naver News pages which is the most popular news website in Korea.

First, through the following process, we can implement a simple indexing pipeline and RAG chain with approximately 20 lines of code.

**[Note]**
- `bs4` is a library for parsing web pages.
- `langchain` is a library that provides various AI-related functionalities. Here, we'll specifically cover text splitting (`RecursiveCharacterTextSplitter`), document loading (`WebBaseLoader`), vector storage (`Chroma`, `FAISS`), output parsing (`StrOutputParser`), and runnable passthrough (`RunnablePassthrough`).
- Through the `langchain_openai` module, we can use OpenAI's chatbot (`ChatOpenAI`) and embedding (`OpenAIEmbeddings`) functionalities.

In [5]:
import bs4
from langchain import hub
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

We implement a process that loads web page content, splits text into chunks for indexing, and then searches for relevant text snippets to generate new content.

`WebBaseLoader` uses `bs4.SoupStrainer` to parse only the necessary parts from the specified web page.

[Note]

- `bs4.SoupStrainer` allows you to conveniently retrieve desired elements from the web.

(Example)

```python
bs4.SoupStrainer(
    "div",
    attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
)
```

In [6]:
# Load news article content, split into chunks, and index them.

url = "https://www.forbes.com/sites/rashishrivastava/2024/05/21/the-prompt-scarlett-johansson-vs-openai/"
loader = WebBaseLoader(
    web_paths=(url,),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["article-body fs-article fs-premium fs-responsive-text current-article font-body color-body bg-base font-accent article-subtype__masthead",
                             "header-content-container masthead-header__container"]},
        )
    ),
)
docs = loader.load()

docs = loader.load()
print(f"Number of documents: {len(docs)}")
docs

Number of documents: 1


[Document(metadata={'source': 'https://www.forbes.com/sites/rashishrivastava/2024/05/21/the-prompt-scarlett-johansson-vs-openai/'}, page_content="ForbesInnovationEditors' PickThe Prompt: Scarlett Johansson Vs OpenAIPlus AI-generated kids draw predators on TikTok and Instagram. \nShare to FacebookShare to TwitterShare to Linkedin“I was shocked, angered and in disbelief,” Scarlett Johansson said about OpenAI's Sky voice for ChatGPT that sounds similar to her own.FilmMagic\nThe Prompt is a weekly rundown of AI’s buzziest startups, biggest breakthroughs, and business deals. To get it in your inbox, subscribe here.\n\n\nWelcome back to The Prompt.\n\nScarlett Johansson’s lawyers have demanded that OpenAI take down a voice for ChatGPT that sounds much like her own after she’d declined to work with the company to create it. The actress said in a statement provided to Forbes that her lawyers have asked the AI company to detail the “exact processes” it used to create the voice, which sounds eer

You can retrieve the main news from the Forbes page and check its **title** and **content** as follows.

Similarly to the code tutorial above, you can load news articles from **Naver news article pages** using a similar method.

<br/>

```python
loader = WebBaseLoader(
    web_paths=("https://n.news.naver.com/article/437/0000378416",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
        )
    ),
)
```

`RecursiveCharacterTextSplitter` splits documents into chunks of specified size.

```python
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
```

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

splits = text_splitter.split_documents(docs)
len(splits)

12

Vector stores like `FAISS` or `Chroma` generate vector representations of documents based on these chunks.

```python
vectorstore = FAISS.from_documents(splits, OpenAIEmbeddings())
```

In [8]:
# Create a vector store.
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Search for and generate information contained in the news.
retriever = vectorstore.as_retriever()

The retriever created through `vectorstore.as_retriever()` generates new content using the prompt fetched with `hub.pull` and the `ChatOpenAI` model.

Finally, `StrOutputParser` parses the generated results into a string.

In [9]:
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    """You are a friendly AI assistant performing Question-Answering. 
Your mission is to answer the given question based on the provided context.
Please answer the question using the following retrieved context. 
If you cannot find the answer in the given context or if you don't know the answer, please respond with 'The information related to the question cannot be found in the provided information'.
Please answer in English. 
However, keep technical terms and names in their original form without translation.

#Question: 
{question} 

#Context: 
{context} 

#Answer:"""
)

**[Note]**
<br/>
If you practice with Naver-News URL, you can download and input the **teddynote/rag-prompt-korean** prompt from hub (which is set in Korean).

In this case, the separate prompt writing process can be skipped.

```python
prompt = hub.pull("teddynote/rag-prompt-korean")
prompt
```

In [10]:
# English rag prompt

prompt = hub.pull("rlm/rag-prompt")

In [11]:
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)


# Create a chain.
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

To use streaming output, use `stream_response`.

```python
stream_response = rag_chain.stream_response(
    {"question": "What is the latest news about AI?"}
)

for chunk in stream_response:
    print(chunk)
```

In [12]:
answer = rag_chain.invoke("What is the latest news about AI?")
print(answer)

The latest news about AI includes Google's rollout of AI-generated summaries on search results, which have faced criticism for inaccuracies. Scale AI has raised $1 billion at a $14 billion valuation, with new investors like Amazon and Meta. Additionally, Microsoft introduced "Copilot+ PCs" with built-in AI features that operate without an internet connection.


In [13]:
answer = rag_chain.invoke("What is the main idea of latest news about?")
print(answer)


The latest news primarily focuses on advancements in AI technology, including Google's AI-generated summaries for search results and Microsoft's new line of AI-powered Windows computers. Google's AI features have faced criticism for inaccuracies, while Microsoft's "Copilot+ PCs" offer AI capabilities without internet access. Additionally, AI's role in social media platforms like TikTok and Instagram is highlighted in the context of combating child sexual abuse material.


In [14]:
answer = rag_chain.invoke("Why did OpenAI and Scarlett Johansson have a conflict?")
print(answer)

Scarlett Johansson had a conflict with OpenAI because the company used a voice for ChatGPT that sounded similar to hers without her consent. She had previously declined an offer from OpenAI to voice ChatGPT, and her lawyers demanded that OpenAI take down the voice and explain how it was created. OpenAI claimed the voice was not an imitation of Johansson's, but the situation was exacerbated by a tweet from OpenAI's CEO, Sam Altman, referencing her film "Her."
