# 1. LangChain 웹페이지 로더

In [10]:
!pip -q install langchain langchain-community langchain-text-splitters \
               sentence-transformers faiss-cpu transformers accelerate \
               pypdf unstructured trafilatura \
               playwright bs4 html2text
# Playwright 브라우저 설치 (동적 렌더링용)
!playwright install --with-deps chromium
!pip -q install -U  trafilatura

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m837.9/837.9 kB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m274.7/274.7 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling dependencies...
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:2 https://cli.github.com/packages stable InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy 

In [39]:
!pip -q install -U  trafilatura

- 가장 간단한 정적 페이지: WebBaseLoader
 - 요청/파싱이 빠르고 가벼움 (정적 HTML에 적합)

In [48]:
from langchain_community.document_loaders import WebBaseLoader
from bs4 import SoupStrainer

urls = [
    "https://example.com/",
    "https://www.pinecone.io/learn/retrieval-augmented-generation/"
]

# <article>만 긁는 등 필요 영역만 파싱(선택)
only_article = SoupStrainer(["article", "main"])

loader = WebBaseLoader(
    web_paths=urls,
    requests_per_second=1,           # 예의 있게
    continue_on_failure=True,        # 중간 오류 무시
    bs_kwargs={"parse_only": only_article}
)
web_docs = loader.load()
len(web_docs), web_docs[0].metadata


(2, {'source': 'https://example.com/'})

- 비동기 수집 + HTML → 텍스트 정제: AsyncHtmlLoader + Html2TextTransformer
 - 여러 URL을 한 번에 빠르게 가져오고, 불필요한 태그를 텍스트로 변환

In [49]:
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers import Html2TextTransformer

urls = [
    "https://python.langchain.com/docs/get_started/introduction",
    "https://python.langchain.com/docs/modules/data_connection/document_loaders/"
]

async_loader = AsyncHtmlLoader(urls)
html_docs = async_loader.load()

html2text = Html2TextTransformer()
clean_docs = html2text.transform_documents(html_docs)
print(len(clean_docs), clean_docs[0].page_content[:400])


Fetching pages: 100%|##########| 2/2 [00:00<00:00,  7.30it/s]

2 Skip to main content

Our new LangChain Academy Course Deep Research with LangGraph is now live!
Enroll for free.

IntegrationsAPI Reference

More

  * Contributing
  * People
  * Error reference
  * * * *

  * LangSmith
  * LangGraph
  * LangChain Hub
  * LangChain JS/TS

v0.3

  * v0.3
  * v0.2
  * v0.1

💬

Search

  * Introduction
  * Tutorials

    * Build a Question Answering application over





In [50]:
# 3) HTML→텍스트 변환(링크/리스트 정리)
html2text = Html2TextTransformer()
text_docs = html2text.transform_documents(clean_docs)

print("정제 후 미리보기:\n", text_docs[0].page_content[:500])
print("메타데이터:", text_docs[0].metadata)

정제 후 미리보기:
 Skip to main content Our new LangChain Academy Course Deep Research with
LangGraph is now live! Enroll for free. IntegrationsAPI Reference More *
Contributing * People * Error reference * * * * * LangSmith * LangGraph *
LangChain Hub * LangChain JS/TS v0.3 * v0.3 * v0.2 * v0.1 💬 Search *
Introduction * Tutorials * Build a Question Answering application over a Graph
Database * Tutorials * Build a simple LLM application with chat models and
prompt templates * Build a Chatbot * Build a Retrieval Au
메타데이터: {'source': 'https://python.langchain.com/docs/get_started/introduction', 'title': 'Introduction | 🦜️🔗 LangChain', 'description': 'LangChain is a framework for developing applications powered by large language models (LLMs).', 'language': 'en'}


In [51]:
cleaned_docs = []
for d in docs:
    soup = BeautifulSoup(d.page_content, "html.parser")
    for sel in ["header", "nav", "footer", "aside", ".ads", ".sponsored"]:
        for el in soup.select(sel):
            el.decompose()
    d.page_content = soup.get_text(" ", strip=True)
    cleaned_docs.append(d)



- Unstructured 기반 범용 로더: UnstructuredURLLoader
 - PDF/HTML 혼재 페이지에 유용

In [52]:
from langchain_community.document_loaders import UnstructuredURLLoader

urls = [
    "https://example.com/",
    "https://arxiv.org/pdf/2307.XXXX.pdf"
]
un_loader = UnstructuredURLLoader(urls=urls, continue_on_failure=True)
un_docs = un_loader.load()
len(un_docs)


2

In [53]:
un_docs[1].page_content

"No article for ''\n\nThe identifier you have specified '' appears to be invalid.\n\ninvalid arXiv identifier 2307.XXXX\n\nPlease inform help@arxiv.org if you believe that the identifier should correspond to a valid paper in arXiv."

- 벡터스토어로 통합 (앞서 만든 파이프라인과 연결)
 - 위 어느 로더에서 얻은 *_docs든 같은 방식으로 청크 → 임베딩 → FAISS 인덱스

In [54]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# ① 문서 합치기(원하는 로더 결과를 이어붙이기)
docs_all = (web_docs if 'web_docs' in globals() else []) \
         + (clean_docs if 'clean_docs' in globals() else []) \
         + (pw_docs if 'pw_docs' in globals() else []) \
         + (site_docs if 'site_docs' in globals() else []) \
         + (tf_docs if 'tf_docs' in globals() else []) \
         + (un_docs if 'un_docs' in globals() else [])

print("수집 문서:", len(docs_all))

# ② 청크 분할
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
splits = splitter.split_documents(docs_all)
print("청크 수:", len(splits))

# ③ 임베딩 & 벡터스토어
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(splits, embeddings)

# ④ retriever로 RAG 체인에 연결 (이전 답변의 llm/prompt/rag_chain 그대로 사용 가능)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})


수집 문서: 6
청크 수: 50


- 정적 페이지: WebBaseLoader
- 많은 페이지: SitemapLoader
- 동적/로그인/JS 렌더링 필요: PlaywrightURLLoader

- 중복/유사 청크 제거: URL 기준/내용 해시로 dedupe
- requests_per_second 조절, robots.txt 준수, 과도한 병렬 요청 지양.
- 출처 보존: metadata["source"]에 URL이 들어오므로, 응답에 파일명 대신 URL을 “출처”로 표시하도록 프롬프트/포맷을 변경

#2.  Trafilatura
- 웹페이지에서 본문 텍스트를 정제하여 추출해주는 파이썬 라이브러리

In [12]:
# 설치
!pip install trafilatura

# 사용 예시
import trafilatura
import requests

url = "https://news.ycombinator.com/"
html = requests.get(url).text

# 본문 추출
text = trafilatura.extract(html)

print(text[:500])  # 앞부분만 출력


Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
1.
Show HN: We started building an AI dev tool but it turned into a Sims-style game
(
youtube.com
)
59 points
by
maxraven
1 hour ago
|
hide
|
38 comments
2.
Show HN: Whispering – Open-source, local-first dictation you can trust
(
github.com/epicenter-so
)
113 points
by
braden-w
3 hours ago
|
hide
|
27 comments
3.
Left to Right Programming
(
graic.net
)
92 points
by
graic
3 hours ago
|
hide
|
81 comments
4.
Show HN: I built an a


- fetch_url(url) → URL에서 HTML을 호출 (requests 대체)
- extract(html) → 본문만 정제해서 추출
- 기사/블로그처럼 글 중심 페이지에 활용.
- 동적 렌더링(JS 기반)은 Playwright 같은 브라우저 자동화와 함께 사용