# Objective
A solution that extracts news articles from the provided URLs, generates summaries, and identifies topics using GenAI tools.

I chose [longchain](https://python.langchain.com/docs/tutorials/) for this task because of its extensive use, good documentation and a large community

In [2]:
!pip install langchain tiktoken langchain-community sentence_transformers faiss-cpu beautifulsoup4 huggingface_hub langchain-huggingface -q

# Extracting news articles

In [4]:
import os
import warnings
from getpass import getpass

warnings.filterwarnings("ignore")

In [5]:
import re
from bs4 import BeautifulSoup

# extractor for page content
def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return re.sub(r"\n\n+", "\n\n", soup.text).strip()

There are a large number of paid tools (Firecrawl, etc.) that improve the quality of data extraction from websites, but they are not available for this task due to the limitations of the tools themselves or a paid subscription.

There is also a more advanced Web Voyager data extraction agent. But this agent works with multimodal LLMs that have Some in their composition [link](https://langchain-ai.github.io/langgraph/tutorials/web-navigation/web_voyager/).


It is possible to use only free tools, so the choice fell on [RecursiveUrlLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html)

In [6]:
from langchain_community.document_loaders import RecursiveUrlLoader

# link to the site where you want to extract the news
url = "https://www.bbc.com/"

loader = RecursiveUrlLoader(url, extractor=bs4_extractor)

# Choosing a model

I decided to choose the "meta-llama/Meta-Llama-3-8B-Instruct" model because of the best result among the free models. Working with the model takes place through the huggingface API, so in this way we can increase the accuracy of the answers, since the model is fully deployed without reducing the accuracy of the weights (bnb, 4k, 2k, ...)

In [7]:
from langchain_huggingface import HuggingFaceEndpoint


hf_api_token = getpass(prompt="API-TOKEN for HuggingFace")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = hf_api_token


# initialize HF LLM
hf_llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct"
)

API-TOKEN for HuggingFace··········


# Prompt and pipeline

In [37]:
from langchain.prompts import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser

template = """
Develop a convincing thesis that clearly articulates the main idea based on the text.
The answer should be brief, contain no arguments, contain no reflections, contain no explanations.


text: {text}


Answer: The thesis is ...
"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
     prompt
    | hf_llm
    | StrOutputParser()
)

**MAX_COUNT** - we limit the number of iterations so as not to wait long,
**MAX_TEX_SIZE** - the number of letters needed to fit into max_token (can use a smarter approach prompt_tokens and completion_token, but the answer becomes illegible from LLMs, due to the limitations on the response) splitting content into parts requires more intelligent partitioning than the framework provides.  When splitting content with a small tail of data, hallucinations occur.

In [39]:
# limit the number of articles
MAX_COUNT = 15
MAX_TEX_SIZE = 30000

count = 0

posts = []

# lazy_load will allow you to process them one at a time, rather than all at once
for doc in loader.lazy_load():
    # whether the text does not exceed the maximum number of tokens
    if len(doc.page_content) > MAX_TEX_SIZE:
      continue
    doc.page_content = chain.invoke(doc.page_content)
    posts.append(doc)
    # comment out this section of code if you need to process everything
    count += 1
    if count >= MAX_COUNT:
        break
    # comment out this section of code if you need to process everything

In [46]:
posts[6]

Document(metadata={'source': 'https://www.bbc.com/news/articles/c17dq901g28o', 'content_type': 'text/html; charset=utf-8', 'title': 'House calls for Secret Service reforms after July Trump shooting', 'description': 'A House task force urged the Secret Service to re-focus on its primary protective mission. ', 'language': 'en-GB'}, page_content='The House Task Force investigation found that the Secret Service failed to prevent the 13 July shooting in Butler, Pennsylvania, and called for reforms, including reducing the number of people it protects and reviewing its investigative role.')

# Storage

FAISS is used as a vector database.
The model for embeddings is used by "cointegrated/LaBSE-en-ru" (supports two languages, it is possible to download from a large number of languages). Retriever uses a database-based search, for more accurate news, you can also connect the AskNews tool

In [47]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(
    model_name="cointegrated/LaBSE-en-ru", model_kwargs={"device": "cpu"}
)

db = FAISS.from_documents(
    posts, embeddings
)

db.save_local("faiss_news_db")

retriever = db.as_retriever(
    search_type="similarity",
    k=4,
    score_threshold=None,
)

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/516M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/521k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

In [48]:
retriever.get_relevant_documents(
    "Secret Service could not"
)

  retriever.get_relevant_documents(


[Document(metadata={'source': 'https://www.bbc.com/news/articles/c17dq901g28o', 'content_type': 'text/html; charset=utf-8', 'title': 'House calls for Secret Service reforms after July Trump shooting', 'description': 'A House task force urged the Secret Service to re-focus on its primary protective mission. ', 'language': 'en-GB'}, page_content='The House Task Force investigation found that the Secret Service failed to prevent the 13 July shooting in Butler, Pennsylvania, and called for reforms, including reducing the number of people it protects and reviewing its investigative role.'),
 Document(metadata={'source': 'https://www.bbc.com/culture/article/20241210-a-complete-unknown-review', 'content_type': 'text/html; charset=utf-8', 'title': "A Complete Unknown review: Timothée Chalamet is 'brilliant and believable' in 'conventional' biopic", 'description': 'Despite its "kinetic performances and irresistible music" from star Timothée Chalamet and cast, James Mangold\'s Dylan biopic is di

# Сonclusion

The extraction of news articles from the site turned out to be. But free extraction methods are not always good. Therefore, it is worth considering that auxiliary pages such as security policies and rules get in, they can be deleted by filtering URLs, but this check is individual for each site. With data extraction, a lot of garbage comes across, sometimes this garbage causes the model to hallucinate. Prompt itself is not the highest quality, but there is no single solution for many LLMs. there are also ready-made Prompts (hub [link text](https://smith.langchain.com/hub)), but most of them are designed for an OpenAI solution.



# PS

the ready-made solution can be run in the docker