# üß† News Chatbot ‚Äî Advanced Deep Learning Project

This project presents an **AI-powered news chatbot** developed as part of the *Advanced Deep Learning* course.  
The chatbot leverages **LangChain** and **Retrieval-Augmented Generation (RAG)** to answer user queries based on **real-world news articles** fetched via RSS feeds and APIs.

Our goal is to build an intelligent conversational agent capable of:
- Understanding user questions in natural language
- Retrieving the most relevant information from recent news
- Generating coherent, factual, and source-based answers

The project demonstrates how **LLMs (Qwen 1.5B)** can be combined with **embeddings (MiniLM-L6-v2)** and **vector databases (FAISS)** to create reliable, domain-specific agents.


### üßë‚Äçüíª Authors

- üë®‚Äçüíª Guenichi Ibrahim
- üë®‚Äçüíª Ben Mohamed Malak
- üë®‚Äçüíª Kacem Skander
- üë®‚Äçüíª Abdellaoui Malek
- üë®‚Äçüíª Bettaieb Ahmed

### üìò Notebook Structure

1. **Framework Used ‚Äî LangChain**
2. **Data Collection & Preprocessing**
3. **Embedding Creation**
4. **Vector Store (FAISS) Setup**
5. **LLM & Prompt Template**
6. **RAG Pipeline Implementation**
7. **Fine-tuning and Optimization**
8. **Model Architecture Summary**
9. **Demo of the Chatbot**


## üîß Environment Setup and Imports

This section imports all necessary frameworks and libraries used throughout the project.  
The main framework powering the chatbot is **LangChain**, which provides tools for prompt templates, chains, retrievers, and agents.  

Additional packages such as **HuggingFace Transformers** are used for the language model pipeline, while **FAISS** supports fast similarity searches for document retrieval.  
Together, these components form the foundation of our Retrieval-Augmented Generation (RAG) system.


### Install required libraries

In [1]:
# Install required Python libraries for data ingestion
# Purpose: Ensure dependencies are available in Google Colab
!pip install feedparser requests --quiet
print("Libraries installed successfully: feedparser, requests")


  Preparing metadata (setup.py) ... [?25l[?25hdone
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/81.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m81.5/81.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
Libraries installed successfully: feedparser, requests


#### Additional preprocessing libraries
Installs HTML parsing, tokenization and data utilities used during preprocessing.

In [2]:
!pip install beautifulsoup4 nltk rouge-score datasets --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [3]:
!rm -rf /root/.config/Google

#### Imports and Drive mount
Imports core libraries and mounts Google Drive for persistent storage when running in Colab. Replace or guard the drive mount when running locally.

In [4]:
# Import libraries and mount Google Drive for persistent storage
import feedparser
import requests
import json
import os
from datetime import datetime
import logging
from google.colab import drive

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Mount Google Drive and create base directory
drive.mount('/content/drive', force_remount=True)
BASE_DIR = '/content/drive/MyDrive/news_data'
os.makedirs(BASE_DIR, exist_ok=True)
logging.info(f"Google Drive mounted. Base directory created at: {BASE_DIR}")
print(f"Environment setup complete. Directory: {BASE_DIR}")


Mounted at /content/drive
Environment setup complete. Directory: /content/drive/MyDrive/news_data


### üåê Defining Categories and Data Sources (RSS / API)

This section centralizes the configuration of all **news categories** and their corresponding **RSS feed URLs** or **API endpoints**.  

In [5]:
# Define news categories and data sources
CATEGORIES = ['world', 'politics', 'business', 'technology', 'science', 'entertainment', 'sports', 'health']

RSS_FEEDS = {
    'world': ['https://feeds.bbci.co.uk/news/world/rss.xml', 'https://www.reuters.com/arc/outboundfeeds/world/?outputType=xml', 'http://rss.cnn.com/rss/cnn_world.rss'],
    'politics': ['https://www.theguardian.com/politics/rss', 'https://www.politico.com/rss/politicopulse.xml', 'https://www.aljazeera.com/xml/rss/all.xml'],
    'business': ['https://feeds.bloomberg.com/news/business.rss', 'https://www.ft.com/rss/home', 'https://www.cnbc.com/id/100727362/device/rss/rss.html'],
    'technology': ['https://techcrunch.com/feed/', 'https://www.wired.com/feed/rss', 'https://www.theverge.com/rss/index.xml'],
    'science': ['https://www.nature.com/nature.rss', 'https://www.science.org/rss/news_current.xml', 'https://www.newscientist.com/feed/home/?cmpid=RSS%7CNSNS%7Cnews'],
    'entertainment': ['https://variety.com/feed/', 'https://www.hollywoodreporter.com/feed/', 'https://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml'],
    'sports': ['https://www.espn.com/espn/rss/news', 'https://feeds.bbci.co.uk/sport/rss.xml', 'https://www.skysports.com/rss/0,20514,11095,00.xml'],
    'health': ['https://www.who.int/feeds/entity/mediacentre/news/en/rss.xml', 'https://www.reuters.com/arc/outboundfeeds/health/?outputType=xml', 'https://newsnetwork.mayoclinic.org/feed/']
}

NEWSAPI_KEY = 'd1fc7716158e4ecf86b9bb812a3b9fcf'
NEWSDATA_KEY = 'pub_f344b9f77d3645aabb54c0e52cb051fd'
NEWSAPI_ENDPOINT = 'https://newsapi.org/v2/top-headlines'
NEWSDATA_ENDPOINT = 'https://newsdata.io/api/1/news'

print(f"Defined categories: {', '.join(CATEGORIES)}")
print(f"Configured {sum(len(feeds) for feeds in RSS_FEEDS.values())} RSS feeds and 2 APIs")


Defined categories: world, politics, business, technology, science, entertainment, sports, health
Configured 24 RSS feeds and 2 APIs


### üì° RSS Fetch Function

This function parses a single **RSS feed** and converts each entry into a **normalized article dictionary** containing:
- `title`  
- `description`  
- `content`  
- `link`  
- `pub_date`  
- `source`  

By unifying the format of all feeds, this step enables seamless integration of RSS content with API-based articles in later stages of the pipeline.


In [6]:
# Fetch articles from a single RSS feed
def fetch_rss(feed_url):
    try:
        feed = feedparser.parse(feed_url)
        articles = []
        for entry in feed.entries:
            article = {
                'title': entry.get('title', 'No Title'),
                'description': entry.get('description', 'No Description'),
                'content': entry.get('content', [{}])[0].get('value', 'No Content'),
                'link': entry.get('link', 'No Link'),
                'pub_date': entry.get('published', datetime.now().isoformat()),
                'source': feed.feed.get('title', feed_url.split('/')[2])
            }
            articles.append(article)
        logging.info(f"Fetched {len(articles)} articles from RSS: {feed_url}")
        print(f"Success: Fetched {len(articles)} articles from {feed_url}")
        return articles
    except Exception as e:
        logging.error(f"RSS fetch error for {feed_url}: {str(e)}")
        print(f"Error fetching {feed_url}: {str(e)}")
        return []

# Test RSS fetch
sample_articles = fetch_rss('https://feeds.bbci.co.uk/news/world/rss.xml')


Success: Fetched 36 articles from https://feeds.bbci.co.uk/news/world/rss.xml


### üåç NewsAPI Fetch Function

This function connects to the **NewsAPI** service to collect news articles based on a specified category.  
It standardizes the returned data into a consistent structure used across all sources.  


In [7]:
# Fetch articles from NewsAPI.org for a category
def fetch_newsapi(category):
    params = {'category': category, 'apiKey': NEWSAPI_KEY, 'pageSize': 20, 'language': 'en'}
    try:
        response = requests.get(NEWSAPI_ENDPOINT, params=params, timeout=10)
        response.raise_for_status()
        data = response.json()
        if data.get('status') != 'ok':
            raise ValueError(f"API error: {data.get('message')}")
        articles = []
        for item in data.get('articles', []):
            article = {
                'title': item.get('title', 'No Title'),
                'description': item.get('description', 'No Description'),
                'content': item.get('content', 'No Content'),
                'link': item.get('url', 'No Link'),
                'pub_date': item.get('publishedAt', datetime.now().isoformat()),
                'source': item['source'].get('name', 'NewsAPI')
            }
            articles.append(article)
        logging.info(f"Fetched {len(articles)} articles from NewsAPI for {category}")
        print(f"Success: Fetched {len(articles)} articles from NewsAPI for {category}")
        return articles
    except Exception as e:
        logging.error(f"NewsAPI fetch error for {category}: {str(e)}")
        print(f"Error fetching NewsAPI for {category}: {str(e)}")
        return []

# Test NewsAPI fetch
sample_articles = fetch_newsapi('technology')


Success: Fetched 20 articles from NewsAPI for technology


### üì∞ NewsData.io Fetch Function

This function queries the **NewsData.io API** using predefined category mappings to retrieve the latest news articles.  
It includes built-in error handling for common API response issues such as **422 (Invalid Query)** and **429 (Rate Limit Exceeded)**, ensuring reliable and continuous data collection even under network or quota constraints.


In [8]:
import requests
from datetime import datetime
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_newsdata(category, country=None):
    """
    Fetch articles from NewsData.io for a given category.

    Args:
        category (str): News category (e.g., 'technology', 'business').
        country (str, optional): Country code (e.g., 'us'; None for global).

    Returns:
        list: List of article dictionaries or empty list on error.

    Note:
        Uses known supported categories: general, business, entertainment, health,
        science, sports, technology, politics. Maps 'world' to 'general'.
        Get a personal API key from https://newsdata.io/ for higher limits (500 calls/day).
    """
    # Known supported categories (based on NewsData.io docs)
    SUPPORTED_CATEGORIES = [
        'general', 'business', 'entertainment', 'health',
        'science', 'sports', 'technology', 'politics'
    ]

    # Map input categories to API-supported values
    category_map = {
        'world': 'general',  # 'world' maps to 'general' for broad news
        'politics': 'politics',
        'business': 'business',
        'technology': 'technology',  # Corrected from 'tech'
        'science': 'science',
        'entertainment': 'entertainment',
        'sports': 'sports',
        'health': 'health'
    }
    mapped_category = category_map.get(category, 'general')

    # Validate category
    if mapped_category not in SUPPORTED_CATEGORIES:
        logging.error(f"Unsupported category '{mapped_category}'. Supported: {', '.join(SUPPORTED_CATEGORIES)}")
        print(f"Error: Category '{mapped_category}' not supported. Choose from: {', '.join(SUPPORTED_CATEGORIES)}")
        return []

    # Validate API key
    if NEWSDATA_KEY == 'YOUR_NEWSDATA_KEY' or not NEWSDATA_KEY:
        logging.error("Invalid or placeholder API key. Get a free key from https://newsdata.io/")
        print("Error: Set a valid NEWSDATA_KEY. Sign up at https://newsdata.io/ for free tier (500 calls/day).")
        return []

    params = {
        'category': mapped_category,
        'apikey': NEWSDATA_KEY,
        'language': 'en'
    }
    if country:
        params['country'] = country

    try:
        response = requests.get('https://newsdata.io/api/1/news', params=params, timeout=10)
        response.raise_for_status()
        data = response.json()

        if data.get('status') != 'success':
            error_msg = data.get('message', 'Unknown error')
            if 'invalid_category' in str(error_msg).lower():
                logging.error(f"API rejected category '{mapped_category}'. Supported: {', '.join(SUPPORTED_CATEGORIES)}")
                print(f"Error: API rejected category '{mapped_category}'. Supported: {', '.join(SUPPORTED_CATEGORIES)}")
            raise ValueError(f"API error: {error_msg}")

        articles = []
        for item in data.get('results', []):
            article = {
                'title': item.get('title', 'No Title'),
                'description': item.get('description', 'No Description'),
                'content': item.get('content', 'No Content'),
                'link': item.get('link', 'No Link'),
                'pub_date': item.get('pubDate', datetime.now().isoformat()),
                'source': item.get('source_id', 'NewsData.io')
            }
            articles.append(article)

        logging.info(f"Fetched {len(articles)} articles from NewsData.io for {category} (mapped to '{mapped_category}')")
        print(f"Success: Fetched {len(articles)} articles from NewsData.io for {category} (mapped to '{mapped_category}')")
        return articles

    except requests.exceptions.HTTPError as e:
        if e.response.status_code == 422:
            error_detail = e.response.json() if e.response.content else {}
            invalid_cat = error_detail.get('results', {}).get('invalid_category', mapped_category)
            logging.error(f"422 Error: Unsupported category '{invalid_cat}'. Supported: {', '.join(SUPPORTED_CATEGORIES)}")
            print(f"Error (422): Unsupported category '{invalid_cat}'. Supported: {', '.join(SUPPORTED_CATEGORIES)}")
        elif e.response.status_code == 429:
            logging.error("429 Rate Limit Exceeded: Wait 24 hours or use a personal API key from https://newsdata.io/")
            print("Error (429): Rate limit exceeded. Wait 24 hours or get a personal key at https://newsdata.io/")
        else:
            logging.error(f"HTTP error: {str(e)}")
            print(f"HTTP Error: {str(e)}")
        return []
    except Exception as e:
        logging.error(f"NewsData.io fetch error for {category}: {str(e)}")
        print(f"Error fetching NewsData.io for {category}: {str(e)}")
        return []

# Set your API key
NEWSDATA_KEY = 'pub_f344b9f77d3645aabb54c0e52cb051fd'

# Test with sample category
sample_articles = fetch_newsdata('technology')
print(f"Sample article count: {len(sample_articles)}")
if sample_articles:
    print(f"First article title: {sample_articles[0]['title']}")
else:
    print("No articles fetched. Try a different category like 'business' or check API key.")

Success: Fetched 10 articles from NewsData.io for technology (mapped to 'technology')
Sample article count: 10
First article title: Why laser toning works best for Indian skin


### Save articles to Google Drive

In [9]:
# Save articles to Google Drive as JSON
def save_articles(articles, category, source):
    try:
        category_dir = os.path.join(BASE_DIR, category)
        os.makedirs(category_dir, exist_ok=True)
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        file_path = os.path.join(category_dir, f'{source}_{timestamp}.json')
        with open(file_path, 'w', encoding='utf-8') as f:
            json.dump(articles, f, ensure_ascii=False, indent=4)
        logging.info(f"Saved {len(articles)} articles to {file_path}")
        print(f"Success: Saved {len(articles)} articles to {file_path}")
    except Exception as e:
        logging.error(f"Save error for {file_path}: {str(e)}")
        print(f"Error saving to {file_path}: {str(e)}")

# Test save function
sample_articles = [{'title': 'Test Article', 'description': 'Test', 'content': 'Test content', 'link': 'http://example.com', 'pub_date': '2025-10-15T13:00:00Z', 'source': 'Test'}]
save_articles(sample_articles, 'test_category', 'test')


Success: Saved 1 articles to /content/drive/MyDrive/news_data/test_category/test_20251026_205306.json


### üîÅ Full Ingestion Loop

This section orchestrates the **entire data ingestion process**, iterating over all configured categories to collect articles from both **RSS feeds** and **API sources**.  

In [10]:
# Fetch and save articles for all categories
for category in CATEGORIES:
    print(f"Processing category: {category}")
    all_articles = []

    # RSS feeds
    for feed_url in RSS_FEEDS.get(category, []):
        articles = fetch_rss(feed_url)
        if articles:
            all_articles.extend(articles)
            save_articles(articles, category, 'rss')

    # NewsAPI
    articles = fetch_newsapi(category)
    if articles:
        all_articles.extend(articles)
        save_articles(articles, category, 'newsapi')

    # NewsData.io
    articles = fetch_newsdata(category)
    if articles:
        all_articles.extend(articles)

    # Save combined articles
    if all_articles:
        save_articles(all_articles, category, 'combined')
        print(f"Total {len(all_articles)} articles saved for {category}")
    else:
        print(f"No articles fetched for {category}")


Processing category: world
Success: Fetched 36 articles from https://feeds.bbci.co.uk/news/world/rss.xml
Success: Saved 36 articles to /content/drive/MyDrive/news_data/world/rss_20251026_205307.json
Success: Fetched 0 articles from https://www.reuters.com/arc/outboundfeeds/world/?outputType=xml
Success: Fetched 29 articles from http://rss.cnn.com/rss/cnn_world.rss
Success: Saved 29 articles to /content/drive/MyDrive/news_data/world/rss_20251026_205307.json
Success: Fetched 0 articles from NewsAPI for world


ERROR:root:422 Error: Unsupported category 'general'. Supported: general, business, entertainment, health, science, sports, technology, politics


Error (422): Unsupported category 'general'. Supported: general, business, entertainment, health, science, sports, technology, politics
Success: Saved 65 articles to /content/drive/MyDrive/news_data/world/combined_20251026_205308.json
Total 65 articles saved for world
Processing category: politics
Success: Fetched 21 articles from https://www.theguardian.com/politics/rss
Success: Saved 21 articles to /content/drive/MyDrive/news_data/politics/rss_20251026_205308.json
Success: Fetched 0 articles from https://www.politico.com/rss/politicopulse.xml
Success: Fetched 25 articles from https://www.aljazeera.com/xml/rss/all.xml
Success: Saved 25 articles to /content/drive/MyDrive/news_data/politics/rss_20251026_205308.json
Success: Fetched 18 articles from NewsAPI for politics
Success: Saved 18 articles to /content/drive/MyDrive/news_data/politics/newsapi_20251026_205308.json
Success: Fetched 10 articles from NewsData.io for politics (mapped to 'politics')
Success: Saved 74 articles to /content

### Verify saved files

In [11]:
# Verify saved files by displaying a sample article
import glob

sample_file = glob.glob(f'{BASE_DIR}/technology/combined_*.json')
if sample_file:
    try:
        with open(sample_file[0], 'r', encoding='utf-8') as f:
            sample_data = json.load(f)
        print(f"Sample article from {sample_file[0]}:")
        print(json.dumps(sample_data[0], indent=4))
    except Exception as e:
        print(f"Error reading {sample_file[0]}: {str(e)}")
else:
    print("No sample files found. Run ingestion first or check API keys.")


Sample article from /content/drive/MyDrive/news_data/technology/combined_20251026_205312.json:
{
    "title": "Ads might be coming to Apple Maps next year",
    "description": "This could be part of a larger strategy to introduce more advertising in iOS.",
    "content": "No Content",
    "link": "https://techcrunch.com/2025/10/26/ads-might-be-coming-to-apple-maps-next-year/",
    "pub_date": "Sun, 26 Oct 2025 19:38:39 +0000",
    "source": "TechCrunch"
}


## üì∞ Data Preprocessing

 This section is responsible for cleaning, and structuring the raw data into a consistent format.

Preprocessing includes:
- Removing unnecessary HTML tags and symbols  
- Normalizing whitespace and punctuation  
- Deduplicating similar content using TF-IDF similarity thresholds  
- Chunking long articles into smaller pieces to improve retrieval precision  

The final output of this step is a list of clean, concise text chunks that are ready to be embedded.


In [12]:

!pip install beautifulsoup4 nltk rouge-score datasets --quiet
import json
import os
import glob
import re
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import logging

In [13]:
# NLTK fix: ensure punkt tokenizer is available (required for NLTK >= 3.8)
print("Downloading NLTK 'punkt' data (and punkt_tab fallback) ...")
nltk.download('punkt_tab', quiet=False)
nltk.download('punkt', quiet=False)
print("NLTK downloads complete.")

# Quick sent_tokenize smoke test
try:
    test_sentences = sent_tokenize("This is a test. It has two sentences.")
    print(f"‚úÖ Tokenization OK. Example: {test_sentences}")
except LookupError as e:
    print(f"‚ùå Tokenization error: {e}")
    print("Manual fix: run nltk.download('punkt_tab') in a separate cell if needed.")
    raise e

logging.basicConfig(level=logging.INFO)

BASE_DIR = '/content/drive/MyDrive/news_data'

def load_articles(category):
    """Load all combined JSON articles for a category."""
    files = glob.glob(f'{BASE_DIR}/{category}/combined_*.json')
    all_articles = []
    for file in files:
        with open(file, 'r', encoding='utf-8') as f:
            data = json.load(f)
            all_articles.extend(data)
    logging.info(f"Loaded {len(all_articles)} articles for {category}")
    return all_articles


Downloading NLTK 'punkt' data (and punkt_tab fallback) ...


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


NLTK downloads complete.
‚úÖ Tokenization OK. Example: ['This is a test.', 'It has two sentences.']


1. **Text Cleaning**
   ```
   Raw HTML ‚Üí BeautifulSoup4 ‚Üí Clean Text ‚Üí Normalized Whitespace
   ```
   - HTML tag removal
   - Special character handling
   - Unicode normalization

In [14]:
def clean_text(text):
    """Clean text: remove HTML and normalize whitespace."""
    if not text:
        return ""
    soup = BeautifulSoup(text, 'html.parser')
    text = soup.get_text()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

2. **Deduplication**
   ```
   Articles ‚Üí TF-IDF Vectors ‚Üí Cosine Similarity ‚Üí Unique Articles
   ```
   - TF-IDF vectorization (scikit-learn)
   - Pairwise similarity computation
   - Threshold-based filtering (0.8)

In [15]:
def deduplicate_articles(articles, threshold=0.8):
    """Deduplicate articles based on TF-IDF similarity. Handles None values."""
    if len(articles) < 2:
        return articles
    # Safety: use .get() and str() to avoid None values
    texts = [clean_text(str(a.get('title', '')) + ' ' + str(a.get('description', '')) + ' ' + str(a.get('content', ''))) for a in articles]
    texts = [t for t in texts if t.strip()]  # Ignore empty texts
    if not texts:
        return articles
    vectorizer = TfidfVectorizer().fit_transform(texts)
    sim_matrix = cosine_similarity(vectorizer)
    unique_indices = [0]
    for i in range(1, len(articles)):
        if all(sim_matrix[i][j] < threshold for j in unique_indices):
            unique_indices.append(i)
    unique_articles = [articles[i] for i in unique_indices]
    logging.info(f"Deduplicated: {len(articles)} -> {len(unique_articles)}")
    return unique_articles

3. **Content Chunking**
   ```
   Clean Text ‚Üí NLTK Sentences ‚Üí Fixed-Size Chunks ‚Üí Metadata
   ```
   - Sentence boundary detection
   - Length-aware chunking (500 chars)
   - Metadata preservation

In [16]:
def chunk_text(text, max_length=500):
    """Split text into chunks of at most max_length characters."""
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
    for sent in sentences:
        if len(current_chunk + sent) < max_length:
            current_chunk += sent + " "
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sent + " "
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

In [17]:
def preprocess_category(category):
    """Full preprocessing pipeline for a category. Handles None values."""
    articles = load_articles(category)
    articles = deduplicate_articles(articles)

    processed = []
    for art in articles:
        full_text = clean_text(str(art.get('title', '')) + ' ' + str(art.get('description', '')) + ' ' + str(art.get('content', '')))
        chunks = chunk_text(full_text)
        for i, chunk in enumerate(chunks):
            if chunk.strip():  # Ignore empty chunks
                processed.append({
                    'id': f"{art.get('link', 'unknown')}_{i}",
                    'title': art.get('title', 'No Title'),
                    'source': art.get('source', 'Unknown'),
                    'link': art.get('link', 'No Link'),
                    'content': chunk,
                    'category': category
                })

    output_file = f'{BASE_DIR}/{category}_processed.json'
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(processed, f, ensure_ascii=False, indent=2)
    logging.info(f"Preprocessing complete: {len(processed)} chunks saved to {output_file}")
    return processed

In [18]:
# Test on a category (should work now)
processed_tech = preprocess_category('technology')
if processed_tech:
    print(f"Sample chunk: {processed_tech[0]['content'][:200]}...")
else:
    print("No chunks generated ‚Äì check the source data.")

Sample chunk: Ads might be coming to Apple Maps next year This could be part of a larger strategy to introduce more advertising in iOS. No Content...


## üîπ Generating Embeddings

In this step, we transform text chunks into **numerical vector representations** using the pre-trained **Sentence-Transformer model `all-MiniLM-L6-v2`**.  

These embeddings form the foundation of our **vector database**, which powers the retrieval stage of the RAG pipeline.

In [19]:
# Install for embeddings and vector store
!pip install sentence-transformers faiss-cpu --quiet
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pickle

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m31.4/31.4 MB[0m [31m71.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### Components

1. **Embedding Model**
   ```
   Model: all-MiniLM-L6-v2
   Output: 384-dimensional vectors
   Framework: sentence-transformers
   ```
   - Optimized for semantic similarity
   - Multi-lingual support
   - Efficient inference

2. **FAISS Integration**
   ```
   Vectors ‚Üí FAISS FlatL2 ‚Üí Indexed Storage
   ```
   - Exact nearest neighbor search
   - L2 distance metric

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

def create_embeddings(chunks):
    """Generate embeddings for the chunks."""
    embeddings = model.encode([c['content'] for c in chunks])
    return np.array(embeddings).astype('float32')

def build_faiss_index(chunks, embeddings):
    """Build and save a FAISS index."""
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)

    faiss.write_index(index, f'{BASE_DIR}/{chunks[0]["category"]}_faiss.index')
    with open(f'{BASE_DIR}/{chunks[0]["category"]}_chunks.pkl', 'wb') as f:
        pickle.dump(chunks, f)
    logging.info(f"FAISS index built: {len(chunks)} vectors")
    return index

processed_tech = json.load(open(f'{BASE_DIR}/technology_processed.json', 'r'))
embeddings_tech = create_embeddings(processed_tech)
index_tech = build_faiss_index(processed_tech, embeddings_tech)
print("FAISS index ready for retrieval!")

## üîç Building the RAG Pipeline

This section combines all components : retriever, embeddings, and LLM into a **Retrieval-Augmented Generation (RAG)** chain.  

When a user asks a question:
1. The retriever fetches the most relevant text chunks from the vector store.  
2. These chunks are inserted into the prompt as contextual evidence.  
3. The language model then generates an answer grounded in the retrieved context.

This approach minimizes hallucinations and ensures the chatbot provides fact-based responses backed by real news data.


### ‚öôÔ∏è Underlying RAG System Components

#### 1. **Retrieval Component**
- Performs **Top-k semantic search (k=5)** using FAISS  
- Applies **score normalization** to improve ranking precision  
- Attaches **metadata** (title, source, URL) to retrieved documents  

#### 2. **Generation Pipeline**
- Uses **Qwen2-1.5B-Instruct**, a multilingual instruction-tuned model  
- Hardware Optimization:
     - Float16 quantization (GPU)
     - Automatic device mapping

#### 3. **Prompt Engineering**
- Injects a structured **system context** for every query  
- Enforces **source citation** in generated responses  
- Supports **French and English** through multilingual optimization  
- Temperature control **0.7**

#### 4. **Integration Architecture**
- Built on a **LangChain RetrievalQA** chain  
- Uses **custom prompt templates** for consistent formatting  
- Tracks **source documents** for transparency  
- Includes **error handling** for robust operation  

In [None]:
def retrieve_chunks(query, index, chunks, model, k=5):
    """Retrieve top-k chunks for a query."""
    query_emb = model.encode([query]).astype('float32')
    distances, indices = index.search(query_emb, k)
    results = []
    for dist, idx in zip(distances[0], indices[0]):
        if idx < len(chunks):
            results.append({
                'chunk': chunks[idx]['content'],
                'source': chunks[idx]['source'],
                'link': chunks[idx]['link'],
                'score': 1 - dist
            })
    return results

# Test retrieval
query = "Latest AI advancements in 2025"
retrieved = retrieve_chunks(query, index_tech, processed_tech, model)
for r in retrieved:
    print(f"Score: {r['score']:.2f} | Source: {r['source']} | Chunk: {r['chunk'][:100]}...")

In [None]:
# Install LangChain and transformers
!pip install langchain langchain-community langchain-huggingface transformers accelerate bitsandbytes sentence-transformers faiss-cpu --quiet
import torch
from langchain.prompts import PromptTemplate
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain.schema import Document
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
import logging

logging.basicConfig(level=logging.INFO)

# Check GPU availability
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

# Load Qwen2-1.5B-Instruct
model_name = "Qwen/Qwen2-1.5B-Instruct"
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if device == 0 else torch.float32,
        device_map="auto" if device == 0 else None,
        trust_remote_code=True
    )
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        device=device
    )
    llm = HuggingFacePipeline(pipeline=pipe)
    print("‚úÖ Qwen loaded successfully!")
except Exception as e:
    print(f"‚ùå Qwen error ({e}). Falling back to GPT-2...")
    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    tokenizer.pad_token = tokenizer.eos_token
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=150, device=device)
    llm = HuggingFacePipeline(pipeline=pipe)

# Key step: build LangChain vectorstore
# Convert chunks into LangChain Document objects
docs = [Document(
    page_content=chunk['content'],
    metadata={
        'title': chunk['title'],
        'source': chunk['source'],
        'link': chunk['link'],
        'category': chunk['category']
    }
) for chunk in processed_tech]

# LangChain embeddings wrapper (uses sentence-transformers model)
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

# Build FAISS vectorstore
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

print("‚úÖ Vectorstore and retriever are ready!")

# Prompt template for Qwen
prompt_template = """<|im_start|>system
Tu es un assistant news expert. R√©ponds de mani√®re concise et factuelle en fran√ßais, bas√© uniquement sur le contexte fourni. Cite les sources via metadata (source, link).
<|im_end|>
<|im_start|>user
Contexte: {context}

Question: {question}
<|im_end|>
<|im_start|>assistant
"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

# RAG chain (with valid retriever)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

# Simplified RAG function (uses LangChain chain)
def rag_query(query):
    """Run RAG using the Qwen model and the integrated retriever."""
    result = qa_chain.invoke({"query": query})
    sources = []
    for doc in result.get('source_documents', []):
        sources.append({
            'chunk': doc.page_content,
            'source': doc.metadata['source'],
            'link': doc.metadata['link'],
            'score': 0.0
        })
    return {
        "answer": result['result'].strip(),
        "sources": sources
    }

## üß™ Demonstration

This final section showcases the chatbot‚Äôs functionality.  
By entering a natural language question, the user receives an AI-generated answer based on the most relevant news content in the dataset.  

The output includes:
- The **question** posed  
- The **AI-generated answer**  
- The **source articles** that supported the response  

This demo validates the effectiveness of our RAG pipeline and completes the full end-to-end workflow of our intelligent news chatbot.


In [23]:
# Test RAG
query = "Connaissez-vous l'√©cole d'ing√©nieurs ESPRIT en Tunisie ? "
result = rag_query(query)
print("**Question:**", query)
print("**R√©ponse (Qwen):**", result['answer'])
print("\n**Sources:**")
for src in result['sources']:
    print(f"- {src['source']}: {src['link']}")

**Question:** Connaissez-vous l'√©cole d'ing√©nieurs ESPRIT en Tunisie ? 
**R√©ponse (Qwen):** <|im_start|>system
Tu es un assistant news expert. R√©ponds de mani√®re concise et factuelle en fran√ßais, bas√© uniquement sur le contexte fourni. Cite les sources via metadata (source, link).
<|im_end|>
<|im_start|>user
Contexte: As Reliance, Meta announce Rs 855 crore AI partnership; RIL clarifies that the transaction does not ... Reliance Industries Limited and Meta Platforms are joining forces. They have formed a new company called Reliance Enterprise Intelligence Limited. This new entity will focus on developing and distributing artificial intelligence services for businesses. The partnership aims to strengthen enterprise technology capabilities and explore new AI solutions.

Meet the Palestinian Teens Trying to Win Robotics Gold Next week, five teens from Palestine will head to Panama to compete in one of the world‚Äôs largest youth robotics competitions. The goal? To win‚Äîand then te