# Md Jakaria Mashud Shahria (2431751)

**Task 2**

The second task will require realization a semantic search tool for the selected context using a natural language model-based transformer neural networks to be used as a feature extractor. First, you will scrape a website (delfi.lt, lrt.lt, ebay.lt, ..) and collect at least 5,000 data entries. You will then develop methods for extracting and saving features for efficient searches based on new queries. During the assessment, the instructor will send test texts / descriptions in text format, with which you will have to demonstrate how your implemented model works. At checkout, you will be able to tell how realized you are task variant configuration i.e. how the natural language model was used.

In [None]:
!pip install requests beautifulsoup4 pandas sentence-transformers faiss-cpu



Have tried with LRT, this one doesn't have pagination. Delfi has pagination in each section. Use "Politics" section as this section has more page more than 100. If the data points are less than 5050, use "Lifestyle" section as a second url.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
import time

# --- Scraper Configuration ---
base_url = "https://www.delfi.lt/en/politics"
second_url = "https://www.delfi.lt/en/lifestyle" #if the datapoints <5000, use this
data = []
# Increase max_pages to get 5,000+ entries
max_pages = 250

print(f"Starting to scrape up to {max_pages} pages from delfi.lt...")

# --- Scraping Loop ---
for page_num in tqdm(range(1, max_pages + 1), desc="Scraping Pages"):
    url = base_url if page_num == 1 else f"{base_url}?page={page_num}"
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')


        # The main container for each article is an 'article' tag with this class
        articles = soup.find_all('article', class_='block-type-102-headline')

        if not articles:
            print(f"\nNo more articles found on page {page_num}. Stopping.")
            break

        for article in articles:
            # The title and link are inside a specific div and h5 structure
            title_container = article.find('div', class_='block-type-102-headline__title')

            if title_container:
                title_tag = title_container.find('a')
                if title_tag and title_tag.has_attr('href'):
                    title = title_tag.get_text(strip=True)
                    link = "https://www.delfi.lt" + title_tag['href']

                    # Since there's no lead text, our search text is just the title
                    text_for_search = title

                    data.append({
                        "title": title,
                        "link": link,
                        "text_for_search": text_for_search
                    })

        time.sleep(0.5)

    except requests.exceptions.RequestException as e:
        print(f"Error on page {page_num}: {e}")
        break

if len(data) < 5050:
  print("---------Running Second URL if We do not get 5000 data points-----------")
  for page_num in tqdm(range(1, max_pages + 1), desc="Scraping Pages"):
    url = second_url if page_num == 1 else f"{second_url}?page={page_num}"
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')


        # The main container for each article is an 'article' tag with this class
        articles = soup.find_all('article', class_='block-type-102-headline')

        if not articles:
            print(f"\nNo more articles found on page {page_num}. Stopping.")
            break

        for article in articles:
            # The title and link are inside a specific div and h5 structure
            title_container = article.find('div', class_='block-type-102-headline__title')

            if title_container:
                title_tag = title_container.find('a')
                if title_tag and title_tag.has_attr('href'):
                    title = title_tag.get_text(strip=True)
                    link = "https://www.delfi.lt" + title_tag['href']

                    # Since there's no lead text, our search text is just the title
                    text_for_search = title

                    data.append({
                        "title": title,
                        "link": link,
                        "text_for_search": text_for_search
                    })

        # Be a good web citizen and wait between requests
        time.sleep(0.5)

    except requests.exceptions.RequestException as e:
        print(f"Error on page {page_num}: {e}")
        break


# --- Save to DataFrame ---
df = pd.DataFrame(data)
print(f"\nScraping complete. Collected {len(df)} articles.")
if not df.empty:
    print(df.head())

# Save the new, accurately scraped data
df.to_csv('delfi_articles.csv', index=False)

Starting to scrape up to 250 pages from delfi.lt...


Scraping Pages:  61%|██████    | 153/250 [08:39<05:29,  3.39s/it]

Error on page 154: 503 Server Error: Service Unavailable for url: https://www.delfi.lt/en/politics?page=154

Scraping complete. Collected 7650 articles.
                                               title  \
0  Belarus may be clearing migrants ahead of Zapa...   
1  Washington said about military aid cuts, no of...   
2  Nausėda meets with candidates for interior and...   
3  Nausėda yet to decide on SocDem’s pick for soc...   
4  TS-LKD proposes scheme easing first-home purchase   

                                                link  \
0  https://www.delfi.lt/en/politics/belarus-may-b...   
1  https://www.delfi.lt/en/politics/washington-sa...   
2  https://www.delfi.lt/en/politics/nauseda-meets...   
3  https://www.delfi.lt/en/politics/nauseda-yet-t...   
4  https://www.delfi.lt/en/politics/ts-lkd-propos...   

                                     text_for_search  
0  Belarus may be clearing migrants ahead of Zapa...  
1  Washington said about military aid cuts, no of...  
2  Nausė




Need Hugging face token for using the model. Hugging face token has been used.

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

# Load the data you scraped from Delfi
try:
    df = pd.read_csv('delfi_articles.csv')
except FileNotFoundError:
    print("Error: 'delfi_articles.csv' not found. Please run the scraping step first.")
    # Stop execution if the file doesn't exist.
    raise

# Ensure there are no empty text entries
df.dropna(subset=['text_for_search'], inplace=True)

# Load the pre-trained multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Get the list of sentences (in our case, article titles) to encode
sentences = df['text_for_search'].tolist()

print(f"Encoding {len(sentences)} articles from Delfi into vectors...")

# Encode the sentences.
embeddings = model.encode(sentences, show_progress_bar=True)

print("Encoding complete.")
print("Shape of the embeddings matrix:", embeddings.shape)

# Save the embeddings with a name specific to the Delfi dataset
np.save('delfi_embeddings.npy', embeddings)
print("Embeddings saved to 'delfi_embeddings.npy'")

Encoding 7650 articles from Delfi into vectors...


Batches:   0%|          | 0/240 [00:00<?, ?it/s]

Encoding complete.
Shape of the embeddings matrix: (7650, 384)
Embeddings saved to 'delfi_embeddings.npy'


In [None]:
import faiss
import numpy as np

#  Load the embeddings created from the Delfi data
try:
    embeddings = np.load('delfi_embeddings.npy')
except FileNotFoundError:
    print("Error: 'delfi_embeddings.npy' not found. Please run Step 3 first.")
    raise

#  Get the dimension of the embeddings (384 for this model)
d = embeddings.shape[1]

#  Build the FAISS index. IndexFlatL2 performs an exact search.
index = faiss.IndexFlatL2(d)

#  Add our article embeddings to the index
index.add(embeddings)

print(f"Indexing complete. Total vectors in the index: {index.ntotal}")

#  Save the index for later use
faiss.write_index(index, 'delfi_index.faiss')
print("Index saved to 'delfi_index.faiss'")

Indexing complete. Total vectors in the index: 7650
Index saved to 'delfi_index.faiss'


In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss

# --- Load all components for the search tool ---
# This code can be run in a new session without re-calculating anything.

# Load the Delfi article data
df = pd.read_csv('delfi_articles.csv')
df.dropna(subset=['text_for_search'], inplace=True)

# Load the transformer model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Load the FAISS index for Delfi articles
index = faiss.read_index('delfi_index.faiss')


# --- The Search Function ---
def semantic_search(query, k=5):
    """
    Performs semantic search on the indexed Delfi articles.

    Args:
        query (str): The search text.
        k (int): Number of top results to return.

    Returns:
        A pandas DataFrame with the top k results.
    """
    # Encode the text query into a vector
    query_embedding = model.encode([query])

    # Search the FAISS index for the k nearest neighbors
    # D = distances, I = indices in the original dataset
    D, I = index.search(query_embedding, k)

    # Retrieve the results from the original dataframe
    results_indices = I[0]
    results_df = df.iloc[results_indices].copy()
    results_df['distance'] = D[0] # Lower distance = more similar

    return results_df

# --- DEMONSTRATION ---
# Test the search tool with sample queries

print("Semantic Search Tool for Delfi.lt is ready!")
print("---")

test_query = "Acting agriculture minister to be questioned by ST"
results = semantic_search(test_query)
print(f"Results for query: '{test_query}'")
display(results[['title', 'distance', 'link']])

Semantic Search Tool for Delfi.lt is ready!
---
Results for query: 'Acting agriculture minister to be questioned by ST'


Unnamed: 0,title,distance,link
5,Acting agriculture min to be questioned by STT...,12.47732,https://www.delfi.lt/en/politics/acting-agricu...
2809,Agriculture minister to join MPs after Majausk...,18.856428,https://www.delfi.lt/en/politics/agriculture-m...
850,Former parlt speaker questions if president’s ...,19.713531,https://www.delfi.lt/en/politics/former-parlt-...
2375,"Nearly all ministers deserve interpellation, F...",19.856237,https://www.delfi.lt/en/politics/nearly-all-mi...
2984,Farmers and Greens will address Constitutional...,20.078333,https://www.delfi.lt/en/politics/farmers-and-g...
