# Md Jakaria Mashud Shahria (2431751)

**Task 2**

The second task will require realization a semantic search tool for the selected context using a natural language model-based transformer neural networks to be used as a feature extractor. First, you will scrape a website (delfi.lt, lrt.lt, ebay.lt, ..) and collect at least 5,000 data entries. You will then develop methods for extracting and saving features for efficient searches based on new queries. During the assessment, the instructor will send test texts / descriptions in text format, with which you will have to demonstrate how your implemented model works. At checkout, you will be able to tell how realized you are task variant configuration i.e. how the natural language model was used.

In [3]:
!pip install requests beautifulsoup4 pandas sentence-transformers faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


Have tried with LRT, this one doesn't have pagination. Delfi has pagination in each section. Use "Politics" section as this section has more page more than 100. If the data points are less than 5050, use "Lifestyle" section as a second url.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
import time

# --- Scraper Configuration ---
base_url = "https://www.delfi.lt/en/politics"
second_url = "https://www.delfi.lt/en/lifestyle" #if the datapoints <5000, use this
data = []
# Increase max_pages to get 5,000+ entries
max_pages = 10

print(f"Starting to scrape up to {max_pages} pages from delfi.lt...")

# --- Scraping Loop ---
for page_num in tqdm(range(1, max_pages + 1), desc="Scraping Pages"):
    url = base_url if page_num == 1 else f"{base_url}?page={page_num}"
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')


        # The main container for each article is an 'article' tag with this class
        articles = soup.find_all('article', class_='block-type-102-headline')

        if not articles:
            print(f"\nNo more articles found on page {page_num}. Stopping.")
            break

        for article in articles:
            # The title and link are inside a specific div and h5 structure
            title_container = article.find('div', class_='block-type-102-headline__title')

            if title_container:
                title_tag = title_container.find('a')
                if title_tag and title_tag.has_attr('href'):
                    title = title_tag.get_text(strip=True)
                    link = "https://www.delfi.lt" + title_tag['href']

                    # Since there's no lead text, our search text is just the title
                    text_for_search = title

                    data.append({
                        "title": title,
                        "link": link,
                        "text_for_search": text_for_search
                    })

        time.sleep(0.5)

    except requests.exceptions.RequestException as e:
        print(f"Error on page {page_num}: {e}")
        break

if len(data) < 5050:
  print("---------Running Second URL if We do not get 5000 data points-----------")
  for page_num in tqdm(range(1, max_pages + 1), desc="Scraping Pages"):
    url = second_url if page_num == 1 else f"{second_url}?page={page_num}"
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'html.parser')


        # The main container for each article is an 'article' tag with this class
        articles = soup.find_all('article', class_='block-type-102-headline')

        if not articles:
            print(f"\nNo more articles found on page {page_num}. Stopping.")
            break

        for article in articles:
            # The title and link are inside a specific div and h5 structure
            title_container = article.find('div', class_='block-type-102-headline__title')

            if title_container:
                title_tag = title_container.find('a')
                if title_tag and title_tag.has_attr('href'):
                    title = title_tag.get_text(strip=True)
                    link = "https://www.delfi.lt" + title_tag['href']

                    # Since there's no lead text, our search text is just the title
                    text_for_search = title

                    data.append({
                        "title": title,
                        "link": link,
                        "text_for_search": text_for_search
                    })

        # Be a good web citizen and wait between requests
        time.sleep(0.5)

    except requests.exceptions.RequestException as e:
        print(f"Error on page {page_num}: {e}")
        break


# --- Save to DataFrame ---
df = pd.DataFrame(data)
print(f"\nScraping complete. Collected {len(df)} articles.")
if not df.empty:
    print(df.head())

# Save the new, accurately scraped data
df.to_csv('delfi_articles.csv', index=False)

Starting to scrape up to 10 pages from delfi.lt...


Scraping Pages: 100%|██████████| 10/10 [00:27<00:00,  2.77s/it]


---------Running Second URL if We do not get 5000 data points-----------


Scraping Pages: 100%|██████████| 10/10 [00:30<00:00,  3.01s/it]


Scraping complete. Collected 1000 articles.
                                               title  \
0  Nemunas Dawn to name new ministerial candidate...   
1  Released Belarusian political prisoners say th...   
2     Coalition agreement still valid – PM-designate   
3  SocDems call board meeting after Žemaitaitis’ ...   
4  Nemunas Dawn nominees for energy, environment ...   

                                                link  \
0  https://www.delfi.lt/en/politics/nemunas-dawn-...   
1  https://www.delfi.lt/en/politics/released-bela...   
2  https://www.delfi.lt/en/politics/coalition-agr...   
3  https://www.delfi.lt/en/politics/socdems-call-...   
4  https://www.delfi.lt/en/politics/nemunas-dawn-...   

                                     text_for_search  
0  Nemunas Dawn to name new ministerial candidate...  
1  Released Belarusian political prisoners say th...  
2     Coalition agreement still valid – PM-designate  
3  SocDems call board meeting after Žemaitaitis’ ...  
4  Nem




Need Hugging face token for using the model. Hugging face token has been used.

In [5]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

# Load the data you scraped from Delfi
try:
    df = pd.read_csv('delfi_articles.csv')
except FileNotFoundError:
    print("Error: 'delfi_articles.csv' not found. Please run the scraping step first.")
    # Stop execution if the file doesn't exist.
    raise

# Ensure there are no empty text entries
df.dropna(subset=['text_for_search'], inplace=True)

# Load the pre-trained multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Get the list of sentences (in our case, article titles) to encode
sentences = df['text_for_search'].tolist()

print(f"Encoding {len(sentences)} articles from Delfi into vectors...")

# Encode the sentences.
embeddings = model.encode(sentences, show_progress_bar=True)

print("Encoding complete.")
print("Shape of the embeddings matrix:", embeddings.shape)

# Save the embeddings with a name specific to the Delfi dataset
np.save('delfi_embeddings.npy', embeddings)
print("Embeddings saved to 'delfi_embeddings.npy'")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding 1000 articles from Delfi into vectors...


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Encoding complete.
Shape of the embeddings matrix: (1000, 384)
Embeddings saved to 'delfi_embeddings.npy'


In [6]:
import faiss
import numpy as np

#  Load the embeddings created from the Delfi data
try:
    embeddings = np.load('delfi_embeddings.npy')
except FileNotFoundError:
    print("Error: 'delfi_embeddings.npy' not found. Please run Step 3 first.")
    raise

#  Get the dimension of the embeddings (384 for this model)
d = embeddings.shape[1]

#  Build the FAISS index. IndexFlatL2 performs an exact search.
index = faiss.IndexFlatL2(d)

#  Add our article embeddings to the index
index.add(embeddings)

print(f"Indexing complete. Total vectors in the index: {index.ntotal}")

#  Save the index for later use
faiss.write_index(index, 'delfi_index.faiss')
print("Index saved to 'delfi_index.faiss'")

Indexing complete. Total vectors in the index: 1000
Index saved to 'delfi_index.faiss'


In [8]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import re

# --- Load all components for the search tool ---
# This code can be run in a new session without re-calculating anything.

# Load the Delfi article data
df = pd.read_csv('delfi_articles.csv')
df.dropna(subset=['text_for_search'], inplace=True)

# Load the transformer model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Load the FAISS index for Delfi articles
index = faiss.read_index('delfi_index.faiss')

# --- Helper function for tokenization ---
def tokenize_text(text):
    """Basic tokenization: lowercase and split by non-alphanumeric characters."""
    return set(re.findall(r'\w+', text.lower()))

# --- The Search Function ---
def semantic_search(query, k=5):
    """
    Performs semantic search on the indexed Delfi articles and includes token overlap.

    Args:
        query (str): The search text.
        k (int): Number of top results to return.

    Returns:
        A pandas DataFrame with the top k results, including semantic distance and token overlap.
    """
    # Encode the text query into a vector
    query_embedding = model.encode([query])
    query_tokens = tokenize_text(query)

    # Search the FAISS index for the k nearest neighbors
    # D = distances, I = indices in the original dataset
    D, I = index.search(query_embedding, k)

    # Retrieve the results from the original dataframe
    results_indices = I[0]
    results_df = df.iloc[results_indices].copy()
    results_df['distance'] = D[0] # Lower distance = more similar

    # Calculate token overlap
    results_df['token_overlap'] = results_df['text_for_search'].apply(
        lambda x: len(query_tokens.intersection(tokenize_text(x))) / len(query_tokens) if len(query_tokens) > 0 else 0
    )

    return results_df

# --- The Exact Match Search Function ---
def exact_match_search(query):
    """
    Performs exact match search on the tokenized Delfi article titles.

    Args:
        query (str): The search text.

    Returns:
        A pandas DataFrame with articles that have an exact token match.
    """
    query_tokens = tokenize_text(query)
    exact_matches = []

    for index, row in df.iterrows():
        article_tokens = tokenize_text(row['text_for_search'])
        if query_tokens == article_tokens:
            exact_matches.append(row.to_dict())

    return pd.DataFrame(exact_matches)


# --- DEMONSTRATION ---
# Test the search tool with sample queries

print("Search Tool for Delfi.lt is ready!")
print("---")

# Demonstrate Semantic Search with token overlap
test_query_semantic = "Acting agriculture minister to be questioned by ST"
results_semantic = semantic_search(test_query_semantic)
print(f"Semantic Search Results for query: '{test_query_semantic}' (including token overlap)")
display(results_semantic[['title', 'distance', 'link']])
print("---")

# Demonstrate Exact Match Search with a known title from the dataset
test_query_exact = "Coalition agreement still valid – PM-designate"
results_exact = exact_match_search(test_query_exact)
print(f"Exact Match Search Results for query: '{test_query_exact}'")
if not results_exact.empty:
    display(results_exact[['title', 'link']])
else:
    print("No exact matches found.")
print("---")

Search Tool for Delfi.lt is ready!
---
Semantic Search Results for query: 'Acting agriculture minister to be questioned by ST' (including token overlap)


Unnamed: 0,title,distance,link
31,Acting agriculture min to be questioned by STT...,12.477318,https://www.delfi.lt/en/politics/acting-agricu...
55,Candidates for agriculture and economy ministe...,22.472071,https://www.delfi.lt/en/politics/candidates-fo...
104,PM candidate says 'various questions' discusse...,23.900375,https://www.delfi.lt/en/politics/pm-candidate-...
24,PM-designate will meet with candidate for envi...,25.368057,https://www.delfi.lt/en/politics/pm-designate-...
154,Farmers & Greens leadership calls on PM Paluck...,25.635052,https://www.delfi.lt/en/politics/farmers-amp-g...


---
Exact Match Search Results for query: 'Coalition agreement still valid – PM-designate'


Unnamed: 0,title,link
0,Coalition agreement still valid – PM-designate,https://www.delfi.lt/en/politics/coalition-agr...


---
