<a href="https://colab.research.google.com/github/ieg-dhr/NLP-Course4Humanities_2024/blob/main/Transformers_SemantischSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Transformers and Semantic Search

Created by Sarah Oberbichler [![ORCID](https://info.orcid.org/wp-content/uploads/2019/11/orcid_16x16.png)](https://orcid.org/0000-0002-1031-2759)

*   Semantic search is a search engine technology that interprets the meaning of words and phrases. The results of a semantic search will return content matching the meaning of a query, as opposed to content that literally matches words in the query.
*   Semantic search uses context clues to determine the meaning of a word across a dataset of millions of examples.
Semantic search also identifies what other words can be used in similar contexts.

In [None]:
!git clone https://github.com/ieg-dhr/NLP-Course4Humanities_2024.git

In [None]:
# @markdown #### Let's import the dataset "NorddeutscheZeitung_1909"
import pandas as pd

# Replace 'your_excel_file.xlsx' with the actual path to your Excel file
df = pd.read_excel('/content/NLP-Course4Humanities_2024/datasets/NorddeutscheZeitung_1909.xlsx')

# Now you can work with the DataFrame 'df'
df.head()

In [None]:
# @markdown #### Installing the Sentence Tranfromers from HuggingFace
!pip install sentence-transformers

#Using Transformers to find similar words

Findig similar words is one of the most basic tasks for semantic search. While still operating on keywords, transformer models help us identify words with similar or related meanings, allowing us to broaden the scope of traditional keyword searches. This approach helps find relevant content even when exact keyword matches aren't present, by including semantically related terms in the search process.

This code below implements a semantic word similarity search using the multilingual LaBSE transformer model (https://huggingface.co/sentence-transformers/LaBSE), where vector embeddings are generated dynamically based on context, unlike static word embeddings from older methods like Word2Vec. It processes a text corpus by converting all text to lowercase and removing special characters.
The core functionality uses the transformer model to convert words into vector embeddings, then calculates cosine similarity between a target word (in this case "Naturkatastrophen") and all other filtered words from the corpus.

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import re
from sentence_transformers import SentenceTransformer
from collections import Counter

def preprocess_text(text):
    if pd.isna(text):
        return ""
    text = str(text).lower()
    text = re.sub(r'[^a-zäöüß\s]', '', text)
    return text

def get_unique_words(text):
    words = text.split()
    return list(set(words))

def find_similar_words(df, target_word, model, top_n=40):
    # Preprocess the text
    df['processed_text'] = df['plainpagefulltext'].apply(preprocess_text)

    # Get unique words from all texts
    all_words = []
    for text in df['processed_text']:
        all_words.extend(get_unique_words(text))

    # Get unique words and their frequencies
    word_freq = Counter(all_words)
    unique_words = list(word_freq.keys())

    print(f"Number of unique words: {len(unique_words)}")

    # Encode the target word and unique words
    target_embedding = model.encode([target_word])
    word_embeddings = model.encode(unique_words)

    # Calculate similarities
    similarities = cosine_similarity(target_embedding, word_embeddings)[0]

    # Create a DataFrame with words and their similarities
    word_sim_df = pd.DataFrame({
        'word': unique_words,
        'similarity': similarities
    })

    # Sort by similarity and get top N results
    top_similar = word_sim_df.sort_values('similarity', ascending=False).head(top_n)

    return top_similar

# Load the pre-trained multilingual model
print("Loading the sentence transformer model...")
model = SentenceTransformer('sentence-transformers/LaBSE')
print("Model loaded successfully.")

target_word = "Naturkatastrophen"

print(f"\nFinding words similar to '{target_word}'...")
similar_words = find_similar_words(df, target_word, model)

print("\nMost similar words:")
print(similar_words)


# Keyword Intepended Search

Unlike traditional keyword search, which would only find exact matches of words like "earthquake" or "reconstruction," semantic search understands the conceptual meaning of the entire query. For example, when searching for "reconstruction after earthquake," the system understands this as a concept involving disaster recovery, rebuilding efforts, community restoration, and infrastructure repair. This means it can identify relevant content that discusses these themes using different terminology - perhaps an article about "community revival following seismic damage" or "rebuilding homes in disaster-struck areas." The search works by transforming both the query and the searchable content into mathematical representations (vectors) that capture their meaning in a multidimensional space, where similar concepts cluster together regardless of the specific words used to express them. This allows for a more intuitive and human-like understanding of language, capturing context, synonyms, related concepts, and even cross-language connections, ultimately providing more relevant and comprehensive search results that align with the user's actual information needs.



In [None]:
# @markdown #### Let's import the dataset "earthquake_articles"
import pandas as pd

# Replace 'your_excel_file.xlsx' with the actual path to your Excel file
articles_df = pd.read_excel('/content/NLP-Course4Humanities_2024/datasets/earthquake_articles.xlsx')

# Now you can work with the DataFrame 'df'
articles_df

The code below implements a semantic search functionality using a multilingual transformer model to find relevant articles for a specific query in a dataset. It first processes and cleans the input data, then uses the SBERT (Sentence-BERT) model 'paraphrase-multilingual-MiniLM-L12-v2' to convert both the search query and all articles into numerical vectors (embeddings) that capture their semantic meaning. Using cosine similarity, it then calculates how closely each article matches the search query, assigns similarity scores, and filters out articles with scores below 0.6. The results are sorted by similarity score in descending order, and the code outputs the top 10 most relevant articles, displaying their similarity scores, titles, and the first 600 characters of their content, making it easy to identify articles that are semantically related to the reconstruction after earthquakes theme, even if they don't contain the exact search terms.

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import torch

# Load the transformer model
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

# Query phrase for semantic search
query = "Opfer durch Erdbeben"

# Preprocess DataFrame: drop rows where 'extracted_article_clean' is NaN or not a string
if 'extracted_article_clean' not in articles_df.columns:
    raise ValueError("The column 'extracted_article_clean' is missing in the DataFrame.")

# Drop rows with missing values in 'extracted_article_clean'
articles_df = articles_df.dropna(subset=['extracted_article_clean'])

# Ensure all entries in 'extracted_article_clean' are strings
articles_df.loc[:, 'extracted_article_clean'] = articles_df['extracted_article_clean'].astype(str)

# Encode the 'extracted_article_clean' column from the DataFrame
article_embeddings = model.encode(articles_df['extracted_article_clean'].tolist(), convert_to_tensor=True)

# Encode the query
query_embedding = model.encode(query, convert_to_tensor=True)

# Calculate cosine similarity between query and each article
similarities = util.pytorch_cos_sim(query_embedding, article_embeddings)[0]

# Add similarity scores to the DataFrame
articles_df = articles_df.copy()  # Avoid potential chained assignment warnings
articles_df.loc[:, 'similarity'] = similarities.cpu().numpy()

# Filter articles with similarity score > 0.6
high_similarity_df = articles_df[articles_df['similarity'] > 0.6].copy()

# Sort by similarity in descending order
high_similarity_df = high_similarity_df.sort_values('similarity', ascending=False)

# Print the number of articles found with similarity > 0.6
print(f"\nFound {len(high_similarity_df)} articles with similarity score > 0.6")

# Print the top 10 highest similarity scores as examples
print("\nTop 10 highest similarity scores:")
top_10_examples = high_similarity_df.head(10)
for _, row in top_10_examples.iterrows():
    print(f"\nSimilarity Score: {row['similarity']:.4f}")
    print(f"Title: {row['paper_title']}")
    print(f"First 1000 characters of article: {row['extracted_article_clean'][:1000]}...")

# Save the filtered DataFrame to a new variable
filtered_df = high_similarity_df

print(f"\nShape of filtered DataFrame: {filtered_df.shape}")