# 📘 DIS Final Exam - Fall 2024

**🎉 Welcome to DIS Final exam that takes place on the 30th of January 2025.**

> Please fill the following info:
> - Your Name: 
> - Your SCIPER:

<div style="padding:15px 15px 15px 15px;border-left:3px solid #b7b7b7ff;background-color:#eeeeeeff;border-radius: 15px;color:black;">

## Rename your notebook with your SciperNo

#### 🎯 **GOAL:** The final sumbitted file should have the following name: `SciperNo.ipynb`

</div>

<div style="padding:15px 15px 15px 15px;border-left:3px solid #b7b7b7ff;background-color:#eeeeeeff;border-radius: 15px;color:black;">

## THE TASK

You are given a set of documents containing relations. These statements have the form `(head, relation, tail)`, where head and tail are entities. For example, the statement "the window is part of the building" has the following form: `("the window", "is part of", "the building")`, with `"the window"` and `"the building"` being the entities and `"is part of"` being the relation pattern of the statement.

You will need to:
- Explore and understand the given documents.
- Extract entities from the documents with syntactic matching based on initial seed relations.
- Based on the extracted entities, extract new relations.
- Perform entity deduplication and normalization.
- Run a full bootstrapping pipeline for relatino extraction computing the confidence in the reliable patterns until convergence.


## THE DATA

The columns of the provided data are the following:

- `document`: the text of a document

</div>

<div style="padding:15px 15px 15px 15px;border-left:3px solid #b7b7b7ff;background-color:#eeeeeeff;border-radius: 15px;color:black;">

#### Structure of the exam & Quick access

- [PART 1: Data descriptives](#part1)
    - [1.1 Compute the size of the dataset and the average, max and min document length in the dataset](#part11)
    - [1.2 Print the 5 tokens that appear most often](#part12)

- [PART 2: Entity and relation extraction](#part2)
    - [2.1 Based on the seed relation pattern, extract the entities for each document using string matching.](#part21)
    - [2.2 Based on the extracted entites, extract new relation patterns for each document](#part22)
    - [2.3 Obtain the unique entities and relation patterns.](#part23)

- [PART 3: Perform entity resolution and normalize the entities found](#part3)
     - [3.1 Encode entities.](#part31)
     - [3.2 Cluster entity encodings.](#part32)
     - [3.3 Normalize extracted entities based on the clusters.](#part33)

- [PART 4: Use Bootstrapping for relation extraction](#part4)
    - [4.1 Create a function that computes the log confidence of an extracted relation pattern and a function that computes the confidence of an extracted statement..](#part41)
    - [4.2 Run the bootstrap pipeline for one iteration.](#part42)
    - [4.3 Run the bootstrap pipeline until convergence (no new relation patterns detected).](#part43)
 
- [PART 5: Follow-up questions](#part5)


</div>

### Answer the **MCQ questions on moodle** before submitting. 
### You can find them here: https://moodle.epfl.ch/mod/quiz/view.php?id=1320882

### 🍀 GOOD LUCK 🍀

---

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import defaultdict
import pandas as pd
import math
import re
import nltk
from nltk.corpus import stopwords
from collections import Counter

In [None]:
nltk.download('stopwords')

In [None]:
# Embeddings to be used in this exam
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
# read input file
documents = pd.read_json('documents.json')

In [None]:
len(documents)

In [None]:
documents.head(3)

<a id='part1'></a>

<div style="padding:15px 15px 15px 15px;border-left:3px solid #0b5394ff;background-color:#3d85c6ff;border-radius: 15px;color:white;">

## 1. Data Descriptives

#### 🎯 **GOAL:** Understand the dataset by exploring data statistics.

</div>

<a id='part11'></a>
<div style="padding:15px 15px 15px 15px;border-left:3px solid #0b5394ff; background-color:#eff7fe;border-radius: 15px;">

#### **1.1** Compute the size of the dataset and the average, max and min document length in the dataset.

</div>

In [None]:
def document_statistics(documents):
    """
    Compute the following statistics of the input documents:
    - average length (number of words)
    - min and max lengths
    - std of the lengths
    
    :param documents: list of str with the documents.
    return: mean and std
    """

    # --------------
    # YOUR CODE HERE

    
    # --------------

    return (avg_document_length, std_document_length)

In [None]:
document_statistics(documents['document'].values)
print('\n'*2)

<a id='part12'></a>
<div style="padding:15px 15px 15px 15px;border-left:3px solid #0b5394ff; background-color:#eff7fe;border-radius: 15px;">

#### **1.2** Print the 5 tokens that appear most often.

_Remove stopwords before computing the top 5 tokens._

</div>

In [None]:
def top_tokens(documents):
    """
    Compute the 5 most frequent tokens.

    :param documents: list of str with the documents.
    return: 
        - dict with the 5 most frequent tokens along with their counts
        - list of str, of all the tokens found in the documents
    """
    stop_words = list(stopwords.words('english'))
    # --------------
    # YOUR CODE HERE

    
    # --------------
    
    return top5,  tokens

In [None]:
top5, tokens = top_tokens(documents['document'].values)
print('Number of tokens: {}'.format(len(tokens)))
print()
print('5 most frequent tokens:')
for top in top5:
    print(top)

print('\n'*2)

<a id='part2'></a>

<div style="padding:15px 15px 15px 15px;border-left:3px solid #38761dff;background-color:#6aa84fff;border-radius: 15px;color:white;">

## 2. Entity and relation extraction

#### 🎯 GOAL: Extract the entities and relations from the documents based on the seed relation pattern.

</div>

<a id='part21'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #38761dff;background-color:#e4fae4;border-radius: 15px;">

#### **2.1:** Based on seed relation patterns, extract the entities for each document using string matching.

</div>

In [None]:
# Function to extract entities relations on seed patterns
def extract_entities(documents, relations):
    """   
    Extracts the head (up to 4 words before) and tail (up to 4 words after) 
    of a given relation pattern in a sentence. The function should return the unique entities.

    :param documents: list of str, with the text of the documents (statements)
    :param relations: list of str, with the seed relations
    return: list of tuples: A list of tuples containing the head and tail as strings, or (None, None) if the relation is not found.
    """

    extracted_entities = []
    # --------------
    # YOUR CODE HERE

    # --------------
    return extracted_entities

In [None]:
# test entity extraction
test_seed = ["is part of"]

test_documents = [
    "the table is part of the office",
    "tables can be found in offices",
]
test_extracted_patterns = extract_entities(test_documents, test_seed)
test_extracted_patterns

# EXPECTED OUTPUT:
# [('the table', 'the office')]

<a id='part22'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #38761dff;background-color:#e4fae4;border-radius: 15px;">

#### **2.2:** Based on the extracted entites, extract new relation patterns for each document.

</div>

In [None]:
def extract_relations(documents, entities):
    """  
    Extracts the relation given a pair of head and tail entities.
    The function should return the unique relations.

    :param documents: list of str, with the text of the documents (statements)
    :param entities: list of tuples of str, with the entity pairs.
    return: list of str: A list of the extracted relations.
    """
    # --------------
    # YOUR CODE HERE

    # --------------

    return relations

In [None]:
# test pattern extraction
test_seeds = [("the table", "the office")]

test_documents = [
    "the table is part of the office",
    "the table can be found in the office",
]
test_extracted_patterns = extract_relations(test_documents, test_seeds)
test_extracted_patterns

# EXPECTED OUTPUT:
# ['is part of', 'can be found in']

<a id='part23'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #38761dff;background-color:#e4fae4;border-radius: 15px;">

#### **2.3:** Obtain the unique entities and relation patterns.

</div>

In [None]:
def find_unique_entities(entity_pairs):
    """
    Return a set of the unique entities found in entity pairs.

    :param entity_pairs: list of tuples of str, with the entity pairs.
    return: set of str, with the unique entities found in entity pairs
    """
    # --------------
    # YOUR CODE HERE

    # --------------
    return unique_entities


def find_unique_relations(relations):
    """
    Return a set of unique relations. 

    :param relations: list of str, with the relations
    return: set of str, with the unique relations
    """
    # --------------
    # YOUR CODE HERE

    # --------------
    return unique_relations

In [None]:
# test above functionality
test_entities = [('the table', 'the office'), ('the table', 'the building')]
test_relations = ['is part of', 'can be found in', 'is part of', '']
print(find_unique_entities(test_entities))
print(find_unique_relations(test_relations))

# EXPECTED OUTPUT:
# {'the office', 'the table', 'the building'}
# {'can be found in', 'is part of'}

<a id='part3'></a>
<div style="padding:15px 15px 15px 15px;border-left:3px solid #b45f06ff;background-color:#e69138ff;border-radius: 15px;color:white;">

## 3. Perform entity resolution and normalize the entities found.

#### 🎯 GOAL: Create one reference string for semantically similar entities (entity normalization).

</div>

<a id='part31'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #b45f06ff;background-color:#fce5cdff;border-radius: 15px;">

#### **3.1:** Encode entities.

_Use the SentenceTransformer model instantiated above_
</div>

In [None]:
def encode_entities(entities, model):
    """
    For each entity, compute the embedding vectors.

    :param entities: list str, with the entities. 
    :param model: SentenceTransformer model object
    return: dict with keys the entities and values the embedding vector for each entity.
    """
    # --------------
    # YOUR CODE HERE

    
    # --------------
    return encodings_map

In [None]:
# test entity encoding
test_entities = ["teacher", "teachers", "doctor", "patient"]

test_encodings = encode_entities(test_entities, model)
print(test_encodings.keys())
test_encodings['teacher'].shape

# EXPECTED OUTPUT:
# dict_keys(['teacher', 'teachers', 'doctor', 'patient'])
# (384,)

<a id='part32'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #b45f06ff;background-color:#fce5cdff;border-radius: 15px;">

#### **3.2:** Cluster entity encodings.

</div>

In [None]:
def cluster_strings_by_similarity(embeddings_dict, threshold):
    """
    Cluster strings based on their embedding similarity using a given threshold.
    
    :param embeddings_dict: Dictionary {string: embedding (numpy array)}.
    :param threshold: Similarity threshold (float, e.g., 0.8).
    :return: List of clusters, where each cluster is a list of strings.
    """
    strings = list(embeddings_dict.keys())
    embeddings = np.array(list(embeddings_dict.values()))
    
    # Compute the pairwise cosine similarity matrix
    similarity_matrix = cosine_similarity(embeddings)
    
    # Create a graph as an adjacency list
    graph = defaultdict(set)
    for i in range(len(strings)):
        for j in range(i + 1, len(strings)):
            if similarity_matrix[i, j] > threshold:
                graph[strings[i]].add(strings[j])
                graph[strings[j]].add(strings[i])
    
    # Perform a graph traversal to find connected components (clusters)
    visited = set()
    clusters = []

    def dfs(node, cluster):
        visited.add(node)
        cluster.append(node)
        for neighbor in graph[node]:
            if neighbor not in visited:
                dfs(neighbor, cluster)

    for string in strings:
        if string not in visited:
            cluster = []
            dfs(string, cluster)
            clusters.append(cluster)
    return clusters

In [None]:
# test entity encoding
test_clusters = cluster_strings_by_similarity(test_encodings, 0.8)
test_clusters

# EXPECTED OUTPUT:
# [['teacher', 'teachers'], ['doctor'], ['patient']]

<a id='part33'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #b45f06ff;background-color:#fce5cdff;border-radius: 15px;">

#### **3.3:** Normalize extracted entities based on the clusters.

</div>

In [None]:
def normalize(clusters, documents, entity_pairs):
    """
    Replace the entities in the same cluster with only one corresponding token. Apply normalization to both sentences and entity pairs.
    E.g.
    cluster: ['teacher', 'teachers'] 
    input: ['the teacher is working at school', 
            'teachers work the school']

    output: ['the teacher is working at school', 
             'teacher work the school']
    or
    output: ['the teachers is working at school', 
             'teachers work the school']   

    :param clusters: list of lists of clusters
    :param documents: list of str, with the text of the documents (statements)
    :param entity_pairs: list of tuples of str, with the entity pairs.
    return: 
        - list of str, with the text of the documents (statements) with normalized entities
        - list of tuples of str, with the normalized entity pairs.
    """
    normalize = dict()
    for c in clusters:
        w0 = c[0]
        for w in c:
            normalize[w] = w0

    normalized_sentences = []
    for sentence in documents:
        for key, value in normalize.items():
            sentence = sentence.replace(key, value)
        normalized_sentences.append(sentence.strip())

    normalized_entity_pairs = []
    for entity_pair in entity_pairs:
        head = normalize.get(entity_pair[0], entity_pair[0])
        tail = normalize.get(entity_pair[1], entity_pair[1])
        normalized_entity_pairs.append((head, tail))       

    return normalized_sentences, normalized_entity_pairs

In [None]:
# test normalized entity functionality
test_documents = ['the teacher is part of the school', 
                  'teachers can be found at the schools']

test_entities = [('the teacher', 'the school'), 
                  ('teachers', 'the schools')]

normalized_test_docs, normalized_test_entities = normalize(test_clusters, test_documents, test_entities)

for d, e in zip(normalized_test_docs, normalized_test_entities):
    print(d)
    print(e)
    print()

print('\n'*1)

# EXPECTED OUTPUT:
# the teacher is part of the school
# ('the teacher', 'the school')

# teacher can be found at the schools
# ('teacher', 'the schools')

<a id='part4'></a>

<div style="padding:15px 15px 15px 15px;border-left:3px solid #351c75ff;background-color:#674ea7ff;border-radius: 15px;color:white;">

## 4. Use Bootstrapping for relation extraction.

#### 🎯 GOAL: Use the above functions to iteratively extract new relation patterns until convergence. Select relation patterns based on confirmed statements (entity pairs) and based on the log confidence threshold. 

</div>

<a id='part41'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #351c75ff;background-color:#d9d2e9ff;border-radius: 20px;">

#### **4.1:** Create a function that computes the log confidence of an extracted relation pattern and a function that computes the confidence of an extracted statement.

</div>

In [None]:
def get_confidence(new_entities, confirmed_entities):
    """
    Compute the log confidence of a relation.

    :param new_entities: list of tuples of str, with entity pairs. 
    :param confirmed_entities: list of tuples of str, with entity pairs. 
    return: float with the confidence score
    """
    # --------------
    # YOUR CODE HERE
    

    # --------------
    return confidence

In [None]:
test_link = "is part of"
test_confirmed_entities = [('the table', 'the office')]
test_new_entities = [('the table', 'the office'), ('the table', 'the kitchen')]
print(get_confidence(test_new_entities, test_confirmed_entities))
print('\n'*1)

# EXPECTED OUTPUT:
# 0.34657359027997264

In [None]:
def get_statement_confidence(sentences, entity_pair, relations):
    """
    Compute the confidence of a statement (entity pair and relation).

    :param sentences: list of str, with the text of the documents (statements)
    :param entity_pair: tuples of str, the entity pair for which confidence is computed
    :param relations: dict with keys the known relations and values their confidence
    return: float, the confidence of a statement
    """
    # --------------
    # YOUR CODE HERE

    
    # --------------

In [None]:
test_pair = ("the table", "the office")

test_documents = [
    "the table is part of the office",
    "the table can be found in the office",
]

test_relations = {"is part of": 0.7,"can be found in": 0.6}

get_statement_confidence(test_documents, test_pair, test_relations)

# EXPECTED OUTPUT:
# 0.88

<a id='part42'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #351c75ff;background-color:#d9d2e9ff;border-radius: 20px;">

#### **4.2:** Run the bootstrap pipeline for one iteration.

</div>

In [None]:
def get_top_relations_and_statements(sentences, 
                                     entity_pairs, 
                                     relations, 
                                     t=0.5,
                                     sim=0.7,
                                     t_stat = 0.5):
    """
    :param sentences: list of str, non-normalized sentences to extract relations.
    :param entity_pairs: list of tuples of str, of the confirmed entity pairs
    :param relations: list of str, confirmed relation patterns
    :param t: float, threshold for log confidence to retain relation pattern.
    :param sim: float, threshold for entity similarity.
    :param t_stat: float, threshold for confidence in statements.
    return: 
        - dict with relations as keys and their confidence as values
        - list of tuples of str, with the entity pairs
    """
    # --------------
    # YOUR CODE HERE

    # Step 1: normalize documents and entities based on the confirmed entity pairs
    
    
    # Step 2: for the confirmed entity pairs, extract all the relation patterns in the documents
    

    # Step 3: for the extracted relation patterns, compute their confidence 
    

    # Step 4: keep only the relation patterns that exceed the confidence threshold 
    

    # Step 5: for the (non-normalized) entity pairs matching the selected relation patterns, compute confidence in the statement
    # select those entity pairs that have statement confidence above the threshold as confirmed statements

    
    # --------------
    return filtered_relations, filtered_statements

In [None]:
# Example usage:
test_seeds = ["is part of"]

test_documents = [
    "the table is part of the office",
    "the table is in the office",
    "the table is in the office",
]

entity_pairs = extract_entities(test_documents, test_seeds)
get_top_relations_and_statements(test_documents, entity_pairs, test_seeds, 0.5, 0.8, 0.5)

# EXPECTED OUTPUT:
# ({'is in': 0.6931471805599453}, [('the table', 'the office')])

<a id='part43'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #351c75ff;background-color:#d9d2e9ff;border-radius: 20px;">

#### **4.3:** Run the bootstrap pipeline until convergence (no new relation patterns detected).

</div>

In [None]:
# Recursice discovery of new relation patterns and confirmed entity pairs, don't change this cell

def run_bootstrap(sentences, seed_relation, t, sim, stat):
    top_relations = {1: [seed_relation]}
    confirmed_entity_pairs = extract_entities(sentences, [seed_relation])
    new_top_links = []
    
    i=1
    while True:
        
        old_relations = new_top_links
        old_pairs = confirmed_entity_pairs
        new_top_links_d, confirmed_entity_pairs = get_top_relations_and_statements(sentences, confirmed_entity_pairs, 
                                                                                   top_relations[i], 
                                                                                   t, 
                                                                                   sim, 
                                                                                   stat)
        new_top_links = list(new_top_links_d.keys())

        # Remove links that are already extracted
        new_items = [link for link in new_top_links if link not in top_relations[i]]

        # If no new links were added, stop the loop
        if not new_items:
            break

        # Add the new links to the extracted list
        all_links = top_relations[i] + new_items
        i += 1
        top_relations[i] = all_links

    return top_relations

#### Experiments

In [None]:
thresholds = [0.5, 0.7, 0.9] 
similarities = [0.5, 0.7, 1]
statement_thresholds = [0.5, 0.7, 0.9]

thresholds = [0.5]

res = list()
for similarity in similarities:
    for threshold in thresholds:
        for statement_threshold in statement_thresholds:
        
            run = run_bootstrap(documents['document'].values, 'is part of', t = threshold, sim = similarity, stat = statement_threshold)
            number_of_runs = len(run.keys())
            res.append({'similarity': similarity, 'threshold': threshold, 'stat_threshold': statement_threshold,
                        'number_of_runs': number_of_runs, 'extracted_relations': run[number_of_runs], 
                        'extracted_relations_count': len(run[number_of_runs])})
results = pd.DataFrame(res)
results

<a id='part5'></a>

<div style="padding:15px 15px 15px 15px;border-left:3px solid #bf9000ff;background-color:#f1c232ff;border-radius: 15px;color:white;">

## 5. Follow-up questions

Based on the previous experiments, answer the following questions.
</div>

<a id='part51'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #bf9000ff;background-color:#fff2ccff;border-radius: 20px;">

#### **5.1:** Discuss the impact of the confidence threshold. Does increasing the threshold always result in fewer detected relation patterns?

</div>

> Your answer here:


<a id='part52'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #bf9000ff;background-color:#fff2ccff;border-radius: 20px;">

#### **5.2:** Discuss the impact of the entity similarity threshold. Does decreasing the threshold always result in more detected relation patterns?

</div>

> Your answer here:


<a id='part53'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #bf9000ff;background-color:#fff2ccff;border-radius: 20px;">

#### **5.3:** With the same parameter settings, does the choice of the seed relation pattern influence the final result?

</div>

> Your answer here:


<a id='part54'></a>
<div style="padding:15px 20px 20px 20px;border-left:3px solid #bf9000ff;background-color:#fff2ccff;border-radius: 20px;">

#### **5.4:** Can you explain why it is not unreasonable that the relation "is less expensive than" can be confused with "is part of".

</div>

> Your answer here:


## 🔚 END OF EXAM
> Don't forget to change the name of the submitted file to your SciperNo as the file name before submitting.

<a id='submit'></a>
#### [SUBMIT HERE](https://moodle.epfl.ch/mod/assign/view.php?id=1321125)