Extracting the texts of the WH Management plans

In [1]:
# With the use of PyMuPDF (running pip install pymupdf first)
# Reference: https://pymupdf.readthedocs.io/en/latest/the-basics.html
import pymupdf, glob

# All MP's added to a single folder
folder_path = r"C:\Tudelft\WG_MPs"

In [2]:
# loading all MP's in the folder
WG_MPs = [pymupdf.open(f) for f in glob.glob(folder_path + "/*.pdf")]

# Extracting all document texts
# Reference: https://pymupdf.readthedocs.io/en/latest/recipes-text.html
MP_texts = [chr(12).join([page.get_text() for page in MP]) for MP in WG_MPs]

# We can now access the text of each MP by its index in the MP_texts list (from 0 to 10)
# For example, to print the text of the 11th and last MP:
print(MP_texts[10])

MANAGEMENT PLAN
Rietveld
Schröder
House
© Centraal Museum Utrecht / Photo: Hans Wilschut
2
CONTENT
1.	 Preface	
3
2.	 Description of the UNESCO-site	
6
2.1.	 Defining the World Heritage Site	
7
2.2.	 Universal values of the site	
7
2.3.	 Authenticity and comprehensiveness	
7
2.4.	 Management and conservation requirements	
8
2.5.	 Spatial definition of the site	
8
3.	 The site: preservation goals and instruments	
9
3.1.	 Preservation goals per party	
10
3.2.	 Instruments	
10
4.	 The management of the RIETVELD SCHRÖDER 	
14
HOUSE: structures, roles, tasks and powers	
4.1.	 The organisation structure	
15
4.2.	 Ownership	
16
4.3.	 Coordination: daily management and supervisors	
16
4.4.	 Conservation goals	
17
4.5.	 Future plan and implementation programme	
17
4.6.	 The current maintenance condition	
18
4.7.	 Science and research	
19
4.8.	 Monitoring and progress reports	
19
5.	 Developments, threats and preventive measures	
20
5.1.	 Incomplete definition of the site	
21
5.2.	 Rietveld P

In [3]:
# Chunking the texts for embedding using LangChain ( First running: pip install langchain-text-splitters
# Reference: https://docs.langchain.com/oss/python/integrations/splitters
# and https://docs.langchain.com/oss/python/integrations/splitters/recursive_text_splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

# using quite standard chunk size and overlap can be adjusted as needed
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

MP_Chunks = [text_splitter.split_text(text) for text in MP_texts]
# We have now a list of MP's containing lists of text chunks       
# For example, to print the 5th chunk of the 8th MP:
print(MP_Chunks[7][4])

  from .autonotebook import tqdm as notebook_tqdm


management of the Schokland World Heritage site. 
In the management plan, we also take the UNESCO 
objectives into account. They are:
• Protecting, enhancing and developing cultural-
historical, ecological and scenic values;
• Increasing the recognisability and public awareness  
of Schokland;
• Strengthening the local economy.
To achieve its objectives, UNESCO applies the 5 Cs:
• Credibility (the Outstanding Universal Value – each 
World Heritage site is unique)
• Conservation (preservation of World Heritage values)
• Communication (providing information on the World 
Heritage site)
• Capacity building (developing knowledge, economy, 
employment)
• Communities (collaboration with the environment)
We are proud of our beautiful 
World Heritage!
Our approach
Schokland was awarded the World Heritage status in 
1995. During the first decade after obtaining this special 
status, we worked on the restoration of buildings and 
other elements. The decade thereafter, we focused on


Embedding the Chunks

In [4]:
from sentence_transformers import SentenceTransformer
#Python package for generating embeddings using pre-trained models
# Reference: https://sbert.net/docs/sentence_transformer/pretrained_models.html
import pandas as pd
import numpy as np

# We now want to use a fast english model with good quality
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

In [None]:
#BUT for this model we need to translate!
# need to translate "droogmakerij de Beemster" to english, this is seventh MP (index 6) (CHECK if dowloaded the files!)
dutch_index = 6
dutch_chunks = MP_Chunks[dutch_index]

from transformers import pipeline
# we will use the following translation model (NL to EN)
# Found on: https://huggingface.co/models?pipeline_tag=translation&sort=trending&search=nl-en
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-nl-en")
translated_output = translator(dutch_chunks)

#translating all chunks of the dutch MP
translated_output = translator(dutch_chunks, batch_size=16, truncation=True)
# truncation=True is purely a safenet because 1000 characters should be well within the limit (512 tokens for this model)

#output give dictionary with 'translation_text' key so we extract the texts
english_chunks = [item['translation_text'] for item in translated_output]

print(dutch_chunks[4])
print(english_chunks[4])  # checking translated chunk

#overwriting
MP_Chunks[dutch_index] = english_chunks

Device set to use cuda:0


UNESCO World Heritage status to the Van Nellefabriek in 2014.
We, both site holders, would like to make our goals and agreements transparent in relation to 
the challenges this World Heritage status will entail for the next five years.
Rotterdam, January 2021
The siteholders:
The Municipality of Rotterdam,	
	
Virgata Monument BV,
A. Aboutaleb, Mayor	 	
	
J. Goetstouwers Odena, owner
 05 
Mayor Aboutaleb and Minister van Engelshoven (EC&S) in conversation 
at the UNESCO World Heritage Board Day (2018)
06
 07 
2. UNESCO status
The Van Nellefabriek derived its UNESCO status from the special and unique value recorded 
in the ‘Statement of Outstanding Universal Value (OUV). The registration in the World 
Heritage Register is further based on two of the ten criteria used by UNESCO to determine 
the exceptional and unique qualities of this complex as a World Heritage Site.
These 2 criteria are:
II.  The complex exhibits an important interaction of human values for developments in
{'transla

In [6]:
#preparing data
# MP_Chunks is a list of lists, a single list of all chunks is more convenient, keeping track of MP index
all_chunks = []
mp_indices = []
for mp_index, chunks in enumerate(MP_Chunks):
    all_chunks.extend(chunks)
    mp_indices.extend([mp_index] * len(chunks))
#The extend() method adds the specified list elements (or any iterable) to the end of the current list.
#Reference: https://www.w3schools.com/python/ref_list_extend.asp

In [7]:
# Generating embeddings for all chunks
embeddings = model.encode(all_chunks)

# Storing embeddings in a DataFrame
df = pd.DataFrame({
    'mp_index': mp_indices,
    'text_chunk': all_chunks,
    'embedding': list(embeddings) 
})
# Displaying the DataFrame
print(df)

# Saving the DataFrame to a Parquet file for efficient storage
# Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html
df.to_parquet("WG_MPs_MiniLM_Embeddings.parquet")

#Checking the dimension of a single embedding
length = len(df['embedding'][0])
print(f"Dimension of a single embedding: {length}")


      mp_index                                         text_chunk  \
0            0  Kingdom of the Netherlands and Kingdom of Belg...   
1            0  Component part B: Wortel \nComponent part C: V...   
2            0  with the contact addresses provided in Chapter...   
3            0  Starting points of the Management Plan\t\n76\n...   
4            0  programme level\t\n107\n4.7\t\nOrganisation pe...   
...        ...                                                ...   
2631        10  department. This department arranges a consult...   
2632        10  around the house, taking care of minor issues ...   
2633        10  of the Rietveld Schröder House and documented ...   
2634        10  Registratie als Rijksmonument onder Monumentnu...   
2635        10  •\t Gemeente Utrecht. Ruimtelijke Strategie Ut...   

                                              embedding  
0     [0.044666335, 0.023840988, 0.069390655, -0.067...  
1     [0.03859686, 0.09956105, 0.051126897, -0.03940...