[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CCS-ZCU/EuPaC_shared/blob/master/NOSCEMUS_getting-started.ipynb)

This Jupyter notebook has been prepared for the EuPaC Hackathon and provides an easy way to start working with the NOSCEMUS dataset — no need to clone the entire repository or download additional data. It is fully compatible with cloud platforms like Google Colaboratory (click the badge above) and runs without requiring any specialized library installations.

As such, it is intended as a starting point for EuPaC participants, including those with minimal coding experience.

In [1]:
# Phase 0A: Setup - Install Libraries
%pip install folium geopandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Phase 0B: Setup - Install Libraries
import pandas as pd
import nltk
import re
import requests
import json
import io
import folium
import geopandas as gpd
import os
import time

In [None]:
# Phase 0A: Data Exploration
# Display 2 sample DataFrame rows
noscemus_metadata = pd.read_csv("https://raw.githubusercontent.com/CCS-ZCU/noscemus_ETF/refs/heads/master/data/metadata_table_long.csv")
noscemus_metadata.head(2)

Unnamed: 0,Author,Full title,In,Year,Place,Publisher/Printer,Era,Form/Genre,Discipline/Content,Original,...,Of interest to,Transkribus text available,Written by,Library and Signature,ids,id,date_min,date_max,filename,file_year
0,"Achrelius, Daniel",Scientiarum magnes recitatus publice anno 1690...,,1690,[Turku],Wall,17th century,Oration,"Mathematics, Astronomy/Astrology/Cosmography, ...",Scientiarum magnes(Google Books),...,"MK, JL",Yes,IT,,[705665],705665,1690.0,1690.0,"Achrelius,_Daniel_-_Scientiarum_magnes__Turku_...",1690.0
1,"Acidalius, Valens","Ad Iordanum Brunum Nolanum, Italum","Poematum Iani Lernutii, Iani Gulielmi, Valenti...",1603,"Liegnitz, Wrocław","Albert, David",17th century,Panegyric poem,Astronomy/Astrology/Cosmography,Ad Iordanum Brunum (1603)(CAMENA)Ad Iordanum B...,...,"MK, IT",Yes,MK,,[801745],801745,1603.0,1603.0,Janus_Lernutius_et_al__-_Poemata__Liegnitz_160...,1603.0


In [9]:
# Phase 0B: Data Exploration
# Display DataFrame Columns

print("\nColumns in noscemus_metadata:")
print(noscemus_metadata.columns.tolist())


Columns in noscemus_metadata:
['Author', 'Full title', 'In', 'Year', 'Place', 'Publisher/Printer', 'Era', 'Form/Genre', 'Discipline/Content', 'Original', 'Digital sourcebook', 'Description', 'References', 'Cited in', 'How to cite this entry', 'Internal notes', 'Of interest to', 'Transkribus text available', 'Written by', 'Library and Signature', 'ids', 'id', 'date_min', 'date_max', 'filename', 'file_year']


In [11]:
# Phase 0C: Data Exploration
# Inspect Potential Columns
# Replace 'candidate_column_name' with a column name from the list above
candidate_column_name = 'Place' # <-- CHANGE THIS VALUE 

if candidate_column_name in noscemus_metadata.columns:
    print(f"\nUnique values in '{candidate_column_name}':")
    # Display a sample of unique values and their counts
    print(noscemus_metadata[candidate_column_name].value_counts().head(30))
    print(f"\nNumber of unique values in '{candidate_column_name}': {noscemus_metadata[candidate_column_name].nunique()}")
    print(f"Number of missing values in '{candidate_column_name}': {noscemus_metadata[candidate_column_name].isnull().sum()}")
    # Show some raw examples of the data in this column
    print("\nSample raw entries (up to first 20 non-null):")
    print(noscemus_metadata[candidate_column_name].dropna().head(20).tolist())
else:
    print(f"Column '{candidate_column_name}' not found in DataFrame. Please choose from the list printed above.")


Unique values in 'Place':
Place
Paris                          69
Amsterdam                      49
Basel                          48
Venice                         48
London                         40
Leipzig                        36
Rome                           34
Zurich                         33
Leiden                         29
Frankfurt am Main              26
Göttingen                      25
Tübingen                       25
Nuremberg                      21
Bologna                        21
Strasbourg                     20
Lyon                           19
Wittenberg                     17
Innsbruck                      16
Cologne                        13
Padua                          13
Naples                         12
Florence                       12
Leiden, Stockholm, Erlangen    10
Halle                          10
Antwerp                        10
Oxford                          8
Copenhagen                      8
Vienna                          8
Bern           

In [12]:
# Phase 1: Data Extraction - Extract 'Place' column
actual_publication_place_column = 'Place'
places_series = noscemus_metadata[actual_publication_place_column].astype(str).str.strip()
unique_raw_places = places_series.unique()
print(f"Found {len(unique_raw_places)} unique raw place mentions from '{actual_publication_place_column}'.")
print("Sample of raw places (first 50):")
print(unique_raw_places[:50])

Found 174 unique raw place mentions from 'Place'.
Sample of raw places (first 50):
['[Turku]' 'Liegnitz, Wrocław' 'Salamanca' 'Heidelberg' 'London' 'Oxford'
 'Lund' 'Strasbourg' 'Basel' 'Bologna' 'Leipzig' 'Zurich' 'Venice' 'Rome'
 'Herborn' 'Frankfurt am Main' 'Turin' 'Florence' 'Alcalá de Henares'
 'Leiden' 'Innsbruck' 'London, Westminster Abbey' 'Paris' 'Cambridge'
 '[Landshut]' '[Ingolstadt]' 'Milan' 'Bergamo' 'Stuttgart' 'Perugia'
 'Lyon' 's.l.' 'Amsterdam' '[Wittenberg]' 'Copenhagen' 'Padua' '[Padua]'
 'Rimini' 'Büdingen' 'Königsberg' 'Uppsala' 'Stockholm, Uppsala, Turku'
 'Leipzig, Desau' 'Würzburg' 'Saint Petersburg' 'Antwerp' 'Graz' 'Aachen'
 'Göttingen' 'Târgu Mureș']
