[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CCS-ZCU/EuPaC_shared/blob/master/NOSCEMUS_getting-started.ipynb)

This Jupyter notebook has been prepared for the EuPaC Hackathon and provides an easy way to start working with the NOSCEMUS dataset — no need to clone the entire repository or download additional data. It is fully compatible with cloud platforms like Google Colaboratory (click the badge above) and runs without requiring any specialized library installations.

As such, it is intended as a starting point for EuPaC participants, including those with minimal coding experience.

In [1]:
# Phase 0A: Setup - Install Libraries
%pip install folium geopandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Phase 0B: Setup - Install Libraries
import pandas as pd
import nltk
import re
import requests
import json
import io
import folium
import geopandas as gpd
import os
import time

In [None]:
# Phase 0A: Data Exploration
# Display 2 sample DataFrame rows
noscemus_metadata = pd.read_csv("https://raw.githubusercontent.com/CCS-ZCU/noscemus_ETF/refs/heads/master/data/metadata_table_long.csv")
noscemus_metadata.head(2)

Unnamed: 0,Author,Full title,In,Year,Place,Publisher/Printer,Era,Form/Genre,Discipline/Content,Original,...,Of interest to,Transkribus text available,Written by,Library and Signature,ids,id,date_min,date_max,filename,file_year
0,"Achrelius, Daniel",Scientiarum magnes recitatus publice anno 1690...,,1690,[Turku],Wall,17th century,Oration,"Mathematics, Astronomy/Astrology/Cosmography, ...",Scientiarum magnes(Google Books),...,"MK, JL",Yes,IT,,[705665],705665,1690.0,1690.0,"Achrelius,_Daniel_-_Scientiarum_magnes__Turku_...",1690.0
1,"Acidalius, Valens","Ad Iordanum Brunum Nolanum, Italum","Poematum Iani Lernutii, Iani Gulielmi, Valenti...",1603,"Liegnitz, Wrocław","Albert, David",17th century,Panegyric poem,Astronomy/Astrology/Cosmography,Ad Iordanum Brunum (1603)(CAMENA)Ad Iordanum B...,...,"MK, IT",Yes,MK,,[801745],801745,1603.0,1603.0,Janus_Lernutius_et_al__-_Poemata__Liegnitz_160...,1603.0


In [9]:
# Phase 0B: Data Exploration
# Display DataFrame Columns

print("\nColumns in noscemus_metadata:")
print(noscemus_metadata.columns.tolist())


Columns in noscemus_metadata:
['Author', 'Full title', 'In', 'Year', 'Place', 'Publisher/Printer', 'Era', 'Form/Genre', 'Discipline/Content', 'Original', 'Digital sourcebook', 'Description', 'References', 'Cited in', 'How to cite this entry', 'Internal notes', 'Of interest to', 'Transkribus text available', 'Written by', 'Library and Signature', 'ids', 'id', 'date_min', 'date_max', 'filename', 'file_year']


In [11]:
# Phase 0C: Data Exploration
# Inspect Potential Columns
# Replace 'candidate_column_name' with a column name from the list above
candidate_column_name = 'Place' # <-- CHANGE THIS VALUE 

if candidate_column_name in noscemus_metadata.columns:
    print(f"\nUnique values in '{candidate_column_name}':")
    # Display a sample of unique values and their counts
    print(noscemus_metadata[candidate_column_name].value_counts().head(30))
    print(f"\nNumber of unique values in '{candidate_column_name}': {noscemus_metadata[candidate_column_name].nunique()}")
    print(f"Number of missing values in '{candidate_column_name}': {noscemus_metadata[candidate_column_name].isnull().sum()}")
    # Show some raw examples of the data in this column
    print("\nSample raw entries (up to first 20 non-null):")
    print(noscemus_metadata[candidate_column_name].dropna().head(20).tolist())
else:
    print(f"Column '{candidate_column_name}' not found in DataFrame. Please choose from the list printed above.")


Unique values in 'Place':
Place
Paris                          69
Amsterdam                      49
Basel                          48
Venice                         48
London                         40
Leipzig                        36
Rome                           34
Zurich                         33
Leiden                         29
Frankfurt am Main              26
Göttingen                      25
Tübingen                       25
Nuremberg                      21
Bologna                        21
Strasbourg                     20
Lyon                           19
Wittenberg                     17
Innsbruck                      16
Cologne                        13
Padua                          13
Naples                         12
Florence                       12
Leiden, Stockholm, Erlangen    10
Halle                          10
Antwerp                        10
Oxford                          8
Copenhagen                      8
Vienna                          8
Bern           

In [12]:
# Phase 1: Data Extraction - Extract 'Place' column
actual_publication_place_column = 'Place'
places_series = noscemus_metadata[actual_publication_place_column].astype(str).str.strip()
unique_raw_places = places_series.unique()
print(f"Found {len(unique_raw_places)} unique raw place mentions from '{actual_publication_place_column}'.")
print("Sample of raw places (first 50):")
print(unique_raw_places[:50])

Found 174 unique raw place mentions from 'Place'.
Sample of raw places (first 50):
['[Turku]' 'Liegnitz, Wrocław' 'Salamanca' 'Heidelberg' 'London' 'Oxford'
 'Lund' 'Strasbourg' 'Basel' 'Bologna' 'Leipzig' 'Zurich' 'Venice' 'Rome'
 'Herborn' 'Frankfurt am Main' 'Turin' 'Florence' 'Alcalá de Henares'
 'Leiden' 'Innsbruck' 'London, Westminster Abbey' 'Paris' 'Cambridge'
 '[Landshut]' '[Ingolstadt]' 'Milan' 'Bergamo' 'Stuttgart' 'Perugia'
 'Lyon' 's.l.' 'Amsterdam' '[Wittenberg]' 'Copenhagen' 'Padua' '[Padua]'
 'Rimini' 'Büdingen' 'Königsberg' 'Uppsala' 'Stockholm, Uppsala, Turku'
 'Leipzig, Desau' 'Würzburg' 'Saint Petersburg' 'Antwerp' 'Graz' 'Aachen'
 'Göttingen' 'Târgu Mureș']


In [15]:
# Phase 2: Geocode Raw Publication Places

GEONAMES_USERNAME = "utaysi"  # Your Geonames username
raw_geocoded_cache_file = 'raw_geocoded_places_cache.csv'

def get_coordinates(place_name, username):
    if not place_name or pd.isna(place_name):
        return None, None, None, None
    # Ensure place_name is a string for requests.utils.quote
    place_name_str = str(place_name)
    try:
        # Initial attempt: prioritize populated places (featureClass=P)
        url = f"http://api.geonames.org/searchJSON?q={requests.utils.quote(place_name_str)}&maxRows=1&featureClass=P&username={username}"
        response = requests.get(url, timeout=15)
        response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
        data = response.json()
        if data.get('geonames') and len(data['geonames']) > 0:
            top_result = data['geonames'][0]
            return float(top_result['lat']), float(top_result['lng']), top_result.get('name'), top_result.get('countryName')
        else:
            # Fallback: search without featureClass if no populated place found or if initial result is empty
            # This helps with broader terms or historical names that might not be classed as 'P'
            url_fallback = f"http://api.geonames.org/searchJSON?q={requests.utils.quote(place_name_str)}&maxRows=1&username={username}"
            # print(f"Retrying without featureClass for: {place_name_str}") # Optional: for debugging
            response_fallback = requests.get(url_fallback, timeout=15)
            response_fallback.raise_for_status()
            data_fallback = response_fallback.json()
            if data_fallback.get('geonames') and len(data_fallback['geonames']) > 0:
                top_result_fallback = data_fallback['geonames'][0]
                # print(f"Fallback success for {place_name_str}: Found {top_result_fallback.get('name')}") # Optional
                return float(top_result_fallback['lat']), float(top_result_fallback['lng']), top_result_fallback.get('name'), top_result_fallback.get('countryName')
            # print(f"Place not found by Geonames (even after fallback): {place_name_str}") # Optional
            return None, None, None, None
    except requests.exceptions.Timeout:
        print(f"API request timed out for {place_name_str}")
        return None, None, None, None
    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error occurred for {place_name_str}: {http_err} - Response: {response.text[:200]}...")
        return None, None, None, None
    except requests.exceptions.RequestException as req_err:
        print(f"API request failed for {place_name_str}: {req_err}")
        return None, None, None, None
    except ValueError as json_err: # Handles JSON decoding errors
        print(f"JSON decoding failed for {place_name_str} (response: {response.text[:200]}...): {json_err}")
        return None, None, None, None

# Check for cached data first
if os.path.exists(raw_geocoded_cache_file):
    print(f"Loading raw geocoded data from cache: {raw_geocoded_cache_file}")
    raw_geocoded_df = pd.read_csv(raw_geocoded_cache_file)
    # Ensure all expected columns are present, fill with NA if not
    expected_cols = ['raw_place', 'geoname_name', 'latitude', 'longitude', 'country']
    for col in expected_cols:
        if col not in raw_geocoded_df.columns:
            raw_geocoded_df[col] = pd.NA
else:
    print(f"No cache file found ({raw_geocoded_cache_file}). Geocoding raw places...")
    raw_geocoded_data = []
    if 'places_series' in locals():
        unique_raw_places = places_series.dropna().unique() # Use dropna() before unique()
        print(f"Geocoding {len(unique_raw_places)} unique raw place names...")
        for i, place in enumerate(unique_raw_places):
            if str(place).strip() == "nan" or str(place).strip() == "": # Skip if place is 'nan' string or empty after strip
                # print(f"Skipping invalid place entry: '{place}'") # Optional
                lat, lon, geoname_name, country = None, None, None, None
            else:
                if (i+1) % 20 == 0:
                    print(f"Processed {i+1}/{len(unique_raw_places)} places...")
                lat, lon, geoname_name, country = get_coordinates(place, GEONAMES_USERNAME)
            
            raw_geocoded_data.append({'raw_place': place, 
                                      'geoname_name': geoname_name, 
                                      'latitude': lat, 
                                      'longitude': lon, 
                                      'country': country})
            time.sleep(0.1) # 100ms delay to be respectful to the API

        raw_geocoded_df = pd.DataFrame(raw_geocoded_data)
        raw_geocoded_df.to_csv(raw_geocoded_cache_file, index=False)
        print(f"Saved raw geocoded data to cache: {raw_geocoded_cache_file}")
    else:
        print("Error: 'places_series' not defined. Please ensure the previous cells (especially 'cline_extract_place_column') have been run.")
        raw_geocoded_df = pd.DataFrame(columns=['raw_place', 'geoname_name', 'latitude', 'longitude', 'country']) # Create empty df

if not raw_geocoded_df.empty:
    print(f"\nSuccessfully geocoded {raw_geocoded_df['latitude'].notna().sum()} places out of {len(raw_geocoded_df)} unique raw names processed.")
    print("\nSample of geocoded data (first 20 rows):")
    print(raw_geocoded_df.head(20))
    
    print("\nPlaces that were NOT found by Geonames (sample):")
    not_found_sample = raw_geocoded_df[raw_geocoded_df['latitude'].isna()]['raw_place'].unique()
    print(not_found_sample[:20]) # Show up to 20 unique not found raw places
    print(f"Total unique raw places not found: {len(not_found_sample)}")
else:
    print("\nraw_geocoded_df is empty. Check for errors in previous steps or API calls.")

No cache file found (raw_geocoded_places_cache.csv). Geocoding raw places...
Geocoding 174 unique raw place names...
Processed 20/174 places...
Processed 40/174 places...
Processed 60/174 places...
Processed 80/174 places...
Processed 100/174 places...
Processed 120/174 places...
Processed 140/174 places...
Processed 160/174 places...
Saved raw geocoded data to cache: raw_geocoded_places_cache.csv

Successfully geocoded 151 places out of 174 unique raw names processed.

Sample of geocoded data (first 20 rows):
            raw_place       geoname_name  latitude  longitude          country
0             [Turku]              Turku  60.45148   22.26869          Finland
1   Liegnitz, Wrocław               None       NaN        NaN             None
2           Salamanca          Salamanca  40.96882   -5.66388            Spain
3          Heidelberg         Heidelberg  49.40768    8.69079          Germany
4              London             London  51.50853   -0.12574   United Kingdom
5         

In [16]:
# Phase 3: Analyze Geocoding Results and Conditional Cleaning

# Ensure raw_geocoded_df exists from the previous step
if 'raw_geocoded_df' in locals() and not raw_geocoded_df.empty:
    failed_raw_places_df = raw_geocoded_df[raw_geocoded_df['latitude'].isna()]
    unique_failed_raw_places = failed_raw_places_df['raw_place'].dropna().unique().tolist()
    print(f"Found {len(unique_failed_raw_places)} unique raw place names that were not geocoded in Phase 2.")
    # print("Sample of failed raw places:", unique_failed_raw_places[:20]) # Optional: for debugging

    # Define the cleaning function (similar to what was planned before, now applied conditionally)
    def clean_place_name(name):
        name_str = str(name).lower().strip() # Ensure string, lowercase, strip whitespace
        name_str = re.sub(r"\(.*?\)", "", name_str).strip() # Remove content in parentheses
        name_str = re.sub(r"\[.*?\]", "", name_str).strip() # Remove content in brackets
        replacements = {
            "lvgduni batavorvm": "leiden", "lugduni batavorum": "leiden",
            "amstelodami": "amsterdam", "amstelædami": "amsterdam", "amstelodamum": "amsterdam",
            "parisiis": "paris", "londini": "london", "franequerae": "franeker",
            "hafniae": "copenhagen", "coloniae agrippinae": "cologne",
            "antverpiae": "antwerp", "lipsiae": "leipzig", "argentorati": "strasbourg",
            "s.l.": "", "s. l.": "", "s.a.": "", "s.n.": "", "o.o.": "", "o. o.": "",
            "not indicated": "", "nan": ""
        }
        name_str = replacements.get(name_str, name_str)
        if ',' in name_str:
            parts = re.split(r'[,;/&]', name_str)
            name_str = parts[0].strip()
            name_str = replacements.get(name_str, name_str)
        name_str = re.sub(r"[^a-z\s'’ʻ-]", "", name_str, flags=re.UNICODE).strip()
        name_str = re.sub(r"\s+", " ", name_str).strip()
        return name_str.title() if name_str else ""

    if unique_failed_raw_places:
        print("\nApplying cleaning function to failed place names...")
        cleaned_failed_places = sorted(list(set([clean_place_name(p) for p in unique_failed_raw_places])))
        cleaned_failed_places = [p for p in cleaned_failed_places if p] # Remove empty strings after cleaning
        print(f"\nGenerated {len(cleaned_failed_places)} unique cleaned names from the {len(unique_failed_raw_places)} failed raw names.")
        print("Sample of cleaned names to be retried for geocoding (first 30):")
        print(cleaned_failed_places[:30])
        # Store for next phase
        places_to_retry_geocoding = cleaned_failed_places
    else:
        print("\nNo failed place names to clean or retry.")
        places_to_retry_geocoding = []
else:
    print("Error: 'raw_geocoded_df' not found or is empty. Please run Phase 2 (cline_geocode_raw_places) first.")
    places_to_retry_geocoding = [] # Initialize to prevent errors in next phase if this one fails


Found 23 unique raw place names that were not geocoded in Phase 2.

Applying cleaning function to failed place names...

Generated 17 unique cleaned names from the 23 failed raw names.
Sample of cleaned names to be retried for geocoding (first 30):
['Aga', 'Augsburg', 'Bratislava', 'Frankfurt Am Main', 'Leiden', 'Leipzig', 'Liegnitz', 'Linz', 'Lyon', 'Neostadii In Palatinate', 'Nuremberg', 'Paris', 'Philadelphia', 'Pitschen', 'Stockholm', 'Venice', 'Vienna']
