EmDigIt Feature Tagging and Categorization
This Jupyter Notebook is designed to process, analyze, and categorize historical itinerary data from the EmDigIt project. It streamlines the extraction of key locations, descriptors, and economic or geographical features mentioned in historical texts by applying regex-based tagging, feature extraction, and classification techniques. The pipeline consists of three main components:

Data Preparation & Cleaning: The notebook begins by importing the dataset from a shared Google Drive location and performing initial preprocessing, including removing redundant content, handling missing values, and ensuring consistent formatting.

Feature and Descriptor Extraction: A regex-based pattern matching system is used to identify and tag important features such as castles, rivers, towns, and markets, as well as descriptors like "famous," "beautiful," and "rich" to provide deeper insights into historical narratives. This step leverages optimized regex operations and vectorized Pandas functions for efficiency.

Categorization & Data Export: The processed data is classified into economic and geographical categories based on predefined rules, allowing researchers to analyze patterns of trade, movement, and settlement over time. Finally, the cleaned and categorized dataset is exported for further study or visualization.

By implementing optimized, scalable methods for text parsing and feature tagging, this notebook enhances the efficiency and accuracy of historical data analysis while ensuring maintainability for future extensions.

Setup and Data Import

In [None]:
# Install and Import Required Packages
import os
import re
import numpy as np
import pandas as pd

# Ensure compatibility with numpy version
!pip uninstall -y numpy
%pip install "numpy<2"

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Load Data (Corrected Path)
df = pd.read_csv("/content/drive/Shareddrives/EmDigIt/Processing/full_test_df.csv")

# Function to Remove Substring
def remove_substring(substring, string):
    return re.compile(re.escape(substring), re.IGNORECASE).sub('', string)

# Remove Content from Cleaned Column
df['content_no_location'] = df.apply(
    lambda row: remove_substring(str(row['cleaned']).strip(), str(row['content']).strip())
    if pd.notnull(row['content']) and pd.notnull(row['cleaned']) else row['content'], axis=1
)


Feature and Descriptor Tagging


In [None]:
# Feature and Descriptor Dictionaries
features = {
    "Castle": r"castell", "Toll": r"si paga|si mostra|dogana| registr",
    "River": r" fium|canale|si passa il", "Lake": r" lago | lagho ",
    "Forest": r" bosc|selva ", "Inn": r"osteria|venta| host",
    "University": r" studio|, universit", "Estate": r" villa | uilla | uillet| villet",
    "Town": r"villag|uillag", "City": r"citt", "Cathedral": r"duomo",
    "Church": r" chies", "Convent/Monastery": r"convento|monasteri",
    "Village": r" borgo| borge|,borg", "Hamlet": r" borgh",
    "Mountain": r" mont|montagna", "Region": r" terra | campagna",
    "Residence": r"risiede|abitava| corte | palaz| è del", "Fort": r"fortez| rocca",
    "Baths": r"bagn", "Market": r"mercat| fiere"
}

descriptors = {
    "Famous": r" famos*|celebr|notab|notev", "Beautiful": r" beliss| bel | bella",
    "Rich": r" ricchiss*", "Mercantile": r"mercantile", "Capital": r"metropoli| capo de",
    "Expedient": r"espedient*", "Good": r"bona|buon|ottim", "Bad": r"cattiv|mal",
    "Large": r"gross*", "By Boat": r"per barca|in barca|sopra barca|imbarca",
    "By Ascending": r"si ascende*|salire*", "By Bridge": r"sopra il ponte",
    "Border": r"primo luogo|che divide|frontera|comincia l|finisce i|confin",
    "Holy": r" santo| santa| santi", "High": r" alta ", "Abundant": r"copios*",
    "Powerful": r"potentiss*", "By Sedan": r"si fanno portare",
    "By Sea": r"per mare|galere|galeazza|in mare", "Devout": r"divot*", "Strong": r"fortis"
}

# Function to Match Features and Descriptors
def match_patterns(content, patterns):
    return '|'.join(sorted([feature for feature, regex in patterns.items() if re.search(regex, content, re.IGNORECASE)])) or None

# Apply Feature and Descriptor Matching
df['features'] = df['content_no_location'].apply(lambda x: match_patterns(x, features))
df['descriptors'] = df['content_no_location'].apply(lambda x: match_patterns(x, descriptors))


Categorization and Export

In [None]:
# Category Dictionary for Classification
categories = {
    "Economic": {
        "content_no_location": r"si fabrican|si serv|industria|popoli|vettova|bottegh|frutt",
        "features": r"Toll|Market|Inn|Baths",
        "descriptors": r"Rich|Mercantile"
    },
    "Crossing": {
        "content_no_location": r"ove passa|si passa|passaret|passate|imbarca|salire|per mare|entra|intra|barc",
        "features": r"Mountain|River",
        "descriptors": r"By Boat|By Ascending|By Sedan|By Sea|By Bridge"
    }
}

# Export Processed Data (Corrected Path)
df.to_csv('/content/drive/Shareddrives/EmDigIt/Processing/feature_tagged.csv', index=False)
