[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CCS-ZCU/EuPaC_shared/blob/master/NOSCEMUS_getting-started.ipynb)

This Jupyter notebook has been prepared for the EuPaC Hackathon and provides an easy way to start working with the NOSCEMUS dataset — no need to clone the entire repository or download additional data. It is fully compatible with cloud platforms like Google Colaboratory (click the badge above) and runs without requiring any specialized library installations.

As such, it is intended as a starting point for EuPaC participants, including those with minimal coding experience.

In [None]:
import pandas as pd
import nltk
import re
import requests
import json
import io

In [None]:
noscemus_metadata = pd.read_csv("https://raw.githubusercontent.com/CCS-ZCU/noscemus_ETF/refs/heads/master/data/metadata_table_long.csv")
noscemus_metadata.head(5)

All mapping between the metadata and the actual textual data happens through the "id" column.
Thus, knowing its ID, you can load full textual data (both raw and morphologically annotated) any text or a subset of texts.

In [None]:
id = 1378359
base_url = "https://ccs-lab.zcu.cz/noscemus_sents_data/{}.json"
sents_data = json.load(io.BytesIO(requests.get(base_url.format(str(id))).content))

In [None]:
# the sents_data is a list of sentences from the given document
# in addition to the raw text of the sentence, it also contains the lemmatized tokens and their POS tags
# look at first few sentences to get an idea of the format:
sents_data[110:115]

For each sentence, you see the following elements:
* (1) ID of the source document
* (2) index of the sentence (remember that Python's indexing starts with 0)
* (2) token data for the sentence

The token data for each token contain:
   * (a) The token as it is in the sentence
   * (b) The automatically assigned lemma corresponding to the token
   * (c) Its Part-of-Speech
   * (d) Its starting positional index within the sentence
   * (e) Its ending positional index within the sentence

In [None]:
# if you want a raw text of the document, use the following:
rawtext = " ".join([sent_data[2] for sent_data in sents_data])
rawtext[:1000]

In [None]:
# if you want a list of lemmatized tokens, filtered by certain POS-tags, use the following:
lemmatized_sents = []
for sent_data in sents_data:
    lemmatized_sent = []
    for token in sent_data[3]:
        if token[2] in ["NOUN", "VERB", "ADJ", "PROPN"]:
            lemmatized_sent.append(token[0])
    lemmatized_sents.append(lemmatized_sent)
lemmatized_sents[150:155]

In [None]:
# based on the metadata, you can easily focus on a subset of documents
# for instance, we want to focus on all texts from the first two decades of the 17th century:

noscemus_subset = noscemus_metadata[noscemus_metadata["file_year"].between(1600, 1620)]
# to work with the subset, we need to know the IDs of the documents
ids = noscemus_subset["id"]
# Subsequently, we can load the data for each document by its ID and calculate the vocabulary of the texts:
# (depending on the size of the subset and your internet connection, this may take a while)
base_url = "https://ccs-lab.zcu.cz/noscemus_sents_data/{}.json"
subset_lemmatized_sentences = []
for id in ids: # for each work ID from our subset of IDs
    f_sents_data = json.load(io.BytesIO(requests.get(base_url.format(str(id))).content))
    sents_n = len(f_sents_data)
    for sent_data in f_sents_data:
        sent_lemmata = [t[1] for t in sent_data[3] if t[2] in ["NOUN", "VERB", "ADJ", "PROPN"]] # filter for specific POS-tags
        sent_lemmata = [re.sub(r"\W*|\d*", "", t) for t in sent_lemmata] # remove all non-alphanumeric characters
        sent_lemmata = [l for l in sent_lemmata if len(l) > 1] # remove all one-letter words
        sent_lemmata = [l.lower() for l in sent_lemmata] # lowercase all words
        subset_lemmatized_sentences.append(sent_lemmata) # add the lemmatized words from the current sentence to the overall list of lemmatized words

In [None]:
# now you have lemmatized sentences for all texts in the subset
# let's take a look at the first few sentences:'
subset_lemmatized_sentences[:10]

In [None]:
# you can flatten this list of lists into a single list of lemmatized words:
subset_lemmata = [lemma for sent in subset_lemmatized_sentences for lemma in sent]
# this data can be used to calculate the vocabulary of the texts:
subset_vocab = nltk.FreqDist(subset_lemmata).most_common()
subset_vocab[:100]

In [None]:
# with lemmatized sentences, you can also immediately proceed to various kinds of co-occurrence analysis or word-embeddings.