[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/CCS-ZCU/EuPaC_shared/blob/master/NOSCEMUS_getting-started.ipynb)

This Jupyter notebook has been prepared for the EuPaC Hackathon and provides an easy way to start working with the NOSCEMUS dataset — no need to clone the entire repository or download additional data. It is fully compatible with cloud platforms like Google Colaboratory (click the badge above) and runs without requiring any specialized library installations.

As such, it is intended as a starting point for EuPaC participants, including those with minimal coding experience.

In [17]:
import pandas as pd
import nltk
import re
import requests
import json
import io

In [39]:
noscemus_metadata = pd.read_csv("https://raw.githubusercontent.com/CCS-ZCU/noscemus_ETF/refs/heads/master/data/metadata_table_long.csv")
noscemus_metadata.head(5)

Unnamed: 0,Author,Full title,In,Year,Place,Publisher/Printer,Era,Form/Genre,Discipline/Content,Original,...,Of interest to,Transkribus text available,Written by,Library and Signature,ids,id,date_min,date_max,filename,file_year
0,"Achrelius, Daniel",Scientiarum magnes recitatus publice anno 1690...,,1690,[Turku],Wall,17th century,Oration,"Mathematics, Astronomy/Astrology/Cosmography, ...",Scientiarum magnes(Google Books),...,"MK, JL",Yes,IT,,[705665],705665,1690.0,1690.0,"Achrelius,_Daniel_-_Scientiarum_magnes__Turku_...",1690.0
1,"Acidalius, Valens","Ad Iordanum Brunum Nolanum, Italum","Poematum Iani Lernutii, Iani Gulielmi, Valenti...",1603,"Liegnitz, Wrocław","Albert, David",17th century,Panegyric poem,Astronomy/Astrology/Cosmography,Ad Iordanum Brunum (1603)(CAMENA)Ad Iordanum B...,...,"MK, IT",Yes,MK,,[801745],801745,1603.0,1603.0,Janus_Lernutius_et_al__-_Poemata__Liegnitz_160...,1603.0
2,"Acosta, José de",De natura novi orbis libri duo et De promulgat...,,1589,Salamanca,Guillelum Foquel,16th century,Monograph,"Astronomy/Astrology/Cosmography, Geography/Car...",De natura novi orbis(Biodiversity Heritage Lib...,...,DB,Yes,DB,,[713323],713323,1589.0,1589.0,"Acosta,_José_de_-_De_natura_novi_orbis__Salama...",1589.0
3,"Adam, Melchior","Vitae Germanorum medicorum, qui saeculo superi...",,1620,Heidelberg,"Rosa, Geyder",17th century,Biography,Medicine,Vitae Germanorum medicorum(MDZ)Alternative lin...,...,IT,Yes,IT,,[693148],693148,1620.0,1620.0,"Adam,_Melchior_-_Vitae_Germanorum_medicorum__H...",1620.0
4,"Addison, Joseph",Ad insignissimum virum dominum Thomam Burnettu...,"Examen poeticum duplex, sive, Musarum anglican...",1698,London,Richard Wellington I.,17th century,Panegyric poem,Meteorology/Earth sciences,Ad Burnettum sacrae theoriae telluris auctorem...,...,"MK, IT",Yes,MK,,[769230],769230,1698.0,1698.0,Examen_poeticum_duplex__London_1698_pdf.txt,1698.0


All mapping between the metadata and the actual textual data happens through the "id" column.
Thus, knowing its ID, you can load full textual data (both raw and morphologically annotated) any text or a subset of texts.

In [40]:
id = 1378359
base_url = "https://ccs-lab.zcu.cz/noscemus_sents_data/{}.json"
sents_data = json.load(io.BytesIO(requests.get(base_url.format(str(id))).content))

In [41]:
# the sents_data is a list of sentences from the given document
# in addition to the raw text of the sentence, it also contains the lemmatized tokens and their POS tags
# look at first few sentences to get an idea of the format:
sents_data[110:115]

[['1378359',
  110,
  'Haec talia, anno praeterito 1587. d. 26.',
  [['Haec', 'hic', 'DET', [0, 4]],
   ['talia', 'talis', 'DET', [5, 10]],
   [',', ',', 'PUNCT', [10, 11]],
   ['anno', 'annus', 'NOUN', [12, 16]],
   ['praeterito', 'praetereo', 'VERB', [17, 27]],
   ['1587', '1587s', 'NUM', [28, 32]],
   ['.', '.', 'PUNCT', [32, 33]],
   ['d.', '', 'ADJ', [34, 36]],
   ['26', '26', 'NUM', [37, 39]],
   ['.', '.', 'PUNCT', [39, 40]]]],
 ['1378359',
  111,
  'Decemb. uesperi circa horam 9.',
  [['Decemb.', 'decemb.', 'NOUN', [0, 7]],
   ['uesperi', 'uesper', 'NOUN', [8, 15]],
   ['circa', 'circa', 'ADP', [16, 21]],
   ['horam', 'hora', 'NOUN', [22, 27]],
   ['9', '9', 'NUM', [28, 29]],
   ['.', '.', 'PUNCT', [29, 30]]]],
 ['1378359',
  112,
  'conspecta sunt.',
  [['conspecta', 'conspicio', 'VERB', [0, 9]],
   ['sunt', 'sum', 'AUX', [10, 14]],
   ['.', '.', 'PUNCT', [14, 15]]]],
 ['1378359',
  113,
  'Lumen ex So ad Nw procedebat adeo horrendum, ut spectatores omnes confiteri necessum ha

For each sentence, you see the following elements:
* (1) ID of the source document
* (2) index of the sentence (remember that Python's indexing starts with 0)
* (2) token data for the sentence

The token data for each token contain:
   * (a) The token as it is in the sentence
   * (b) The automatically assigned lemma corresponding to the token
   * (c) Its Part-of-Speech
   * (d) Its starting positional index within the sentence
   * (e) Its ending positional index within the sentence

In [42]:
# if you want a raw text of the document, use the following:
rawtext = " ".join([sent_data[2] for sent_data in sents_data])
rawtext[:1000]

"Cccxui. Obseruationes De Lumen Boreali, Ab A. Mdccxui. Ad A. Mdccxxxii. Partim A Se, Partim Ab Aliis, In Suecia Habitas, Collegit Andreas Celsius, In Acad. Upsal. Astron. Prof. Reg. Et Soc. Reg. Scien. Suec. Secr. Norimbergae. Apud Wolfg. Maur. Endteri Haeredes, Filiam, Mayeriam, hujusque Filium. Anno 1733. Lucani Phars. Lib. I. u. 526-528. Ignota obscurae uiderunt sidera noctes, Ardentemque Polum flammis, coeloque uolantes, Obliquas per inane faces, -- -- -- -- -- -- -- *** Stiernhielmii Hercul. u. 452. & 453. Dygd utan dadlige mildhet, en dunst aer; en malning i watne; Skugg' utan kropp; en fyllning af wind; ett hliom, och ett Nordblys. Praefatio. Septentrionales oras, ceu diuturnis densisque immersas tenebris, horruerunt exterorum ueterum non pauci, nescio qua mentis caligine obducti. Sed, si uerum fateri liceat, plagae Polo Arctoo uicinae, uberrima luminis copia, ceteras regiones omnes, austrum uersus positas, facile antecellunt. Taceo jam crepusculorum diuturnitatem, quae, cum ex

In [43]:
# if you want a list of lemmatized tokens, filtered by certain POS-tags, use the following:
lemmatized_sents = []
for sent_data in sents_data:
    lemmatized_sent = []
    for token in sent_data[3]:
        if token[2] in ["NOUN", "VERB", "ADJ", "PROPN"]:
            lemmatized_sent.append(token[0])
    lemmatized_sents.append(lemmatized_sent)
lemmatized_sents[150:155]

[['Cl'],
 ['Kirchii', 'Descript'],
 ['loc'],
 ['cit'],
 ['taceam',
  'quamplurimas',
  'obseruationes',
  'tempore',
  'Europa',
  'factas',
  'publicatas']]

In [44]:
# based on the metadata, you can easily focus on a subset of documents
# for instance, we want to focus on all texts from the first two decades of the 17th century:

noscemus_subset = noscemus_metadata[noscemus_metadata["file_year"].between(1600, 1620)]
# to work with the subset, we need to know the IDs of the documents
ids = noscemus_subset["id"]
# Subsequently, we can load the data for each document by its ID and calculate the vocabulary of the texts:
# (depending on the size of the subset and your internet connection, this may take a while)
base_url = "https://ccs-lab.zcu.cz/noscemus_sents_data/{}.json"
subset_lemmatized_sentences = []
for id in ids: # for each work ID from our subset of IDs
    f_sents_data = json.load(io.BytesIO(requests.get(base_url.format(str(id))).content))
    sents_n = len(f_sents_data)
    for sent_data in f_sents_data:
        sent_lemmata = [t[1] for t in sent_data[3] if t[2] in ["NOUN", "VERB", "ADJ", "PROPN"]] # filter for specific POS-tags
        sent_lemmata = [re.sub(r"\W*|\d*", "", t) for t in sent_lemmata] # remove all non-alphanumeric characters
        sent_lemmata = [l for l in sent_lemmata if len(l) > 1] # remove all one-letter words
        sent_lemmata = [l.lower() for l in sent_lemmata] # lowercase all words
        subset_lemmatized_sentences.append(sent_lemmata) # add the lemmatized words from the current sentence to the overall list of lemmatized words

In [45]:
# now you have lemmatized sentences for all texts in the subset
# let's take a look at the first few sentences:'
subset_lemmatized_sentences[:10]

[['poema',
  'janus',
  'janus',
  'gulielmius',
  'ualeo',
  'acidalius',
  'nouus',
  'editio'],
 ['lignix', 'impensa', 'dauidus', 'alipertus', 'bibliopolas', 'uratisl'],
 ['annus', 'cio', 'io', 'ciiius'],
 ['excello',
  'doctrina',
  'uirtus',
  'uiris',
  'jacobus',
  'consiliarius',
  'ligiobregensis',
  'daniel',
  'rindfleisch',
  'bucretiocphilosophus',
  'patritius',
  'uratislauiensibus',
  'dominus',
  'patronum',
  'obseruo',
  'caspar',
  'agna',
  'pius',
  'antiquitas',
  'poeta',
  'existimatio',
  'uiri',
  'nobilissimus',
  'lu',
  'magnus',
  'reuerentia',
  'obseruo'],
 ['primus', 'phiiosophiaa', 'magistrus', 'musons', 'nuncupo'],
 ['filius',
  'deus',
  'deus',
  'interpres',
  'propheta',
  'pater',
  'sapientia',
  'uirtus',
  'genitor',
  'plato',
  'magnus',
  'philosophia',
  'sol',
  'perhibeo'],
 ['eximius',
  'ingenium',
  'felicitas',
  'uarius',
  'exquiro',
  'doctrina',
  'instructus',
  'antesignanus',
  'eruditio',
  'uirtus',
  'uia'],
 ['graeci', 'f

In [46]:
# you can flatten this list of lists into a single list of lemmatized words:
subset_lemmata = [lemma for sent in subset_lemmatized_sentences for lemma in sent]
# this data can be used to calculate the vocabulary of the texts:
subset_vocab = nltk.FreqDist(subset_lemmata).most_common()
subset_vocab[:100]

[('pars', 21986),
 ('dico', 20590),
 ('habeo', 17009),
 ('possum', 16985),
 ('facio', 16617),
 ('locus', 13746),
 ('liber', 13129),
 ('res', 11965),
 ('dies', 11843),
 ('primus', 11829),
 ('motus', 11645),
 ('uideo', 11555),
 ('corpus', 11122),
 ('do', 10954),
 ('terra', 10160),
 ('ratio', 9733),
 ('natura', 9200),
 ('os', 8809),
 ('annus', 8723),
 ('sol', 8350),
 ('magnus', 8162),
 ('capitulum', 8053),
 ('tempus', 7974),
 ('caput', 7261),
 ('homo', 6474),
 ('aqua', 6419),
 ('luna', 6084),
 ('medius', 6027),
 ('modus', 5968),
 ('uerus', 5850),
 ('uis', 5803),
 ('oculus', 5798),
 ('fio', 5720),
 ('materia', 5708),
 ('causa', 5655),
 ('forma', 5609),
 ('circulus', 5312),
 ('urbs', 5216),
 ('angulus', 4854),
 ('genus', 4849),
 ('stella', 4840),
 ('deus', 4826),
 ('tertius', 4770),
 ('secundus', 4726),
 ('linea', 4687),
 ('usus', 4681),
 ('similis', 4576),
 ('nomen', 4516),
 ('sequor', 4473),
 ('moueo', 4437),
 ('color', 4411),
 ('animal', 4352),
 ('anima', 4339),
 ('figura', 4303),
 ('ago

In [47]:
# with lemmatized sentences, you can also immediately proceed to various kinds of co-occurrence analysis or word-embeddings.