<a href="https://colab.research.google.com/github/riinakik/digital-humanities-technologies/blob/main/assignment_stanza.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction**

For this assignment, I selected a short biographical text about the French composer Claude Debussy. The goal of the analysis was to explore the linguistic structure of the text using the Stanza natural language processing toolkit. Specifically, I examined how the text is organized at different linguistic levels: sentence structure, tokenization, parts of speech, lemmas, dependency relations, and named entities.
I wanted to understand the text’s writing style, its distribution of parts of speech, and the proportions of nouns, verbs, and other grammatical categories. In addition, I analyzed the morphological and syntactic patterns to see how biographical information is presented through language.

In [None]:
# 1. Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# 2. Change directory to your assignment folder
%cd "/content/drive/MyDrive/Digihumanitaaria tehnoloogiad/assignment_debussy"

# 3. Read the text file
content = open("debussy.txt", "r", encoding="utf-8").read()

# 4. Display the file to verify
content


Mounted at /content/drive
/content/drive/MyDrive/Digihumanitaaria tehnoloogiad/assignment_debussy


"Achille Claude Debussy (22 August 1862 – 25 March 1918) was a French composer. He is sometimes seen as the first Impressionist composer, although he rejected the term. He was one of the most influential composers of the late nineteenth and early twentieth centuries.\n\nBorn to a family of modest means and little cultural involvement, Debussy showed enough musical talent to be admitted at the age of ten to France's leading music college, the Conservatoire de Paris. He originally studied the piano, but found his vocation in innovative composition, despite the disapproval of the Conservatoire's conservative professors. He took many years to develop his mature style and was nearly forty when he achieved international fame in 1902 with the only opera he completed, Pelleas et Melisande.\n\nDebussy's orchestral works include Prelude to the Afternoon of a Faun (1894), Nocturnes (1897–1899), and Images (1905–1912). His music was in many ways a reaction against Wagner and the German musical tra

**Result**

Opens and reads the content of the debussy.txt file

In [None]:
# Install the Stanza NLP library.
!pip install stanza

# Import the Stanza module and download the English language models.
import stanza
stanza.download("en")

# Initialize the Stanza NLP pipeline for English.
nlp = stanza.Pipeline("en")

# Load the selected text file ("debussy.txt") into a Python string.
# UTF-8 encoding ensures that special characters are handled correctly.
content = open("debussy.txt", "r", encoding="utf-8").read()

# Process the text using the Stanza pipeline.
# This creates a 'doc' object that contains sentences, words, POS tags, lemmas, and more.
doc = nlp(content)

Collecting stanza
  Downloading stanza-1.11.0-py3-none-any.whl.metadata (14 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Downloading stanza-1.11.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.15.0-py3-none-any.whl (608 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.4/608.4 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.15.0 stanza-1.11.0


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.11.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor    | Package                   |
--------------------------------------------
| tokenize     | combined                  |
| mwt          | combined                  |
| pos          | combined_charlm           |
| lemma        | combined_nocharlm         |
| constituency | ptb3-revised_charlm       |
| depparse     | combined_charlm           |
| sentiment    | sstplus_charlm            |
| ner          | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


**Result**

Stanza was successfully installed, and the English language model was downloaded.
The text file debussy.txt was loaded, and the NLP pipeline processed it into a structured doc object containing sentences, tokens, lemmas, POS tags, and dependency information.

In [None]:
# Print the number of sentences in the processed document.
len(doc.sentences)


14

**Result**

There are 14 sentences in this text

In [None]:
# Display the full text of the first sentence.
doc.sentences[0].text

'Achille Claude Debussy (22 August 1862 – 25 March 1918) was a French composer.'

**Result**

Shows the first sentence of the text: "Achille Claude Debussy (22 August 1862 – 25 March 1918) was a French composer."

In [None]:
# Extract and print all individual word tokens from the first sentence.
[w.text for w in doc.sentences[0].words]


['Achille',
 'Claude',
 'Debussy',
 '(',
 '22',
 'August',
 '1862',
 '–',
 '25',
 'March',
 '1918',
 ')',
 'was',
 'a',
 'French',
 'composer',
 '.']

**Result**

The output shows all individual tokens (words and symbols) from the first sentence of the text.

In [None]:
# Extract all word tokens from the entire text, excluding punctuation.
# This list collects every meaningful word Stanza identifies in all sentences.
# We remove punctuation because it does not contribute to linguistic analysis.
words = [w.text for s in doc.sentences for w in s.words if w.upos != "PUNCT"]

# Display the total number of non-punctuation words in the text.
len(words)


276

**Result**

The output shows the total number (276) of meaningful words in the text after removing punctuation.


In [None]:
# Extract all nouns from the entire text.
# Nouns are identified by POS tags starting with "NN" (e.g., NN, NNS, NNP, NNPS).
nouns = [w.text for s in doc.sentences for w in s.words if w.xpos.startswith("NN")]

# Extract all verbs from the entire text.
# Verbs are identified by POS tags starting with "VB" (e.g., VB, VBD, VBG, VBN, VBP, VBZ).
verbs = [w.text for s in doc.sentences for w in s.words if w.xpos.startswith("VB")]

# Display the total number of nouns and verbs.
len(nouns), len(verbs)

(80, 33)

**Result**

A high noun count (80) vs verbs (33) shows that the text is descriptive and informational, which is typical of a biography.
A lower verb count confirms that the text focuses on facts, descriptions, dates, and names, rather than actions or events.

In [None]:
# Calculate the proportion of verbs relative to all meaningful words in the text.
# This shows how verb-heavy the text is (amount of action or events described).
len(verbs) / len(words)


0.11956521739130435

**Result**

This means that approximately 11.96% of all meaningful words in the text are verbs.

In [None]:
# Define a dictionary that groups POS categories and their corresponding tag codes.
# This allows us to count how many times each part of speech appears in the text.
pos_tags = {
    "Conjunction": ["CC"],
    "Pronoun": ["PRP", "PRP$", "WP", "WP$"],
    "Noun": ["NN", "NNS", "NNP", "NNPS"],
    "Verb": ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"],
    "Adjective": ["JJ", "JJR", "JJS"]
}

# Count how many words belong to each POS category defined above.
# The result is stored in a dictionary where each key is a POS name
# and each value is the total count of that category in the text.
results = {}
for pos_name, tags in pos_tags.items():
    results[pos_name] = len([w.text for s in doc.sentences for w in s.words if w.xpos in tags])

# Display the POS frequency counts.
results


{'Conjunction': 10, 'Pronoun': 20, 'Noun': 80, 'Verb': 33, 'Adjective': 36}

**Result**

The POS distribution shows that the text is descriptive, fact-based, and biography-style, with a strong focus on nouns and adjectives rather than actions.

In [None]:
# Sort the POS results by their frequency in descending order.
# This makes it easy to see which parts of speech appear most often in the text.
sorted(results.items(), key=lambda x: x[1], reverse=True)


[('Noun', 80),
 ('Adjective', 36),
 ('Verb', 33),
 ('Pronoun', 20),
 ('Conjunction', 10)]

**Result**

The distribution is noun-heavy, confirming that the text focuses on presenting information rather than narrating events or dialogue.

In [None]:
# Extract the lemma (dictionary form) of every word in the text.
lemmas = [w.lemma for s in doc.sentences for w in s.words]

# Display the first 40 lemmas.
lemmas[:40]


['Achille',
 'Claude',
 'Debussy',
 '(',
 '22',
 'August',
 '1862',
 '-',
 '25',
 'March',
 '1918',
 ')',
 'be',
 'a',
 'French',
 'composer',
 '.',
 'he',
 'be',
 'sometimes',
 'see',
 'as',
 'the',
 'first',
 'impressionist',
 'composer',
 ',',
 'although',
 'he',
 'reject',
 'the',
 'term',
 '.',
 'he',
 'be',
 'one',
 'of',
 'the',
 'most',
 'influential']

**Result**

The output shows the first 40 lemmas of the text.

A lemma is the base or dictionary form of a word. For example:
"was" → "be"
"composers" → "composer"

In [None]:
# Generate a list of tuples showing the dependency relations in the first sentence.
# For each word, we extract:
# 1) the word form (w.text)
# 2) the dependency label (w.deprel), showing the grammatical function
# 3) the head word it depends on (syntactic governor)
#
# If w.head == 0, the word is the ROOT of the sentence.
# Otherwise, w.head-1 gives the index of its governing word.
[(w.text, w.deprel, doc.sentences[0].words[w.head-1].text if w.head > 0 else "ROOT")
 for w in doc.sentences[0].words]


[('Achille', 'nsubj', 'composer'),
 ('Claude', 'flat', 'Achille'),
 ('Debussy', 'flat', 'Achille'),
 ('(', 'punct', 'August'),
 ('22', 'nmod:unmarked', 'Achille'),
 ('August', 'compound', '22'),
 ('1862', 'nummod', 'August'),
 ('–', 'case', '25'),
 ('25', 'nmod', 'August'),
 ('March', 'compound', '25'),
 ('1918', 'nmod:unmarked', '25'),
 (')', 'punct', 'August'),
 ('was', 'cop', 'composer'),
 ('a', 'det', 'composer'),
 ('French', 'amod', 'composer'),
 ('composer', 'root', 'ROOT'),
 ('.', 'punct', 'composer')]

**Result**

The output shows the dependency relations for every word in the first sentence.

For example:

('Achille', 'nsubj', 'composer')
→ Achille is the subject of the verb phrase headed by composer.

('Claude', 'flat', 'Achille')
→ Claude is linked to Achille as part of a name construction.

In [None]:
# Create a Stanza pipeline with specific processors enabled:
# - tokenize: split text into words
# - pos: assign part-of-speech tags
# - lemma: reduce words to their base form
# - depparse: analyze syntactic dependency structure
# - ner: identify named entities (people, locations, dates, organizations, etc.)
nlp = stanza.Pipeline("en", processors="tokenize,pos,lemma,depparse,ner")

# Process the text using the full pipeline.
doc = nlp(content)

# Extract all named entities from the text.
# For each entity, we store:
# 1) the entity text (ent.text)
# 2) the entity type label (ent.type), such as PERSON, DATE, ORG, GPE (location), WORK_OF_ART, etc.
[(ent.text, ent.type) for ent in doc.ents]


INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| pos       | combined_charlm           |
| lemma     | combined_nocharlm         |
| depparse  | combined_charlm           |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


[('Achille Claude Debussy', 'PERSON'),
 ('22 August 1862', 'DATE'),
 ('25 March 1918', 'DATE'),
 ('French', 'NORP'),
 ('first', 'ORDINAL'),
 ('one', 'CARDINAL'),
 ('the late nineteenth and early twentieth centuries', 'DATE'),
 ('Debussy', 'PERSON'),
 ('ten', 'CARDINAL'),
 ("France's", 'PERSON'),
 ('the Conservatoire de Paris', 'ORG'),
 ("Conservatoire's", 'NORP'),
 ('many years', 'DATE'),
 ('1902', 'DATE'),
 ('Pelleas et Melisande', 'PERSON'),
 ('Prelude to the Afternoon of a Faun', 'WORK_OF_ART'),
 ('1894', 'DATE'),
 ('Nocturnes', 'WORK_OF_ART'),
 ('1897', 'DATE'),
 ('1899', 'DATE'),
 ('Images', 'WORK_OF_ART'),
 ('1905', 'DATE'),
 ('1912', 'DATE'),
 ('Wagner', 'PERSON'),
 ('German', 'NORP'),
 ('The Sea (La mer)', 'WORK_OF_ART'),
 ('1903', 'DATE'),
 ('1905', 'DATE'),
 ('twenty-four', 'CARDINAL'),
 ('twelve', 'CARDINAL'),
 ('Symbolist', 'NORP'),
 ('the later nineteenth century', 'DATE'),
 ('The Blessed Damozel', 'WORK_OF_ART'),
 ('The Martyrdom of Saint Sebastian', 'WORK_OF_ART'),
 ('hi

**Result**

In this text, Stanza successfully identifies important biographical information about Claude Debussy.

For example:
“Achille Claude Debussy” → PERSON.
“22 August 1862” → DATE.
“the Conservatoire de Paris” → ORG.

In [None]:
# For every word in the first sentence, extract three pieces of information:
# 1) w.text  → the original word form in the sentence
# 2) w.xpos  → the detailed POS tag (Penn Treebank / universal POS extension)
# 3) w.feats → morphological features, such as Number, Tense, Person, Mood, Case, Gender, etc.

# This gives a detailed grammatical profile of each word,
# allowing deeper analysis of how the sentence is structured linguistically.
[(w.text, w.xpos, w.feats) for w in doc.sentences[0].words]


[('Achille', 'NNP', 'Number=Sing'),
 ('Claude', 'NNP', 'Number=Sing'),
 ('Debussy', 'NNP', 'Number=Sing'),
 ('(', '-LRB-', None),
 ('22', 'CD', 'NumForm=Digit|NumType=Card'),
 ('August', 'NNP', 'Number=Sing'),
 ('1862', 'CD', 'NumForm=Digit|NumType=Card'),
 ('–', 'SYM', None),
 ('25', 'CD', 'NumForm=Digit|NumType=Card'),
 ('March', 'NNP', 'Number=Sing'),
 ('1918', 'CD', 'NumForm=Digit|NumType=Card'),
 (')', '-RRB-', None),
 ('was', 'VBD', 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin'),
 ('a', 'DT', 'Definite=Ind|PronType=Art'),
 ('French', 'JJ', 'Degree=Pos'),
 ('composer', 'NN', 'Number=Sing'),
 ('.', '.', None)]

**Result**

This output provides a deep grammatical breakdown of each token in the sentence.
It shows:

Names (e.g., Achille, Claude, Debussy) marked as NNP, Number=Sing.

Dates and numbers correctly recognized as numerals (CD, NumForm=Digit).

Verbs with rich grammatical detail (e.g., was → VBD, with features indicating past tense).

Punctuation and symbols categorized appropriately.

Nouns and adjectives tagged with their syntactic and morphological information.

In [None]:
import os

# ------------------------------
# ANALYSIS PART
# ------------------------------

num_sentences = len(doc.sentences)
words = [w.text for s in doc.sentences for w in s.words if w.upos != "PUNCT"]
nouns = [w.text for s in doc.sentences for w in s.words if w.xpos.startswith("NN")]
verbs = [w.text for s in doc.sentences for w in s.words if w.xpos.startswith("VB")]
verb_ratio = len(verbs) / len(words)

pos_tags = {
    "Conjunction": ["CC"],
    "Pronoun": ["PRP", "PRP$", "WP", "WP$"],
    "Noun": ["NN", "NNS", "NNP", "NNPS"],
    "Verb": ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"],
    "Adjective": ["JJ", "JJR", "JJS"]
}

results = {}
for pos_name, tags in pos_tags.items():
    results[pos_name] = len([w for s in doc.sentences for w in s.words if w.xpos in tags])

# ------------------------------
# 4. CREATE OUTPUT FOLDER
# ------------------------------

output_folder = "/content/drive/MyDrive/Digihumanitaaria tehnoloogiad/assignment_debussy/output"
os.makedirs(output_folder, exist_ok=True)

output_file = os.path.join(output_folder, "analysis.txt")

# ------------------------------
# 5. WRITE ANALYSIS TO FILE
# ------------------------------

with open(output_file, "w", encoding="utf-8") as out:
    out.write("Analysis of the Selected Text\n")
    out.write("------------------------------------\n\n")

    out.write(f"Number of sentences: {num_sentences}\n")
    out.write(f"Total meaningful words (no punctuation): {len(words)}\n")
    out.write(f"Noun count: {len(nouns)}\n")
    out.write(f"Verb count: {len(verbs)}\n")
    out.write(f"Proportion of verbs: {verb_ratio:.2f}\n\n")

    out.write("Part-of-Speech distribution:\n")
    for pos_name, count in results.items():
        out.write(f"  {pos_name}: {count}\n")

    out.write("\nNamed Entities:\n")
    for ent in doc.ents:
        out.write(f"  {ent.text}  -->  {ent.type}\n")

print("analysis.txt has been created successfully!")
print("Saved to:", output_file)


analysis.txt has been created successfully!
Saved to: /content/drive/MyDrive/Digihumanitaaria tehnoloogiad/assignment_debussy/output/analysis.txt


**Result**

This code generates a new folder and creates an output file named analysis.txt inside it.

The file contains a summary of the linguistic features extracted from the Claude Debussy biography using the Stanza NLP pipeline.