This notebook illustrates the `NLP()` pipeline on all available languages.

If dependency parse information is available, an example tree is printed, too.

In [1]:
from cltk import NLP
from cltk.dependency.tree import DependencyTree
from cltk.languages.example_texts import get_example_text
from cltk.languages.pipelines import *

In [2]:
iso_to_pipeline = {
    "akk": AkkadianPipeline,
    "ang": OldEnglishPipeline,
    "arb": ArabicPipeline,
    "arc": AramaicPipeline,
    "chu": OCSPipeline,
    "cop": CopticPipeline,
    "enm": MiddleEnglishPipeline,
    "frm": MiddleFrenchPipeline,
    "fro": OldFrenchPipeline,
    "gmh": MiddleHighGermanPipeline,
    "got": GothicPipeline,
    "grc": GreekPipeline,
    "hin": HindiPipeline,
    "lat": LatinPipeline,
    "lzh": ChinesePipeline,
    "non": OldNorsePipeline,
    "pan": PanjabiPipeline,
    "pli": PaliPipeline,
    "san": SanskritPipeline,
}

In [3]:
for lang, pipeline in iso_to_pipeline.items():
    print(f"{pipeline.language.name} ('{pipeline.language.iso_639_3_code}') ...")
    text = get_example_text(lang)
    cltk_nlp = NLP(language=lang)
    cltk_doc = cltk_nlp.analyze(text=text)
    cltk_doc.sentences_strings
    word = cltk_doc.sentences[0][0]
    print("Example `Word`:", word)
    if all([w.features for w in cltk_doc.sentences[0]]):
        print("Printing dependency tree of first sentence ...")
        try:
            a_tree = DependencyTree.to_tree(cltk_doc.sentences[0])
        except:
            print(f"Dependency parsing Process not available for '{lang}'.")
            print("")
            continue
        a_tree.print_tree()
    print("")

Akkadian ('akk') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Akkadian' (ISO: 'akk'): `AkkadianTokenizationProcess`, `StopsProcess`.
Example `Word`: Word(index_char_start=0, index_char_stop=2, index_token=0, index_sentence=None, string=('u2-wa-a-ru', 'akkadian'), pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Old English (ca. 450-1100) ('ang') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Old English (ca. 450-1100)' (ISO: 'ang'): `MultilingualTokenizationProcess`, `OldEnglishLemmatizationProcess`, `OldEnglishEmbeddingsProcess`, `StopsProcess`.
This part of the CLTK depends upon models from the CLTK project.
Do you want to download 'https://github.com/cltk/ang_models_cltk' to '~/cltk_data/ang'? [Y/n] 
Y
Example `Word`: Word(index_char_start=0, index_char_stop=5, index_token=0, index_sentence=N

100%|█████████████████████████████████████| 1.61G/1.61G [01:32<00:00, 17.4MiB/s]


Example `Word`: Word(index_char_start=0, index_char_stop=5, index_token=0, index_sentence=None, string='كهيعص', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Official Aramaic (700-300 BCE) ('arc') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Official Aramaic (700-300 BCE)' (ISO: 'arc'): `ArabicTokenizationProcess`, `AramaicEmbeddingsProcess`.
CLTK message: This part of the CLTK depends upon word embedding models from the Fasttext project.
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.arc.vec' to '/Users/kylejohnson/cltk_data/arc/embeddings/fasttext/wiki.arc.vec'? [Y/n] 
Y


100%|█████████████████████████████████████| 8.66M/8.66M [00:00<00:00, 8.69MiB/s]


Example `Word`: Word(index_char_start=0, index_char_stop=1, index_token=0, index_sentence=None, string='ܒ', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Church Slavic ('chu') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Church Slavic' (ISO: 'chu'): `OCSStanzaProcess`.
Unrecognized UD `feature_name` ('Variant') with `feature_value` ('Short').
Please raise an issue at <https://github.com/cltk/cltk/issues> and include a small sample to reproduce the error.
Unrecognized UD `feature_name` ('Variant') with `feature_value` ('Short').
Please raise an issue at <https://github.com/cltk/cltk/issues> and include a small sample to reproduce the error.
Example `Word`: Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='отьчє', pos=noun, lemma='отьчь', stem=None, scansion=N



This part of the CLTK depends upon models from the CLTK project.
Do you want to download 'https://github.com/cltk/fro_models_cltk' to '~/cltk_data/fro'? [Y/n] 
Y


INFO:CLTK:Cloning 'fro_models_cltk' from 'https://github.com/cltk/fro_models_cltk.git'


Example `Word`: Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Une', pos=determiner, lemma='Une', stem=None, scansion=None, xpos='DETndf', upos='DET', dependency_relation='det', governor=1, features={Definiteness: [indefinite], PrononimalType: [article]}, category={F: [pos], N: [pos], V: [neg]}, stop=False, named_entity=False, syllables=None, phonetic_transcription=None, definition=None)

Middle High German ('gmh') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Middle High German' (ISO: 'gmh'): `MiddleHighGermanTokenizationProcess`, `StopsProcess`.
Example `Word`: Word(index_char_start=0, index_char_stop=3, index_token=0, index_sentence=None, string='Uns', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Gothic ('got') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline fo

100%|█████████████████████████████████████| 6.94M/6.94M [00:01<00:00, 6.21MiB/s]


Example `Word`: Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='swa', pos=adverb, lemma='swa', stem=None, scansion=None, xpos='Df', upos='ADV', dependency_relation='advmod', governor=1, features={}, category={F: [neg], N: [pos], V: [pos]}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Ancient Greek ('grc') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Ancient Greek' (ISO: 'grc'): `GreekNormalizeProcess`, `GreekStanzaProcess`, `GreekEmbeddingsProcess`, `StopsProcess`.
Example `Word`: Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='ὅτι', pos=adverb, lemma='ὅτι', stem=None, scansion=None, xpos='Df', upos='ADV', dependency_relation='advmod', governor=6, features={}, category={F: [neg], N: [pos], V: [pos]}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Hindi ('hin') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for

INFO:CLTK:Cloning 'cltk_non_zoega_dictionary' from 'https://github.com/cltk/cltk_non_zoega_dictionary.git'


Example `Word`: Word(index_char_start=0, index_char_stop=5, index_token=0, index_sentence=None, string='Gylfi', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition='')

Eastern Panjabi ('pan') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Eastern Panjabi' (ISO: 'pan'): `MultilingualTokenizationProcess`, `StopsProcess`.
Example `Word`: Word(index_char_start=0, index_char_stop=3, index_token=0, index_sentence=None, string='ਆਿਦ', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Pali ('pli') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Pali' (ISO: 'pli'): `MultilingualTokenizationProcess`, `PaliEmbeddingsProcess`.
CLTK message: T

100%|█████████████████████████████████████| 5.02M/5.02M [00:00<00:00, 5.63MiB/s]


Example `Word`: Word(index_char_start=0, index_char_stop=6, index_token=0, index_sentence=None, string='Raajaa', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

Sanskrit ('san') ...
‎𐤀 CLTK version '1.1.5'.
Pipeline for language 'Sanskrit' (ISO: 'san'): `MultilingualTokenizationProcess`, `SanskritEmbeddingsProcess`, `StopsProcess`.
CLTK message: This part of the CLTK depends upon word embedding models from the Fasttext project.
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.sa.vec' to '/Users/kylejohnson/cltk_data/san/embeddings/fasttext/wiki.sa.vec'? [Y/n] 
Y


100%|███████████████████████████████████████| 129M/129M [00:08<00:00, 15.3MiB/s]


Example `Word`: Word(index_char_start=0, index_char_stop=3, index_token=0, index_sentence=None, string='ईशा', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

