Skip to content

Lemmatization functionality for Georgian language.

License

Notifications You must be signed in to change notification settings

screeve/lemmatizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Lemmatizer

Lemmatization functionality for Georgian language. screeve-postagger is a small encoder-decoder (sequence-to-sequence) model trained on Georgian National Corpus for lemmatizing Georgian senteces. We provide you with two types of lemmatizer.

You can use basic Lemmatizer in the following manner:

from transformers import BartForConditionalGeneration, AutoTokenizer, pipeline

model = BartForConditionalGeneration.from_pretrained("Nargizi/screeve-lemmatizer")
tokenizer = AutoTokenizer.from_pretrained("Nargizi/screeve-lemmatizer")

tagger = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

ka_word = "დამიწერია"
result = tagger(ka_word)
print(result)

For the Lemmatizer with POS tags as additional input use the following code:

from transformers import BartForConditionalGeneration, AutoTokenizer, pipeline

model = BartForConditionalGeneration.from_pretrained("Nargizi/screeve-pos-lemmatizer")
tokenizer = AutoTokenizer.from_pretrained("Nargizi/screeve-pos-lemmatizer")

ka_word = "დამიწერია"
pos_tag = "V"

encoded_ka_word = tokenizer(ka_word, text_pair=pos_tag, return_tensors="pt")
generated_tokens = model.generate(**encoded_ka_word)
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

print(result)

Label descriptions:

Abbreviation Description
V Verb
N Noun
Pron Pronoun
Interj Interjection
A Person’s name
Adv Adverb
Cj Conjunction
Punct Punctuation
Num Number
Pp Preposition

About

Lemmatization functionality for Georgian language.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages