Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Use stanza model for Finnish #255

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions docker/PythonDockerfileDev
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ RUN apt-get update -y \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

RUN pip install torch --index-url https://download.pytorch.org/whl/cpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about stanza and NVIDIA drivers is needed here.


RUN pip install -U --no-cache-dir \
setuptools \
wheel \
Expand All @@ -22,6 +24,8 @@ RUN pip install -U --no-cache-dir \
bottle \
#spacy
spacy \
#stanza integration for spacy
stanza \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacy_stanza should be installed as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to remove it, since there are at least two issues with the spacy_stanza library I bumped into:

  1. Multi-word token expansion issue, misaligned tokens --> failed NER (German) explosion/spacy-stanza#70. As I understand this affects quality because of imprecise tokenization. Also, this generate verbose output I could not suppress.
  2. stanza version is pinned and it is not the latest.

Also, doing lemmatization using stanza directly is straightforward, see #255 (comment).

The code in this PR should be changed a bit to make it work. Currently, it is broken, since I wanted to check the image size and did not care about usable LinguaCafe at this stage).

#chinese reading
pinyin \
#subtitle file parser
Expand All @@ -33,7 +37,6 @@ RUN python3 -m spacy download de_core_news_sm \
&& python3 -m spacy download nb_core_news_sm \
&& python3 -m spacy download es_core_news_sm \
&& python3 -m spacy download nl_core_news_sm \
&& python3 -m spacy download fi_core_news_sm \
&& python3 -m spacy download fr_core_news_sm \
&& python3 -m spacy download it_core_news_sm \
&& python3 -m spacy download sv_core_news_sm \
Expand All @@ -48,5 +51,6 @@ RUN python3 -m spacy download de_core_news_sm \
&& python3 -m spacy download pt_core_news_sm \
&& python3 -m spacy download ro_core_news_sm \
&& python3 -m spacy download sl_core_news_sm \
&& python3 -m spacy download xx_ent_wiki_sm
&& python3 -m spacy download xx_ent_wiki_sm \
&& python3 -c 'import stanza; stanza.download("fi", processors="tokenize,mwt,lemma")'

3 changes: 2 additions & 1 deletion tools/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import shutil
import subprocess
from newspaper import Article
import spacy_stanza

# create emtpy sapce models
multi_nlp = None
Expand Down Expand Up @@ -122,7 +123,7 @@ def getTokenizerDoc(language, words):
if language == 'finnish':
global finnish_nlp
if finnish_nlp == None:
finnish_nlp = spacy.load("fi_core_news_sm", disable = ['ner', 'parser'])
finnish_nlp = spacy_stanza.load_pipeline("fi", processors="tokenize,lemma")
finnish_nlp.add_pipe("custom_sentence_splitter", first=True)
doc = finnish_nlp(words)

Expand Down