CountVectorizer can't remain stop words in Chinese #10756

RingWong · 2018-03-06T04:09:44Z

Description

CountVectorizer can't remain stop words in Chinese
I want to remain all the words in sentence, but some stop words always dislodged.

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
import jieba

text = ['今天是个阴天。', '为什么会这样子...']
text_list = []
for t in text:
    text_list.append(' '.join(jieba.cut(t, HMM=False)))
print(text_list)
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=None)
word_count = vectorizer.fit_transform(text_list)
word = vectorizer.get_feature_names()
word_count_array = word_count.toarray()
print(word)
print(word_count_array)

Expected Results

text_list:
['今天是个阴天。', '为什么会这样子 . . .']
word:
['为什么', '今天', '样子', '阴天', '是’, '个', '会', '这']

Actual Results

text_list:
['今天是个阴天。', '为什么会这样子 . . .']
word:
['为什么', '今天', '样子', '阴天']
word_count_array:
[[0 1 0 1]
[1 0 1 0]]

Versions

platform: Windows-10-10.0.14393-SP0
sys: Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
numpy: 1.13.3
scipy: 0.19.0
Scikit-Learn: 0.18.1

The text was updated successfully, but these errors were encountered:

jnothman · 2018-03-06T06:37:28Z

Use strip_accents=False. I am updating the documentation.

…

On 6 March 2018 at 15:09, RingWong ***@***.***> wrote: Description CountVectorizer can't remain stop words in Chinese I want to remain all the words in sentence, but some stop words always dislodged. Steps/Code to Reproduce from sklearn.feature_extraction.text import CountVectorizer import jieba text = ['今天是个阴天。', '为什么会这样子...'] text_list = [] for t in text: text_list.append(' '.join(jieba.cut(t, HMM=False))) print(text_list) vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=None) word_count = vectorizer.fit_transform(text_list) word = vectorizer.get_feature_names() word_count_array = word_count.toarray() print(word) print(word_count_array) Expected Results text_list: ['今天是个阴天。', '为什么会这样子 . . .'] word: ['为什么', '今天', '样子', '阴天', '是’, '个', '会', '这'] Actual Results text_list: ['今天是个阴天。', '为什么会这样子 . . .'] word: ['为什么', '今天', '样子', '阴天'] word_count_array: [[0 1 0 1] [1 0 1 0]] Versions platform: Windows-10-10.0.14393-SP0 sys: Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)] numpy: 1.13.3 scipy: 0.19.0 Scikit-Learn: 0.18.1 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10756>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz63tmvzlD0A9919tNoKUkXjpTF2erks5tbgwMgaJpZM4SeE5l> .

jnothman · 2018-03-06T06:38:41Z

or strip_accents=None

…

On 6 March 2018 at 17:37, Joel Nothman ***@***.***> wrote: Use strip_accents=False. I am updating the documentation. On 6 March 2018 at 15:09, RingWong ***@***.***> wrote: > Description > > CountVectorizer can't remain stop words in Chinese > I want to remain all the words in sentence, but some stop words always > dislodged. > Steps/Code to Reproduce > > from sklearn.feature_extraction.text import CountVectorizer > import jieba > > text = ['今天是个阴天。', '为什么会这样子...'] > text_list = [] > for t in text: > text_list.append(' '.join(jieba.cut(t, HMM=False))) > print(text_list) > vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=None) > word_count = vectorizer.fit_transform(text_list) > word = vectorizer.get_feature_names() > word_count_array = word_count.toarray() > print(word) > print(word_count_array) > > Expected Results > > text_list: > ['今天是个阴天。', '为什么会这样子 . . .'] > word: > ['为什么', '今天', '样子', '阴天', '是’, '个', '会', '这'] > Actual Results > > text_list: > ['今天是个阴天。', '为什么会这样子 . . .'] > word: > ['为什么', '今天', '样子', '阴天'] > word_count_array: > [[0 1 0 1] > [1 0 1 0]] > Versions > > platform: Windows-10-10.0.14393-SP0 > sys: Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, > 13:25:24) [MSC v.1900 64 bit (AMD64)] > numpy: 1.13.3 > scipy: 0.19.0 > Scikit-Learn: 0.18.1 > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#10756>, or mute the > thread > <https://github.com/notifications/unsubscribe-auth/AAEz63tmvzlD0A9919tNoKUkXjpTF2erks5tbgwMgaJpZM4SeE5l> > . >

RingWong · 2018-03-06T07:25:01Z

I modified the following code, but it does not work.

vectorizer = CountVectorizer(ngram_range=(1, 1), strip_accents=None)

RingWong · 2018-03-06T07:33:30Z

I found that the length of missing word is always equal to 1.

jnothman · 2018-03-06T07:33:39Z

Sorry I misread. Is the problem that it seems to be cutting the document at the first 。?

RingWong · 2018-03-06T07:36:58Z

Yes, I cut the document first.

RingWong · 2018-03-06T08:28:59Z

I work it out by modified the token_pattern

vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b', stop_words=None)

@jnothman Think you very much!

jnothman · 2018-03-06T11:38:50Z

Ah, now I understand what your problem was! Sigh. The defaults are a little bit europe-centric.

qinhanmin2014 · 2018-03-06T14:36:39Z

Surprising to find that scikit-learn selects tokens of 2 or more alphanumeric characters by default. Seems unfriendly from my side (Relevant information is hidden in parameter descriptions). I think we might consider to put relevant information to a more obvious place since it has caused confusion.

jnothman · 2018-03-06T22:01:03Z

Yeah, it's one of those estimators where basically every parameter is highly dependent on language, if not task. PRs improving the documentation are very welcome.

secsilm · 2019-08-09T12:55:18Z

I have the same problem. Is there any progress on this? @jnothman @qinhanmin2014

jnothman · 2019-08-11T23:51:36Z

Not sure what progress you're looking for, @secsilm . If you think the documentation is inadequate, please submit improvements as pull requests.

jnothman mentioned this issue Mar 6, 2018

[MRG] DOC improve strip_accents relevance to non-Roman scripts #10757

Merged

RingWong closed this as completed Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CountVectorizer can't remain stop words in Chinese #10756

CountVectorizer can't remain stop words in Chinese #10756

RingWong commented Mar 6, 2018

jnothman commented Mar 6, 2018 via email

jnothman commented Mar 6, 2018 via email

RingWong commented Mar 6, 2018

RingWong commented Mar 6, 2018

jnothman commented Mar 6, 2018

RingWong commented Mar 6, 2018

RingWong commented Mar 6, 2018

jnothman commented Mar 6, 2018

qinhanmin2014 commented Mar 6, 2018

jnothman commented Mar 6, 2018 via email

secsilm commented Aug 9, 2019

jnothman commented Aug 11, 2019

CountVectorizer can't remain stop words in Chinese #10756

CountVectorizer can't remain stop words in Chinese #10756

Comments

RingWong commented Mar 6, 2018

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman commented Mar 6, 2018 via email

jnothman commented Mar 6, 2018 via email

RingWong commented Mar 6, 2018

RingWong commented Mar 6, 2018

jnothman commented Mar 6, 2018

RingWong commented Mar 6, 2018

RingWong commented Mar 6, 2018

jnothman commented Mar 6, 2018

qinhanmin2014 commented Mar 6, 2018

jnothman commented Mar 6, 2018 via email

secsilm commented Aug 9, 2019

jnothman commented Aug 11, 2019