Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CountVectorizer can't remain stop words in Chinese #10756

Closed
RingWong opened this issue Mar 6, 2018 · 12 comments
Closed

CountVectorizer can't remain stop words in Chinese #10756

RingWong opened this issue Mar 6, 2018 · 12 comments

Comments

@RingWong
Copy link

RingWong commented Mar 6, 2018

Description

CountVectorizer can't remain stop words in Chinese
I want to remain all the words in sentence, but some stop words always dislodged.

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer
import jieba

text = ['今天是个阴天。', '为什么会这样子...']
text_list = []
for t in text:
    text_list.append(' '.join(jieba.cut(t, HMM=False)))
print(text_list)
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=None)
word_count = vectorizer.fit_transform(text_list)
word = vectorizer.get_feature_names()
word_count_array = word_count.toarray()
print(word)
print(word_count_array)

Expected Results

text_list:
['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
word:
['为什么', '今天', '样子', '阴天', '是’, '个', '会', '这']

Actual Results

text_list:
['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
word:
['为什么', '今天', '样子', '阴天']
word_count_array:
[[0 1 0 1]
[1 0 1 0]]

Versions

platform: Windows-10-10.0.14393-SP0
sys: Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
numpy: 1.13.3
scipy: 0.19.0
Scikit-Learn: 0.18.1

@jnothman
Copy link
Member

jnothman commented Mar 6, 2018 via email

@jnothman
Copy link
Member

jnothman commented Mar 6, 2018 via email

@RingWong
Copy link
Author

RingWong commented Mar 6, 2018

I modified the following code, but it does not work.

vectorizer = CountVectorizer(ngram_range=(1, 1), strip_accents=None)

@RingWong
Copy link
Author

RingWong commented Mar 6, 2018

I found that the length of missing word is always equal to 1.

@jnothman
Copy link
Member

jnothman commented Mar 6, 2018

Sorry I misread. Is the problem that it seems to be cutting the document at the first 。?

@RingWong
Copy link
Author

RingWong commented Mar 6, 2018

Yes, I cut the document first.

@RingWong
Copy link
Author

RingWong commented Mar 6, 2018

I work it out by modified the token_pattern

vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b', stop_words=None)

@jnothman Think you very much!

@RingWong RingWong closed this as completed Mar 6, 2018
@jnothman
Copy link
Member

jnothman commented Mar 6, 2018

Ah, now I understand what your problem was! Sigh. The defaults are a little bit europe-centric.

@qinhanmin2014
Copy link
Member

Surprising to find that scikit-learn selects tokens of 2 or more alphanumeric characters by default. Seems unfriendly from my side (Relevant information is hidden in parameter descriptions). I think we might consider to put relevant information to a more obvious place since it has caused confusion.

@jnothman
Copy link
Member

jnothman commented Mar 6, 2018 via email

@secsilm
Copy link

secsilm commented Aug 9, 2019

I have the same problem. Is there any progress on this? @jnothman @qinhanmin2014

@jnothman
Copy link
Member

Not sure what progress you're looking for, @secsilm . If you think the documentation is inadequate, please submit improvements as pull requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants