New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CountVectorizer can't remain stop words in Chinese #10756
Comments
Use strip_accents=False. I am updating the documentation.
…On 6 March 2018 at 15:09, RingWong ***@***.***> wrote:
Description
CountVectorizer can't remain stop words in Chinese
I want to remain all the words in sentence, but some stop words always
dislodged.
Steps/Code to Reproduce
from sklearn.feature_extraction.text import CountVectorizer
import jieba
text = ['今天是个阴天。', '为什么会这样子...']
text_list = []
for t in text:
text_list.append(' '.join(jieba.cut(t, HMM=False)))
print(text_list)
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=None)
word_count = vectorizer.fit_transform(text_list)
word = vectorizer.get_feature_names()
word_count_array = word_count.toarray()
print(word)
print(word_count_array)
Expected Results
text_list:
['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
word:
['为什么', '今天', '样子', '阴天', '是’, '个', '会', '这']
Actual Results
text_list:
['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
word:
['为什么', '今天', '样子', '阴天']
word_count_array:
[[0 1 0 1]
[1 0 1 0]]
Versions
platform: Windows-10-10.0.14393-SP0
sys: Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017,
13:25:24) [MSC v.1900 64 bit (AMD64)]
numpy: 1.13.3
scipy: 0.19.0
Scikit-Learn: 0.18.1
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#10756>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz63tmvzlD0A9919tNoKUkXjpTF2erks5tbgwMgaJpZM4SeE5l>
.
|
or strip_accents=None
…On 6 March 2018 at 17:37, Joel Nothman ***@***.***> wrote:
Use strip_accents=False. I am updating the documentation.
On 6 March 2018 at 15:09, RingWong ***@***.***> wrote:
> Description
>
> CountVectorizer can't remain stop words in Chinese
> I want to remain all the words in sentence, but some stop words always
> dislodged.
> Steps/Code to Reproduce
>
> from sklearn.feature_extraction.text import CountVectorizer
> import jieba
>
> text = ['今天是个阴天。', '为什么会这样子...']
> text_list = []
> for t in text:
> text_list.append(' '.join(jieba.cut(t, HMM=False)))
> print(text_list)
> vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=None)
> word_count = vectorizer.fit_transform(text_list)
> word = vectorizer.get_feature_names()
> word_count_array = word_count.toarray()
> print(word)
> print(word_count_array)
>
> Expected Results
>
> text_list:
> ['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
> word:
> ['为什么', '今天', '样子', '阴天', '是’, '个', '会', '这']
> Actual Results
>
> text_list:
> ['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
> word:
> ['为什么', '今天', '样子', '阴天']
> word_count_array:
> [[0 1 0 1]
> [1 0 1 0]]
> Versions
>
> platform: Windows-10-10.0.14393-SP0
> sys: Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017,
> 13:25:24) [MSC v.1900 64 bit (AMD64)]
> numpy: 1.13.3
> scipy: 0.19.0
> Scikit-Learn: 0.18.1
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#10756>, or mute the
> thread
> <https://github.com/notifications/unsubscribe-auth/AAEz63tmvzlD0A9919tNoKUkXjpTF2erks5tbgwMgaJpZM4SeE5l>
> .
>
|
I modified the following code, but it does not work.
|
I found that the length of missing word is always equal to 1. |
Sorry I misread. Is the problem that it seems to be cutting the document at the first 。? |
Yes, I cut the document first. |
I work it out by modified the token_pattern
@jnothman Think you very much! |
Ah, now I understand what your problem was! Sigh. The defaults are a little bit europe-centric. |
Surprising to find that scikit-learn selects tokens of 2 or more alphanumeric characters by default. Seems unfriendly from my side (Relevant information is hidden in parameter descriptions). I think we might consider to put relevant information to a more obvious place since it has caused confusion. |
Yeah, it's one of those estimators where basically every parameter is
highly dependent on language, if not task.
PRs improving the documentation are very welcome.
|
I have the same problem. Is there any progress on this? @jnothman @qinhanmin2014 |
Not sure what progress you're looking for, @secsilm . If you think the documentation is inadequate, please submit improvements as pull requests. |
Description
CountVectorizer can't remain stop words in Chinese
I want to remain all the words in sentence, but some stop words always dislodged.
Steps/Code to Reproduce
Expected Results
text_list:
['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
word:
['为什么', '今天', '样子', '阴天', '是’, '个', '会', '这']
Actual Results
text_list:
['今天 是 个 阴天 。', '为什么 会 这 样子 . . .']
word:
['为什么', '今天', '样子', '阴天']
word_count_array:
[[0 1 0 1]
[1 0 1 0]]
Versions
platform: Windows-10-10.0.14393-SP0
sys: Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
numpy: 1.13.3
scipy: 0.19.0
Scikit-Learn: 0.18.1
The text was updated successfully, but these errors were encountered: