Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

[MRG] Optional whitespace normalisation for CountVectorizer(analyzer='char') #3092

Open
wants to merge 5 commits into from

6 participants

@robertlayton

Currently, getting character n-grams from CountVectorizer (or variants) automatically normalises whitespace, and there is not current option to turn it off. This PR adds this option.

Pretty basic PR here: normalize_whitespace option to CountVectorizer (and variants), True by default, which matches current behaviour. When set to False, the normalisation that occurred in CountVectorizer._char_ngrams no longer happens.

Current todo:

  • Test case
  • Narrative documentation updates if applicable
@robertlayton robertlayton changed the title from [WIP] Optional whitespace cleansing for CountVectorizer(analyzer='char') to [WIP] Optional whitespace normalisation for CountVectorizer(analyzer='char')
@coveralls

Coverage Status

Changes Unknown when pulling 58bfdc1 on robertlayton:optwhitespace into ** on scikit-learn:master**.

@coveralls

Coverage Status

Changes Unknown when pulling f6c2d4f on robertlayton:optwhitespace into ** on scikit-learn:master**.

@robertlayton robertlayton changed the title from [WIP] Optional whitespace normalisation for CountVectorizer(analyzer='char') to [MRG] Optional whitespace normalisation for CountVectorizer(analyzer='char')
@jnothman
Owner

What do you think of making this a different setting for analyzer? I think it's a good idea to avoid new parameters to text.*Vectorizer if they do not apply to most cases. Already, I think this is a somewhat niche feature, but if it's only a variant analyzer, then placing 'char+whitespace' after 'char' in the list of analyzers is, I think, even more effective documentation of the default implementation than your patch.

@robertlayton

It would work, but I don't think making more complex parameters would really make it simpler. Having a parameter makes it explicit.

It would be my preference to remove the whitespace normalisation all together -- nothing is gained by having it in there (as opposed to having tf-idf calculation, which is faster to compute it all at once). It could be a separate preprocessing step, which would make the default behaviour of the class less "surprising" (it took me a long time to work out why my code in scikit-learn wasn't matching the output of my other code to do character n-grams -- I assumed my code was to blame!).

@jnothman
Owner

I don't see how having a another parameter is more explicit than another analyzer. And in cases like this, I'm not sure it's easy to change the current default behaviour.

@robertlayton

No, I agree with you that we can't change the default behaviour.

I think either option (analyzer/parameter) would work. Parameter makes more sense for me, but we should get someone to do a tie-breaker?

@larsmans
Owner

I don't like yet another option for text.*Vectorizer, there's too many of those already.

@robertlayton

@larsmans Does @jnothman 's suggestion work for you?

Generally I think, as a long term thing, that the Vectorizers are trying to do too much. Given that we have pipelines, I think we should probably simplify the base classes and spend some effort into creating good examples of setting up pipelines. That's an issue for another time/place though.

@larsmans
Owner

We also have DictVectorizer and FeatureHasher. Those take arbitrary data and their feature extraction logic can be (must be!) specified by the user:

dv = DictVectorizer()
X_train = dv.fit_transform(some_function(x) for x in iterable_over_training_data)
X_test = dv.transform(some_function(x) for x in iterable_over_test_data)

I don't see why we don't push this as an alternative for more complex feature extraction. It could fit your problem as well.

@robertlayton

That's a good option too. My main reasoning behind the PR is that the class does an undocumented "surprising" thing by default, and there is no way to turn that off.

@larsmans
Owner

Shouldn't we just document it? I know nobody reads the docs, but giving support is easier if we can read it to people :)

@robertlayton

Yeah, that would work. Don't you think it's weird though that preprocessing is done in a transformer, and that can't be turned off?

@jnothman
Owner
@larsmans
Owner

Also because not every problem is a bag of words problem :)

@robertlayton

OK, so is everyone happy with this alternate proposal:

  • Properly document that the 'char' analyser does whitespace normalisation
  • Note in there that if this is not what you want, to see DictVectorizer
  • Create an example of DictVectoriser as an example.
@larsmans
Owner

I am, but I would like to hear @ogrisel's opinion too.

@kastnerkyle
Owner

I like this idea - the text preprocessing stuff has a lot of options, and there is no way we can handle every use case. Clear docs and a nice example showing the alternative is better IMO.

@ogrisel
Owner

+1 for documenting it rather than introducing a new option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
View
14 doc/modules/feature_extraction.rst
@@ -294,9 +294,9 @@ reasonable (please see the :ref:`reference documentation
charset_error=None, decode_error=...'strict',
dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
- ngram_range=(1, 1), preprocessor=None, stop_words=None,
- strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
- tokenizer=None, vocabulary=None)
+ ngram_range=(1, 1), normalize_whitespace=True, preprocessor=None,
+ stop_words=None, strip_accents=None,
+ token_pattern=...'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
Let's use it to tokenize and count the word occurrences of a minimalistic
corpus of text documents::
@@ -653,6 +653,14 @@ the local structure of sentences and paragraphs should thus be taken
into account. Many such models will thus be casted as "Structured output"
problems which are currently outside of the scope of scikit-learn.
+By default, whitespace is normalized for character n-gram extraction. Any
+sequence of whitespace characters is replaced with a single space. If
+this is not what you want, setting ``normalize_whitespace=False`` when
+creating the ``CountVectorizer`` instance stops this from happening. This is
+particularly useful in a number of contexts such as authorship analysis,
+where the different usage of whitespace is sometimes used as a feature in
+training.
+
.. _hashing_vectorizer:
View
11 sklearn/feature_extraction/tests/test_text.py
@@ -210,6 +210,17 @@ def test_char_ngram_analyzer():
text = StringIO("This is a test with a file-like object!")
expected = ['thi', 'his', 'is ', 's i', ' is']
assert_equal(cnga(text)[:5], expected)
+ # Whitespace should be normalsied by default
+ cnga = CountVectorizer(analyzer='char',
+ ngram_range=(2, 2)).build_analyzer()
+ text = "This text contains double spaces."
+ assert_false(" " in cnga(text))
+ # Whitespace normalization turned off
+ cnga = CountVectorizer(analyzer='char', ngram_range=(2, 2),
+ normalize_whitespace=False).build_analyzer()
+ text = "This text contains double spaces."
+ assert_true(" " in cnga(text))
+
def test_char_wb_ngram_analyzer():
View
20 sklearn/feature_extraction/text.py
@@ -94,6 +94,7 @@ class VectorizerMixin(object):
"""Provides common code for text vectorizers (tokenization logic)."""
_white_spaces = re.compile(r"\s\s+")
+ normalize_whitespace = True
def decode(self, doc):
"""Decode the input into a string of unicode symbols
@@ -133,7 +134,8 @@ def _word_ngrams(self, tokens, stop_words=None):
def _char_ngrams(self, text_document):
"""Tokenize text_document into a sequence of character n-grams"""
# normalize white spaces
- text_document = self._white_spaces.sub(" ", text_document)
+ if self.normalize_whitespace:
+ text_document = self._white_spaces.sub(" ", text_document)
text_len = len(text_document)
ngrams = []
@@ -580,6 +582,10 @@ class CountVectorizer(BaseEstimator, VectorizerMixin):
dtype : type, optional
Type of the matrix returned by fit_transform() or transform().
+
+ normalize_whitespace: boolean, True by default
+ If True, whitespace in the document is normalized. If False, no such
+ normalization occurs.
Attributes
----------
@@ -604,7 +610,8 @@ def __init__(self, input='content', encoding='utf-8', charset=None,
stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
ngram_range=(1, 1), analyzer='word',
max_df=1.0, min_df=1, max_features=None,
- vocabulary=None, binary=False, dtype=np.int64):
+ vocabulary=None, binary=False, dtype=np.int64,
+ normalize_whitespace=True):
self.input = input
self.encoding = encoding
self.decode_error = decode_error
@@ -665,6 +672,7 @@ def __init__(self, input='content', encoding='utf-8', charset=None,
self.fixed_vocabulary = False
self.binary = binary
self.dtype = dtype
+ self.normalize_whitespace = normalize_whitespace
def _sort_features(self, X, vocabulary):
"""Sort features by name
@@ -1144,6 +1152,10 @@ class TfidfVectorizer(CountVectorizer):
sublinear_tf : boolean, optional
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
+
+ normalize_whitespace: boolean, True by default
+ If True, whitespace in the document is normalized. If False, no such
+ normalization occurs.
See also
--------
@@ -1165,7 +1177,7 @@ def __init__(self, input='content', encoding='utf-8', charset=None,
ngram_range=(1, 1), max_df=1.0, min_df=1,
max_features=None, vocabulary=None, binary=False,
dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
- sublinear_tf=False):
+ sublinear_tf=False, normalize_whitespace=True):
super(TfidfVectorizer, self).__init__(
input=input, charset=charset, charset_error=charset_error,
@@ -1175,7 +1187,7 @@ def __init__(self, input='content', encoding='utf-8', charset=None,
stop_words=stop_words, token_pattern=token_pattern,
ngram_range=ngram_range, max_df=max_df, min_df=min_df,
max_features=max_features, vocabulary=vocabulary, binary=binary,
- dtype=dtype)
+ dtype=dtype, normalize_whitespace=normalize_whitespace)
self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,
smooth_idf=smooth_idf,
Something went wrong with that request. Please try again.