scikit-learn wrappers for Python fastText.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
skift
tests sklearn estimator clone now supported, for cross validation and such Feb 22, 2018
.codecov.yml fixed .codecov.yml filename Oct 17, 2018
.coveragerc
.gitattributes
.gitignore
.travis.yml
LICENSE
MANIFEST.in first commit - package skeleton Feb 3, 2018
Pipfile fasttext git dependency + moved to pipenv Feb 12, 2018
Pipfile.lock
README.rst Update README.rst Apr 18, 2018
mit_license_badge.svg
pytest.ini
setup.cfg first commit - package skeleton Feb 3, 2018
setup.py added license param to setup.py Jun 8, 2018
skift.png
versioneer.py first test Feb 3, 2018

README.rst

skift skift_icon

PyPI-Status PyPI-Versions Build-Status Codecov LICENCE

scikit-learn wrappers for Python fastText.

>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

1   Installation

Dependencies:

  • numpy
  • scipy
  • scikit-learn
  • fastText Python package
pip install skift

NOTE: Installing skift will not install fasttext itself, as a the official Python bindings are not currently maintaned on PyPI.

To install the version of fasttext (and its official Python bindings) which skift is tested against, run:

pip install git+https://github.com/facebookresearch/fastText.git@ca8c5face7d5f3a64fff0e4dfaf58d60a691cb7c

Additionally, the official Python bindings prevent the pretrainedVectors argument from being forwarded from the Python interface to the library itself. A simple one-line change can enable this forwarding again, but if you cannot be bothered cloning the entire repository just to change this one line, you can install my fork of the fasttext repository which fixes this (this is the only change done in this fork, and it is kept up to date). Thus, to install my simple fork of fasttext, run:

pip install git+https://github.com/shaypal5/fastText.git@fdbc22b18c44fd223da844f10afdfbaa3e956219

2   Features

  • Adheres to the scikit-learn classifier API, including predict_proba.
  • Also caters to the common use case of pandas.DataFrame inputs.
  • Enables easy stacking of fastText with other types of scikit-learn-compliant classifiers.
  • Pickle-able classifier objects.
  • Built around the official fasttext Python bindings.
  • Pure python.
  • Supports Python 3.5+.
  • Fully tested.

3   Wrappers

fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

skift includes several scikit-learn-compatible wrappers (for the official fastText Python bindings) which cater to these use cases.

NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised method on every call to fit.

3.1   Standard wrappers

These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn classifies; i.e. that input is a 2d ndarray object and such.

  • FirstColFtClassifier - An sklearn classifier adapter for fasttext that takes the first column of input ndarray objects as input.
>>> from skift import FirstColFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
  • IdxBasedFtClassifier - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the input_ix parameter to the constructor.
>>> from skift import IdxBasedFtClassifier
>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])
>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)
>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

3.2   pandas-dependent wrappers

These wrappers assume the X parameter given to fit, predict, and predict_proba methods is a pandas.DataFrame object:

  • FirstObjFtClassifier - An sklearn adapter for fasttext using the first column of dtype == object as input.
>>> from skift import FirstObjFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = FirstObjFtClassifier(lr=0.2)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]
  • ColLblBasedFtClassifier - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the input_col_lbl parameter to the constructor.
>>> from skift import ColLblBasedFtClassifier
>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])
>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)
>>> sk_clf.fit(df[['txt']], df['lbl'])
>>> sk_clf.predict([['woof']])
[0]

4   Contributing

Package author and current maintainer is Shay Palachy (shay.palachy@gmail.com); You are more than welcome to approach him for help. Contributions are very welcomed.

4.1   Installing for development

Clone:

git clone git@github.com:shaypal5/skift.git

Install in development mode, including test dependencies:

cd skift
pip install -e '.[test]'

To also install fasttext, see instructions in the Installation section.

4.2   Running the tests

To run the tests use:

cd skift
pytest

4.3   Adding documentation

The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

5   Credits

Created by Shay Palachy (shay.palachy@gmail.com).