Skip to content

Efficient text classification and representation learning for Ruby

License

Notifications You must be signed in to change notification settings

stjordanis/fastText-1

 
 

Repository files navigation

fastText

fastText - efficient text classification and representation learning - for Ruby

Build Status Build status

Installation

Add this line to your application’s Gemfile:

gem 'fasttext'

Getting Started

fastText has two primary use cases:

Text Classification

Prep your data

# documents
x = [
  "text from document one",
  "text from document two",
  "text from document three"
]

# labels
y = ["ham", "ham", "spam"]

Use an array if a document has multiple labels

Train a model

model = FastText::Classifier.new
model.fit(x, y)

Get predictions

model.predict(x)

Save the model to a file

model.save_model("model.bin")

Load the model from a file

model = FastText.load_model("model.bin")

Evaluate the model

model.test(x_test, y_test)

Get words and labels

model.words
model.labels

Use include_freq: true to get their frequency

Search for the best hyperparameters

model.fit(x, y, autotune_set: [x_valid, y_valid])

Compress the model - significantly reduces size but sacrifices a little performance

model.quantize
model.save_model("model.ftz")

Word Representations

Prep your data

x = [
  "text from document one",
  "text from document two",
  "text from document three"
]

Train a model

model = FastText::Vectorizer.new
model.fit(x)

Get nearest neighbors

model.nearest_neighbors("asparagus")

Get analogies

model.analogies("berlin", "germany", "france")

Get a word vector

model.word_vector("carrot")

Get words

model.words

Save the model to a file

model.save_model("model.bin")

Load the model from a file

model = FastText.load_model("model.bin")

Use continuous bag-of-words

model = FastText::Vectorizer.new(model: "cbow")

Parameters

Text classification

FastText::Classifier.new(
  lr: 0.1,                    # learning rate
  dim: 100,                   # size of word vectors
  ws: 5,                      # size of the context window
  epoch: 5,                   # number of epochs
  min_count: 1,               # minimal number of word occurences
  min_count_label: 1,         # minimal number of label occurences
  minn: 0,                    # min length of char ngram
  maxn: 0,                    # max length of char ngram
  neg: 5,                     # number of negatives sampled
  word_ngrams: 1,             # max length of word ngram
  loss: "softmax",            # loss function {ns, hs, softmax, ova}
  bucket: 2000000,            # number of buckets
  thread: 3,                  # number of threads
  lr_update_rate: 100,        # change the rate of updates for the learning rate
  t: 0.0001,                  # sampling threshold
  label_prefix: "__label__"   # label prefix
  verbose: 2,                 # verbose
  pretrained_vectors: nil,    # pretrained word vectors (.vec file)
  autotune_metric: "f1",      # autotune optimization metric
  autotune_predictions: 1,    # autotune predictions
  autotune_duration: 300,     # autotune search time in seconds
  autotune_model_size: nil    # autotune model size, like 2M
)

Word representations

FastText::Vectorizer.new(
  model: "skipgram",          # unsupervised fasttext model {cbow, skipgram}
  lr: 0.05,                   # learning rate
  dim: 100,                   # size of word vectors
  ws: 5,                      # size of the context window
  epoch: 5,                   # number of epochs
  min_count: 5,               # minimal number of word occurences
  minn: 3,                    # min length of char ngram
  maxn: 6,                    # max length of char ngram
  neg: 5,                     # number of negatives sampled
  word_ngrams: 1,             # max length of word ngram
  loss: "ns",                 # loss function {ns, hs, softmax, ova}
  bucket: 2000000,            # number of buckets
  thread: 3,                  # number of threads
  lr_update_rate: 100,        # change the rate of updates for the learning rate
  t: 0.0001,                  # sampling threshold
  verbose: 2                  # verbose
)

Input Files

Input can be read directly from files

model.fit("train.txt", autotune_set: "valid.txt")
model.test("test.txt")

Each line should be a document

text from document one
text from document two
text from document three

For text classification, lines should start with a list of labels prefixed with __label__

__label__ham text from document one
__label__ham text from document two
__label__spam text from document three

Pretrained Models

There are a number of pretrained models you can download

Language Identification

Download one of the pretrained models and load it

model = FastText.load_model("lid.176.ftz")

Get language predictions

model.predict("bon appétit")

rbenv

This library uses Rice to interface with the fastText C++ library. Rice and earlier versions of rbenv don’t play nicely together. If you encounter an error during installation, upgrade ruby-build and reinstall your Ruby version.

brew upgrade ruby-build
rbenv install [version]

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/fastText.git
cd fastText
bundle install
bundle exec rake compile
bundle exec rake test

About

Efficient text classification and representation learning for Ruby

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Ruby 57.8%
  • C++ 42.2%