Natural Language Processing (NLP) and Machine Learning (ML) library for Elixir. Penelope provides a scikit-learn-inspired interface to the the LIBSVM, LIBLINEAR, and CRFsuite C/C++ libraries in Elixir, which can be used for many ML/NLP applications.
The API reference is available here.
First, clone the project's submodules.
git submodule update --init
This package requires an implementation of BLAS
for efficient matrix math.
It can be installed on each platform as follows:
BLAS is built into OSX.
Install openblas-dev
via apk.
sudo apk add openblas-dev
Install libblas-dev
via apt.
sudo apt install libblas-dev
def deps do
[
{:penelope, "~> 0.4"}
]
end
Penelope can be used to build a machine learning model for identifying natural
language utterances and extracting parameters from them. The
Penelope.NLP.IntentClassifier
module uses a predictor pipeline for
recognizing intents and a recognizer pipeline for extracting named entities
from the utterance. The following is a contrived example that classifies
intents based on the token length of the utterance.
alias Penelope.NLP.IntentClassifier
pipeline = %{
tokenizer: [{:ptb_tokenizer, []}],
classifier: [{:count_vectorizer, []},
{:linear_classifier, [probability?: true]}],
recognizer: [{:crf_tagger, []}],
}
x = [
"you have four pears",
"three hundred apples would be a lot"
]
y = [
{"intent_1", ["o", "o", "b_count", "b_fruit"]},
{"intent_2", ["b_count", "i_count", "b_fruit", "o", "o", "o", "o"]}
]
classifier = IntentClassifier.fit(%{}, x, y, pipeline)
{intents, params} = IntentClassifier.predict_intent(
classifier,
%{},
"I have three bananas"
)
pipeline = %{
tokenizer: [{:ptb_tokenizer, []}],
classifier: [{:count_vectorizer, []},
{:linear_classifier, [probability?: true]}],
recognizer: [{:crf_tagger, []}],
}
This block configures the tokenizer, classifier, and recognizer pipelines used by the intent classifier. A pipeline in Penelope is a list of components and configuration that are used to train/predict a machine learning model, with an interface similar to that used in scikit-learn.
The tokenizer converts a string utterance into a sequence of tokens. In this
example, we use the Penn Treebank
tokenizer (:ptb-tokenizer
). The tokenizer pipeline is run before either
of the other two pipelines, so that they can share its output.
The classifier pipeline receives a tokenized utterance (x) and class labels (y) and learns a model that can predict the label from the utterance. In this example, we use a simple token count vectorizer (number of tokens in the utterance) and a logistic regression classifier to predict the class labels.
Finally, the recognizer pipeline receives a tokenized sequence (x) and
sequence tags (y) to learn a model that can predict the label of each
tag in the sequence. This allows the recognizer to extract slot
values from
natural language utterances. This example uses a Conditional Random Field
(CRF) model, which
can be thought of as a sequence extension of logistic regression, to tag
the tokens in the utterance.
x = [
"you have four pears",
"three hundred apples would be a lot"
]
y = [
{"intent_1", ["o", "o", "b_count", "b_fruit"]},
{"intent_2", ["b_count", "i_count", "b_fruit", "o", "o", "o", "o"]}
]
classifier = IntentClassifier.fit(%{}, x, y, pipeline)
Inputs (x) to the intent classifier are simple natural language utterances. These inputs are tokenized and converted to feature vectors/maps as needed by the classifier/recognizer.
Each label (y) is a tuple of {intent, tags}
, where intent
is the
class label of the intent for the corresponding x value. tags
is a
list of token tags, each of which is a label for the corresponding token
in the utterance x. Tag labels are expressed using the
Inside-Outside-Beginning (IOB)
format. In the above snippet, the following are the token tags for the first
utterance.
token | tag |
---|---|
you | o |
have | o |
four | b_count |
pears | b_fruit |
{intents, params} = IntentClassifier.predict_intent(
classifier,
%{},
"I have three bananas"
)
The snippet above returns the following intents
map and params
map that
classify the utterance. The intents
map contains the posterior probability
of each intent, all of which sum to 1.0. The params
map contains the
map of entity names extracted from the utterance, based on the names
specified in the training examples.
{
%{
"intent_1" => 0.6666666661872298,
"intent_2" => 0.3333333338127702
},
%{
"count" => "three",
"fruit" => "bananas"
}
}
Obviously, using the token count as the only feature to try to predict an intent is silly, and using only the input tokens to train the entity recognizer will not generalize well. For better classification/recognition, Penelope includes several feature generation components/vectorizers, including support for pretrained embeddings (word vectors) and regexes. Examples of these can be found in the API reference.
Copyright 2017 Pylon, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.