Penelope

Natural Language Processing (NLP) and Machine Learning (ML) library for Elixir. Penelope provides a scikit-learn-inspired interface to the the LIBSVM, LIBLINEAR, and CRFsuite C/C++ libraries in Elixir, which can be used for many ML/NLP applications.

Status

The API reference is available here.

Installation

Dependencies

First, clone the project's submodules.

git submodule update --init

This package requires an implementation of BLAS for efficient matrix math. It can be installed on each platform as follows:

OSX

BLAS is built into OSX.

Alpine

Install openblas-dev via apk.

sudo apk add openblas-dev

Ubuntu

Install libblas-dev via apt.

sudo apt install libblas-dev

Hex

def deps do
  [
    {:penelope, "~> 0.4"}
  ]
end

Usage

Intent Classification/Entity Recognition

Penelope can be used to build a machine learning model for identifying natural language utterances and extracting parameters from them. The Penelope.NLP.IntentClassifier module uses a predictor pipeline for recognizing intents and a recognizer pipeline for extracting named entities from the utterance. The following is a contrived example that classifies intents based on the token length of the utterance.

alias Penelope.NLP.IntentClassifier

pipeline = %{
  tokenizer: [{:ptb_tokenizer, []}],
  classifier: [{:count_vectorizer, []},
               {:linear_classifier, [probability?: true]}],
  recognizer: [{:crf_tagger, []}],
}
x = [
  "you have four pears",
  "three hundred apples would be a lot"
]
y = [
  {"intent_1", ["o", "o", "b_count", "b_fruit"]},
  {"intent_2", ["b_count", "i_count", "b_fruit", "o", "o", "o", "o"]}
]
classifier = IntentClassifier.fit(%{}, x, y, pipeline)

{intents, params} = IntentClassifier.predict_intent(
  classifier,
  %{},
  "I have three bananas"
)

Pipeline Definition

pipeline = %{
  tokenizer: [{:ptb_tokenizer, []}],
  classifier: [{:count_vectorizer, []},
               {:linear_classifier, [probability?: true]}],
  recognizer: [{:crf_tagger, []}],
}

This block configures the tokenizer, classifier, and recognizer pipelines used by the intent classifier. A pipeline in Penelope is a list of components and configuration that are used to train/predict a machine learning model, with an interface similar to that used in scikit-learn.

The tokenizer converts a string utterance into a sequence of tokens. In this example, we use the Penn Treebank tokenizer (:ptb-tokenizer). The tokenizer pipeline is run before either of the other two pipelines, so that they can share its output.

The classifier pipeline receives a tokenized utterance (x) and class labels (y) and learns a model that can predict the label from the utterance. In this example, we use a simple token count vectorizer (number of tokens in the utterance) and a logistic regression classifier to predict the class labels.

Finally, the recognizer pipeline receives a tokenized sequence (x) and sequence tags (y) to learn a model that can predict the label of each tag in the sequence. This allows the recognizer to extract slot values from natural language utterances. This example uses a Conditional Random Field (CRF) model, which can be thought of as a sequence extension of logistic regression, to tag the tokens in the utterance.

Training

x = [
  "you have four pears",
  "three hundred apples would be a lot"
]
y = [
  {"intent_1", ["o", "o", "b_count", "b_fruit"]},
  {"intent_2", ["b_count", "i_count", "b_fruit", "o", "o", "o", "o"]}
]
classifier = IntentClassifier.fit(%{}, x, y, pipeline)

Inputs (x) to the intent classifier are simple natural language utterances. These inputs are tokenized and converted to feature vectors/maps as needed by the classifier/recognizer.

Each label (y) is a tuple of {intent, tags}, where intent is the class label of the intent for the corresponding x value. tags is a list of token tags, each of which is a label for the corresponding token in the utterance x. Tag labels are expressed using the Inside-Outside-Beginning (IOB) format. In the above snippet, the following are the token tags for the first utterance.

token	tag
you	o
have	o
four	b_count
pears	b_fruit

Prediction

{intents, params} = IntentClassifier.predict_intent(
  classifier,
  %{},
  "I have three bananas"
)

The snippet above returns the following intents map and params map that classify the utterance. The intents map contains the posterior probability of each intent, all of which sum to 1.0. The params map contains the map of entity names extracted from the utterance, based on the names specified in the training examples.

{
    %{
        "intent_1" => 0.6666666661872298,
        "intent_2" => 0.3333333338127702
    },
    %{
        "count" => "three",
        "fruit" => "bananas"
    }
}

Improvements

Obviously, using the token count as the only feature to try to predict an intent is silly, and using only the input tokens to train the entity recognizer will not generalize well. For better classification/recognition, Penelope includes several feature generation components/vectorizers, including support for pretrained embeddings (word vectors) and regexes. Examples of these can be found in the API reference.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.circleci		.circleci
c_src		c_src
docker		docker
lib		lib
priv		priv
test		test
.dialyzerignore		.dialyzerignore
.formatter.exs		.formatter.exs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Penelope

Status

Installation

Dependencies

OSX

Alpine

Ubuntu

Hex

Usage

Intent Classification/Entity Recognition

Pipeline Definition

Training

Prediction

Improvements

License

About

Releases 1

Packages

Contributors 5

Languages

pylon/penelope

Folders and files

Latest commit

History

Repository files navigation

Penelope

Status

Installation

Dependencies

OSX

Alpine

Ubuntu

Hex

Usage

Intent Classification/Entity Recognition

Pipeline Definition

Training

Prediction

Improvements

License

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 5

Languages

Packages