Skip to content

Commit

Permalink
Merge 3143d8a into 977c44a
Browse files Browse the repository at this point in the history
  • Loading branch information
ltbringer committed Sep 3, 2021
2 parents 977c44a + 3143d8a commit c5f8972
Show file tree
Hide file tree
Showing 10 changed files with 87 additions and 289 deletions.
128 changes: 47 additions & 81 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,113 +1,79 @@
# Dialogy

[![Build Status](https://travis-ci.com/Vernacular-ai/dialogy.svg?branch=master)](https://travis-ci.com/Vernacular-ai/dialogy)
[![Coverage Status](https://coveralls.io/repos/github/Vernacular-ai/dialogy/badge.svg?branch=master)](https://coveralls.io/github/Vernacular-ai/dialogy?branch=master)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/03ab1c93c9354def81de73ba04b0d94c)](https://www.codacy.com/gh/Vernacular-ai/dialogy/dashboard?utm_source=github.com&utm_medium=referral&utm_content=Vernacular-ai/dialogy&utm_campaign=Badge_Grade)
[![Build Status](https://app.travis-ci.com/skit-ai/dialogy.svg?branch=master)](https://app.travis-ci.com/skit-ai/dialogy)
[![Coverage Status](https://coveralls.io/repos/github/skit-ai/dialogy/badge.svg?branch=master)](https://coveralls.io/github/skit-ai/dialogy?branch=master)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/03ab1c93c9354def81de73ba04b0d94c)](https://www.codacy.com/gh/skit-ai/dialogy/dashboard?utm_source=github.com&utm_medium=referral&utm_content=Vernacular-ai/dialogy&utm_campaign=Badge_Grade)
[![PyPI version](https://badge.fury.io/py/dialogy.svg)](https://badge.fury.io/py/dialogy)

Dialogy is a batteries-included 🔋 opinionated framework to build machine-learning solutions for speech applications.

- Plugin-based: Makes it easy to import/export components to other projects. 🔌
- Stack-agnostic: No assumptions made on ML stack; your choice of machine learning library will not be affected by using Dialogy. 👍
- Progressive: Minimal boilerplate writing to let you focus on your machine learning problems. 🤏

[Documentation](https://vernacular-ai.github.io/dialogy/)
Dialogy is a library for building SLU applications.
[Documentation](https://skit-ai.github.io/dialogy/)

## Installation

```shell
pip install dialogy
```

## Test

```shell
make test
```

## Examples

Using `dialogy` to run a classifier on a new input.

```python
import pickle
from dialogy.workflow import Workflow
from dialogy.preprocessing import merge_asr_output


def access(workflow):
return workflow.input
Dialogy's CLI supports building and migration of projects.

```txt
dialogy -h
usage: dialogy [-h] {create,update,train} ...
def mutate(workflow, value):
workflow.input = value
positional arguments:
{create,update,train}
Dialogy project utilities.
create Create a new project.
update Migrate an existing project to the latest template version.
train Train a workflow.

def vectorizer(workflow):
vectorizer = TfidfVectorizer()
workflow.input = vectorizer.transform(workflow.input)


class TfidfMLPClfWorkflow(Workflow):
def __init__(self):
super(TfidfMLPClfWorkflow).__init__()
self.model = None

def load_model(self, model_path):
with open(model_path, "rb") as binary:
self.model = binary.load()

def inference(self):
self.output = self.model.predict(self.input)


preprocessors = [merge_asr_output(access=access, mutate=mutate), vectorizer]
workflow = TfidfMLPClfWorkflow(preprocessors=preprocessors, postprocessors=[])
output = workflow.run([[{"transcript": "hello world", "confidence": 0.97}]]) # output -> _greeting_
optional arguments:
-h, --help show this help message and exit
```

Refer to the source for [`merge_asr_output`](https://vernacular-ai.github.io/dialogy/dialogy/preprocess/text/merge_asr_output.html) and [`Plugins`](https://vernacular-ai.github.io/dialogy/dialogy/plugin/plugin.html) to understand this example better.
## Project Creation

## Note
```txt
dialogy create -h
usage: dialogy create [-h] [--template TEMPLATE] [--dry-run] [--namespace NAMESPACE] [--master] project
- Popular workflow sub-classes will be accepted after code-review.
## FAQs
positional arguments:
project A directory with this name will be created at the root of command invocation.
### Training boilerplate
optional arguments:
-h, --help show this help message and exit
--template TEMPLATE
--dry-run Make no change to the directory structure.
--namespace NAMESPACE
The github/gitlab user or organization name where the template project lies.
--master Download the template's master branch (HEAD) instead of the latest tag.
```

❌. This is not an end-to-end automated model training framework. That said, no one wants to write boilerplate code,
unfortunately, it doesn't align with the objectives of this project. Also, it is hard to accomodate for different needs
like:
## Concepts

- library dependencies
- hardware support
- Need for visualizations/reports during/after training.
There are a few key concepts to build a machine-learning `Workflow`(s) using Dialogy.
All the effects comprising pre-processing, classification, scoring, ranking, etc are governed by `Plugin`(s).

Any rigidity here would lead to distractions both to maintainers and users of this project. [`Plugins`](https://vernacular-ai.github.io/dialogy/dialogy/plugin/plugin.html) and custom
[Workflow](https://vernacular-ai.github.io/dialogy/dialogy/workflow/workflow.html) are certainly welcome and can take care of recipe-based needs.
### Workflow

### Common Evaluation Plans
A workflow has these objectives.

❌. Evaluation of models is hard to standardize. if you have a fairly common need, feel free to contribute your `workflow`, `plugins`.
1. Allow interactions between different plugins.

### Benefits
2. Isolating plugins from each other. A plugin can't access another in the chain.

- ✅. This project offers a conduit for an untrained model. This means once a [Workflow](https://vernacular-ai.github.io/dialogy/dialogy/workflow/workflow.html) is ready you can use it anywhere:
evaluation scripts, serving your models via an API, custom training/evaluation scripts, combining another workflow, etc.
3. Storing the information till the end of the execution of plugin chain.

- ✅. If your application needs spoken language understanding, you should find little need to write data processing functions.
### Plugin

- ✅. Little to no learning curve, if you know python you know how to use this project.
A plugin transforms data. Depending on its utility, a plugin may have phases of operation.

## Contributions
- Bulk transformation during training phase.

- Go through the docs.
- Inference or Transformation

- Name your branch as "purpose/short-description". examples:
- "feature/hippos_can_fly"
- "fix/breathing_water_instead_of_fire"
- "docs/chapter_on_mighty_sphinx"
- "refactor/limbs_of_mummified_pharao"
- "test/my_patience"
## Test

- Make sure tests are added are passing. Run `make test lint` at project root. Coverage is important aim for 100% unless reviewer agrees.
```shell
make test
```
17 changes: 1 addition & 16 deletions dialogy/cli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,24 +99,9 @@ def project_command_parser(command_string: Optional[str]) -> argparse.Namespace:
create_parser = add_project_command_arguments(create_parser)
update_parser = add_project_command_arguments(update_parser)
train_workflow_parser = parser.add_parser(const.TRAIN, help="Train a workflow.")
test_workflow_parser = parser.add_parser(const.TEST, help="Test a workflow.")
train_workflow_parser = add_workflow_command_arguments(
train_workflow_parser, const.TRAIN
)
test_workflow_parser = add_workflow_command_arguments(
test_workflow_parser, const.TEST
)
test_workflow_parser.add_argument(
"--out",
help="model",
default="The directory where the artifacts must be stored.",
required=True,
)
test_workflow_parser.add_argument(
"--join-id",
help="Join prediction dataframe and (truth) labeled dataframe by id.",
required=True,
)

command = command_string.split() if command_string else None
return argument_parser.parse_args(args=command)
Expand All @@ -129,5 +114,5 @@ def main(command_string: Optional[str] = None) -> None:
args = project_command_parser(command_string=command_string)
if args.command in [const.CREATE, const.UPDATE]:
project.project_cli(args)
elif args.command in [const.TRAIN, const.TEST]:
elif args.command in [const.TRAIN]:
workflow.workflow_cli(args)
35 changes: 0 additions & 35 deletions dialogy/cli/workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,39 +36,6 @@ def train_workflow(args: argparse.Namespace) -> None:
raise AttributeError from error


def test_workflow(args: argparse.Namespace) -> None:
"""
Test a workflow.
"""
module = args.module
get_workflow_fn = args.fn
data = args.data
output_dir = args.out
join_id = args.join_id
lang = args.lang
project = args.project
kwargs = {"lang": lang, "project": project}

workflow: Workflow = get_workflow(module, get_workflow_fn, args.command, **kwargs)
test_df = pd.read_csv(data)
try:
result_df = workflow.prediction_labels(test_df, join_id)
except AttributeError as error:
logger.error(f"{workflow=} doesn't have a prediction_labels method? 🤔")
raise AttributeError from error

result_df = pd.merge(test_df, result_df, on=join_id)
report = pd.DataFrame(
classification_report(
result_df[const.LABELS],
result_df[const.INTENT],
zero_division=0,
output_dict=True,
)
).T
report.to_csv(create_timestamps_path(output_dir, "report.csv"), index=False)


def workflow_cli(args: argparse.Namespace) -> None:
"""
CLI entry point for workflows.
Expand All @@ -85,8 +52,6 @@ def workflow_cli(args: argparse.Namespace) -> None:
try:
if command == const.TRAIN:
train_workflow(args)
elif command == const.TEST:
test_workflow(args)
except ModuleNotFoundError as error:
logger.error(
f"Could not import module {args.module} is "
Expand Down
1 change: 1 addition & 0 deletions dialogy/constants/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,4 @@ class SIGNAL:
CREATE = "create"
UPDATE = "update"
ERROR_LABEL = "_error_"
TEXT = "text"
17 changes: 13 additions & 4 deletions dialogy/plugins/text/classification/xlmr.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ def __init__(
data_column: str = const.DATA,
label_column: str = const.LABELS,
args_map: Optional[Dict[str, Any]] = None,
skip_labels: Optional[List[str]] = None,
kwargs: Optional[Dict[str, Any]] = None,
) -> None:
try:
Expand All @@ -62,6 +63,7 @@ def __init__(
self.model_dir, const.LABELENCODER_FILE
)
self.threshold = threshold
self.skip_labels = set(skip_labels or set())
self.purpose = purpose
self.round = score_round_off
if args_map and (
Expand Down Expand Up @@ -206,15 +208,22 @@ def train(self, training_data: pd.DataFrame) -> None:
)
return

skip_labels_filter = training_data[self.label_column].isin(self.skip_labels)
training_data = training_data[~skip_labels_filter].copy()

encoder = self.labelencoder.fit(training_data[self.label_column])
sample_size = 5 if len(training_data) > 5 else len(training_data)
training_data.rename(columns={self.data_column: "text"}, inplace=True)
training_data.loc[:, self.label_column] = encoder.transform(
training_data[self.label_column]
training_data.rename(
columns={self.data_column: const.TEXT, self.label_column: const.LABELS},
inplace=True,
)
training_data.loc[:, const.LABELS] = encoder.transform(
training_data[const.LABELS]
)
training_data = training_data[[const.TEXT, const.LABELS]]
self.init_model(len(encoder.classes_))
logger.debug(
f"Displaying a few samples (this goes into the model):\n{training_data.sample(sample_size)}"
f"Displaying a few samples (this goes into the model):\n{training_data.sample(sample_size)}\nLabels: {len(encoder.classes_)}."
)
self.model.train_model(training_data)
self.save()
Expand Down
14 changes: 7 additions & 7 deletions dialogy/plugins/text/merge_asr_output/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,19 +109,19 @@ def transform(self, training_data: pd.DataFrame) -> pd.DataFrame:
try:
asr_output = json.loads(row[self.data_column])
if asr_output:
training_data.loc[i, self.data_column] = merge_asr_output(
asr_output
)[0]
merged_asr_ouptut = merge_asr_output(asr_output)
training_data.loc[i, self.data_column] = merged_asr_ouptut[0]
else:
training_data.loc[i, "use"] = False
except Exception as error: # pylint: disable=broad-except
training_data.loc[i, "use"] = False
logger.error(f"{error}\n{traceback.format_exc()}")
logger.error(f"{error} -- {asr_output}\n{traceback.format_exc()}")

training_data_ = training_data[training_data.use].copy()
training_data_.drop("use", axis=1, inplace=True)
discarded_data = len(training_data) - len(training_data_)
logger.debug(
f"Discarding {discarded_data} samples because the alternatives couldn't be parsed."
)
if discarded_data:
logger.debug(
f"Discarding {discarded_data} samples because the alternatives couldn't be parsed."
)
return training_data_
40 changes: 0 additions & 40 deletions dialogy/workflow/workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,43 +203,3 @@ def train(self, training_data: pd.DataFrame) -> None:
transformed_data = plugin.transform(training_data)
if transformed_data is not None:
training_data = transformed_data

def prediction_labels(
self, testing_data: pd.DataFrame, id_: Union[str, int]
) -> pd.DataFrame:
"""
Evaluate the workflow with all the embedded plugins.
Plugins can be evaluated individually for fine-tuning but since there are interactions
between them, we need to evaluate them all together. This helps in cases where these interactions
are a cause of a model's poor performance.
This method doesn't mutate the given test dataset, instead we produce results with the same `id_`
so that they can be joined and studied together.
:param testing_data: A pandas DataFrame containing the testing data.
:type testing_data: pd.DataFrame
:param id_: The join parameter, which is used to join the testing data with the ground truth data.
:type id_: Union[str, int]
:return: A pandas DataFrame containing the workflow output.
:rtype: pd.DataFrame
"""
results = []
for _, row in tqdm(testing_data.iterrows(), total=len(testing_data)):
output = self.run(
input_={
const.CLASSIFICATION_INPUT: json.loads(row[const.DATA])[
const.ALTERNATIVES
]
}
)
intents = output.get(const.INTENTS, [])
if intents:
results.append(
{
id_: row[id_],
const.INTENT: intents[0].name,
const.SCORE: intents[0].score,
}
)
return pd.DataFrame(results)
Loading

0 comments on commit c5f8972

Please sign in to comment.