Merge 3143d8a into 977c44a

skit-ai · Sep 3, 2021 · c5f8972 · c5f8972
2 parents 977c44a + 3143d8a
commit c5f8972
Show file tree

Hide file tree

Showing 10 changed files with 87 additions and 289 deletions.
diff --git a/README.md b/README.md
@@ -1,113 +1,79 @@
 # Dialogy
 
-[![Build Status](https://travis-ci.com/Vernacular-ai/dialogy.svg?branch=master)](https://travis-ci.com/Vernacular-ai/dialogy)
-[![Coverage Status](https://coveralls.io/repos/github/Vernacular-ai/dialogy/badge.svg?branch=master)](https://coveralls.io/github/Vernacular-ai/dialogy?branch=master)
-[![Codacy Badge](https://app.codacy.com/project/badge/Grade/03ab1c93c9354def81de73ba04b0d94c)](https://www.codacy.com/gh/Vernacular-ai/dialogy/dashboard?utm_source=github.com&utm_medium=referral&utm_content=Vernacular-ai/dialogy&utm_campaign=Badge_Grade)
+[![Build Status](https://app.travis-ci.com/skit-ai/dialogy.svg?branch=master)](https://app.travis-ci.com/skit-ai/dialogy)
+[![Coverage Status](https://coveralls.io/repos/github/skit-ai/dialogy/badge.svg?branch=master)](https://coveralls.io/github/skit-ai/dialogy?branch=master)
+[![Codacy Badge](https://app.codacy.com/project/badge/Grade/03ab1c93c9354def81de73ba04b0d94c)](https://www.codacy.com/gh/skit-ai/dialogy/dashboard?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=Vernacular-ai/dialogy&amp;utm_campaign=Badge_Grade)
 [![PyPI version](https://badge.fury.io/py/dialogy.svg)](https://badge.fury.io/py/dialogy)
 
-Dialogy is a batteries-included 🔋 opinionated framework to build machine-learning solutions for speech applications. 
-
--   Plugin-based: Makes it easy to import/export components to other projects. 🔌
--   Stack-agnostic: No assumptions made on ML stack; your choice of machine learning library will not be affected by using Dialogy. 👍
--   Progressive: Minimal boilerplate writing to let you focus on your machine learning problems. 🤏
-
-[Documentation](https://vernacular-ai.github.io/dialogy/)
+Dialogy is a library for building SLU applications.
+[Documentation](https://skit-ai.github.io/dialogy/)
 
 ## Installation
 
 ```shell
 pip install dialogy
 ```
 
-## Test
-
-```shell
-make test
-```
-
-## Examples
-
-Using `dialogy` to run a classifier on a new input.
-
-```python
-import pickle
-from dialogy.workflow import Workflow
-from dialogy.preprocessing import merge_asr_output
-
-
-def access(workflow):
-    return workflow.input
+Dialogy's CLI supports building and migration of projects.
 
+```txt
+dialogy -h
+usage: dialogy [-h] {create,update,train} ...
 
-def mutate(workflow, value):
-    workflow.input = value
+positional arguments:
+  {create,update,train}
+                        Dialogy project utilities.
+    create              Create a new project.
+    update              Migrate an existing project to the latest template version.
+    train               Train a workflow.
 
-
-def vectorizer(workflow):
-    vectorizer = TfidfVectorizer()
-    workflow.input = vectorizer.transform(workflow.input)
-
-
-class TfidfMLPClfWorkflow(Workflow):
-    def __init__(self):
-        super(TfidfMLPClfWorkflow).__init__()
-        self.model = None
-
-    def load_model(self, model_path):
-        with open(model_path, "rb") as binary:
-            self.model = binary.load()
-
-    def inference(self):
-        self.output = self.model.predict(self.input)
-
-
-preprocessors = [merge_asr_output(access=access, mutate=mutate), vectorizer]
-workflow = TfidfMLPClfWorkflow(preprocessors=preprocessors, postprocessors=[])
-output = workflow.run([[{"transcript": "hello world", "confidence": 0.97}]]) # output -> _greeting_
+optional arguments:
+  -h, --help            show this help message and exit
 ```
 
-Refer to the source for [`merge_asr_output`](https://vernacular-ai.github.io/dialogy/dialogy/preprocess/text/merge_asr_output.html) and [`Plugins`](https://vernacular-ai.github.io/dialogy/dialogy/plugin/plugin.html) to understand this example better.
+## Project Creation
 
-## Note
+```txt
+dialogy create -h
+usage: dialogy create [-h] [--template TEMPLATE] [--dry-run] [--namespace NAMESPACE] [--master] project
 
--   Popular workflow sub-classes will be accepted after code-review.
-## FAQs
+positional arguments:
+  project               A directory with this name will be created at the root of command invocation.
 
-### Training boilerplate
+optional arguments:
+  -h, --help            show this help message and exit
+  --template TEMPLATE
+  --dry-run             Make no change to the directory structure.
+  --namespace NAMESPACE
+                        The github/gitlab user or organization name where the template project lies.
+  --master              Download the template's master branch (HEAD) instead of the latest tag.
+```
 
-❌. This is not an end-to-end automated model training framework. That said, no one wants to write boilerplate code,
-unfortunately, it doesn't align with the objectives of this project. Also, it is hard to accomodate for different needs 
-like: 
+## Concepts
 
--   library dependencies 
--   hardware support
--   Need for visualizations/reports during/after training.
+There are a few key concepts to build a machine-learning `Workflow`(s) using Dialogy.
+All the effects comprising pre-processing, classification, scoring, ranking, etc are governed by `Plugin`(s).
 
-Any rigidity here would lead to distractions both to maintainers and users of this project. [`Plugins`](https://vernacular-ai.github.io/dialogy/dialogy/plugin/plugin.html) and custom
-[Workflow](https://vernacular-ai.github.io/dialogy/dialogy/workflow/workflow.html) are certainly welcome and can take care of recipe-based needs. 
+### Workflow
 
-### Common Evaluation Plans
+A workflow has these objectives.
 
-❌. Evaluation of models is hard to standardize. if you have a fairly common need, feel free to contribute your `workflow`, `plugins`.
+1. Allow interactions between different plugins.
 
-### Benefits
+2. Isolating plugins from each other. A plugin can't access another in the chain.
 
--   ✅. This project offers a conduit for an untrained model. This means once a [Workflow](https://vernacular-ai.github.io/dialogy/dialogy/workflow/workflow.html) is ready you can use it anywhere:
-    evaluation scripts, serving your models via an API, custom training/evaluation scripts, combining another workflow, etc. 
+3. Storing the information till the end of the execution of plugin chain.
 
--   ✅. If your application needs spoken language understanding, you should find little need to write data processing functions.
+### Plugin
 
--   ✅. Little to no learning curve, if you know python you know how to use this project.
+A plugin transforms data. Depending on its utility, a plugin may have phases of operation.
 
-## Contributions
+- Bulk transformation during training phase.
 
--   Go through the docs.
+- Inference or Transformation
 
--   Name your branch as "purpose/short-description". examples:
-    -   "feature/hippos_can_fly"
-    -   "fix/breathing_water_instead_of_fire"
-    -   "docs/chapter_on_mighty_sphinx"
-    -   "refactor/limbs_of_mummified_pharao"
-    -   "test/my_patience"
+## Test
 
--   Make sure tests are added are passing. Run `make test lint` at project root. Coverage is important aim for 100% unless reviewer agrees.
+```shell
+make test
+```
diff --git a/dialogy/cli/__init__.py b/dialogy/cli/__init__.py
@@ -99,24 +99,9 @@ def project_command_parser(command_string: Optional[str]) -> argparse.Namespace:
     create_parser = add_project_command_arguments(create_parser)
     update_parser = add_project_command_arguments(update_parser)
     train_workflow_parser = parser.add_parser(const.TRAIN, help="Train a workflow.")
-    test_workflow_parser = parser.add_parser(const.TEST, help="Test a workflow.")
     train_workflow_parser = add_workflow_command_arguments(
         train_workflow_parser, const.TRAIN
     )
-    test_workflow_parser = add_workflow_command_arguments(
-        test_workflow_parser, const.TEST
-    )
-    test_workflow_parser.add_argument(
-        "--out",
-        help="model",
-        default="The directory where the artifacts must be stored.",
-        required=True,
-    )
-    test_workflow_parser.add_argument(
-        "--join-id",
-        help="Join prediction dataframe and (truth) labeled dataframe by id.",
-        required=True,
-    )
 
     command = command_string.split() if command_string else None
     return argument_parser.parse_args(args=command)
@@ -129,5 +114,5 @@ def main(command_string: Optional[str] = None) -> None:
     args = project_command_parser(command_string=command_string)
     if args.command in [const.CREATE, const.UPDATE]:
         project.project_cli(args)
-    elif args.command in [const.TRAIN, const.TEST]:
+    elif args.command in [const.TRAIN]:
         workflow.workflow_cli(args)
diff --git a/dialogy/cli/workflow.py b/dialogy/cli/workflow.py
@@ -36,39 +36,6 @@ def train_workflow(args: argparse.Namespace) -> None:
         raise AttributeError from error
 
 
-def test_workflow(args: argparse.Namespace) -> None:
-    """
-    Test a workflow.
-    """
-    module = args.module
-    get_workflow_fn = args.fn
-    data = args.data
-    output_dir = args.out
-    join_id = args.join_id
-    lang = args.lang
-    project = args.project
-    kwargs = {"lang": lang, "project": project}
-
-    workflow: Workflow = get_workflow(module, get_workflow_fn, args.command, **kwargs)
-    test_df = pd.read_csv(data)
-    try:
-        result_df = workflow.prediction_labels(test_df, join_id)
-    except AttributeError as error:
-        logger.error(f"{workflow=} doesn't have a prediction_labels method? 🤔")
-        raise AttributeError from error
-
-    result_df = pd.merge(test_df, result_df, on=join_id)
-    report = pd.DataFrame(
-        classification_report(
-            result_df[const.LABELS],
-            result_df[const.INTENT],
-            zero_division=0,
-            output_dict=True,
-        )
-    ).T
-    report.to_csv(create_timestamps_path(output_dir, "report.csv"), index=False)
-
-
 def workflow_cli(args: argparse.Namespace) -> None:
     """
     CLI entry point for workflows.
@@ -85,8 +52,6 @@ def workflow_cli(args: argparse.Namespace) -> None:
     try:
         if command == const.TRAIN:
             train_workflow(args)
-        elif command == const.TEST:
-            test_workflow(args)
     except ModuleNotFoundError as error:
         logger.error(
             f"Could not import module {args.module} is "

diff --git a/dialogy/constants/__init__.py b/dialogy/constants/__init__.py
@@ -117,3 +117,4 @@ class SIGNAL:
 CREATE = "create"
 UPDATE = "update"
 ERROR_LABEL = "_error_"
+TEXT = "text"
diff --git a/dialogy/plugins/text/classification/xlmr.py b/dialogy/plugins/text/classification/xlmr.py
@@ -38,6 +38,7 @@ def __init__(
         data_column: str = const.DATA,
         label_column: str = const.LABELS,
         args_map: Optional[Dict[str, Any]] = None,
+        skip_labels: Optional[List[str]] = None,
         kwargs: Optional[Dict[str, Any]] = None,
     ) -> None:
         try:
@@ -62,6 +63,7 @@ def __init__(
             self.model_dir, const.LABELENCODER_FILE
         )
         self.threshold = threshold
+        self.skip_labels = set(skip_labels or set())
         self.purpose = purpose
         self.round = score_round_off
         if args_map and (
@@ -206,15 +208,22 @@ def train(self, training_data: pd.DataFrame) -> None:
             )
             return
 
+        skip_labels_filter = training_data[self.label_column].isin(self.skip_labels)
+        training_data = training_data[~skip_labels_filter].copy()
+
         encoder = self.labelencoder.fit(training_data[self.label_column])
         sample_size = 5 if len(training_data) > 5 else len(training_data)
-        training_data.rename(columns={self.data_column: "text"}, inplace=True)
-        training_data.loc[:, self.label_column] = encoder.transform(
-            training_data[self.label_column]
+        training_data.rename(
+            columns={self.data_column: const.TEXT, self.label_column: const.LABELS},
+            inplace=True,
+        )
+        training_data.loc[:, const.LABELS] = encoder.transform(
+            training_data[const.LABELS]
         )
+        training_data = training_data[[const.TEXT, const.LABELS]]
         self.init_model(len(encoder.classes_))
         logger.debug(
-            f"Displaying a few samples (this goes into the model):\n{training_data.sample(sample_size)}"
+            f"Displaying a few samples (this goes into the model):\n{training_data.sample(sample_size)}\nLabels: {len(encoder.classes_)}."
         )
         self.model.train_model(training_data)
         self.save()

diff --git a/dialogy/plugins/text/merge_asr_output/__init__.py b/dialogy/plugins/text/merge_asr_output/__init__.py
@@ -109,19 +109,19 @@ def transform(self, training_data: pd.DataFrame) -> pd.DataFrame:
             try:
                 asr_output = json.loads(row[self.data_column])
                 if asr_output:
-                    training_data.loc[i, self.data_column] = merge_asr_output(
-                        asr_output
-                    )[0]
+                    merged_asr_ouptut = merge_asr_output(asr_output)
+                    training_data.loc[i, self.data_column] = merged_asr_ouptut[0]
                 else:
                     training_data.loc[i, "use"] = False
             except Exception as error:  # pylint: disable=broad-except
                 training_data.loc[i, "use"] = False
-                logger.error(f"{error}\n{traceback.format_exc()}")
+                logger.error(f"{error} -- {asr_output}\n{traceback.format_exc()}")
 
         training_data_ = training_data[training_data.use].copy()
         training_data_.drop("use", axis=1, inplace=True)
         discarded_data = len(training_data) - len(training_data_)
-        logger.debug(
-            f"Discarding {discarded_data} samples because the alternatives couldn't be parsed."
-        )
+        if discarded_data:
+            logger.debug(
+                f"Discarding {discarded_data} samples because the alternatives couldn't be parsed."
+            )
         return training_data_
diff --git a/dialogy/workflow/workflow.py b/dialogy/workflow/workflow.py
@@ -203,43 +203,3 @@ def train(self, training_data: pd.DataFrame) -> None:
             transformed_data = plugin.transform(training_data)
             if transformed_data is not None:
                 training_data = transformed_data
-
-    def prediction_labels(
-        self, testing_data: pd.DataFrame, id_: Union[str, int]
-    ) -> pd.DataFrame:
-        """
-        Evaluate the workflow with all the embedded plugins.
-
-        Plugins can be evaluated individually for fine-tuning but since there are interactions
-        between them, we need to evaluate them all together. This helps in cases where these interactions
-        are a cause of a model's poor performance.
-
-        This method doesn't mutate the given test dataset, instead we produce results with the same `id_`
-        so that they can be joined and studied together.
-
-        :param testing_data: A pandas DataFrame containing the testing data.
-        :type testing_data: pd.DataFrame
-        :param id_: The join parameter, which is used to join the testing data with the ground truth data.
-        :type id_: Union[str, int]
-        :return: A pandas DataFrame containing the workflow output.
-        :rtype: pd.DataFrame
-        """
-        results = []
-        for _, row in tqdm(testing_data.iterrows(), total=len(testing_data)):
-            output = self.run(
-                input_={
-                    const.CLASSIFICATION_INPUT: json.loads(row[const.DATA])[
-                        const.ALTERNATIVES
-                    ]
-                }
-            )
-            intents = output.get(const.INTENTS, [])
-            if intents:
-                results.append(
-                    {
-                        id_: row[id_],
-                        const.INTENT: intents[0].name,
-                        const.SCORE: intents[0].score,
-                    }
-                )
-        return pd.DataFrame(results)