Skip to content

Commit

Permalink
Merge pull request #5 from rcrowe-google/Kshitijaa/base/boilerplate
Browse files Browse the repository at this point in the history
Added boilerplate code from Hello World example
  • Loading branch information
deutranium committed Jun 14, 2021
2 parents c81eac4 + 8f4b1e8 commit 6ffb17b
Show file tree
Hide file tree
Showing 12 changed files with 499 additions and 2 deletions.
44 changes: 44 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Contribution Guidelines

## Directory Structure
The repo contains three main directories as follows:
- **[Component](./component):** Contains the main component code with a separate file for the executor code
- **[Data](./data):** Containing the sample data to be used for testing
- **[Example](./example):** Contains example codes to test our component with the CSVs present in [data](./data)

## A few Git and GitHub practices

### Commits
Commits serve as checkpoints during your workflow and can be used to **revert back** in case something gets messed up.
- **When to commit:** Try not to pile up many changes in multiple commits while ensuring that you don't make too many commits for fixing a small issue.
- **Commit messages:** Commit messages should be descriptive enough for an external person to get an idea of what it accomplished while ensuring they don't exceed 50 characters.

Check out [this](https://gist.github.com/turbo/efb8d57c145e00dc38907f9526b60f17) for more information about the good practices

### Branches
Branches are a good way to simulataniously work on different features at the same time. Check out [git-scm](https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging) to know more about various concepts involved in the same.

For descriptive branch names, it is a good idea to follow the following format:
**`name/keyword/short-description`**
- **Name:** Name of the person/s working on the branch. This can be ignored if many people(>2) are expected to work on it.
- **Keyword:** This describes what "type" of work this branch is supposed to do. These are typically named as:
- `feature`: Adding/expanding a feature
- `base`: Adding boilerplate/readme/templates etc.
- `bug`: Fixes a bug
- `junk`: Throwaway branch created to experiment
- **Short description:** As the name suggests, this contains a short description about the branch, usually no longer than 2-3 words separated by a hyphen (`-`).

P.S. If multiple branches are being used to work on the same issue (say issue `#n`), they can be named as `name/keyword/#n-short-description`

### Issues
The following points should be considered while creating new issues
- Use relevant labels like `bug`, `feature` etc.
- If the team has decided the person who will work on it, it should be **assigned** to the said person as soon as possible to prevent same work being done twice.
- The issue should be linked in the **project** if needed and the status of the same should be maintained as the work progresses.

### Pull Requests
It is always a good idea to ensure the following are present in your Pull Request description:
- Relevant issue/s
- What it accomplished
- Mention `[WIP]` in title and make it a `Draft Pull Request` if it is a work in progress
- Once the pull request is final, it should be **requested for review** from the concerned people
39 changes: 39 additions & 0 deletions PROPOSAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#### SIG TFX-Addons
# Project Proposal

**Your name:** Pratishtha Abrol

**Your email:** pratishthaabrol@gmail.com

**Your company/organization:** Outreachy

**Project name:** [Schema curation custom component](https://github.com/tensorflow/tfx-addons/issues/8)

## Project Description
This project applies Python user code from a user-supplied module file to a schema produced by SchemaGen, to curate the schema based on domain knowledge.

## Project Category
Component

## Project Use-Case(s)
This project will allow the user to add a custom component that modifies the schema generated by SchemaGen component according to user knowledge, for example, fixing domain limits that were inferred wrongly by the SchemaGen component.

## Project Implementation
Implementation of the Schema Curation Custom Component can be done using the following approach:
- Get the base Schema using SchemaGen component of TFX
- User supplies a module file with a fully-custom component that defines the additions/changes to the initially generated schema through SchemaGen.
- And execution script would run on the module file, which sets and modifies variables accordingly.
- The base schema gets modified according to the module file and used further along the pipeline

## Project Dependencies
The implementation will use the [TFDV library](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv) for validation and modification of schema objects according to the module file provided by the user. The following two methods would be of special focus:
- [tfdv.set_domain](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/set_domain)
- [tfdv.write_schema_text](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/write_schema_text)

A similar implementation can be seen in the [Transform library](https://github.com/tensorflow/transform). Paricularly, the [schema_utils](https://github.com/tensorflow/transform/blob/master/tensorflow_transform/tf_metadata/schema_utils.py) method could come in useful.

## Project Team
**Project Leader** : Pratishtha Abrol, pratishtha-abrol, pratishthaabrol@gmail.com
1. Fatimah Adwan, FatimahAdwan, akilahafaf72@gmail.com
2. Kshitijaa Jaglan, deutranium, jaglan.kshitijaa2@gmail.com
3. Nirzari Gupta, nirzu97, nirzu97@gmail.com
19 changes: 17 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,17 @@
# schemacomponent
Outreachy TFX custom component project
# Schema Curation Custom Component

> Outreachy TFX custom component project
This repo contains the code for Schema Curation Custom Component made as a part of [TFX-Addons](https://github.com/tensorflow/tfx-addons/) through the [Outreachy](https://www.outreachy.org/outreachy-may-2021-internship-round/communities/tensorflow/#create-custom-components-and-tools-for-tensorflow-) program. You may view the linked Pull Request in TFX-Addons [here](https://github.com/tensorflow/tfx-addons/pull/32) and the issue [here](https://github.com/tensorflow/tfx-addons/issues/8) for relevant discussions related to the project.

## The Team:
### Mentors:
- Robert Crowe
- Thea Lamkin
- Josh Gordon

### Interns:
- [Fatima Adwan](https://github.com/FatimahAdwan/FatimahAdwan)
- [Kshitijaa Jaglan](https://github.com/deutranium/)
- [Nirzari Gupta](https://github.com/Nirzu97)
- [Pratishtha Abrol](https://github.com/pratishtha-abrol)
Empty file added __init__.py
Empty file.
Empty file added component/__init__.py
Empty file.
88 changes: 88 additions & 0 deletions component/component.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Lint as: python3
# Copyright 2019 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example of a Hello World TFX custom component.
This custom component simply reads tf.Examples from input and passes through as
output. This is meant to serve as a kind of starting point example for creating
custom components.
This component along with other custom component related code will only serve as
an example and will not be supported by TFX team.
"""

from typing import Optional, Text

from tfx import types
from tfx.dsl.components.base import base_component
from tfx.dsl.components.base import executor_spec
from tfx.examples.custom_components.hello_world.hello_component import executor
from tfx.types import channel_utils
from tfx.types import standard_artifacts
from tfx.types.component_spec import ChannelParameter
from tfx.types.component_spec import ExecutionParameter


class HelloComponentSpec(types.ComponentSpec):
"""ComponentSpec for Custom TFX Hello World Component."""

PARAMETERS = {
# These are parameters that will be passed in the call to
# create an instance of this component.
'name': ExecutionParameter(type=Text),
}
INPUTS = {
# This will be a dictionary with input artifacts, including URIs
'input_data': ChannelParameter(type=standard_artifacts.Examples),
}
OUTPUTS = {
# This will be a dictionary which this component will populate
'output_data': ChannelParameter(type=standard_artifacts.Examples),
}


class HelloComponent(base_component.BaseComponent):
"""Custom TFX Hello World Component.
This custom component class consists of only a constructor.
"""

SPEC_CLASS = HelloComponentSpec
EXECUTOR_SPEC = executor_spec.ExecutorClassSpec(executor.Executor)

def __init__(self,
input_data: types.Channel = None,
output_data: types.Channel = None,
name: Optional[Text] = None):
"""Construct a HelloComponent.
Args:
input_data: A Channel of type `standard_artifacts.Examples`. This will
often contain two splits: 'train', and 'eval'.
output_data: A Channel of type `standard_artifacts.Examples`. This will
usually contain the same splits as input_data.
name: Optional unique name. Necessary if multiple Hello components are
declared in the same pipeline.
"""
# output_data will contain a list of Channels for each split of the data,
# by default a 'train' split and an 'eval' split. Since HelloComponent
# passes the input data through to output, the splits in output_data will
# be the same as the splits in input_data, which were generated by the
# upstream component.
if not output_data:
output_data = channel_utils.as_channel([standard_artifacts.Examples()])

spec = HelloComponentSpec(input_data=input_data,
output_data=output_data, name=name)
super(HelloComponent, self).__init__(spec=spec)
52 changes: 52 additions & 0 deletions component/component_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Lint as: python3
# Copyright 2019 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for HelloComponent."""

import json

import tensorflow as tf

from tfx.examples.custom_components.hello_world.hello_component import component
from tfx.types import artifact
from tfx.types import channel_utils
from tfx.types import standard_artifacts


class HelloComponentTest(tf.test.TestCase):

def setUp(self):
super(HelloComponentTest, self).setUp()
self.name = 'HelloWorld'

def testConstruct(self):
input_data = standard_artifacts.Examples()
input_data.split_names = json.dumps(artifact.DEFAULT_EXAMPLE_SPLITS)
output_data = standard_artifacts.Examples()
output_data.split_names = json.dumps(artifact.DEFAULT_EXAMPLE_SPLITS)
this_component = component.HelloComponent(
input_data=channel_utils.as_channel([input_data]),
output_data=channel_utils.as_channel([output_data]),
name=u'Testing123')
self.assertEqual(standard_artifacts.Examples.TYPE_NAME,
this_component.outputs['output_data'].type_name)
artifact_collection = this_component.outputs['output_data'].get()
for artifacts in artifact_collection:
split_list = json.loads(artifacts.split_names)
self.assertEqual(artifact.DEFAULT_EXAMPLE_SPLITS.sort(),
split_list.sort())


if __name__ == '__main__':
tf.test.main()
88 changes: 88 additions & 0 deletions component/executor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Lint as: python3
# Copyright 2019 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Example of a Hello World TFX custom component.
This custom component simply passes examples through. This is meant to serve as
a kind of starting point example for creating custom components.
This component along with other custom component related code will only serve as
an example and will not be supported by TFX team.
"""

import json
import os
from typing import Any, Dict, List, Text


from tfx import types
from tfx.dsl.components.base import base_executor
from tfx.dsl.io import fileio
from tfx.types import artifact_utils
from tfx.utils import io_utils


class Executor(base_executor.BaseExecutor):
"""Executor for HelloComponent."""

def Do(self, input_dict: Dict[Text, List[types.Artifact]],
output_dict: Dict[Text, List[types.Artifact]],
exec_properties: Dict[Text, Any]) -> None:
"""Copy the input_data to the output_data.
For this example that is all that the Executor does. For a different
custom component, this is where the real functionality of the component
would be included.
This component both reads and writes Examples, but a different component
might read and write artifacts of other types.
Args:
input_dict: Input dict from input key to a list of artifacts, including:
- input_data: A list of type `standard_artifacts.Examples` which will
often contain two splits, 'train' and 'eval'.
output_dict: Output dict from key to a list of artifacts, including:
- output_data: A list of type `standard_artifacts.Examples` which will
usually contain the same splits as input_data.
exec_properties: A dict of execution properties, including:
- name: Optional unique name. Necessary iff multiple Hello components
are declared in the same pipeline.
Returns:
None
Raises:
OSError and its subclasses
"""
self._log_startup(input_dict, output_dict, exec_properties)

input_artifact = artifact_utils.get_single_instance(
input_dict['input_data'])
output_artifact = artifact_utils.get_single_instance(
output_dict['output_data'])
output_artifact.split_names = input_artifact.split_names

split_to_instance = {}

for split in json.loads(input_artifact.split_names):
uri = artifact_utils.get_split_uri([input_artifact], split)
split_to_instance[split] = uri

for split, instance in split_to_instance.items():
input_dir = instance
output_dir = artifact_utils.get_split_uri([output_artifact], split)
for filename in fileio.listdir(input_dir):
input_uri = os.path.join(input_dir, filename)
output_uri = os.path.join(output_dir, filename)
io_utils.copy_file(src=input_uri, dst=output_uri, overwrite=True)
5 changes: 5 additions & 0 deletions data/data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pickup_community_area,fare,trip_start_month,trip_start_hour,trip_start_day,trip_start_timestamp,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude,trip_miles,pickup_census_tract,dropoff_census_tract,payment_type,company,trip_seconds,dropoff_community_area,tips
60,27.05,10,2,3,1380593700,41.836150155,-87.648787952,,,12.6,,,Cash,Taxi Affiliation Services,1380,,0.0
10,5.85,10,1,2,1382319000,41.985015101,-87.804532006,,,0.0,,,Cash,Taxi Affiliation Services,180,,0.0
14,16.65,5,7,5,1369897200,41.968069,-87.721559063,,,0.0,,,Cash,Dispatch Taxi Affiliation,1080,,0.0
13,16.45,11,12,3,1446554700,41.983636307,-87.723583185,,,6.9,,,Cash,,780,,0.0
Empty file added example/__init__.py
Empty file.
Loading

0 comments on commit 6ffb17b

Please sign in to comment.