Skip to content

Commit

Permalink
Refactor transforms
Browse files Browse the repository at this point in the history
  • Loading branch information
pauldraper committed Aug 23, 2021
1 parent 213ba52 commit 454f500
Show file tree
Hide file tree
Showing 27 changed files with 64,576 additions and 492 deletions.
3 changes: 2 additions & 1 deletion Makefile
Expand Up @@ -41,14 +41,15 @@ schema: $(SCHEMA_TGT)
# Format
###
FORMAT_SRC := $(shell find . $(TARGET:%=-not \$(LPAREN) -name % -prune \$(RPAREN)) -name '*.py')
PRETTIER_SRC := $(shell find . $(TARGET:%=-not \$(LPAREN) -name % -prune \$(RPAREN)) -name '*.md')

.PHONY: format
format: target/format.target

.PHONY: test-format
test-format: target/format-test.target

target/format.target: $(FORMAT_SRC) target/node_modules.target
target/format.target: $(FORMAT_SRC) $(PRETTIER_SRC) target/node_modules.target
mkdir -p $(@D)
isort --profile black $(FORMAT_SRC)
black -t py37 $(FORMAT_SRC)
Expand Down
74 changes: 15 additions & 59 deletions README.md
Expand Up @@ -117,74 +117,30 @@ Replacements are deterministic for a given pepper. By default, the pepper is
randomly generated each run. You may specify it as `--pepper`. Note that
possession of the pepper makes the data guessable.

Transformation may operate an existing slice, or happen during the dump.
Transformation may operate on an existing slice (TODO), or happen during the
dump.

### Transforms
### Configuration

#### alphanumeric
Transforms are specified by:

Replace alphanumeric characters, preserve the type and case of characters.
- `class`, the Python class
- `config`, transform-specific options
- `module`, defaults to `slice_db.transforms`

- `caseInsensitive` - Whether the value is case-insensitive
- `unique` - Whether to generate a unique value
The name given to the transform is appended to the global pepper.

### composite
### Custom transforms

Parse as a PostgreSQL composite, with suboptions (TODO).
To create custom transforms, implement `slice_db.transform.Transform`, expose
the class on a module, and install the module so that is accessible by
`slicedb`.

#### const
### Built-in transforms

Const value
The `slice_db.transforms` package has many common transforms.

Params are that value

#### date_year

Change date by up to one year.

### geozip

Replace zip code, preserving the first three digits.

Uses [https://simplemaps.com/data/us-zips](https://simplemaps.com/data/us-zips).

### given_name

Replace given name.

Uses
[https://www.ssa.gov/cgi-bin/popularnames.cgi](https://www.ssa.gov/cgi-bin/popularnames.cgi).

- `caseInsensitive` - Whether the value is case-insensitive

### json_object

Operation on json_object, with options.

- `properties` - Object of properties to transforms

### json_string

Operation on json_object, with options.

- Inner tansform

### null

Null value.

### person_name

Replace name.

### surname

Replace surname

Uses
[https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv](https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv)

- `caseInsensitive` - Whether the value is case-insensitive
See [transforms.md](doc/transforms.md) for the full list.

## Restore

Expand Down
104 changes: 104 additions & 0 deletions doc/transforms.md
@@ -0,0 +1,104 @@
# Built-in transforms

## Alphanumeric

Replace alphanumeric characters, preserving the type of characters.
Non-alphanumeric characters are left unchanged.

Class: `AlphanumericTranform`

Config:

- `unique` - Whether to generate a unique value

## City

Replace with US city.

Class: `CityTransform`

Uses [plotly/datasets](https://raw.githubusercontent.com/plotly/datasets/master/us-cities-top-1k.csv).

## Compose

Class: `ComposeTransform`

## Composite

Class: TODO

Parse as a PostgreSQL composite, with suboptions.

## Constant

Replace a non-null values with a string.

Class: `ConstantTransform`

Config: The value to use.

## Date (year)

Change date by up to one year.

Class: `DateYearTransform`

## Geozip

Replace zip code, preserving the first three digits.

Class: `GeopzipTransform`

Uses [simplemaps.com](https://simplemaps.com/data/us-zips).

Uses
[www.ssa.gov](https://www.ssa.gov/cgi-bin/popularnames.cgi).

## Given name

Replace given name.

Class: `GivenNameTransform`

## JSONPath

Class: `JsonPathTransform`

Config: Array of entries where each entry is

- `path` - JSONPath expression
- `transform` - Name of transform

## Null

Null value.

Class: `NullTransform`

## Surname

Replace surname.

Class: `SurnameTransform`

Uses
[fivethirtyeight/data](https://raw.githubusercontent.com/fivethirtyeight/data/master/most-common-name/surnames.csv).

## US state

Class: `UsStateTransform`

Config:

- `abbr` - Whether to use abbreviation (default false)

Based on [rogerallen/1583593](https://gist.github.com/rogerallen/1583593)

## Words

Replace with random words.

Class `WordsTransform`

Uses
[first20hours/google-10000-english](https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-no-swears.txt).
36 changes: 25 additions & 11 deletions schema/transform.yml
Expand Up @@ -6,17 +6,7 @@ definitions:
column:
description: Column
title: Column
properties:
params:
default:
description: Parameters for transform
title: Transform parameters
transform:
description: Type of transform
title: Transform
type: string
required: [transform]
type: object
type: string
table:
description: Table
properties:
Expand All @@ -27,10 +17,34 @@ definitions:
type: object
title: Table
type: object
transform:
description: Transform
properties:
class:
description: Class name.
title: Class
type: string
config:
default: null
description: Configuration given to transform
title: Configuration
module:
default: "slice_db.transforms"
description: Module name.
title: Module
type: string
required: [class]
title: Transform
properties:
tables:
additionalProperties: { $ref: "#/definitions/table" }
description: Tables. Omitted tables are left untouched.
title: Tables
type: object
transforms:
additionalProperties: { $ref: "#/definitions/transform" }
description: Transforms.
title: Transforms
type: object
required: [tables, transforms]
type: object
1 change: 1 addition & 0 deletions setup.py
Expand Up @@ -33,6 +33,7 @@
install_requires=[
"asyncpg",
"dataclasses_json==0.3.7",
"jsonpath-ng",
"jsonschema",
"numpy",
"pg-sql",
Expand Down
9 changes: 9 additions & 0 deletions slice_db/cli/common.py
@@ -1,6 +1,15 @@
import argparse
import json
import sys


def json_type(string: str):
try:
return json.loads(string)
except json.decoder.JSONDecodeError as e:
raise argparse.ArgumentError(str(e))


def open_bytes_read(path):
return open(path, "rb") if path != "-" else sys.stdin.buffer

Expand Down
31 changes: 17 additions & 14 deletions slice_db/cli/main.py
Expand Up @@ -7,6 +7,7 @@

from ..log import TRACE
from ..version import __version__
from .common import json_type

warnings.filterwarnings("ignore")

Expand Down Expand Up @@ -79,17 +80,17 @@ def create_parser():
description="Provide one of the subcommands for more specific help.",
)

add_dump_command(subparsers)
add_restore_command(subparsers)
add_schema_command(subparsers)
add_schema_filter_command(subparsers)
add_transform_command(subparsers)
add_transform_field_command(subparsers)
_add_dump_command(subparsers)
_add_restore_command(subparsers)
_add_schema_command(subparsers)
_add_schema_filter_command(subparsers)
_add_transform_command(subparsers)
_add_transform_field_command(subparsers)

return parser


def add_dump_command(subparsers):
def _add_dump_command(subparsers):
parser = subparsers.add_parser(
"dump",
description="Dump data from database.",
Expand Down Expand Up @@ -148,7 +149,7 @@ def add_dump_command(subparsers):
)


def add_restore_command(subparsers):
def _add_restore_command(subparsers):
parser = subparsers.add_parser(
"restore", description="Restore data.", formatter_class=ArgumentFormatter
)
Expand Down Expand Up @@ -192,7 +193,7 @@ def add_restore_command(subparsers):
)


def add_schema_command(subparsers):
def _add_schema_command(subparsers):
parser = subparsers.add_parser(
"schema",
description="Collect schema from database.",
Expand All @@ -207,7 +208,7 @@ def add_schema_command(subparsers):
)


def add_schema_filter_command(subparsers):
def _add_schema_filter_command(subparsers):
parser = subparsers.add_parser("schema-filter")
parser.add_argument("-i", "--input", default="-", help="Input")
parser.add_argument("-o", "--output", default="-", help="Output")
Expand All @@ -229,24 +230,26 @@ def add_schema_filter_command(subparsers):
children_parser.add_argument("table", nargs="*")


def add_transform_command(subparsers):
def _add_transform_command(subparsers):
parser = subparsers.add_parser(
"transform", description="Transform slice", formatter_class=ArgumentFormatter
)
update_help(parser)
parser.add_argument("--transform", required=True)


def add_transform_field_command(subparsers):
def _add_transform_field_command(subparsers):
parser = subparsers.add_parser(
"transform-field",
description="Transform field",
formatter_class=ArgumentFormatter,
)
update_help(parser)
parser.add_argument(
"--transforms", help="Transform JSON", required=True, type=json_type
)
parser.add_argument("--name", default="", help="Name of transform")
parser.add_argument("--pepper", help="Pepper.")
parser.add_argument("--transform", required=True)
parser.add_argument("--params", default="null")
parser.add_argument("field")


Expand Down

0 comments on commit 454f500

Please sign in to comment.