Generate feature column eagerly by simba-git · Pull Request #10 · streamsql-io/streamsql

simba-git · 2020-06-22T19:03:59Z

When generating a training dataset, we needs all the values in the
feature column. We used to only generate the value at lookup time;
however, it's easier to eagerly generate the column to be able to use
it natively in pandas.

When generating a training dataset, we needs all the values in the feature column. We used to only generate the value at lookup time; however, it's easier to eagerly generate the column to be able to use it natively in pandas.

Table indices are casted to be strings. This is a reasonable default for our current use cases, except when there is no index column. In this case, the row number should be used as the index (And not casted as a string).

The Column should have the same name as the feature.

This avoids manual importing of every error in errors.py.

This clarifies that no column is being set to the primary key.

entity_values is an array of hte value of each entity not the value of each feature.

Training sets can be configured by providing labels, features, and entity mappings. The entities mapping is used to get the actual value of each feature per label.

Adds two tests, one to check entity mapping across two CSVs and one where the features are in the same file as the labels.

It's cleaner to not have any characters in the version.

This allows python3 setup.py sdist to behave properly

Thist dist/ directory is generated when pushing a new version to pypi.

It was missing a comma.

Name should only be changed via the rename method. This removes any confusion about this.

This simplifies the API and expectations. Currently, renaming only happen when a column is transformed.

This adds a series of simple tests for a column containing all integers between 1-100 (inclusive).

Simba added 18 commits June 22, 2020 11:55

Generate feature column eagerly

600de79

When generating a training dataset, we needs all the values in the feature column. We used to only generate the value at lookup time; however, it's easier to eagerly generate the column to be able to use it natively in pandas.

Keep row-number indices as numbers

9279a6a

Table indices are casted to be strings. This is a reasonable default for our current use cases, except when there is no index column. In this case, the row number should be used as the index (And not casted as a string).

Rename eagerly generated transformed column

1fdcb76

The Column should have the same name as the feature.

Use streamsql.errors in local.py

f722aa5

This avoids manual importing of every error in errors.py.

Add egg-info directories to .gitignore

8ab2e9b

Set primary key default to None on a Table

df2483d

This clarifies that no column is being set to the primary key.

Rename feature_values to entity_values in local.py

68644c5

entity_values is an array of hte value of each entity not the value of each feature.

Add training set generation

3f8cc6a

Training sets can be configured by providing labels, features, and entity mappings. The entities mapping is used to get the actual value of each feature per label.

Test feature set generation

0ce99e5

Adds two tests, one to check entity mapping across two CSVs and one where the features are in the same file as the labels.

Format python client test files

88e34e7

Set Python client version to 0.0.1

5886002

It's cleaner to not have any characters in the version.

Copies README into client/

74d0629

This allows python3 setup.py sdist to behave properly

Add client dist/ to git ignore

24e23b4

Thist dist/ directory is generated when pushing a new version to pypi.

Fix client setup.py

731a09e

It was missing a comma.

Bump Python Client version to 2.0.0

f878411

Turn name into a private variable in Column

583ecbc

Name should only be changed via the rename method. This removes any confusion about this.

Make Column immutable

ceda5c0

This simplifies the API and expectations. Currently, renaming only happen when a column is transformed.

Add more tests for Column

582f1b2

This adds a series of simple tests for a column containing all integers between 1-100 (inclusive).

simba-git force-pushed the feature/simba/training-api branch from 8b46a59 to 582f1b2 Compare June 24, 2020 02:04

simba-git merged commit 3443fcd into develop Jun 24, 2020

simba-git deleted the feature/simba/training-api branch June 24, 2020 02:21

simba-git added this to the v0.0.0a1 milestone Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate feature column eagerly#10

Generate feature column eagerly#10
simba-git merged 18 commits intodevelopfrom
feature/simba/training-api

simba-git commented Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

simba-git commented Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant