Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create features dynamically. #228

Closed
shivam1892 opened this issue Feb 18, 2019 · 10 comments
Closed

Unable to create features dynamically. #228

shivam1892 opened this issue Feb 18, 2019 · 10 comments

Comments

@shivam1892
Copy link

shivam1892 commented Feb 18, 2019

Describe the bug
I was trying to run the sample examples given here. In examples we need to define schema file (Case class) for the data supplied to Transmogrifai reader Eg: In Boston Housing price. But, I would like to know, if I can work with Transmogrifai without defining the schema for the data (Similar to inferSchema in Spark), which automatically infers the schema from the data provided example from CSV.

To Reproduce
Minimal set of steps or code snippet to reproduce the behavior

Expected behavior
I just need to provide the CSV file or any other file, it should create data-frame by inferring schema and run the algorithms on top of that.

Logs or screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional context
I know there is functionality where we can use Avro schema to create Schema class which generates java class for schema, but again we need to define FeatureBuilders on top of that Eg: Iris data multi-classification.

Can it be done without defining these schema i.e By inferring schema & create feature builder automatically and run the algorithms on top of that.
Example of inferSchema from csv file in Spark Scala.

val df = sqlContext.
                  read.
                  format("com.databricks.spark.csv").
                  option("header", "true").
                  option("inferSchema", "true").
                  load(sampleData)
@shivam1892
Copy link
Author

I have also tried this example where Feature engineering is done automatically.
Here I removed the case class (Passanger) from the code and tried to generate it with Avro dynamically. But I am getting compile time error as shown:
screenshot from 2019-02-18 16-43-38

Kindly provide the solution to achieve this functionality.

@tovbinm
Copy link
Collaborator

tovbinm commented Feb 19, 2019

Yes, we have a CSV reader that allows inferring schema automatically. See CSVAutoReader.

Here is an example way to use it:

val autoReader = new CSVAutoReader[GenericRecord](readPath, _.get("id").toString)

Notes:

  1. You can use any other more specific type instead of GenericRecord if you'd like.
  2. You can customize CSV parsing options using CSVOptions argument
  3. You can also customize the names of the field by passing in headers

Please let me know if it works for you.

@shivam1892
Copy link
Author

shivam1892 commented Feb 19, 2019

Hi thanks for help,

I am trying to to read with CSVAutoReader for that I replace

val passengersData = DataReaders.Simple.csvCase[Passenger](pathToData, key = _.id.toString).readDataset()(spark, newProductEncoder).toDF()

with

val passengersData = new CSVAutoReader[GenericRecord](pathToData, key = _.get("id").toString, headers =
      Seq("id", "survived", "pClass", "name", "sex", "age", "sibSp", "parCh", "ticket", "fare", "cabin", "embarked")).read()(spark)

in this example.

but next step

val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
    val featureVector = features.transmogrify()

creates the features from dataframe but in CSVAutoReader read method is not convertible to Dataframe.
This is giving compile time error as shown in screenshot.
screenshot from 2019-02-19 14-38-28

It will be great if you can provide the working example for this, as I am not able to find on provided hello-world examples.

@tovbinm
Copy link
Collaborator

tovbinm commented Feb 20, 2019

In general readers only allow reading typed data in an RDD using read() method or materializing a Dataframe for a provided set of features using generateDataFrame() method.

What I understand from your case is that you want to have both feature definitions and data reader be created automatically. The only way to currently allow that is to use our cli codegen tool as explained here. You can try using --auto flag to automatically detect schema prior to generating features (though it's experimental).

@shivam1892
Copy link
Author

Yes So, I would like to raise a feature request, in which we can convert spark Dataframe to TransmogrifAI Dataframe which we can directly pass to

val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](transmogrifConvertedDFfromSpark, response = "survived")

and to the workflow

val model = new OpWorkflow().setInputDataset(transmogrifConvertedDFfromSpark).setResultFeatures(prediction).train()(spark)

@tovbinm
Copy link
Collaborator

tovbinm commented Feb 20, 2019

FeatureBuilder.fromDataFrame allows you to pass in a dataframe and infer primitive feature types but we still haven't implemented it for more advanced types #64

@shivam1892
Copy link
Author

Here is the snapshot of the code I am trying to run using spark dataframe. (I have added column names to the CSV file)

val df = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(pathToData)
val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](df, response = "survived")

df is the spark dataframe & I get the following error. (I am assuming the primitive data type 'RealNN' is implemented because according the error below, Transmogrifai is not converting integral type to RealNN type)

Exception in thread "main" java.lang.RuntimeException: Response feature 'survived' is of type com.salesforce.op.features.types.Integral, but expected com.salesforce.op.features.types.RealNN

@tovbinm
Copy link
Collaborator

tovbinm commented Feb 21, 2019

You would need to apply a column transformation on dataframe using udf (or using cast) to convert survived into Double as in example here. Then apply FeatureBuilder.fromDataFrame on the resulted dataframe.

@shivam1892
Copy link
Author

Yes Now its running, Thanks @tovbinm

One more thing I want to know In case of multi-class classification Eg ("a","b","c") 3 classes, I want to follow same procedure, There should I provide string indexes as response Eg (1,2,3) or should it be one-hot encoded Eg ([1,0,0], [0,1,0], [0,0,1]).
When I give string indexes as response, then algorithm works fine and gives the output, so

  1. Does Transmogriai internally do this one-hot encoding step, because if it does not then it will consider the numerical ordering of indexes in the algorithm which is not good for algorithm.
  2. If not then how to provide the response variable in multi-class classification problem.

@tovbinm
Copy link
Collaborator

tovbinm commented Feb 25, 2019

We don't modify the label automatically and yes, you should apply indexer on the response feature for multiclass. E.g.

val response: FeatureLike[PickList] = ...
val indexed: FeatureLike[RealNN] = response.indexed()

@tovbinm tovbinm closed this as completed Feb 25, 2019
@tovbinm tovbinm mentioned this issue Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants