Unable to create features dynamically. #228

shivam1892 · 2019-02-18T08:27:23Z

Describe the bug
I was trying to run the sample examples given here. In examples we need to define schema file (Case class) for the data supplied to Transmogrifai reader Eg: In Boston Housing price. But, I would like to know, if I can work with Transmogrifai without defining the schema for the data (Similar to inferSchema in Spark), which automatically infers the schema from the data provided example from CSV.

To Reproduce
Minimal set of steps or code snippet to reproduce the behavior

Expected behavior
I just need to provide the CSV file or any other file, it should create data-frame by inferring schema and run the algorithms on top of that.

Logs or screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional context
I know there is functionality where we can use Avro schema to create Schema class which generates java class for schema, but again we need to define FeatureBuilders on top of that Eg: Iris data multi-classification.

Can it be done without defining these schema i.e By inferring schema & create feature builder automatically and run the algorithms on top of that.
Example of inferSchema from csv file in Spark Scala.

val df = sqlContext.
                  read.
                  format("com.databricks.spark.csv").
                  option("header", "true").
                  option("inferSchema", "true").
                  load(sampleData)

The text was updated successfully, but these errors were encountered:

shivam1892 · 2019-02-18T11:17:04Z

I have also tried this example where Feature engineering is done automatically.
Here I removed the case class (Passanger) from the code and tried to generate it with Avro dynamically. But I am getting compile time error as shown:

Kindly provide the solution to achieve this functionality.

tovbinm · 2019-02-19T01:17:43Z

Yes, we have a CSV reader that allows inferring schema automatically. See CSVAutoReader.

Here is an example way to use it:

val autoReader = new CSVAutoReader[GenericRecord](readPath, _.get("id").toString)

Notes:

You can use any other more specific type instead of GenericRecord if you'd like.
You can customize CSV parsing options using CSVOptions argument
You can also customize the names of the field by passing in headers

Please let me know if it works for you.

shivam1892 · 2019-02-19T09:11:30Z

Hi thanks for help,

I am trying to to read with CSVAutoReader for that I replace

val passengersData = DataReaders.Simple.csvCase[Passenger](pathToData, key = _.id.toString).readDataset()(spark, newProductEncoder).toDF()

with

val passengersData = new CSVAutoReader[GenericRecord](pathToData, key = _.get("id").toString, headers =
      Seq("id", "survived", "pClass", "name", "sex", "age", "sibSp", "parCh", "ticket", "fare", "cabin", "embarked")).read()(spark)

in this example.

but next step

val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
    val featureVector = features.transmogrify()

creates the features from dataframe but in CSVAutoReader read method is not convertible to Dataframe.
This is giving compile time error as shown in screenshot.

It will be great if you can provide the working example for this, as I am not able to find on provided hello-world examples.

tovbinm · 2019-02-20T01:00:00Z

In general readers only allow reading typed data in an RDD using read() method or materializing a Dataframe for a provided set of features using generateDataFrame() method.

What I understand from your case is that you want to have both feature definitions and data reader be created automatically. The only way to currently allow that is to use our cli codegen tool as explained here. You can try using --auto flag to automatically detect schema prior to generating features (though it's experimental).

shivam1892 · 2019-02-20T07:23:00Z

Yes So, I would like to raise a feature request, in which we can convert spark Dataframe to TransmogrifAI Dataframe which we can directly pass to

val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](transmogrifConvertedDFfromSpark, response = "survived")

and to the workflow

val model = new OpWorkflow().setInputDataset(transmogrifConvertedDFfromSpark).setResultFeatures(prediction).train()(spark)

tovbinm · 2019-02-20T08:10:51Z

FeatureBuilder.fromDataFrame allows you to pass in a dataframe and infer primitive feature types but we still haven't implemented it for more advanced types #64

shivam1892 · 2019-02-20T09:12:43Z

Here is the snapshot of the code I am trying to run using spark dataframe. (I have added column names to the CSV file)

val df = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(pathToData)
val (survived, features) = FeatureBuilder.fromDataFrame[RealNN](df, response = "survived")

df is the spark dataframe & I get the following error. (I am assuming the primitive data type 'RealNN' is implemented because according the error below, Transmogrifai is not converting integral type to RealNN type)

Exception in thread "main" java.lang.RuntimeException: Response feature 'survived' is of type com.salesforce.op.features.types.Integral, but expected com.salesforce.op.features.types.RealNN

tovbinm · 2019-02-21T06:41:29Z

You would need to apply a column transformation on dataframe using udf (or using cast) to convert survived into Double as in example here. Then apply FeatureBuilder.fromDataFrame on the resulted dataframe.

shivam1892 · 2019-02-24T12:54:22Z

Yes Now its running, Thanks @tovbinm

One more thing I want to know In case of multi-class classification Eg ("a","b","c") 3 classes, I want to follow same procedure, There should I provide string indexes as response Eg (1,2,3) or should it be one-hot encoded Eg ([1,0,0], [0,1,0], [0,0,1]).
When I give string indexes as response, then algorithm works fine and gives the output, so

Does Transmogriai internally do this one-hot encoding step, because if it does not then it will consider the numerical ordering of indexes in the algorithm which is not good for algorithm.
If not then how to provide the response variable in multi-class classification problem.

tovbinm · 2019-02-25T03:10:58Z

We don't modify the label automatically and yes, you should apply indexer on the response feature for multiclass. E.g.

val response: FeatureLike[PickList] = ...
val indexed: FeatureLike[RealNN] = response.indexed()

tovbinm closed this as completed Feb 25, 2019

tovbinm mentioned this issue Jul 11, 2019

Release 3.3.3 #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create features dynamically. #228

Unable to create features dynamically. #228

shivam1892 commented Feb 18, 2019 •

edited

Loading

shivam1892 commented Feb 18, 2019

tovbinm commented Feb 19, 2019

shivam1892 commented Feb 19, 2019 •

edited

Loading

tovbinm commented Feb 20, 2019 •

edited

Loading

shivam1892 commented Feb 20, 2019

tovbinm commented Feb 20, 2019

shivam1892 commented Feb 20, 2019

tovbinm commented Feb 21, 2019 •

edited

Loading

shivam1892 commented Feb 24, 2019

tovbinm commented Feb 25, 2019 •

edited

Loading

Unable to create features dynamically. #228

Unable to create features dynamically. #228

Comments

shivam1892 commented Feb 18, 2019 • edited Loading

shivam1892 commented Feb 18, 2019

tovbinm commented Feb 19, 2019

shivam1892 commented Feb 19, 2019 • edited Loading

tovbinm commented Feb 20, 2019 • edited Loading

shivam1892 commented Feb 20, 2019

tovbinm commented Feb 20, 2019

shivam1892 commented Feb 20, 2019

tovbinm commented Feb 21, 2019 • edited Loading

shivam1892 commented Feb 24, 2019

tovbinm commented Feb 25, 2019 • edited Loading

shivam1892 commented Feb 18, 2019 •

edited

Loading

shivam1892 commented Feb 19, 2019 •

edited

Loading

tovbinm commented Feb 20, 2019 •

edited

Loading

tovbinm commented Feb 21, 2019 •

edited

Loading

tovbinm commented Feb 25, 2019 •

edited

Loading