Added new sample for HousingPrices #365

rajdeepd · 2019-07-12T08:52:43Z

Related issues
[Issues]#211

Describe the proposed solution
New more complex regression example on Jupyter

Describe alternatives you've considered
none

Additional context
Adding more examples to TransmogrifAI stack

codecov · 2019-07-12T09:15:22Z

Codecov Report

Merging #365 into master will decrease coverage by 57.28%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master     #365       +/-   ##
===========================================
- Coverage   86.05%   28.77%   -57.29%     
===========================================
  Files         336      336               
  Lines       10950     8849     -2101     
  Branches      351      433       +82     
===========================================
- Hits         9423     2546     -6877     
- Misses       1527     6303     +4776

Impacted Files	Coverage Δ
...sforce/op/stages/base/binary/BinaryEstimator.scala	`0% <0%> (-100%)`	⬇️
...la/com/salesforce/op/aggregators/Geolocation.scala	`0% <0%> (-100%)`	⬇️
...ala/com/salesforce/op/testkit/InfiniteStream.scala	`0% <0%> (-100%)`	⬇️
.../scala/com/salesforce/op/test/FeatureAsserts.scala	`0% <0%> (-100%)`	⬇️
...la/com/salesforce/op/utils/io/avro/AvroInOut.scala	`0% <0%> (-100%)`	⬇️
.../salesforce/op/aggregators/FeatureAggregator.scala	`0% <0%> (-100%)`	⬇️
...cala/com/salesforce/op/features/types/OPList.scala	`0% <0%> (-100%)`	⬇️
...n/scala/com/salesforce/op/readers/CSVReaders.scala	`0% <0%> (-100%)`	⬇️
...stages/base/sequence/BinarySequenceEstimator.scala	`0% <0%> (-100%)`	⬇️
.../op/stages/impl/feature/TextMapNullEstimator.scala	`0% <0%> (-100%)`	⬇️
... and 238 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 088e041...e687dc0. Read the comment docs.

leahmcguire · 2019-07-16T03:08:36Z

@Jauntbox can you take a look? you have done more with notebooks than I have.

leahmcguire

This is awesome thank you so much for the contribution!

I made a couple minor comments. Mostly expansions to the descriptions, once those are done LGTM :-)

leahmcguire · 2019-07-19T02:58:21Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.5.1"


please update to latest version 0.6.0

leahmcguire · 2019-07-19T02:59:36Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.0"


please update to current spark dep 2.4.3

I thought version 0.6.0 was still on Spark 2.3.x, and the upgrade to Spark 2.4.3 is in master but hasn't been released yet.

@leahmcguire @Jauntbox so what is the spark version i should go with?

@Jauntbox is correct please go with spark 2.3.3

@leahmcguire ok..

leahmcguire · 2019-07-19T03:02:29Z

helloworld/notebooks/OpHousingPrices.ipynb

+    "\n",
+    "When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders.\n",
+    "\n",
+    "SalesType encoder from text to numeric."


"SalesType encoder from text to numeric." What?

perhaps expand a bit as in the Iris notebook:

"#### Feature Engineering\n", "\n", "We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](https://docs.transmogrif.ai/Developer-Guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: `Text`, `Numeric`, `Vector`, `List` , `Set`, `Map`. \n", "In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the Type Hierarchy and Automatic Feature Engineering section in the Documentation.\n", "\n", "Basic `FeatureBuilders` will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described here. However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).\n", "\n", "When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders:"```

leahmcguire · 2019-07-19T03:04:45Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "source": [
+    "**Create a feature sequence and transmogrify it**\n",
+    "\n",
+    "The `.transmogrify()` shortcut is a special AutoML Estimator that applies a default set of transformations to all the specified inputs and combines them into a single vector. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.\n",


it is a Transmografai shortcut to many estimators

leahmcguire · 2019-07-19T03:07:21Z

helloworld/notebooks/OpHousingPrices.ipynb

+    "\n",
+    "The `.transmogrify()` shortcut is a special AutoML Estimator that applies a default set of transformations to all the specified inputs and combines them into a single vector. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.\n",
+    "\n",
+    "The next stage applies another powerful AutoML Estimator — the SanityChecker. The SanityChecker applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of label leakage or that show little to no predictive power. This is in essence the automatic feature selection Stage of TransmogrifAI:"


Transmografai rather than AutoML

leahmcguire · 2019-07-19T03:09:59Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create an OpWorkflow and call train() on it to create a model."


" Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model."
"When we now call ‘train’ on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features fitting all the estimators on the training data in the process. Calling score on the fitted workflow then transforms the underlying training data to produce a DataFrame with the all the features manifested. The score method can optionally be passed an evaluator that produces metrics.\n",
"workflow.train() methods fits all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set."

Jauntbox

Please clean up the extract functions, then it looks good!

Jauntbox · 2019-07-19T18:57:46Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "val lotShape = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",


I think something like

FeatureBuilder.Integral[HousingPrices].extract(_.lotShape match { case "IR1" => 1 case _ => 0 }.toIntegral).asPredictor

would be more readable

Jauntbox · 2019-07-19T18:59:57Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "val yrSold = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",


Please don't use vars in here. Something like this is better:

FeatureBuilder.Integral[HousingPrices].extract(x => (2019 - x.yrSold).toIntegral).asPredictor)

Jauntbox · 2019-07-19T19:01:26Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    " val saleType = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",


No need for intermediate vals here:

FeatureBuilder.Integral[HousingPrices].extract(x => saleTypeEncoder.get(x.saleType).toIntegral).asPredictor

Jauntbox · 2019-07-19T19:02:16Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "val saleCondition = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",


Same here, cleaner as

FeatureBuilder.Integral[HousingPrices].extract(x => saleConditionEncoder.get(x.saleCondition).toIntegral).asPredictor

leahmcguire · 2019-07-29T17:27:07Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "source": [
+    "import org.apache.spark.sql.{Encoders}\n",
+    "implicit val srEncoder = Encoders.product[HousingPrices]\n",
+    "val saleTypeEncoder = Map(\"COD\" -> 1, \"CWD\" -> 2, \"Con\" -> 3, \"ConLD\" -> 4,\n",


Why convert these to integers rather than treat them as picklists?

leahmcguire · 2019-07-29T17:28:23Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "val yrSold = FeatureBuilder.Integral[HousingPrices].extract(x => (2019 - x.yrSold).toIntegral).asPredictor"


why not just treat this as a date?

leahmcguire · 2019-07-29T17:29:32Z

helloworld/notebooks/OpHousingPrices.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "val saleConditionEncoder = Map(\"Abnorml\" -> 1, \"AdjLand\" -> 2, \"Alloca\" -> 3, \"Family\" -> 4,\n",


same here why are you converting to int? this is a picklist.

if they work better in the model as encoded variable can you use the string indexer before calling .transmografai()

https://github.com/salesforce/TransmogrifAI/blob/master/core/src/test/scala/com/salesforce/op/stages/impl/feature/OpStringIndexerNoFilterTest.scala#L75

"val saleTypeInd = saleType.indexed()"
val saleCondInd = saleCondition.indexed()"
"val features = Seq(lotFrontage,area,lotShape, yrSold, saleTypeInd, saleCondInd).transmogrify()\n",

This is preferred because model insights will have the correct names if you use our internal indexer

With encoded variables i am not able to use indexed()

<console>:149: error: value indexed is not a member of com.salesforce.op.features.Feature[com.salesforce.op.features.types.Integral] val saleCondition = FeatureBuilder.Integral[HousingPrices].extract(x => saleConditionEncoder.get(x.saleCondition).toIntegral).asPredictor.indexed()

The indexer is for strings - put the original string in - not the mapped integer value. The indexer will then do the mapping for you. But unlike in the map you are creating the name to index mapping will be saved in the feature metadata so any feature insights will have the name associated with it. It will also not treat it as an ordered integer in modeling which would be inappropriate

leahmcguire

LGTM! Thank you for the contribution!

leahmcguire · 2019-08-07T17:07:26Z

@Jauntbox can you take another look? your review is blocking

rajdeepd · 2019-08-08T04:16:40Z

@Jauntbox cleaned extract functions, please check

Jauntbox · 2019-08-10T20:46:15Z

Sorry for the delay - LGTM!

rajdeepd · 2019-08-14T02:53:15Z

@tovbinm please approve

leahmcguire · 2019-08-14T16:40:11Z

Matthew is on vacation.

rajdeepd · 2019-08-15T09:21:59Z

@leahmcguire thanks!

Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)

salesforce-cla · 2020-10-16T10:26:28Z

Thanks for the contribution! Before we can merge this, we need @rajdeepd to sign the Salesforce.com Contributor License Agreement.

salesforce-cla · 2020-12-03T00:56:47Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

rajdeepd requested review from leahmcguire and tovbinm as code owners July 12, 2019 08:52

leahmcguire requested a review from Jauntbox July 15, 2019 21:56

leahmcguire requested changes Jul 19, 2019

View reviewed changes

Jauntbox suggested changes Jul 19, 2019

View reviewed changes

rajdeepd force-pushed the housing branch from 9f48ecb to 9dc4c4b Compare July 27, 2019 11:32

leahmcguire reviewed Jul 29, 2019

View reviewed changes

Added new sample for HousingPrices

367665d

rajdeepd force-pushed the housing branch from 9dc4c4b to 367665d Compare August 7, 2019 06:15

leahmcguire approved these changes Aug 7, 2019

View reviewed changes

Merge branch 'master' into housing

4e3baa1

Merge branch 'master' into housing

d465660

Jauntbox approved these changes Aug 10, 2019

View reviewed changes

Merge branch 'master' into housing

e687dc0

leahmcguire merged commit 1f9fdd6 into salesforce:master Aug 14, 2019

gerashegalov mentioned this pull request Sep 8, 2019

0.6.1 release #403

Merged

salesforce-cla bot added the cla:missing label Oct 16, 2020

Added new sample for HousingPrices #365

Added new sample for HousingPrices #365

Conversation

rajdeepd commented Jul 12, 2019

codecov bot commented Jul 12, 2019 • edited Loading

Codecov Report

leahmcguire commented Jul 16, 2019

leahmcguire left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jauntbox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

leahmcguire commented Aug 7, 2019

rajdeepd commented Aug 8, 2019

Jauntbox commented Aug 10, 2019

rajdeepd commented Aug 14, 2019

leahmcguire commented Aug 14, 2019

rajdeepd commented Aug 15, 2019

salesforce-cla bot commented Oct 16, 2020

salesforce-cla bot commented Dec 3, 2020

codecov bot commented Jul 12, 2019 •

edited

Loading