Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new sample for HousingPrices #365

Merged
merged 4 commits into from
Aug 14, 2019
Merged

Conversation

rajdeepd
Copy link
Contributor

Related issues
[Issues]#211

Describe the proposed solution
New more complex regression example on Jupyter

Describe alternatives you've considered
none

Additional context
Adding more examples to TransmogrifAI stack

@codecov
Copy link

codecov bot commented Jul 12, 2019

Codecov Report

Merging #365 into master will decrease coverage by 57.28%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #365       +/-   ##
===========================================
- Coverage   86.05%   28.77%   -57.29%     
===========================================
  Files         336      336               
  Lines       10950     8849     -2101     
  Branches      351      433       +82     
===========================================
- Hits         9423     2546     -6877     
- Misses       1527     6303     +4776
Impacted Files Coverage Δ
...sforce/op/stages/base/binary/BinaryEstimator.scala 0% <0%> (-100%) ⬇️
...la/com/salesforce/op/aggregators/Geolocation.scala 0% <0%> (-100%) ⬇️
...ala/com/salesforce/op/testkit/InfiniteStream.scala 0% <0%> (-100%) ⬇️
.../scala/com/salesforce/op/test/FeatureAsserts.scala 0% <0%> (-100%) ⬇️
...la/com/salesforce/op/utils/io/avro/AvroInOut.scala 0% <0%> (-100%) ⬇️
.../salesforce/op/aggregators/FeatureAggregator.scala 0% <0%> (-100%) ⬇️
...cala/com/salesforce/op/features/types/OPList.scala 0% <0%> (-100%) ⬇️
...n/scala/com/salesforce/op/readers/CSVReaders.scala 0% <0%> (-100%) ⬇️
...stages/base/sequence/BinarySequenceEstimator.scala 0% <0%> (-100%) ⬇️
.../op/stages/impl/feature/TextMapNullEstimator.scala 0% <0%> (-100%) ⬇️
... and 238 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 088e041...e687dc0. Read the comment docs.

@leahmcguire
Copy link
Collaborator

@Jauntbox can you take a look? you have done more with notebooks than I have.

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome thank you so much for the contribution!

I made a couple minor comments. Mostly expansions to the descriptions, once those are done LGTM :-)

"metadata": {},
"outputs": [],
"source": [
"%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.5.1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update to latest version 0.6.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

"metadata": {},
"outputs": [],
"source": [
"%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update to current spark dep 2.4.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought version 0.6.0 was still on Spark 2.3.x, and the upgrade to Spark 2.4.3 is in master but hasn't been released yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leahmcguire @Jauntbox so what is the spark version i should go with?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jauntbox is correct please go with spark 2.3.3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"\n",
"When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders.\n",
"\n",
"SalesType encoder from text to numeric."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SalesType encoder from text to numeric." What?

perhaps expand a bit as in the Iris notebook:

"#### Feature Engineering\n",
"\n",
"We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](https://docs.transmogrif.ai/Developer-Guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: `Text`, `Numeric`, `Vector`, `List` , `Set`, `Map`. \n",
"In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the Type Hierarchy and Automatic Feature Engineering section in the Documentation.\n",
"\n",
"Basic `FeatureBuilders` will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described here. However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).\n",
"\n",
"When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders:"```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"source": [
"**Create a feature sequence and transmogrify it**\n",
"\n",
"The `.transmogrify()` shortcut is a special AutoML Estimator that applies a default set of transformations to all the specified inputs and combines them into a single vector. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a Transmografai shortcut to many estimators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"\n",
"The `.transmogrify()` shortcut is a special AutoML Estimator that applies a default set of transformations to all the specified inputs and combines them into a single vector. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.\n",
"\n",
"The next stage applies another powerful AutoML Estimator — the SanityChecker. The SanityChecker applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of label leakage or that show little to no predictive power. This is in essence the automatic feature selection Stage of TransmogrifAI:"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transmografai rather than AutoML

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"cell_type": "markdown",
"metadata": {},
"source": [
"Create an OpWorkflow and call train() on it to create a model."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model."
"When we now call ‘train’ on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features fitting all the estimators on the training data in the process. Calling score on the fitted workflow then transforms the underlying training data to produce a DataFrame with the all the features manifested. The score method can optionally be passed an evaluator that produces metrics.\n",
"workflow.train() methods fits all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Contributor

@Jauntbox Jauntbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean up the extract functions, then it looks good!

"metadata": {},
"outputs": [],
"source": [
"val lotShape = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like

FeatureBuilder.Integral[HousingPrices].extract(_.lotShape match {
    case "IR1" => 1
    case _ => 0
}.toIntegral).asPredictor

would be more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"metadata": {},
"outputs": [],
"source": [
"val yrSold = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use vars in here. Something like this is better:

FeatureBuilder.Integral[HousingPrices].extract(x => (2019 - x.yrSold).toIntegral).asPredictor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"metadata": {},
"outputs": [],
"source": [
" val saleType = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for intermediate vals here:

FeatureBuilder.Integral[HousingPrices].extract(x => saleTypeEncoder.get(x.saleType).toIntegral).asPredictor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"metadata": {},
"outputs": [],
"source": [
"val saleCondition = FeatureBuilder.Integral[HousingPrices].extract(x =>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, cleaner as

FeatureBuilder.Integral[HousingPrices].extract(x => 
   saleConditionEncoder.get(x.saleCondition).toIntegral).asPredictor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

"source": [
"import org.apache.spark.sql.{Encoders}\n",
"implicit val srEncoder = Encoders.product[HousingPrices]\n",
"val saleTypeEncoder = Map(\"COD\" -> 1, \"CWD\" -> 2, \"Con\" -> 3, \"ConLD\" -> 4,\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why convert these to integers rather than treat them as picklists?

"metadata": {},
"outputs": [],
"source": [
"val yrSold = FeatureBuilder.Integral[HousingPrices].extract(x => (2019 - x.yrSold).toIntegral).asPredictor"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just treat this as a date?

"metadata": {},
"outputs": [],
"source": [
"val saleConditionEncoder = Map(\"Abnorml\" -> 1, \"AdjLand\" -> 2, \"Alloca\" -> 3, \"Family\" -> 4,\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here why are you converting to int? this is a picklist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they work better in the model as encoded variable can you use the string indexer before calling .transmografai()

https://github.com/salesforce/TransmogrifAI/blob/master/core/src/test/scala/com/salesforce/op/stages/impl/feature/OpStringIndexerNoFilterTest.scala#L75

"val saleTypeInd = saleType.indexed()"
val saleCondInd = saleCondition.indexed()"
"val features = Seq(lotFrontage,area,lotShape, yrSold, saleTypeInd, saleCondInd).transmogrify()\n",

This is preferred because model insights will have the correct names if you use our internal indexer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With encoded variables i am not able to use indexed()

<console>:149: error: value indexed is not a member of com.salesforce.op.features.Feature[com.salesforce.op.features.types.Integral]
       val saleCondition = FeatureBuilder.Integral[HousingPrices].extract(x => saleConditionEncoder.get(x.saleCondition).toIntegral).asPredictor.indexed()
       

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexer is for strings - put the original string in - not the mapped integer value. The indexer will then do the mapping for you. But unlike in the map you are creating the name to index mapping will be saved in the feature metadata so any feature insights will have the name associated with it. It will also not treat it as an ordered integer in modeling which would be inappropriate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the contribution!

@leahmcguire
Copy link
Collaborator

@Jauntbox can you take another look? your review is blocking

@rajdeepd
Copy link
Contributor Author

rajdeepd commented Aug 8, 2019

@Jauntbox cleaned extract functions, please check

@Jauntbox
Copy link
Contributor

Sorry for the delay - LGTM!

@rajdeepd
Copy link
Contributor Author

@tovbinm please approve

@leahmcguire
Copy link
Collaborator

Matthew is on vacation.

@leahmcguire leahmcguire merged commit 1f9fdd6 into salesforce:master Aug 14, 2019
@rajdeepd
Copy link
Contributor Author

@leahmcguire thanks!

@gerashegalov gerashegalov mentioned this pull request Sep 8, 2019
gerashegalov added a commit that referenced this pull request Sep 11, 2019
Bug fixes:
- Ensure correct metrics despite model failures on some CV folds [#404](#404)
- Fix flaky `ModelInsight` tests [#395](#395)
- Avoid creating `SparseVector`s for LOCO [#377](#377)

New features / updates:
- Model combiner [#385](#399)
- Added new sample for HousingPrices [#365](#365)
- Test to verify that custom metrics appear in model insight metrics [#387](#387)
- Add `FeatureDistribution` to `SerializationFormat`s [#383](#383)
- Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378)
- Improve json serde error in `evalMetFromJson` [#380](#380)
- Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354)
- Making model selectors robust to failing models [#372](#372)
- Use compact and compressed model json by default [#375](#375)
- Descale feature contribution for Linear Regression & Logistic Regression [#345](#345)

Dependency updates:   
- Update tika version [#382](#382)
@salesforce-cla
Copy link

Thanks for the contribution! Before we can merge this, we need @rajdeepd to sign the Salesforce.com Contributor License Agreement.

@salesforce-cla
Copy link

salesforce-cla bot commented Dec 3, 2020

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants