Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new sample for HousingPrices #365

Merged
merged 4 commits into from
Aug 14, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
383 changes: 383 additions & 0 deletions helloworld/notebooks/OpHousingPrices.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,383 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Op Housing Prices Sample\n",
"Here we describe a very simple TransmogrifAI workflow for predicting the housing prices based on the features of house on sale. The code for building and applying the Titanic model can be found here: Titanic Code, and the data can be found here: Titanic Data.\n",
"\n",
"First we need to load transmogrifai and Spark Mllib jars"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.6.0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Import the classes**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import org.apache.spark.SparkConf\n",
"import org.apache.spark.sql.SparkSession\n",
"import org.apache.spark.SparkContext\n",
"import org.apache.spark.sql.functions.udf\n",
"\n",
"import com.salesforce.op._\n",
"import com.salesforce.op.features._\n",
"import com.salesforce.op.features.types._\n",
"import com.salesforce.op.evaluators.Evaluators"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import com.salesforce.op.OpWorkflow\n",
"import com.salesforce.op.evaluators.Evaluators\n",
"import com.salesforce.op.readers.DataReaders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instantiate a SparkSession"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val conf = new SparkConf().setMaster(\"local[*]\").setAppName(\"HousingPricesPrediction\")\n",
"implicit val spark = SparkSession.builder.config(conf).getOrCreate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Schema class\n",
"Let us create a case class to describe the schema for the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"case class HousingPrices(\n",
" lotFrontage: Double,\n",
" area: Integer,\n",
" lotShape: String,\n",
" yrSold : Integer,\n",
" saleType: String,\n",
" saleCondition: String,\n",
" salePrice: Double)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Feature Engineering\n",
"\n",
"We then define the set of raw features that we would like to extract from the data. The \n",
"raw features are defined using [FeatureBuilders](https://docs.transmogrif.ai/Developer-Guide#featurebuilders), \n",
"and are strongly typed. TransmogrifAI supports the following basic feature types: `Text`, \n",
"`Numeric`, `Vector`, `List` , `Set`, `Map`. In addition it supports many specific feature \n",
"types which extend these base types: Email extends Text; Integral, Real and Binary extend \n",
"Numeric; Currency and Percentage extend Real. For a complete view of the types supported \n",
"see the Type Hierarchy and Automatic Feature Engineering section in the Documentation.\n",
"\n",
"Basic `FeatureBuilders` will be created for you if you use the TransmogrifAI CLI to bootstrap \n",
"your project as described here. However, it is often useful to edit this code to customize \n",
"feature generation and take full advantage of the Feature types available (selecting the \n",
"appropriate type will improve automatic feature engineering steps).\n",
"When defining raw features, specify the extract logic to be applied to the raw data, and \n",
"also annotate the features as either predictor or response variables via the FeatureBuilders:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import org.apache.spark.sql.{Encoders}\n",
"implicit val srEncoder = Encoders.product[HousingPrices]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val lotFrontage = FeatureBuilder.Real[HousingPrices].extract(_.lotFrontage.toReal).asPredictor\n",
"val area = FeatureBuilder.Integral[HousingPrices].extract(_.area.toIntegral).asPredictor"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val lotShape = FeatureBuilder.Integral[HousingPrices].extract(_.lotShape match {\n",
" case \"IR1\" => 1.toIntegral\n",
" case _ => 0.toIntegral\n",
"}).asPredictor"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val yrSold = FeatureBuilder.Integral[HousingPrices].extract(_.yrSold.toIntegral).asPredictor"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val saleType = FeatureBuilder.Text[HousingPrices].extract(_.saleType.toText).asPredictor.indexed()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val saleCondition = FeatureBuilder.Text[HousingPrices]\n",
" .extract(_.saleCondition.toText).asPredictor.indexed()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val salePrice = FeatureBuilder.RealNN[HousingPrices].extract(_.salePrice.toRealNN).asResponse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" val trainFilePath = \"/home/beakerx/helloworld/src/main/resources/HousingPricesDataset/train_lf_la_ls_ys_st_sc.csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a training data reader from the `trainFilePath` using `DataReaders.Simple`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val trainDataReader = DataReaders.Simple.csvCase[HousingPrices](\n",
" path = Option(trainFilePath)\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a feature sequence and transmogrify it\n",
"\n",
"`.transmogrify()` is a Transmografai shortcut to many estimators. This is in essence the automatic feature engineering Stage of TransmogrifAI. This stage can be discarded in favor of hand-tuned feature engineering and manual vector creation followed by combination using the VectorsCombiner Transformer (short-hand Seq(....).combine()) if the user desires to have complete control over feature engineering.\n",
"\n",
"The next stage applies another powerful transmogrifai Estimator — the SanityChecker. The SanityChecker applies a variety of statistical tests to the data based on Feature types and discards predictors that are indicative of label leakage or that show little to no predictive power. This is in essence the automatic feature selection Stage of TransmogrifAI:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import com.salesforce.op.stages.impl.tuning.{DataCutter, DataSplitter}\n",
"val features = Seq(lotFrontage,area,lotShape, yrSold, saleType, saleCondition).transmogrify()\n",
"val randomSeed = 42L\n",
"val splitter = DataSplitter(seed = randomSeed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model selector\n",
"Create a prediction(model) based on RegressionModelSelector. We are using Gradient Boosted Trees and Random Forest. Notice how input is applied of `salesPrice` and `features`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import com.salesforce.op.stages.impl.regression.RegressionModelSelector\n",
"import com.salesforce.op.stages.impl.regression.RegressionModelsToTry.{OpGBTRegressor, OpRandomForestRegressor}\n",
"\n",
"val prediction1 = RegressionModelSelector\n",
" .withCrossValidation(\n",
" dataSplitter = Some(splitter), seed = randomSeed,\n",
" modelTypesToUse = Seq(OpGBTRegressor, OpRandomForestRegressor)\n",
" ).setInput(salePrice,features).getOutput()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create an evaluator of type Regression and call setLabelCol and setPredictionCol"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val evaluator = Evaluators.Regression().setLabelCol(salePrice).setPredictionCol(prediction1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Workflow and WorkflowModel\n",
"\n",
"Workflow for TransmogrifAI. Takes the final features that the user wants to generate as \n",
"inputs and constructs the full DAG needed to generate them from those features lineage. \n",
"Then fits any estimators in the pipeline dag to create a sequence of transformations that \n",
"are saved in a workflow model.\n",
"When we now call `train` on this workflow, it automatically computes and executes the \n",
"entire DAG of Stages needed to compute the features fitting all the estimators on the training data in the process. \n",
"Calling score on the fitted workflow then transforms the underlying training data to \n",
"produce a DataFrame with the all the features manifested. The score method can optionally \n",
"be passed an evaluator that produces metrics.\n",
"`workflow.train()` methods fits all of the estimators in the pipeline and return a \n",
"pipeline model of only transformers. Uses data loaded as specified by the data reader to \n",
"generate the initial data set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val workflow = new OpWorkflow().setResultFeatures(prediction1, salePrice).setReader(trainDataReader)\n",
"val workflowModel = workflow.train()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Score and evaluate the model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val dfScoreAndEvaluate = workflowModel.scoreAndEvaluate(evaluator)\n",
"dfScoreAndEvaluate._1.show(false)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val dfEvaluate = dfScoreAndEvaluate._2\n",
"dfEvaluate.toString()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Scala",
"language": "scala",
"name": "scala"
},
"language_info": {
"codemirror_mode": "text/x-scala",
"file_extension": ".scala",
"mimetype": "",
"name": "Scala",
"nbconverter_exporter": "",
"version": "2.11.12"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": false,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": false,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading