Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jupyter notebooks for transmogrify samples #231

Merged
merged 2 commits into from
Mar 11, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
339 changes: 339 additions & 0 deletions helloworld/notebooks/OpIris.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Op Iris Sample\n",
"\n",
"The following code illustrates how TransmogrifAI can be used to do classify multiple classes over the Iris dataset.\n",
"\n",
"First we need to load transmogrifai and Spark Mllib jars using maven"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%classpath add mvn com.salesforce.transmogrifai transmogrifai-core_2.11 0.5.1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%classpath add mvn org.apache.spark spark-mllib_2.11 2.3.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define features\n",
"\n",
"Let us first define the case Class which describes the schema for the data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"case class Iris\n",
"(\n",
" sepalLength: Double,\n",
" sepalWidth: Double,\n",
" petalLength: Double,\n",
" petalWidth: Double,\n",
" irisClass: String\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Feature Engineering\n",
"\n",
"We then define the set of raw features that we would like to extract from the data. The raw features are defined using [FeatureBuilders](https://docs.transmogrif.ai/Developer-Guide#featurebuilders), and are strongly typed. TransmogrifAI supports the following basic feature types: `Text`, `Numeric`, `Vector`, `List` , `Set`, `Map`. \n",
"In addition it supports many specific feature types which extend these base types: Email extends Text; Integral, Real and Binary extend Numeric; Currency and Percentage extend Real. For a complete view of the types supported see the Type Hierarchy and Automatic Feature Engineering section in the Documentation.\n",
"\n",
"Basic `FeatureBuilders` will be created for you if you use the TransmogrifAI CLI to bootstrap your project as described here. However, it is often useful to edit this code to customize feature generation and take full advantage of the Feature types available (selecting the appropriate type will improve automatic feature engineering steps).\n",
"\n",
"When defining raw features, specify the extract logic to be applied to the raw data, and also annotate the features as either predictor or response variables via the FeatureBuilders:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import com.salesforce.op.features.FeatureBuilder\n",
"import com.salesforce.op.features.types._\n",
"\n",
"val sepalLength = FeatureBuilder.Real[Iris].extract(_.sepalLength.toReal).asPredictor\n",
"val sepalWidth = FeatureBuilder.Real[Iris].extract(_.sepalWidth.toReal).asPredictor\n",
"val petalLength = FeatureBuilder.Real[Iris].extract(_.petalLength.toReal).asPredictor\n",
"val petalWidth = FeatureBuilder.Real[Iris].extract(_.petalWidth.toReal).asPredictor\n",
"val irisClass = FeatureBuilder.Text[Iris].extract(_.irisClass.toText).asResponse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"import the classes from Spark library for initiating the `SparkSession`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import org.apache.spark.SparkConf\n",
"import org.apache.spark.sql.SparkSession\n",
"import org.apache.spark.SparkContext\n",
"import org.apache.spark.sql.functions.udf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val conf = new SparkConf().setMaster(\"local[*]\").setAppName(\"TitanicPrediction\")\n",
"implicit val spark = SparkSession.builder.config(conf).getOrCreate()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import com.salesforce.op._\n",
"import com.salesforce.op.evaluators.Evaluators\n",
"import com.salesforce.op.readers.DataReaders\n",
"import com.salesforce.op.stages.impl.classification.MultiClassificationModelSelector\n",
"import com.salesforce.op.stages.impl.tuning.DataCutter\n",
"import org.apache.spark.sql.Encoders"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next step is to encode the case class using `org.apache.spark.sql.Encoders`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"implicit val irisEncoder = Encoders.product[Iris]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a DataRead which will load csv and map to schema of type Iris"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val irisReader = DataReaders.Simple.csvCase[Iris]()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Feature Engineering\n",
"See [Creating Shortcuts for Transformers and Estimators](https://docs.transmogrif.ai/en/stable/developer-guide#creating-shortcuts-for-transformers-and-estimators) for more documentation on how shortcuts for stages can be created. We now define a Feature of type `Vector`, that is a vector representation of all the features we would like to use as predictors in our workflow."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val labels = irisClass.indexed()\n",
"val features = Seq(sepalLength, sepalWidth, petalLength, petalWidth).transmogrify()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next step is to create a DataCutter : Creates instance that will split data into training and test set filtering out any labels that don't meet the minimum fraction cutoff or fall in the top N labels specified"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val randomSeed = 42L\n",
"val cutter = DataCutter(reserveTestFraction = 0.2, seed = randomSeed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a MultiClassModelSelector and specify splitter created above. Then set the input - labels and features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val prediction = MultiClassificationModelSelector\n",
" .withCrossValidation(splitter = Option(cutter), seed = randomSeed)\n",
" .setInput(labels, features).getOutput()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val evaluator = Evaluators.MultiClassification.f1().setLabelCol(labels).setPredictionCol(prediction)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Everything we’ve done so far has been purely at the level of definitions. We have defined how we would like to extract our raw features from data of type `Iris`, and we have defined how we would like to manipulate them. In order to actually manifest the data described by these features, we need to add them to a workflow and attach a data source to the workflow.\n",
"\n",
"Please note the `trainFilePath` is the derived path from folder where host folder is mounted as a volume (/home/beakerx/helloworld) in this case. This can be changed as well depending on the location and volume director you are mounting the data from. You can also create a new DataReader with a new path."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"implicit val spark = SparkSession.builder.config(conf).getOrCreate()\n",
"import spark.implicits._ // Needed for Encoders for the Passenger case class\n",
"import com.salesforce.op.readers.DataReaders\n",
"\n",
"val trainFilePath = \"/home/beakerx/helloworld/src/main/resources/IrisDataset/iris.data\"\n",
" // Define a way to read data into our Passenger class from our CSV file\n",
"val trainDataReader = DataReaders.Simple.csvCase[Iris](\n",
" path = Option(trainFilePath)\n",
" //key = _.id.toString\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Workflow for TransmogrifAI. Takes the final features that the user wants to generate as inputs and constructs the full DAG needed to generate them from those features lineage. Then fits any estimators in the pipeline dag to create a sequence of transformations that are saved in a workflow model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val workflow = new OpWorkflow().setResultFeatures(prediction, labels).setReader(trainDataReader)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we now call ‘train’ on this workflow, it automatically computes and executes the entire DAG of Stages needed to compute the features fitting all the estimators on the training data in the process. Calling score on the fitted workflow then transforms the underlying training data to produce a DataFrame with the all the features manifested. The score method can optionally be passed an evaluator that produces metrics.\n",
"`workflow.train()` methods fits all of the estimators in the pipeline and return a pipeline model of only transformers. Uses data loaded as specified by the data reader to generate the initial data set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val fittedWorkflow = workflow.train()\n",
"println(s\\\"Summary: ${fittedWorkflow.summaryPretty()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After model has been fitted we use scoreAndEvaluate() function to evaluate the metrics"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to mention that you can change out the data before doing this by either setting a new path or a new reader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"println(\"Scoring the model:\\n=================\")\n",
"val (dataframe, metrics) = fittedWorkflow.scoreAndEvaluate(evaluator = evaluator)\n",
"\n",
"println(\"Transformed dataframe columns:\\n--------------------------\")\n",
"dataframe.columns.foreach(println)\n",
"\n",
"println(\"Metrics:\\n------------\")\n",
"println(metrics)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Scala",
"language": "scala",
"name": "scala"
},
"language_info": {
"codemirror_mode": "text/x-scala",
"file_extension": ".scala",
"mimetype": "",
"name": "Scala",
"nbconverter_exporter": "",
"version": "2.11.12"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": false,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": false,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading