# What is DataRobot
The DataRobot automated machine learning platform helps data scientists and business analysts discover the best predictive models for every situation, and then deploy them so they can consistently make smarter and faster business decisions that impact their company's bottom line.

## Why use DataRobot with Databricks
DataRobot brings the power of auto-modeling to Databricks users, allowing them to quickly determine and use the best machine learning model for their problem. Within minutes DataRobot can iterate on thousands of combinations of models, data preparation steps, and parameters that would take days or weeks to do manually

## Before you start: Pre-requisistes

To experience the Power of DataRobot+Databricks you'll need a DataRobot account. If your company already deployed DataRobot please get an account from your administrator. Otherwise, please [contact us](https://www.datarobot.com/contact-us/) here: https://www.datarobot.com/contact-us/

## Getting your DataRobot API Endpoint

1. While logged in the DataRobot interface, click on the _profile_ icon on the top right corner of the screen.

    ![profile icon](https://s3.amazonaws.com/datarobot_public/databricksNotebookAssets/user_avatar.png)

2. Select `Profile` from the drop down menu:

    ![profile link](https://s3.amazonaws.com/datarobot_public/databricksNotebookAssets/user_dropdown.png)

3. Your API Token will be in the top section of your profile, copy to insert in your notebooks.

    ![profile page](https://s3.amazonaws.com/datarobot_public/databricksNotebookAssets/profile_page.png)

# Overview of the Modeling Example

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In the example notebooks below, we will use DataRobot to try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether it rained the day of the flight.

# About this Notebook

This notebook provides an introduction to the `datarobot` Python package, highlighting the following key details of its use:

* connecting to the DataRobot modeling engine from a Python session
* creating a new modeling project in the DataRobot modeling engine
* retrieving the results from a DataRobot modeling project
* generating predictions from any DataRobot model

To illustrate this, we will focus on the problem of predicting airline delays.


---

# Table of Contents

1. [The DataRobot Modeling Engine](#The-DataRobot-Modeling-Engine)
2. [Connecting to DataRobot](#Connecting-to-DataRobot)
3. [Sample Data for this Exercise](#Sample-Data-for-this-Exercise)
4. [Creating a new Project](#Creating-a-new-Project)
5. [Making Predictions with DataRobot](#Making-Predictions-with-DataRobot)
6. [Conclusion](#Conclusion)

---
## The DataRobot Modeling Engine

The DataRobot modeling engine is a commercial product that supports massively parallel modeling applications, building and optimizing models of many different types, and evaluating and ranking their relative performance. This modeling engine exists in a variety of implementations, some cloud-based, accessed via the Internet, and others residing in customer-specific on-premises computing environments.

The DataRobot modeling engine is organized around *modeling projects*, each based on a single data source, a single target variable to be predicted, and a single metric to be optimized in fitting and ranking project models. This information is sufficient to create a project, identified by a unique alphanumeric **project_id** label, and start the DataRobot Autopilot.

## The `datarobot` Python package.
This notebook uses our official Python client package that wraps the DataRobot REST API in an easy to use library. The Python package supports Python 2 and 3 and is hosted on [PyPi](https://pypi.python.org/pypi/datarobot). 

### Installing the `datarobot` package in databricks 
Databricks makes it easy to manage python dependencies for use in your clusters. To install the `datarobot` package, open the `Workspace` using the panel on the left of your databricks homepage and click on `Shared`.

![databricks_shared_pane](https://s3.amazonaws.com/datarobot_public/databricksNotebookAssets/databricks_shared_pane.png)


Right-click in the pane for shared resources, then Select `Create`, then `Library`. 

![databricks_shared_create](https://s3.amazonaws.com/datarobot_public/databricksNotebookAssets/databricks_shared_create.png)

On this dialog page, specify you are installing a python package from PyPI, and then specify `datarobot` as the package you want to install.

![databricks_install_pypi](https://s3.amazonaws.com/datarobot_public/databricksNotebookAssets/databricks_install_pypi.png)

This tutorial makes heavy use of this library so we will take a moment to make sure it is installed:

In [0]:
try:
    import datarobot
except ImportError:
    print('Make sure to install the datarobot package from PyPi into your cluster before running this notebook! See the instructions in the cell above.')
    raise

<div class="alert alert-box alert-info">
<em>Note</em>: if are <strong>not</strong> using DataRobot Cloud, we recommend you pin the package version to correspond to the version that matches your DataRobot Enterprise install version. 
Your support representative can help you determine what the recommended version you should use (as the package version does <strong>not</strong> have the same versioning scheme as the DataRobot modeling engine).

</div>

---
## Connecting to DataRobot

To access the DataRobot modeling engine, it is necessary to establish an authenticated connection. The necessary information is an **endpoint** - the URL address of the specific DataRobot server being used - and a **token**, a previously validated access token.

### Endpoint 
**endpoint** depends on the DataRobot modeling engine installation (cloud-based or on-prem) you are using. Contact your DataRobot admin for the correct endpoint to use. 

Please update the variable below accordingly.
<div class="alert alert-box alert-info">
<em>Note</em>: If you are using DataRobot Cloud you do <strong>not</strong> need to make any changes
</div>

In [0]:
endpoint = 'https://app.datarobot.com/api/v2'

### Token
**token** is unique for each DataRobot modeling engine account and can be obtained by logging into the DataRobot webapp browsing to the account profile section. It looks like a string of letters and numbers. Your token can be found in the DataRobot UI at "API Token", in "Settings", in the person icon (top right). You can also go to https://app.datarobot.com/account/me to see it, replacing _app.datarobot.com_ with your endpoint if you are not hosted on the cloud.

Enter your token below.

<div style='color: white; background-color: #33b5e5; border-radius: 5px; padding: 10px'>Note: You need to change the value of `token` in the cell below in order to continue running this notebook.</div>

In [0]:
token = 'YOUR_TOKEN_HERE'

assert token != 'YOUR_TOKEN_HERE'

### Setting the configuration in code
The next cell will configure the datarobot package to use the provided endpoint and token. This only needs to be done once per session.

In [0]:
import datarobot as dr 
dr.Client(endpoint=endpoint, token=token)

The connection to DataRobot should now be ready. 

<div class="alert alert-box alert-info">
<em>Note</em>: Setting these values in code is not the only way to configure the DataRobot package. See the package documentation for additional configuration options.
</div>

## Sample Data for this Exercise

### Background

Statistics on whether a flight was delayed and for how long are available from government databases for all the major carriers. It would be useful to be able to predict before scheduling a flight whether or not it was likely to be delayed. In this example, DataRobot will try to model whether a flight will be delayed, based on information such as the scheduled departure time and whether it rained the day of the flight.

---

Information on flights and flight delays is made available by the Bureau of Transportation Statistics at https://www.transtats.bts.gov/ONTIME/Departures.aspx. To narrow down the amount of data involved, the datasets assembled for this use case are limited to US Airways flights out of Boston Logan in 2013 and 2014, although the script for interacting with DataRobot is sufficiently general that any dataset with the correct format could be used. A flight was declared to be delayed if it ultimately took off at least fifteen minutes after its scheduled departure time.

In additional to flight information, each record in the prepared dataset notes the amount of rain and whether it rained on the day of the flight. This information came from the National Oceanic and Atmospheric Administration’s Quality Controlled Local Climatological Data, available at http://www.ncdc.noaa.gov/qclcd/QCLCD. By looking at the recorded daily summaries of the water equivalent precipitation at the Boston Logan station, the daily rainfall for each day in 2013 and 2014 was measured. For some days, the QCLCD reports trace amounts of rainfall, which was recorded as 0 inches of rain.

We have collected and stored this data in Amazon S3, and we can read that data into a Pandas dataframe for analysis.

## Getting the Data for this Exercise
In the following cell we will download the datasets for this exercise and save them both to DBFS on your cluster.
This will allow us to go through the process of getting a Spark DataFrame and then serializing it to a Python
DataFrame to then send it to DataRobot to begin the automated modeling process.

This process of using a Spark DataFrame isn't strictly necessary for the computation we want to accomplish
inside this notebook. But almost certainly you have pipelines connected to more interesting datasets than
just CSV files, like SQL Databases and data stored in S3 or Azure, which you will be manipulating as Spark Dataframes.

At that point you can follow the recipe provided in this notebook for getting your data into DataRobot for
both model training and generating predictions.

In [0]:
import errno
import os

import requests


files = ['logan-US-2013.csv',
         'logan-US-2014.csv']

s3_location = 'https://s3.amazonaws.com/datarobot-public-datasets-redistributable'
storage_directory = '/dbfs/FileStore/DataRobotQuickStart'

try:
  os.mkdir(storage_directory)
except OSError as err:
  if err.errno != errno.EEXIST:
    raise

for filename in files:
  s3_path = '/'.join((s3_location, filename))
  storage_path = os.path.join(storage_directory, filename)
  response = requests.get(s3_path, stream=True)
  print('Retrieving {}'.format(filename))
  with open(storage_path, 'wb') as f:
    for chunk in response.iter_content(chunk_size=1024):
      f.write(chunk)
  print('Saved {}'.format(filename))

## Read the Data into Spark DataFrames

In [0]:
# This is the data we will use for training
logan_2013 = (spark.read.format('csv')
                        .option('header', 'true')
                        .option('inferSchema', 'true')
                        .load('/FileStore/DataRobotQuickStart/logan-US-2013.csv'))

# This is the data we will use to evaluate the model
logan_2014= (spark.read.format('csv')
                       .option('header', 'true')
                       .option('inferSchema', 'true')
                       .load('FileStore/DataRobotQuickStart/logan-US-2014.csv'))


logan_2013.createOrReplaceTempView('logan_2013')
logan_2014.createOrReplaceTempView('logan_2014')

In [0]:
%sql
SELECT * FROM logan_2013 LIMIT 10;

was_delayed,daily_rainfall,did_rain,Carrier Code,Date (MM/DD/YYYY),Flight Number,Tail Number,Destination Airport,Scheduled Departure Time
False,0.0,False,US,02/01/2013,225,N662AW,PHX,16:20
False,0.0,False,US,02/01/2013,280,N822AW,PHX,06:00
False,0.0,False,US,02/01/2013,303,N653AW,CLT,09:35
True,0.0,False,US,02/01/2013,604,N640AW,PHX,09:55
False,0.0,False,US,02/01/2013,722,N715UW,PHL,18:30
False,0.0,False,US,02/01/2013,754,N952UW,PHL,15:30
False,0.0,False,US,02/01/2013,897,N956UW,PHL,09:30
False,0.0,False,US,02/01/2013,967,N741UW,PHL,07:30
True,0.0,False,US,02/01/2013,1029,N176UW,CLT,08:00
False,0.0,False,US,02/01/2013,1097,N948UW,PHL,20:30


### Dataset Structure

Each row in the assembled dataset contains the following columns:

- was_delayed
    - boolean
    - whether the flight was delayed
- daily_rainfall
    - float
    - the amount of rain, in inches, on the day of the flight
- did_rain
    - bool
    - whether it rained on the day of the flight
- Carrier Code
    - str
    - the carrier code of the airline - US for all entries in assembled dataset
- Date
    - str (MM/DD/YYYY format)
    - the date of the flight
- Flight Number
    - str
    - the flight number for the flight
- Tail Number
    - str
    - the tail number of the aircraft
- Destination Airport
    - str
    - the three-letter airport code of the destination airport
- Scheduled Departure Time
    - str
    - the 24-hour scheduled departure time of the flight, in the origin airport's timezone

## Transform the Date Column
We want to be able to make predictions for future data, so the “date” column should be transformed in a way that avoids values that won’t be populated for future data.

We will use this transformation for both the training data and the data we later predict on.

In [0]:
from pyspark.sql.functions import to_timestamp, month, date_format

def prepare_modeling_dataset(df):
    """
    Enrich the dataframe by adding in the day of week and month extracted from the date column.
    Then drop the date column.
    
    Parameters
    ----------
    df : pyspark.sql.DataFrame
      The dataset
    """
    date_column_name = 'Date (MM/DD/YYYY)'
    modeling_df = logan_2013.withColumn('Timestamp', to_timestamp(logan_2013[date_column_name], 'MM/dd/yyyy'))
    modeling_df = modeling_df.withColumn('day_of_week', date_format(modeling_df.Timestamp, 'EEE'))
    modeling_df = modeling_df.withColumn('month', month(modeling_df.Timestamp))
    modeling_df = modeling_df.drop(date_column_name, 'Timestamp')
    return modeling_df

modeling_2013 = prepare_modeling_dataset(logan_2013)
display(modeling_2013)

was_delayed,daily_rainfall,did_rain,Carrier Code,Flight Number,Tail Number,Destination Airport,Scheduled Departure Time,day_of_week,month
False,0.0,False,US,225,N662AW,PHX,16:20,Fri,2
False,0.0,False,US,280,N822AW,PHX,06:00,Fri,2
False,0.0,False,US,303,N653AW,CLT,09:35,Fri,2
True,0.0,False,US,604,N640AW,PHX,09:55,Fri,2
False,0.0,False,US,722,N715UW,PHL,18:30,Fri,2
False,0.0,False,US,754,N952UW,PHL,15:30,Fri,2
False,0.0,False,US,897,N956UW,PHL,09:30,Fri,2
False,0.0,False,US,967,N741UW,PHL,07:30,Fri,2
True,0.0,False,US,1029,N176UW,CLT,08:00,Fri,2
False,0.0,False,US,1097,N948UW,PHL,20:30,Fri,2


---
## Creating a new Project

One of the most common and important uses of the **datarobot** package is the creation of a new modeling project. This task is supported by the following three functions:

* __dr.Project.start__ creates a new project, generating a unique alphanumeric project identifier (__projectId__), uploading the modeling data, and allowing the specification of a project name and the target to model with;
* __project.wait_for_autopilot__ lets us wait for DataRobot to finish building models.
* __project.get_models__ lets us retrieve information on the models DataRobot made, once autopilot is complete.

The **DataRobot Autopilot** builds, evaluates, and summarizes a collection of models. While the Autopilot is running, intermediate results are saved in a list that is updated until the project completes. The last stage of the modeling process constructs *blender* models, ensemble models that combine two or more of the best-performing individual models in various different ways. These models are ranked in the same way as the individual models and are included in the final project list.

We convert the Spark DataFrame to a Pandas DataFrame to be able to send it to the DataRobot server.

In [0]:
project_name = 'Compute Airline Delay from Databricks'
pandas_2013 = modeling_2013.toPandas()                  # Convert to Pandas DataFrame
project = dr.Project.start(pandas_2013,                 # Specify the dataframe we want to model with. (This can also be a path to a CSV or URL.)
                           project_name=project_name,   # Give the project a name.
                           target='was_delayed')        # Give the name of the variable specifying the target (the value we want to predict).
project.id

# Configuring the number of workers
The DataRobot platform can run multiple modeling tasks in parallel by leveraging our pool of workers. You can set the number of workers which should be utilized by a project

<div class="alert alert-box alert-warning">
Depending on the number of workers available to you per your license with DataRobot, the following cell might not succeed. It may also not be using all of your available workers.
</div>

In [0]:
# Set the worker count to the max your account allows to speed up training
project.set_worker_count(4)

You can view the progress of auto-pilot in from the DataRobot Web UI or explore more aspects of the project you just created. The cell below should output a link to the newly created project:

In [0]:
displayHTML('<a href="{0}">{0}</a>'.format(project.get_leaderboard_ui_permalink()))

You can also wait for the modeling to finish via the API...

In [0]:
%%time
project.wait_for_autopilot()

### Retrieving project results

We can then use the API to interact with the `project` object to get data, such as the list of models built.

In [0]:
models = project.get_models()
len(models)

Here we can see DataRobot built approximately 60 different predictive models automatically in under 30 minutes. Cool! Let's get some more information on what happened, such as getting all the models, their unique IDs, the logarithmic loss for each model on the cross validation segment, the percent of train data the model was trained on, and the type of model.

Some models don't have a root mean squared error for cross validation, as DataRobot only does cross validation for the best models found via a simple holdout set. `LogLoss CV` tells us the logarithmic loss on a five-fold cross validation and `LogLoss` tells us the logarithmic loss on just the validation set (first fold). Below we will look at the top 10 models in the _leaderboard_.

In [0]:
import pandas as pd
model_list = [{'id': m.id,
               'LogLoss CV': m.metrics['LogLoss']['crossValidation'],
               'LogLoss': m.metrics['LogLoss']['validation'],
               'type': m.model_type,
               'samplePct': m.sample_pct} for m in models]
model_df = pd.DataFrame(model_list)
model_df.sort_values(by='LogLoss CV', inplace=True)
model_df.reset_index(drop=True, inplace=True)
display(sqlContext.createDataFrame(model_df.head(10)))

LogLoss,LogLoss CV,id,samplePct,type
0.27244,0.27147,5af206578cf9824d606c96d0,64.0035,ENET Blender
0.27244,0.271474,5af206568cf9824d606c96ca,64.0035,AVG Blender
0.27281,0.271706,5af206578cf9824d606c96ce,64.0035,ENET Blender
0.27206,0.27185,5af206578cf9824d606c96cc,64.0035,Advanced AVG Blender
0.27228,0.272182,5af2058b02f05c6bd989f296,64.0035,Light Gradient Boosted Trees Classifier with Early Stopping
0.27357,0.2722819999999999,5af2058a02f05c6bd989f295,64.0035,eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features
0.27417,0.272626,5af2058a02f05c6bd989f291,64.0035,eXtreme Gradient Boosted Trees Classifier with Early Stopping
0.27421,0.273586,5af2058a02f05c6bd989f293,64.0035,eXtreme Gradient Boosted Trees Classifier with Early Stopping
0.27501,0.274132,5af2058b02f05c6bd989f297,64.0035,Light Gradient Boosting on ElasticNet Predictions
0.27678,0.274976,5af2058a02f05c6bd989f294,64.0035,Gradient Boosted Trees Classifier with Early Stopping


In [0]:
best_model_id = model_df.iloc[0, :]['id']
best_model_id

In [0]:
best_model = dr.Model.get(project.id, best_model_id)
best_model

### Feature Impact
Feature Impact measures how important a feature is in generating predictions by the model. That is, it measures how much the accuracy of a model would decrease if that feature were removed.

Feature Impact is available for all model types. It is an on-demand feature, meaning that you must initiate a calculation to see the results. Once you have had DataRobot compute the feature impact for a model, that information is saved with the project. Let's give it a try:

In [0]:
try:
    feature_impacts = best_model.get_feature_impact()  # check if they've already been computed
except dr.errors.ClientError as e:
    assert e.status_code == 404  # it hasn't been computed yet
    impact_job = best_model.request_feature_impact()
    feature_impacts = impact_job.get_result_when_complete(4*60)  # wait a few minutes to complete

In [0]:
impact_df = pd.DataFrame(feature_impacts)
impact_df.sort_values(by='impactNormalized', ascending=False, inplace=True)
display(sqlContext.createDataFrame(impact_df))

featureName,impactNormalized,impactUnnormalized
Scheduled Departure Time,1.0,0.0464782796917205
daily_rainfall,0.657886722430346,0.0305774430905869
month,0.4126256803782759,0.019178131780608
Tail Number,0.3694132179657475,0.0171696908664305
Flight Number,0.2465271808607253,0.0114581592636561
day_of_week,0.2367906881638498,0.0110056238328744
did_rain,0.2258445207459558,0.0104968648020731
Scheduled Departure Time (Hour of Day),0.2008007378662268,0.0093328728568503
Destination Airport,0.1114577771162008,0.0051803657386242


As we said before, DataRobot uses built-in cross-validation and holdout to judge models. Prior to predicting with our holdout set, we will want to train the DataRobot model on the maximum amount of data it has. To do this, we unlock the holdout set using `project.unlock_holdout()` and then we retrain the model on 100% of the data given to DataRobot using `model.train`. Once we start a training, we use `dr.models.modeljob.wait_for_async_model_creation` to pause until the model has been built.

In [0]:
%%time
project.unlock_holdout()
job_id_for_retraining_best_model = best_model.train(sample_pct=100)
best_model = dr.models.modeljob.wait_for_async_model_creation(project.id, job_id_for_retraining_best_model)

### ROC Curve

We also can plot the ROC curve. The receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

In [0]:
roc = best_model.get_roc_curve('crossValidation')
roc_df = pd.DataFrame(roc.roc_points)
display(sqlContext.createDataFrame(roc_df))

accuracy,f1_score,false_negative_score,false_positive_rate,false_positive_score,matthews_correlation_coefficient,negative_predictive_value,positive_predictive_value,threshold,true_negative_rate,true_negative_score,true_positive_rate,true_positive_score
0.9023907103825136,0.0,1429,0.0,0,0.0,0.9023907103825136,0.0,1.0,1.0,13211,0.0,0
0.9025273224043716,0.0027952480782669,1427,0.0,0,0.035540690262069,0.9025140046454434,1.0,0.9177740089813656,1.0,13211,0.0013995801259622,2
0.9026639344262296,0.0055826936496859,1425,0.0,0,0.0502655602110401,0.9026373326045368,1.0,0.9129712462403462,1.0,13211,0.0027991602519244,4
0.9028688524590164,0.0097493036211699,1422,0.0,0,0.0665019018458001,0.9028223877537074,1.0,0.8474107619082456,1.0,13211,0.0048985304408677,7
0.9033469945355193,0.0234644582470669,1412,0.0002270834910302021,3,0.0937652362775674,0.9034199726402188,0.85,0.7892104339013508,0.9997729165089698,13208,0.0118964310706787,17
0.9041666666666668,0.0462270564242012,1395,0.0006055559760805389,8,0.1286657430161998,0.9044389642416768,0.8095238095238095,0.7362427099330165,0.9993944440239194,13203,0.0237928621413575,34
0.9047131147540984,0.0606060606060606,1384,0.0008326394671107411,11,0.1473988896481662,0.9051014810751508,0.8035714285714286,0.6971792329741485,0.9991673605328892,13200,0.0314905528341497,45
0.9056010928961749,0.0919842312746386,1359,0.0017409734312315,23,0.1764840535916544,0.9065786760156732,0.7526881720430108,0.6402927060780538,0.9982590265687684,13188,0.0489853044086774,70
0.9077868852459016,0.1434010152284264,1316,0.0025736128983422,34,0.227731290394409,0.909197543641758,0.7687074829931972,0.5919772207828347,0.9974263871016578,13177,0.0790762771168649,113
0.9077185792349728,0.1433100824350031,1316,0.0026493073953523,35,0.2267438835524065,0.9091912779464532,0.7635135135135135,0.5918701664157944,0.9973506926046476,13176,0.0790762771168649,113


---
## Making Predictions with DataRobot

Now that we have some basic information on all the models, we can make predictions with the best model.

Now let's load some test data. We'll predict for data from the year 2014. We will need to munge the predict data the same way as the train data, and then we can upload it to DataRobot and get predictions.

In [0]:
%%time
test_df = prepare_modeling_dataset(logan_2014).toPandas()  # Use our same transformation steps
test_df.drop('was_delayed', inplace=True, axis=1)
prediction_dataset = project.upload_dataset(test_df)
predict_job = best_model.request_predictions(prediction_dataset.id)
predictions = predict_job.get_result_when_complete()

In [0]:
results_df = pd.DataFrame(predictions)
display(sqlContext.createDataFrame(results_df))

positive_probability,prediction,row_id,class_0.0,class_1.0
0.073626832337718,0.0,0,0.926373167662282,0.073626832337718
0.0193093038174744,0.0,1,0.9806906961825256,0.0193093038174744
0.0238237953366585,0.0,2,0.9761762046633414,0.0238237953366585
0.0654510859919305,0.0,3,0.9345489140080696,0.0654510859919305
0.0747755252419655,0.0,4,0.9252244747580344,0.0747755252419655
0.0489604655208469,0.0,5,0.9510395344791532,0.0489604655208469
0.0352042025377301,0.0,6,0.9647957974622698,0.0352042025377301
0.0390085166712699,0.0,7,0.96099148332873,0.0390085166712699
0.0542841403000421,0.0,8,0.945715859699958,0.0542841403000421
0.1081716910887458,0.0,9,0.8918283089112542,0.1081716910887458


Predictions come back letting us know the probability of both the binary classes, the overall positive probability (same as the probability of class = 1.0), and the overall class prediction (0 or 1, based on the probability). We can then combine the class prediction with the original data.

In [0]:
predictions = pd.concat([results_df['prediction'].reset_index(), test_df.reset_index()],
                        axis=1).drop('index', axis=1)
display(sqlContext.createDataFrame(predictions))

prediction,daily_rainfall,did_rain,Carrier Code,Flight Number,Tail Number,Destination Airport,Scheduled Departure Time,day_of_week,month
0.0,0.0,False,US,225,N662AW,PHX,16:20,Fri,2
0.0,0.0,False,US,280,N822AW,PHX,06:00,Fri,2
0.0,0.0,False,US,303,N653AW,CLT,09:35,Fri,2
0.0,0.0,False,US,604,N640AW,PHX,09:55,Fri,2
0.0,0.0,False,US,722,N715UW,PHL,18:30,Fri,2
0.0,0.0,False,US,754,N952UW,PHL,15:30,Fri,2
0.0,0.0,False,US,897,N956UW,PHL,09:30,Fri,2
0.0,0.0,False,US,967,N741UW,PHL,07:30,Fri,2
0.0,0.0,False,US,1029,N176UW,CLT,08:00,Fri,2
0.0,0.0,False,US,1097,N948UW,PHL,20:30,Fri,2


---
## Conclusion
This concludes our brief overview of the `datarobot` Python client. You've seen how to serialize a Spark dataframe to Pandas in order to send it to the DataRobot client, for both training and predictions. At this point you should be able to use DataRobot in combination with your existing ETL pipelines in Spark. Good luck!