<table style="border: none" align="left">
    <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/cars-4-you/master/static/images/logo.png" width="200" alt="Icon"></th>
       <th style="border: none"><font face="verdana" size="5" color="black"><b>Action Recommendation</b></th>
   </tr>
</table>

<img align=left src="https://github.com/pmservice/cars-4-you/raw/master/static/images/action.png" width="550" alt="Icon">

Contents
- [0. Setup](#setup)
- [1. Introduction](#introduction)
- [2. Load and explore data](#load)
- [3. Create an Apache Spark machine learning model](#model)
- [4. Store the model in the Watson Machine Learning repository](#persistence)
- [5. Deploy the model in the IBM Cloud](#persistence)
- [6. Configure continous learning system](#learning)

**Note:** This notebook works correctly with kernel `Python 3.5 with Spark 2.1`, please **do not change kernel**.

<a id="setup"></a>
## 0. Setup

In this section please use below cell to upgrade the `watson-machine-learning-client`.

In [1]:
!rm -rf $PIP_BUILD
!pip install --upgrade watson-machine-learning-client==1.0.260

Requirement already up-to-date: watson-machine-learning-client==1.0.260 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s861-ffd6e6d0334337-ed352cd4c457/.local/lib/python3.5/site-packages (1.0.260)
Requirement not upgraded as not directly required: tqdm in /usr/local/src/conda3_runtime.v43/home/envs/DSX-Python35-Spark/lib/python3.5/site-packages (from watson-machine-learning-client==1.0.260) (4.19.4)
Requirement not upgraded as not directly required: tabulate in /usr/local/src/conda3_runtime.v43/home/envs/DSX-Python35-Spark/lib/python3.5/site-packages (from watson-machine-learning-client==1.0.260) (0.8.2)
Requirement not upgraded as not directly required: urllib3 in /usr/local/src/conda3_runtime.v43/home/envs/DSX-Python35-Spark/lib/python3.5/site-packages (from watson-machine-learning-client==1.0.260) (1.22)
Requirement not upgraded as not directly required: certifi in /usr/local/src/conda3_runtime.v43/home/envs/DSX-Python35-Spark/lib/python3.5/site-packages (from watson-machine-learn

**Note**: Please restart the kernel (Kernel -> Restart)

<a id="introduction"></a>
## 1. Introduction

This notebook defines, trains and deploys the model that recommends specific Action for unstatisfied customers.

<a id="load"></a>
## 2. Load and explore data

In this section you will load the data as an Apache Spark DataFrame and perform a basic exploration.
You will also use the **bias detection** library to evaluate your data.

Read data into Spark DataFrame from DB2 database and show sample record.

### Load data

**TIP:** If needed put your service credentials here.

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# @hidden_cell
# The following code is used to access your data and contains your credentials.
# You might want to remove those credentials before you share your notebook.

properties_db2 = {
    'driver': 'com.ibm.db2.jcc.DB2Driver',
    'jdbcurl': '***',
    'user': '***',
    'password': '***'
}

table_name = 'CAR_RENTAL_TRAINING'
df_data = spark.read.jdbc(properties_db2['jdbcurl'], table='.'.join([properties_db2['user'], table_name]), properties=properties_db2)
df_data.head()


Py4JJavaError: An error occurred while calling o91.jdbc.
: java.lang.NullPointerException
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:72)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
	at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
	at java.lang.reflect.Method.invoke(Method.java:508)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:811)


### Explore data

In [None]:
df_data.printSchema()

**Tip:** Code above can be inserted using Data menu.  You have to select `Insert SparkSession DataFrame` option.

**Note:** Inserted code is modified to work with code in cells below.

As you can see, the data contains eleven fields. `Action` field is the one you would like to predict using feedback data in `Customer_Service` field.

In [None]:
print("Number of records: " + str(df_data.count()))

In [None]:
df_data.select('Business_area').groupBy('Business_area').count().show(truncate=False)

In [None]:
df_data.select('Action').groupBy('Action').count().show(truncate=False)

### Bias detection

A data set exhibits bias if its content supports or opposes particular objects, ideologies, person groups, or beliefs in an unfair way, allowing opinions to influence unbiased judgment. If such data is used to build a machine learning model, then it is very likely that the generated model will also exhibit bias. 

The code below shows you how to use the `ibm_bias_detection` library to detect potential bias in a data set about car rentals usage. For example, if you create a model to determine whether an unsatisfied customer is eligible for a voucher or with a free upgraded to rentals usage based on different criteria, you want to be sure that your training data does not exhibit bias towards any of the chosen criteria.

Use below code to import bias detection library: `ibm_bias_detection`

In [None]:
from ibm_bias_detection import data_bias_checker

Create Pandas DataFrame.

In [None]:
import pandas as pd

pd_data = df_data.toPandas()
pd_data.head()

#### Input Parameters for the Bias Checker

Before you can call the function to detect bias in the data in a data set, you must characterize the bias you want to detect. You do this, by determining the input parameters to the bias checker function calls based on your sample data set.

You add the input parameters to the bias checker function into a helper map function. You can add the following parameters:

 - `class_label`: the name of the column in the data frame which you have designated as the classified column. This is the column whose value will be predicted by the machine learning model.

 This parameter is **mandatory**.

 Examples are a column called `Action`.
 - `protected_attributes`: an array of one or more column names which are likely to show bias towards the class-label.

 This parameter is **mandatory**.

 Examples are `Gender`, `Age`.
 - `favourable_class`: an array of one or more values of the `class_label` column which depicts a favorable outcome for the end user.

 This parameter is **optional**. If not specified, a library finds all of the distinct values of the `class_label` column and runs the bias detection algorithm multiple times on those extracted values. During each run it assumes one value from the set of distinct values as the favorable outcome and the rest as unfavorable. The library reports the top three biases found across all these runs.

 Examples of favorable outcomes for a class-label called `Action` is `On-demand pickup location`, `Voucher`, `Free Upgrade`, `Premium features`, and `NA` is an unfavorable outcome.
 - `majority`: map of expressions describing the majority group for each protected attribute.

 This parameter is **optional**.

 An example is taking `Age` as the protected attribute and specifying the majority group as `{'majority' : 'Age' : '[26,60]' } or {'majority' : 'Age' : '<60'}`.
 - `minority`: map of expressions describing the minority group for each protected attribute.

 This parameter is **optional**.

 An example is taking `Age` as the protected attribute and specifying the minority group as  `{'minority' : 'Age' : '[60,70]' } or {'minority' : 'Age' : '>=60' }`.
 - `threshold`: the decisive factor in determining the presence of bias. This value empowers organizations to have their own personalized criteria for bias.

 This parameter is **optional**. If not specified, the taken default is 0.8. If a bias score is below 0.8, then the presence of bias is confirmed.
    

Define input parameters.

In [None]:
inputs={'class_label': "Action",
'protected_attributes': ["Gender"],
'threshold': 0.8,
'favourable_class': ['On-demand pickup location', 'Voucher', 'Free Upgrade', 'Premium features'],
'source_bias': True}

Run the bias checker.

In [None]:
biasD = data_bias_checker()
bias_result = biasD.data_checker(pd_data, inputs)

Based on the above report (no bias found) we can continue our model creation.

<a id="model"></a>
## 3. Create an Apache Spark machine learning model

In this section you will learn how to:

- [3.1 Prepare data for training a model](#prep)
- [3.2 Create an Apache Spark machine learning pipeline](#pipe)
- [3.3 Train a model](#train)

<a id="prep"></a>
### 3.1 Prepare data for training a model

In this subsection you will split your data into: train and test data set.

In [None]:
train_data, test_data = df_data.randomSplit([0.8, 0.2], 24)

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

### 3.2 Create the pipeline<a id="pipe"></a>

In this section you will create an Apache Spark machine learning pipeline and then train the model.

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler, HashingTF, IDF, Tokenizer
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, use the StringIndexer transformer to convert all the string fields to numeric ones.

In [None]:
string_indexer_gender = StringIndexer(inputCol="Gender", outputCol="gender_ix")
string_indexer_customer_status = StringIndexer(inputCol="Customer_Status", outputCol="customer_status_ix")
string_indexer_status = StringIndexer(inputCol="Status", outputCol="status_ix")
string_indexer_owner = StringIndexer(inputCol="Car_Owner", outputCol="owner_ix")
string_business_area = StringIndexer(inputCol="Business_Area", outputCol="area_ix")

In [None]:
assembler = VectorAssembler(inputCols=["gender_ix", "customer_status_ix", "status_ix", "owner_ix", "area_ix", "Children", "Age", "Satisfaction"], outputCol="features")

In [None]:
string_indexer_action = StringIndexer(inputCol="Action", outputCol="label").fit(df_data)

In [None]:
label_action_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=string_indexer_action.labels)

In [None]:
dt_action = DecisionTreeClassifier()

In [None]:
pipeline_action = Pipeline(stages=[string_indexer_gender, string_indexer_customer_status, string_indexer_status, string_indexer_action, string_indexer_owner, string_business_area, assembler, dt_action, label_action_converter])

In [None]:
model_action = pipeline_action.fit(train_data)

In [None]:
predictions_action = model_action.transform(test_data)
predictions_action.select('Business_Area','Action','probability','predictedLabel').show(2)

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions_action)

print("Accuracy = %g" % accuracy)

<a id="persistence"></a>
## 4. Store the model in the repository

In this section you will store trained model to Watson Machine Learning repository. When model is stored some metada is optional, however we provide it to be able to configure Continuous Learning System.

In [None]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

We need Watson Machine Learning credentials to be able to store model in repository.

**TIP:** If needed put your service credentials here.

In [None]:
# @hidden_cell
# How to get associated service credentials

wml_credentials = {
  "apikey": "***",
  "instance_id": "***",
  "password": "***",
  "url": "https://us-south.ml.cloud.ibm.com",
  "username": "***"
}

In [None]:
client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
client.version

### 4.2 Save the pipeline and model<a id="save"></a>

In [None]:
db2_service_credentials = {
  "hostname": "***",
  "password": "***",
  "https_url": "***",
  "port": 50000,
  "ssldsn": "***",
  "host": "***",
  "jdbcurl": "***",
  "uri": "***",
  "db": "BLUDB",
  "dsn": "***",
  "username": "***",
  "ssljdbcurl": "***"
}

training_data_reference = {
 "name": "CARS4U training reference",
 "connection": db2_service_credentials,
 "source": {
  "tablename": table_name,
  "type": "dashdb"
 }
}

In [None]:
model_props = {
    client.repository.ModelMetaNames.NAME: "CARS4U - Action Recommendation Model",
    client.repository.ModelMetaNames.TRAINING_DATA_REFERENCE: training_data_reference,
    client.repository.ModelMetaNames.EVALUATION_METHOD: "multiclass",
    client.repository.ModelMetaNames.EVALUATION_METRICS: [
        {
           "name": "accuracy",
           "value": accuracy,
           "threshold": 0.7
        }
    ]
}

**Tip**: Use `client.repository.ModelMetaNames.show()` to get the list of available meta names.

In [None]:
published_model_details = client.repository.store_model(model=model_action, meta_props=model_props, training_data=train_data, pipeline=pipeline_action)

In [None]:
model_uid = client.repository.get_model_uid(published_model_details)
print(model_uid)

<a id="deploy"></a>
## 5. Deploy model in the IBM Cloud

You can use following command to create online deployment in cloud.

In [None]:
deployment_details = client.deployments.create(model_uid=model_uid, name='CARS4U - Action Model Deployment')

You can use deployed model to score new data using scoring endpoint.

In [None]:
scoring_url = client.deployments.get_scoring_url(deployment_details)
print(scoring_url)

<a id="learning"></a>
## 6. Continuous Learning System

### 6.1 Setup

**TIP:** If needed put your service credentials here

In [None]:
# @hidden_cell

spark_credentials = {
  "tenant_id": "***",
  "tenant_id_full": "***",
  "cluster_master_url": "https://spark.bluemix.net",
  "tenant_secret": "***",
  "instance_id": "***",
  "plan": "ibm.SparkService.PayGoPersonal"
}

In [None]:
feedback_data_reference = {
 "name": "Cars4You feedback data",
 "connection": db2_service_credentials,
 "source": {
  "tablename": "CAR_RENTAL_FEEDBACK",
  "type": "dashdb"
 }
}

In [None]:
system_config = {
    client.learning_system.ConfigurationMetaNames.FEEDBACK_DATA_REFERENCE: feedback_data_reference,
    client.learning_system.ConfigurationMetaNames.MIN_FEEDBACK_DATA_SIZE: 10,
    client.learning_system.ConfigurationMetaNames.SPARK_REFERENCE: spark_credentials,
    client.learning_system.ConfigurationMetaNames.AUTO_RETRAIN: "never",
    client.learning_system.ConfigurationMetaNames.AUTO_REDEPLOY: "never"
}

**Note:** You can update RETRAIN option to either `always` or `conditionally`. The REDEPLOY option can be also updated `always` or `conditionally`. `conditionally` means that action will happen only if new model version is better than previosly used one.

In [None]:
learning_system_details = client.learning_system.setup(model_uid=model_uid, meta_props=system_config)

### 6.2 Run learning system iteration

In [None]:
run_details = client.learning_system.run(model_uid, asynchronous=False)

In [None]:
client.learning_system.list()

In [None]:
client.learning_system.list_runs(model_uid)

In [None]:
client.learning_system.list_metrics(model_uid)

---