<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Tutorial: Build, Save and Deploy Model to IBM Watson Machine Learning (WML)</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
</table>

In this notebook you will learn how to build a predictive model with Spark machine learning API (SparkML) and deploy it for scoring in Watson Machine Learning service (WML). 

This notebook walks you through these steps:
- Build a model with SparkML API
- Save the model in the WML repository
- Create a Deployment in WML
- Invoke the deployed model with a Rest Client to test it

### Use Case

The analytics use case implemented in this notebook is telco churn. While it's a simple use case, it implements all steps from the CRISP-DM methodolody, which is the recommended best practice for implementing predictive analytics. 
![CRISP-DM](https://raw.githubusercontent.com/yfphoon/dsx_demo/master/crisp_dm.png)

The analytics process starts with defining the business problem and identifying the data that can be used to solve the problem. For Telco churn, we use demographic and historical transaction data. We also know which customers have churned, which is the critical information for building predictive models. In the next step, we use visual APIs for data understanding and complete some data preparation tasks. In a typical analytics project data preparation will include more steps (for example, formatting data or deriving new variables). 

Once the data is ready, we can build a predictive model. In our example we are using the SparkML Random Forrest classification model. Classification is a statistical technique which assigns a "class" to each customer record (for our use case "churn" or "no churn"). Classification models use historical data to come up with the logic to predict "class", this process is called model training. After the model is created, it's usually evaluated using another data set. 

Finally, if the model's accuracy meets the expectations, it can be deployed for scoring. Scoring is the process of applying the model to a new set of data. For example, when we receive new transactional data, we can score the customer for the risk of churn.   

### Working with Notebooks

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. To run the notebook, it must be in the Edit mode. If you don't see the menu in the notebook, then it's not in the edit mode. Click on the pencil icon.
2. The notebook has 2 types of cells - markdown (text) and code. 
3. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell at a time because we need to make code changes to some of the cells.
4. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.
5. Work through this notebook by reading the instructions and executing code cell by cell. Some cells will require modifications before you run them. 

### Step 1: Load data 

In this notebook we are loading data from a URL. Other data sources can be used - flat files from Object Storage, databases, Hadoop, etc. 

In [1]:
#Run this cell only once to install the wget library
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Running setup.py bdist_wheel for wget ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [4]:
!pip install pyspark

Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/ee/2f/709df6e8dc00624689aa0a11c7a4c06061a7d00037e370584b9f011df44c/pyspark-2.3.1.tar.gz (211.9MB)
[K    100% |████████████████████████████████| 211.9MB 4.2kB/s eta 0:00:01 3% |█                               | 6.9MB 43.6MB/s eta 0:00:05�█████████████▊           | 136.9MB 56.0MB/s eta 0:00:02██████████████████           | 139.2MB 51.9MB/s eta 0:00:02MB 52.1MB/s eta 0:00:0202      | 150.3MB 51.2MB/s eta 0:00:02████         | 152.5MB 15.8MB/s eta 0:00:04�██▍        | 155.2MB 46.5MB/s eta 0:00:02�██████▉        | 157.7MB 52.2MB/s eta 0:00:02��███████▏      | 166.6MB 50.3MB/s eta 0:00:0101��████████▉      | 171.3MB 47.6MB/s eta 0:00:010:01███████▋     | 176.2MB 34.3MB/s eta 0:00:02K    84% |███████████████████████████     | 179.0MB 52.9MB/s eta 0:00:01K    85% |███████████████████████████▎    | 180.9MB 55.5MB/s eta 0:00:01████▋    | 183.2MB 52.6MB/s eta 0:00:01████████████████████████████    | 185.2MB 51.1MB/s eta 0:0

In [None]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName('demo1').setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

In [5]:
import wget
url_churn='https://raw.githubusercontent.com/yfphoon/dsx_demo/master/data/customer_churn/churn.csv'
url_customer='https://raw.githubusercontent.com/yfphoon/dsx_demo/master/data/customer_churn/customer.csv'

#remove existing files before downloading
!rm -f churn.csv
!rm -f customer.csv

churnFilename=wget.download(url_churn)
customerFilename=wget.download(url_customer)

customer_churn= sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(churnFilename)
customer= sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(customerFilename)

customer.take(5)

[Row(ID=1, Gender='F', Status='S', Children=1.0, Est Income=38000.0, Car Owner='N', Age=24.393333, LongDistance=23.56, International=0.0, Local=206.08, Dropped=0.0, Paymethod='CC', LocalBilltype='Budget', LongDistanceBilltype='Intnl_discount', Usage=229.64, RatePlan=3.0),
 Row(ID=6, Gender='M', Status='M', Children=2.0, Est Income=29616.0, Car Owner='N', Age=49.426667, LongDistance=29.78, International=0.0, Local=45.5, Dropped=0.0, Paymethod='CH', LocalBilltype='FreeLocal', LongDistanceBilltype='Standard', Usage=75.29, RatePlan=2.0),
 Row(ID=8, Gender='M', Status='M', Children=0.0, Est Income=19732.8, Car Owner='N', Age=50.673333, LongDistance=24.81, International=0.0, Local=22.44, Dropped=0.0, Paymethod='CC', LocalBilltype='FreeLocal', LongDistanceBilltype='Standard', Usage=47.25, RatePlan=3.0),
 Row(ID=11, Gender='M', Status='S', Children=2.0, Est Income=96.33, Car Owner='N', Age=56.473333, LongDistance=26.13, International=0.0, Local=32.88, Dropped=1.0, Paymethod='CC', LocalBilltype

In [18]:
type(customer)

pyspark.sql.dataframe.DataFrame

In [19]:
type(customer.toPandas())

pandas.core.frame.DataFrame

In [12]:
customer.toPandas().head()

Unnamed: 0,ID,Gender,Status,Children,Est Income,Car Owner,Age,LongDistance,International,Local,Dropped,Paymethod,LocalBilltype,LongDistanceBilltype,Usage,RatePlan
0,1,F,S,1.0,38000.0,N,24.393333,23.56,0.0,206.08,0.0,CC,Budget,Intnl_discount,229.64,3.0
1,6,M,M,2.0,29616.0,N,49.426667,29.78,0.0,45.5,0.0,CH,FreeLocal,Standard,75.29,2.0
2,8,M,M,0.0,19732.8,N,50.673333,24.81,0.0,22.44,0.0,CC,FreeLocal,Standard,47.25,3.0
3,11,M,S,2.0,96.33,N,56.473333,26.13,0.0,32.88,1.0,CC,Budget,Standard,59.01,1.0
4,14,F,M,2.0,52004.8,N,25.14,5.03,0.0,23.11,0.0,CH,Budget,Intnl_discount,28.14,1.0


In [13]:
customer_churn.toPandas().head()

Unnamed: 0,ID,CHURN
0,1,T
1,6,F
2,8,F
3,11,F
4,14,F


If the first step ran successfully (you saw the output), then continue reviewing the notebook and running each code cell step by step. Note that not every cell has a visual output. The cell is still running if you see a * in the brackets next to the cell. 

If the first step didn't finish successfully, check with the instructor. 

### Step 2: Merge Files

In [8]:
merged=customer.join( customer_churn, customer['ID'] == customer_churn['ID']).select(customer['*'],customer_churn['CHURN'])

### Step 3: Rename some columns
This step is to remove spaces from columns names, it's an example of data preparation that you may have to do before creating a model. 

In [14]:
merged = merged.withColumnRenamed("Est Income", "EstIncome").withColumnRenamed("Car Owner","CarOwner")
merged.toPandas().head()

Unnamed: 0,ID,Gender,Status,Children,EstIncome,CarOwner,Age,LongDistance,International,Local,Dropped,Paymethod,LocalBilltype,LongDistanceBilltype,Usage,RatePlan,CHURN
0,1,F,S,1.0,38000.0,N,24.393333,23.56,0.0,206.08,0.0,CC,Budget,Intnl_discount,229.64,3.0,T
1,6,M,M,2.0,29616.0,N,49.426667,29.78,0.0,45.5,0.0,CH,FreeLocal,Standard,75.29,2.0,F
2,8,M,M,0.0,19732.8,N,50.673333,24.81,0.0,22.44,0.0,CC,FreeLocal,Standard,47.25,3.0,F
3,11,M,S,2.0,96.33,N,56.473333,26.13,0.0,32.88,1.0,CC,Budget,Standard,59.01,1.0,F
4,14,F,M,2.0,52004.8,N,25.14,5.03,0.0,23.11,0.0,CH,Budget,Intnl_discount,28.14,1.0,F


### Step 4: Data understanding

Data preparation and data understanding are the most time-consuming tasks in the data mining process. The data scientist needs to review and evaluate the quality of data before modeling.

Visualization is one of the ways to reivew data.

The Brunel Visualization Language is a highly succinct and novel language that defines interactive data visualizations based on tabular data. The language is well suited for both data scientists and business users. 
More information about Brunel Visualization: https://github.com/Brunel-Visualization/Brunel/wiki

Try Brunel visualization here: http://brunel.mybluemix.net/gallery_app/renderer

In [15]:
import brunel
Merged = merged.toPandas()
%brunel data('Merged') bar x(CHURN) y(EstIncome) mean(EstIncome) color(LocalBilltype) stack tooltip(EstIncome) | x(LongDistance) y(Usage) point color(Paymethod) tooltip(LongDistance, Usage) :: width=1100, height=400 

<IPython.core.display.Javascript object>

### Step 5: Build the Spark pipeline and the Random Forest model
"Pipeline" is an API in SparkML that's used for building models.
Additional information on SparkML: https://spark.apache.org/docs/2.0.2/ml-guide.html

In [16]:
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# Prepare string variables so that they can be used by the decision tree algorithm
stringIndexer1 = StringIndexer(inputCol='Gender', outputCol='GenderEncoded')
stringIndexer2 = StringIndexer(inputCol='Status',outputCol='StatusEncoded')
stringIndexer3 = StringIndexer(inputCol='CarOwner',outputCol='CarOwnerEncoded')
stringIndexer4 = StringIndexer(inputCol='Paymethod',outputCol='PaymethodEncoded')
stringIndexer5 = StringIndexer(inputCol='LocalBilltype',outputCol='LocalBilltypeEncoded')
stringIndexer6 = StringIndexer(inputCol='LongDistanceBilltype',outputCol='LongDistanceBilltypeEncoded')
stringIndexer7 = StringIndexer(inputCol='CHURN', outputCol='label')

# Pipelines API requires that input variables are passed in  a vector
assembler = VectorAssembler(inputCols=["GenderEncoded", "StatusEncoded", "CarOwnerEncoded", "PaymethodEncoded", "LocalBilltypeEncoded", \
                                       "LongDistanceBilltypeEncoded", "Children", "EstIncome", "Age", "LongDistance", "International", "Local",\
                                      "Dropped","Usage","RatePlan"], outputCol="features")


# instantiate the algorithm, take the default settings
rf=RandomForestClassifier(labelCol="label", featuresCol="features")

#pipeline = Pipeline(stages=[stringIndexer1, stringIndexer2, stringIndexer3, assembler, rf])
pipeline = Pipeline(stages=[stringIndexer1,stringIndexer2,stringIndexer3,stringIndexer4,stringIndexer5,stringIndexer6,stringIndexer7, assembler, rf])

In [17]:
# Split data into train and test datasets
train, test = merged.randomSplit([0.8,0.2], seed=6)

In [24]:
# Build models
model = pipeline.fit(train)

### Step 6: Score the test data set

In [25]:
results = model.transform(test)

### Step 7: Model Evaluation 

In [27]:
print('Precision model1 = {:.2f}.'.format(results.filter(results.label == results.prediction).count() / float(results.count())))

Precision model1 = 0.94.


In [28]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
print('Area under ROC curve = {:.2f}.'.format(evaluator.evaluate(results)))

Area under ROC curve = 0.93.


We have finished building and testing a predictive model. The next step is to deploy it for real time scoring. 

### Step 8: Save Model in WML repository

In this section you will store your model in the Watson Machine Learning (WML) repository by using Python client libraries.
* <a href="https://console.ng.bluemix.net/docs/services/PredictiveModeling/index.html">WML Documentation</a>
* <a href="http://watson-ml-api.mybluemix.net/">WML API</a> 
<br/>

First, you must import client libraries.

In [29]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

ImportError: No module named 'repository'

Put your authentication information from your instance of the Watson Machine Learning service in <a href="https://console.ng.bluemix.net/dashboard/apps/" target="_blank">Bluemix</a> in the next cell. You can find your information on the **Service Credentials** tab of your service instance in Bluemix. 

<span style="color:red">Replace the service_path and credentials with your own information.</span> [Helper video](https://raw.githubusercontent.com/ibm-cloud-architecture/refarch-data-science/master/videos/LookupService.mp4). *Click "View Raw"* 

service_path=[your url]<br/>
username=[your username]<br/>
password=[your password]<br/>

In [None]:
service_path = 'https://ibm-watson-ml.mybluemix.net'
username = 'fef42c4e-cf59-4df7-8a95-98ebf29b13bf'
password = '21309f7f-9b36-4b39-95ff-99559066d654'

Authorize the repository client to invoke the service:

In [None]:
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

Create the model artifact (abstraction layer).

<b>Tip:</b> The MLRepositoryArtifact method expects a trained model object, training data, and a model name. (It is this model name that is displayed by the Watson Machine Learning service).


In [None]:
model_artifact = MLRepositoryArtifact(model, training_data=train, name="Predict Customer Churn")

Save pipeline and model artifacts to your Watson Machine Learning instance:

In [None]:
saved_model = ml_repository_client.models.save(model_artifact)

In [None]:
# Print the saved model properties
print "modelType: " + saved_model.meta.prop("modelType")
print "creationTime: " + str(saved_model.meta.prop("creationTime"))
print "modelVersionHref: " + saved_model.meta.prop("modelVersionHref")
print "label: " + saved_model.meta.prop("label")

### Step 9:  Generate Authorization Token for Invoking the model
[Helper video for steps 9, 10, 11](https://raw.githubusercontent.com/ibm-cloud-architecture/refarch-data-science/master/videos/RESTClient.mp4). *Click "View Raw"* 

In [None]:
import urllib3, requests, json

headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(username, password))
url = '{}/v2/identity/token'.format(service_path)
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')
print mltoken

#### Step 9.1 Copy the generated token into your notepad

### Step 10:  Go to WML in Bluemix to create a Deployment Endpoint and Test the Deployed model

* In your <a href="https://console.ng.bluemix.net/dashboard/apps/" target="_blank">Bluemix</a> dashboard, click into your WML Service and click the **Launch Dashboard** button under Watson Machine Learing.
![WML Launch Dashboard](https://raw.githubusercontent.com/yfphoon/dsx_demo/master/WML_Launch_Dashboard.png)

<br/>
* You should see your deployed model in the **Models** tab


* Under *Actions*, click on the 3 ellipses and click ***Create Deployment***.  Give your deployment configuration a unique name, e.g. "Predict Customer Churn Deply", accept the defaults and click **Save**.
<br/>
<br/>
* In the *Deployments tab*, under *Actions*, click **View Details**
<br/>
<br/>
* Scoll down to **API Details**, copy the value of the **Scoring Endpoint** into your notepad.  (e.g. 	https://ibm-watson-ml.mybluemix.net/v2/published_models/64fd0462-3f8a-4b42-820b-59a4da9b7dc6/deployments/7d9995ed-7daf-4cfd-b40f-37cb8ab3d88f/online)



### Step 11:  Invoke the model with a REST Client, e.g. https://client.restlet.com/

In the REST client interface enter the following information:

1. Protocol:  **HTTPS**
<br/>
<br/>

2. URI: **your scoring endpoint**  (Step 10)
<br/>
<br/>
3. method: **POST**
<br/>
<br/>
4. Authorization:  **your generated token** (Step 9). Hint: Add "Basic authorization" with a dummy value of 1 in the userid field. Then replace the value with the token. 
<br/>
<br/>
5. Content Type: **application/JSON**
<br/>
<br/>
6. JSON Body:<br/>**{
  "fields": [
    "ID","Gender","Status","Children","EstIncome","CarOwner","Age","LongDistance","International","Local","Dropped","Paymethod","LocalBilltype","LongDistanceBilltype","Usage","RatePlan"
  ],
  "values": [ 
  [999,"F","M",2.0,77551.100000,"Y",33.600000,20.530000,0.000000,41.890000,1.000000,"CC","Budget","Intnl_discount",62.420000,2.000000]
  ]
} **
<br/>
<br/>
7. Click **Send*

Scroll down to the **RESPONSE** section to see the scored results

**Note:** The values in the JSON body does not include the label.

**Sample REST Client Input**
![Rest Client Input](https://github.com/ibm-cloud-architecture/refarch-data-science/blob/master/static/imgs/RestRequest.PNG?raw=true)

You have come to the end of this notebook.

### Summary

In this notebook you created a model using SparkML API, deployed it in Watson Machine Learning service for online (real time) scoring and tested it using a REST API client. 

### Lab Verification
Replace <name> with your name, run the cell, and take the screenshot. Lab facilitators will provide instructions for submitting the screenshot.  

In [None]:
print "<name> finished this lab!"


Created by **Sidney Phoon**
<br/>
yfphoon@us.ibm.com
<br/>
April 25, 2017