## Customer Churn Model Scoring
#### The objectives of this lab is:
- score **new** customer data against a pre-built model
- schedule the notebook to run via the Notebook scheduler

### Step 1: Download new customer data

In [2]:
import wget
url_customer='https://raw.githubusercontent.com/yfphoon/dsx_demo/master/data/new_customer_churn_data.csv'

#remove existing files before downloading
!rm -f new_customer_churn_data.csv

customerFilename=wget.download(url_customer)

!ls -l new_customer_churn_data.csv

-rw------- 1 sf94-47271e4efc25ab-4c4827746caf users 27597 Aug 11 15:12 new_customer_churn_data.csv


### Step 2: Read data into a Spark DataFrame
**Note**: the new dataset does not contain the label column

In [3]:
newData= sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(customerFilename)

In [4]:
newData = newData.withColumnRenamed("Est Income", "EstIncome").withColumnRenamed("Car Owner","CarOwner")
newData.toPandas().head()

Unnamed: 0,ID,Gender,Status,Children,EstIncome,CarOwner,Age,LongDistance,International,Local,Dropped,Paymethod,LocalBilltype,LongDistanceBilltype,Usage,RatePlan
0,2048,F,S,1,13576.5,N,39.426667,14.83,0,25.66,0,CC,Budget,Standard,40.49,1
1,2054,F,M,2,84166.1,N,54.013333,3.28,0,11.74,1,CC,Budget,Standard,15.02,2
2,2075,F,S,0,68427.4,N,42.393333,23.76,0,50.05,0,Auto,FreeLocal,Standard,73.81,3
3,2095,F,M,2,77551.1,Y,33.6,20.53,0,41.89,1,CC,Budget,Intnl_discount,62.42,2
4,2108,F,S,1,13109.1,N,62.606667,22.38,0,40.48,0,Auto,Budget,Standard,62.87,1


### Step 3: Load Saved Model
Load model in Object Storage.

In [5]:
from pyspark.ml import PipelineModel
model1_loaded = PipelineModel.load("PredictChurn.churnModel")

### Step 4: Score the new data
Note: The scored output contains the predicted values and confidence scores

In [6]:
result = model1_loaded.transform(newData)

### Step 5: Export Score into a csv file

In [7]:
#Select ID, prediction and probability fields from the result dataframe

r1=result.select(result["ID"],result["predictedLabel"],result["prediction"],result["probability"])
r1.show(5,False)

+----+--------------+----------+------------------------------------------+
|ID  |predictedLabel|prediction|probability                               |
+----+--------------+----------+------------------------------------------+
|2048|T             |1.0       |[0.019912822658724294,0.9800871773412757] |
|2054|T             |1.0       |[0.27537695698434417,0.7246230430156559]  |
|2075|F             |0.0       |[1.0,0.0]                                 |
|2095|F             |0.0       |[0.9365642172419282,0.0634357827580719]   |
|2108|T             |1.0       |[0.0018529893529893534,0.9981470106470107]|
+----+--------------+----------+------------------------------------------+
only showing top 5 rows



#### Decompose the probability column
The probability column contains a vector for each record, and the elements must be extracted

In [8]:
from pyspark.sql import Row
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors

udf_0 = udf(lambda vector: float(vector[0]), DoubleType())
udf_1 = udf(lambda vector: float(vector[1]), DoubleType())

r2 = (r1.select(r1["ID"], r1["prediction"],r1["probability"])
    .withColumn('probability_0', udf_0(r1.probability))
    .withColumn('probability_1', udf_1(r1.probability))
    .drop("probability"))

r2.show(10, False)

+----+----------+---------------------+--------------------+
|ID  |prediction|probability_0        |probability_1       |
+----+----------+---------------------+--------------------+
|2048|1.0       |0.019912822658724294 |0.9800871773412757  |
|2054|1.0       |0.27537695698434417  |0.7246230430156559  |
|2075|0.0       |1.0                  |0.0                 |
|2095|0.0       |0.9365642172419282   |0.0634357827580719  |
|2108|1.0       |0.0018529893529893534|0.9981470106470107  |
|2124|0.0       |0.995243288590604    |0.004756711409395973|
|2154|1.0       |0.13510869565217393  |0.8648913043478261  |
|2218|0.0       |0.9820768344696615   |0.01792316553033841 |
|2267|0.0       |0.9756357043893127   |0.024364295610687297|
|2284|1.0       |0.06202317290552585  |0.937976827094474   |
+----+----------+---------------------+--------------------+
only showing top 10 rows



#### Connect to Object Storage
In order to write the scores to Object Storage, specify the credentials to connect to your instance of Object Storage.  The easiet way to do that is:
- If you do not already have a file in Object Storage, load a file into it using the **Files** interface
- Choose "*Insert SparkSession DataFame*" to generate the credentials and code to connect to Object Storage

![Load Files](https://raw.githubusercontent.com/yfphoon/IntroToNotebooks/master/images/upload_files.png)

- Edit the code to comment out or edit the code that reads the file.  The edited code cell should look like this

![credentials](https://raw.githubusercontent.com/yfphoon/IntroToNotebooks/master/images/generated_credentials.png)



In [None]:
# insert code here



#### Write sores .csv file

The code cell below specifies the options for saving the csv file.  Check that you have specified the **TARGET_CONTAINER** to point to your project.

In [28]:
from ingest.Connectors import Connectors

objectstoresaveOptions = {
        Connectors.BluemixObjectStorage.AUTH_URL          : credentials['auth_url'],
        Connectors.BluemixObjectStorage.USERID            : credentials['user_id'],
        Connectors.BluemixObjectStorage.PASSWORD          : credentials['password'],
        Connectors.BluemixObjectStorage.PROJECTID         : credentials['project_id'],
        Connectors.BluemixObjectStorage.REGION            : credentials['region'],
        Connectors.BluemixObjectStorage.TARGET_CONTAINER  : 'IntroToNotebooks',
        Connectors.BluemixObjectStorage.TARGET_FILE_NAME  : 'churn_scores.csv',
        Connectors.BluemixObjectStorage.TARGET_WRITE_MODE : 'write'}


r2.write.format("com.ibm.spark.discover").options(**objectstoresaveOptions).save()

In [33]:
r3 = spark.read\
  .format('csv')\
  .load(bmos.url('IntroToNotebooks', 'churn_scores.csv'))
r3.select(r3["_c0"].alias("ID"), r3["_c1"].alias("prediction"), r3["_c2"].alias("probability_0"), r3["_c3"].alias("probability_1")).show(5, False)

+----+----------+---------------------+------------------+
|ID  |prediction|probability_0        |probability_1     |
+----+----------+---------------------+------------------+
|2048|1.0       |0.019912822658724294 |0.9800871773412757|
|2054|1.0       |0.27537695698434417  |0.7246230430156559|
|2075|0.0       |1.0                  |0.0               |
|2095|0.0       |0.9365642172419282   |0.0634357827580719|
|2108|1.0       |0.0018529893529893534|0.9981470106470107|
+----+----------+---------------------+------------------+
only showing top 5 rows



### Step 6: Schedule this notebook to run at a time and frequency of your choice
Click on the "clock" icon at the top right

You have come to the end of this notebook

** Sidney Phoon** <br/>
yfphoon@us.ibm.com<br/>
Aug, 2017