# Logistic Regression Consulting Project

## Binary Customer Churn

A marketing agency has many customers that use their service to produce ads for the client/customer websites. They've noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager. Luckily they have some historical data, can you help them out? Create a classification algorithm that will help classify whether or not a customer churned. Then the company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

The data is saved as customer_churn.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
Once you've created the model and evaluated it, test out the model on some new data (you can think of this almost like a hold-out set) that your client has provided, saved under new_customers.csv. The client wants to know which customers are most likely to churn given this data (they don't have the label yet).

# Start

First thing is starting a new spark session. Let's call it churn as we are dealing with a problem of churned customers prediction:

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('churn').getOrCreate()

Next is reading the data, which is in a csv file:

In [2]:
df = spark.read.csv('input data/customer_churn.csv', header=True, inferSchema=True)

Before actually diving into the data let's check the schema:

In [3]:
df.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



To get a sense of the available data one can use the following describe command to look at more information:

In [4]:
df.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|                null|                null|0.16666666666666666|
| stddev|         null|6.127560416916251|2408.644531858096|0.4999208935073339|1.274449013194616|1.764835592035

The features of the dataframe are presented below:

In [5]:
df.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

Or better yet, one can check the top rows:

In [6]:
for s in df.head(3):
    print(s)
    print('-------')
    print('\n')

Row(Names='Cameron Williams', Age=42.0, Total_Purchase=11066.8, Account_Manager=0, Years=7.22, Num_Sites=8.0, Onboard_date=datetime.datetime(2013, 8, 30, 7, 0, 40), Location='10265 Elizabeth Mission Barkerburgh, AK 89518', Company='Harvey LLC', Churn=1)
-------


Row(Names='Kevin Mueller', Age=41.0, Total_Purchase=11916.22, Account_Manager=0, Years=6.5, Num_Sites=11.0, Onboard_date=datetime.datetime(2013, 8, 13, 0, 38, 46), Location='6157 Frank Gardens Suite 019 Carloshaven, RI 17756', Company='Wilson PLC', Churn=1)
-------


Row(Names='Eric Lozano', Age=38.0, Total_Purchase=12884.75, Account_Manager=0, Years=6.67, Num_Sites=12.0, Onboard_date=datetime.datetime(2016, 6, 29, 6, 20, 7), Location='1331 Keith Court Alyssahaven, DE 90114', Company='Miller, Johnson and Wallace', Churn=1)
-------




Let's choose the features for the logistic regression model that will be used.

This time, the Age, Total_Purchase, Account_Manager, Years and Num_Sites will be selected. This is because features such as Names have arbitrary values which do not add much value to the prediction. The same reasoning could be said about the Account_Manager feature but since it is a numerical feature and does not need any further feature engineering it could be included as it does not hurt the model.

In [7]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['Age',
                                 'Total_Purchase',
                                 'Account_Manager',
                                 'Years',
                                 'Num_Sites'], outputCol='features')
output = assembler.transform(df)

The dataset can now be created:

In [8]:
data = output.select('features', 'churn')

I'll split the dataset into train and test sets:

In [9]:
train_data, test_data = data.randomSplit([0.7, 0.3])

And now a Logistic Regression Model is bcreated to perform the customer prediction:


In [10]:
from pyspark.ml.classification import LogisticRegression

log_reg = LogisticRegression(labelCol='churn')

lr_model = log_reg.fit(train_data)

After fitting the model, a summary can be printed to verify the statistics:

In [11]:
lr_model.summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                642|                642|
|   mean|0.16199376947040497|0.12305295950155763|
| stddev| 0.3687323817945704|  0.328754127642659|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



Next, it would be nice to look at some evaluation metrics. For that, the BinaryClassificationEvaluator is used:

In [12]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

pred = lr_model.evaluate(test_data)
pred.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[25.0,9672.03,0.0...|    0|[5.38269718505798...|[0.99542561066989...|       0.0|
|[26.0,8787.39,1.0...|    1|[1.31855325377138...|[0.78894090494836...|       0.0|
|[28.0,8670.98,0.0...|    0|[8.46751801319574...|[0.99978985823362...|       0.0|
|[28.0,9090.43,1.0...|    0|[2.13470173852087...|[0.89423053449346...|       0.0|
|[29.0,9378.24,0.0...|    0|[5.34081749819450...|[0.99523090159244...|       0.0|
|[29.0,13255.05,1....|    0|[4.87746598493005...|[0.99244128001150...|       0.0|
|[30.0,7960.64,1.0...|    1|[3.81858186353097...|[0.97851291363285...|       0.0|
|[30.0,10744.14,1....|    1|[2.24045104481495...|[0.90382367315183...|       0.0|
|[30.0,11575.37,1....|    1|[4.62279635050318...|[0.99027031423655...|       0.0|
|[30.0,12788.37,

The same evaluator can be used to check the AUC (Area Under the Curve)

In [13]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='churn')

auc = evaluator.evaluate(pred.predictions)

auc

0.7590237899917965

It is not a fantastic value but it's not so bad either!

Let's now use the model to fit new incoming data and predict the churn of new customers:

In [14]:
log_reg_model = log_reg.fit(data)

new_customers = spark.read.csv('input data/new_customers.csv', header=True, inferSchema=True)

new_customers_test = assembler.transform(new_customers)

This should have the features column as can be verified below in the schema:

In [15]:
new_customers_test.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



Finally it is time to transform the new customers data and check the final results:

In [16]:
result = log_reg_model.transform(new_customers_test)
result.select('Company', 'prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+



This new data has only six customers and the model predicted that 4 of them will churn, which means that an account manager must be assigned to them!

Thank you!