# A marketing agency has many customoers that use their service to produce ads for the client / customer websites
They've noticed that they have quite a bit of churn in clients

They currently randomely assign account managers, but want tyou to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager.

They have some historical data can you help them out?

## Create a classification algorithm that will help classify whether or not a customer churned.
Then company can test this against incoming data for future customers to predict which curstomers will churn and assign them an account manager


1. Name: name of the latest contact at company
2. Age: Customer Age
3. Total_Purchase: Total Ads Purchased
4. Account_Manager: Binary 0=N0 manager, 1= Account Manager assigned
5. Years: Total years as customer
6. Num_Sites: Number of websites that use the service
7. Onboard_date: Date that the name of the latest contact was onboarded
8. Location: Client HQ Address
9. Company: Name of Client company

10. Churn: 0 or 1 indicating whether customer has churned.

Your goal is to create a model that predict whether a customer will churn (o or 1) based off the features
Remember that the acccount manager is currently randomely assigned.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("logregconsult").getOrCreate()

In [3]:
data = spark.read.csv('customer_churn.csv',inferSchema=True, header=True)

In [4]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [5]:
data.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|       Onboard_date|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|               null|                null|                null|0.16666666666666666|
| stddev| 

In [6]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [7]:
from pyspark.ml.feature import VectorAssembler

In [10]:
assembler = VectorAssembler(inputCols=['Age',
                                      'Total_Purchase',
                                      'Account_Manager',
                                      'Years',
                                      'Num_Sites'],outputCol='features')

In [11]:
output = assembler.transform(data)

In [12]:
final_data = output.select('features','churn')

In [13]:
train_churn, test_churn = final_data.randomSplit([0.7,0.3])

In [14]:
from pyspark.ml.classification import LogisticRegression

In [15]:
lr_churn = LogisticRegression(labelCol='churn')

In [16]:
fitted_churn_model = lr_churn.fit(train_churn)

In [17]:
training_sum = fitted_churn_model.summary

In [18]:
training_sum.predictions.describe().show()

+-------+------------------+-------------------+
|summary|             churn|         prediction|
+-------+------------------+-------------------+
|  count|               630|                630|
|   mean| 0.173015873015873|0.13015873015873017|
| stddev|0.3785615604806714| 0.3367453504422738|
|    min|               0.0|                0.0|
|    max|               1.0|                1.0|
+-------+------------------+-------------------+



In [19]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [20]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

In [21]:
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[25.0,9672.03,0.0...|    0|[4.15933013419811...|[0.98462215504283...|       0.0|
|[28.0,11128.95,1....|    0|[3.89797818361482...|[0.98012033848093...|       0.0|
|[29.0,5900.78,1.0...|    0|[3.58224677329476...|[0.97293949926017...|       0.0|
|[29.0,11274.46,1....|    0|[4.25508271020556...|[0.98600667522147...|       0.0|
|[30.0,10744.14,1....|    1|[1.44506063173800...|[0.80923709620359...|       0.0|
|[30.0,10960.52,1....|    0|[2.13536165798201...|[0.89429293494806...|       0.0|
|[31.0,5304.6,0.0,...|    0|[2.81598090279714...|[0.94353331749454...|       0.0|
|[31.0,10058.87,1....|    0|[4.15730464239268...|[0.98459145620494...|       0.0|
|[31.0,12264.68,1....|    0|[3.34666644972021...|[0.96599550468753...|       0.0|
|[32.0,8011.38,0

In [25]:
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                           labelCol='churn')

In [26]:
auc = churn_eval.evaluate(pred_and_labels.predictions)

In [27]:
auc

0.8139844498881671

### Predict on new Data

In [28]:
final_lr_model = lr_churn.fit(final_data)

In [31]:
new_customers = spark.read.csv('customer.csv', inferSchema=True, header=True)

In [32]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [33]:
test_new_customers = assembler.transform(new_customers)

In [34]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)
 |-- features: vector (nullable = true)



In [36]:
final_result = final_lr_model.transform(test_new_customers)

In [38]:
final_result.show()

+-------------------+---+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|              Names|Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|            features|       rawPrediction|         probability|prediction|
+-------------------+---+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|   Cameron Williams| 42|       11066.8|              0| 7.22|        8|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|[42.0,11066.8,0.0...|[2.63872820904874...|[0.93331285179943...|       0.0|
|      Kevin Mueller| 41|      11916.22|              0|  6.5|       11|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|

In [39]:
final_result.select('company','prediction').show()

+--------------------+----------+
|             company|prediction|
+--------------------+----------+
|          Harvey LLC|       0.0|
|          Wilson PLC|       1.0|
|Miller, Johnson a...|       1.0|
|           Smith Inc|       0.0|
|          Love-Jones|       0.0|
|        Kelly-Warren|       0.0|
|   Reynolds-Sheppard|       1.0|
|          Singh-Cole|       0.0|
|           Lopez PLC|       1.0|
|       Reed-Martinez|       1.0|
|Briggs, Lamb and ...|       0.0|
|    Figueroa-Maynard|       1.0|
|     Abbott-Thompson|       1.0|
|Smith, Kim and Ma...|       1.0|
|Snyder, Lee and M...|       0.0|
|      Sanders-Pierce|       1.0|
|Andrews, Adams an...|       1.0|
|Morgan, Phillips ...|       1.0|
|      Villanueva LLC|       0.0|
|Berry, Orr and Ca...|       0.0|
+--------------------+----------+
only showing top 20 rows



In [40]:
test_new_customers.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|       Onboard_date|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|               null|                null|                null|0.16666666666666666|
| stddev| 