# Logistic Regression-Churning

## Task: Create a machine learning model that will help predict which customers will churn  so that they can correctly assign the customers most at risk to churn an account manager. Then test out the model on some new data that has been provided (new_customers.csv). Which customers are most likely to churn given this unlabeled data?

## Results: Accuracy of the logistic regression model on the test data is high at .91.  Of the six customers/companies in the new customers data set,  Cannon-Benson,  Barron-Robertson, Sexton-Golden, and Parks-Robbins are likely to churn.

In [71]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('churn').getOrCreate()

In [72]:
# Use Spark to read in the Cruise Ship Info csv file.
data = spark.read.csv("spark_master/Spark_for_Machine_Learning/Logistic_Regression/customer_churn.csv",inferSchema=True,header=True)

In [73]:
# Print the Schema of the DataFrame
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [74]:
for item in data.head(5):
    print(item)
    print('\n')

Row(Names='Cameron Williams', Age=42.0, Total_Purchase=11066.8, Account_Manager=0, Years=7.22, Num_Sites=8.0, Onboard_date=datetime.datetime(2013, 8, 30, 7, 0, 40), Location='10265 Elizabeth Mission Barkerburgh, AK 89518', Company='Harvey LLC', Churn=1)


Row(Names='Kevin Mueller', Age=41.0, Total_Purchase=11916.22, Account_Manager=0, Years=6.5, Num_Sites=11.0, Onboard_date=datetime.datetime(2013, 8, 13, 0, 38, 46), Location='6157 Frank Gardens Suite 019 Carloshaven, RI 17756', Company='Wilson PLC', Churn=1)


Row(Names='Eric Lozano', Age=38.0, Total_Purchase=12884.75, Account_Manager=0, Years=6.67, Num_Sites=12.0, Onboard_date=datetime.datetime(2016, 6, 29, 6, 20, 7), Location='1331 Keith Court Alyssahaven, DE 90114', Company='Miller, Johnson and Wallace', Churn=1)


Row(Names='Phillip White', Age=42.0, Total_Purchase=8010.76, Account_Manager=0, Years=6.71, Num_Sites=10.0, Onboard_date=datetime.datetime(2014, 4, 22, 12, 43, 12), Location='13120 Daniel Mount Angelabury, WY 30645-4695',

In [75]:
# A few things we need to do before Spark can accept the data!
# It needs to be in the form of two columns
# ("label","features")

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [76]:
#List the column names in the data set
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [77]:
#Examine Location variable. Since it is not factorable, it will not be used to train the model.
count = data.groupBy('Location').count().show()

+--------------------+-----+
|            Location|count|
+--------------------+-----+
|062 Trevor Falls ...|    1|
|066 Jenkins Walks...|    1|
|45946 Day Springs...|    1|
|143 Andrea Flat L...|    1|
|Unit 2093 Box 153...|    1|
|399 Herbert Key P...|    1|
|104 Ruben Rapid A...|    1|
|930 Carrie Harbor...|    1|
|8202 Jade Unions ...|    1|
|USCGC Bailey FPO ...|    1|
|893 Carla Trace S...|    1|
|446 Rodney Ridge ...|    1|
|30668 Isabella Fr...|    1|
|911 Kent Point An...|    1|
|078 Nunez Haven S...|    1|
|PSC 5667, Box 831...|    1|
|4972 Michael Vill...|    1|
|567 Ian Loop Lamb...|    1|
|482 Wells Mountai...|    1|
|7259 Brown Street...|    1|
+--------------------+-----+
only showing top 20 rows



In [78]:
#Examine Company variable. Since it is not factorable, it will not be used to train the model.
count = data.groupBy('Company').count().show()

+--------------------+-----+
|             Company|count|
+--------------------+-----+
|Miller, Johnson a...|    1|
|Hunter, Reyes and...|    1|
|          Obrien PLC|    1|
|            Soto PLC|    2|
|            Todd LLC|    1|
|Smith, Marshall a...|    1|
|           Smith PLC|    1|
|          Hall Group|    1|
|Freeman, Lam and ...|    1|
|       Smith-Carroll|    1|
|Hall, Hernandez a...|    1|
|          Cannon Inc|    1|
|        White-Dennis|    1|
|Wilson, Collins a...|    1|
|Jennings, Gates a...|    1|
|     Campbell-Willis|    1|
|    Martinez-Roberts|    1|
|        Robinson PLC|    1|
|          Barton Inc|    1|
|Hernandez, Middle...|    1|
+--------------------+-----+
only showing top 20 rows



In [79]:
#The following attributes were omitted: 'Names','Account_Manager' (these were assigned randomly), 'Onboard_date', 
#'Location','Company'

assembler = VectorAssembler(
    inputCols=['Age', 'Total_Purchase', 'Years', 'Num_Sites'],
    outputCol="features")

In [80]:
#Use the assembler to transform the data into two columns: features, Churn
output = assembler.transform(data)

In [81]:
#Show the output dataframe
output.select("features","Churn").show()

+--------------------+-----+
|            features|Churn|
+--------------------+-----+
|[42.0,11066.8,7.2...|    1|
|[41.0,11916.22,6....|    1|
|[38.0,12884.75,6....|    1|
|[42.0,8010.76,6.7...|    1|
|[37.0,9191.58,5.5...|    1|
|[48.0,10356.02,5....|    1|
|[44.0,11331.58,5....|    1|
|[32.0,9885.12,6.9...|    1|
|[43.0,14062.6,5.4...|    1|
|[40.0,8066.94,7.1...|    1|
|[30.0,11575.37,5....|    1|
|[45.0,8771.02,6.6...|    1|
|[45.0,8988.67,4.8...|    1|
|[40.0,8283.32,5.1...|    1|
|[41.0,6569.87,4.3...|    1|
|[38.0,10494.82,6....|    1|
|[45.0,8213.41,7.3...|    1|
|[43.0,11226.88,8....|    1|
|[53.0,5515.09,6.8...|    1|
|[46.0,8046.4,5.69...|    1|
+--------------------+-----+
only showing top 20 rows



In [82]:
#Attach the dataframe to an object
final_data = output.select("features",'Churn')

In [83]:
#Split the data into train & test: 70% train, 30% test
train_churn,test_churn = final_data.randomSplit([0.7,0.3])

In [84]:
#View train data
train_churn.describe().show()

+-------+-------------------+
|summary|              Churn|
+-------+-------------------+
|  count|                626|
|   mean|0.17412140575079874|
| stddev| 0.3795170968969129|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



In [85]:
#View test data
test_churn.describe().show()

+-------+-------------------+
|summary|              Churn|
+-------+-------------------+
|  count|                274|
|   mean|0.14963503649635038|
| stddev| 0.3573660434685389|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



In [86]:
#Import LogisticRegression
from pyspark.ml.classification import LogisticRegression

In [87]:
# Create a Logistic Regression Model object
churn_lr = LogisticRegression(labelCol='Churn')

In [88]:
# Fit the model to the data and call this model churn_lrModel
churn_lrModel = churn_lr.fit(train_churn)

In [89]:
# Create an object with summary model data 
training_sum = churn_lrModel.summary

In [90]:
# View the object
training_sum.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              Churn|         prediction|
+-------+-------------------+-------------------+
|  count|                626|                626|
|   mean|0.17412140575079874|0.12300319488817892|
| stddev| 0.3795170968969129|0.32870352354329324|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



In [109]:
#Import BinaryClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [110]:
#Apply the model, churn_lrModel to the test data
pred_and_labels = churn_lrModel.evaluate(test_churn)

In [111]:
#Show the results from applying the model to the test data
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|Churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[22.0,11254.38,4....|    0|[4.29340632296435...|[0.98652571480195...|       0.0|
|[28.0,8670.98,3.9...|    0|[7.14143718163282...|[0.99920901257036...|       0.0|
|[28.0,11204.23,3....|    0|[1.09075590062981...|[0.74852403589785...|       0.0|
|[28.0,11245.38,6....|    0|[3.22120108594342...|[0.96162436269381...|       0.0|
|[29.0,10203.18,5....|    0|[3.70803782670405...|[0.97606150634386...|       0.0|
|[30.0,10960.52,5....|    0|[2.33571362166432...|[0.91179195090912...|       0.0|
|[30.0,13473.35,3....|    0|[1.88886504197959...|[0.86862606920732...|       0.0|
|[31.0,8829.83,4.5...|    0|[4.37510481124901...|[0.98756963585975...|       0.0|
|[31.0,9574.89,7.3...|    0|[2.97533524147837...|[0.95144733457579...|       0.0|
|[32.0,6367.22,2

In [118]:
#Evaluate using ACC churn_lrModel using test data
acc = evaluator.evaluate(pred_and_labels.predictions)

In [119]:
#Print out accurancy
acc

0.9051094890510949

## Predict on New Data

In [99]:
# Fit the model to all of the data and call this model final_lrModel
final_lr_model = churn_lr.fit(final_data)

In [101]:
# Use Spark to read in the new_customers csv file.
new_customers = spark.read.csv("spark_master/Spark_for_Machine_Learning/Logistic_Regression/new_customers.csv",inferSchema=True,header=True)

In [102]:
#Print schema
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [121]:
#Use assembler to transform new_customer data set
test_new_customers = assembler.transform(new_customers)

In [122]:
#Print schema
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [120]:
#Show the test_new_customers info; Only 6 customers
test_new_customers.describe().show()

+-------+-------------+------------------+-----------------+------------------+-----------------+------------------+--------------------+----------------+
|summary|        Names|               Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|            Location|         Company|
+-------+-------------+------------------+-----------------+------------------+-----------------+------------------+--------------------+----------------+
|  count|            6|                 6|                6|                 6|                6|                 6|                   6|               6|
|   mean|         null|35.166666666666664|7607.156666666667|0.8333333333333334|6.808333333333334|12.333333333333334|                null|            null|
| stddev|         null| 15.71517313511584|4346.008232825459| 0.408248290463863|3.708737880555414|3.3862466931200785|                null|            null|
|    min|Andrew Mccall|              22.0|            100.0|          

In [123]:
#Apply model to the transformed new_customers data set
final_results = final_lr_model.transform(test_new_customers)

In [127]:
final_results.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [128]:
final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+



## Results: Accuracy of the logistic regression model on the test data is high at .91.  Of the six customers/companies in the new customers data set,  Cannon-Benson,  Barron-Robertson, Sexton-Golden, and Parks-Robbins are likely to churn.