# Problem Statement
A marketing agency has many customers that use their service to produce ads for the
client/customer websites. They've noticed that they have quite a bit of churn in clients. They basically
randomly assign account managers right now, but want you to create a machine learning model that
will help predict which customers will churn (stop buying their service) so that they can correctly
assign the customers most at risk to churn an account manager. Luckily they have some historical
data, can you help them out? Create a classification algorithm that will help classify whether or not a
customer churned. Then the company can test this against incoming data for future customers to
predict which customers will churn and assign them an account manager.

## Solution
 We will create a model using **Logistic Regression** on given data and apply it on new dataset.

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('churn').getOrCreate()

In [5]:
# Load data as spark dataframe
data = spark.read.table('churn')

In [6]:
data.printSchema()

In [7]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
                                OneHotEncoder,StringIndexer)

In [8]:
data.groupBy('Company').count().show()

In [9]:
data.columns

We are ignoring **Name**, **Company**, **Onboard_date** and **Location** column. As they are hardly provinding any information and mostly values are unique.

In [11]:
#Create a feature vector
assembler = VectorAssembler(inputCols=[
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')

In [12]:
output = assembler.transform(data)

In [13]:
from pyspark.ml.classification import LogisticRegression

In [14]:
#Even if target variable is already binary, performing this step to maintain consistency and practice  
indexer = StringIndexer(inputCol="Churn", outputCol="ChurnIndexed")
output_fixed = indexer.fit(output).transform(output)

In [15]:
log_reg = LogisticRegression(featuresCol='features',labelCol='ChurnIndexed', maxIter=500) 

In [16]:
final_data = output_fixed.select("features",'ChurnIndexed')

In [17]:
#splitting the data into train and test 
train_data, test_data = final_data.randomSplit([0.8,.2])

In [18]:
model = log_reg.fit(train_data)

In [19]:
results = model.transform(test_data)

In [20]:
#Evaluating the result 
from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='ChurnIndexed')


In [21]:
results.select('ChurnIndexed','prediction').show()

In [22]:
AUC = my_eval.evaluate(results)

In [23]:
AUC

We have achieved 78% accuracy on given data.

### Now Let's check how our model performs on new data

In [26]:
test_data1 = spark.read.table('newdata')

In [27]:
test_data2=assembler.transform(test_data1)

In [28]:
results1 = model.transform(test_data2)

In [29]:
new_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction')

In [30]:
results1.select('prediction').show()

We do not have taget variable to evaluate our result for new data.