#                                       Predicting Telecom cutomer churn

## Churn is the largest problem faced by most of businesses. Customer churn means loss of clients or customers, this is also known as customer attrition. Cutomer churn analysis is a key business metric in telecom service providers, Internet service providers, Insurance firms etc. As the cost of retaining an existing customer is far less than acquiring a new customer( According to Harvard Business Review, it costs between 5 times and 25 times as much to find a new customer than to retain an existing one). 

## Preventing customer churn is an important function of business. By building predictive models using machine learning algorithms, we can predict whether a customer is willing to leave the carrier service. By identifying the customers who are in a risk of churning, network provider can provide better offers and services to retain their client.

In [1]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import Row
from pyspark.sql import functions

#conf = SparkConf().setMaster("local").setAppName("SentenceEmotion")
#sc = SparkContext(conf = conf)

lines = sc.textFile("Telecomchurn.csv")

## Data

### We have used dataset from Kaggle. This data consists of various customers covering almost all the states in the united states. Each data entry involves the information about the customer which has features such as: 

### categorical variables 1) State, 2)International Plan 3) Voice mail plan 4) churn(Dependent variable)
### continuous variables 1) number vmail messages	2)total day minutes	3)total day calls	4)total day charge	5)total eve minutes	5)total eve calls	6)total eve charge	7)total night minutes	8)total night calls	9) total night charge	10)total intl minutes	11)total intl calls	12)total intl charge 13) customer service calls


### For the purpose of the project, the dependent variable is whether the customer will 'Churn' or not.

In [2]:
#checking top 5 lines
lines.take(5)

[u'state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn',
 u'KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10,3,2.7,1,FALSE',
 u'OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,FALSE',
 u'NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,FALSE',
 u'OH,84,408,375-9999,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,FALSE']

In [5]:
header = lines.first()
sent = lines.filter(lambda x: x!=header)
sent.take(5)

In [6]:
#creating dataframe
from numpy import array
cleanData =  sent.map(lambda line: [x for x in line.split(',')])
cleanData.take(5)

In [7]:
#splitting the file with ','
from numpy import array
data =  sent.map(lambda line: array([x for x in line.split(',')]))
data.take(5)

In [8]:
#Creating features
X_features=data.map(lambda x: array([float(x[1]),float(x[2]),float(x[6]),float(x[7]),float(x[8]),float(x[9]),float(x[10]),float(x[11]),float(x[12]),float(x[13]),float(x[14]),float(x[15]),float(x[16]),float(x[17]),float(x[18]),float(x[19])]))
X_features.take(5)

In [9]:
#importing Kmeans, KMeans modelling from MLlib and creating clusters.
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(X_features, 2, maxIterations=150, runs=10,initializationMode="random",seed=42)

In [10]:
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

In [11]:
from math import sqrt
WSSSE = X_features.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

In [12]:
# Import the necessary modules 
from pyspark.sql import Row

# Map the RDD to a DF
df = cleanData.map(lambda x: Row(state=x[0], 
                              account_length=x[1], 
                              area_code=x[2],
                              phone_number=x[3],
                              international_plan=x[4],
                              voice_mail_plan=x[5], 
                              voice_mail_msgs=x[6],
                              day_mins=x[7],
                              day_calls=x[8],
                              day_charges=x[9],
                              eve_mins=x[10],
                              eve_calls=x[11],
                              eve_charges=x[12],
                              nit_mins=x[13],
                              nit_calls=x[14],
                              nit_charges=x[15],
                              intl_mins=x[16],
                              intl_calls=x[17],
                              intl_charges=x[18],
                              customer_service_calls=x[19],
                              churn=x[20])).toDF()

In [13]:
df.show()

In [14]:
df.printSchema()

In [15]:
from pyspark.sql.types import *
df = df.withColumn("account_length", df["account_length"].cast(FloatType())) \
     .withColumn("voice_mail_msgs", df["voice_mail_msgs"].cast(FloatType())) .withColumn("day_mins", df["day_mins"].cast(FloatType())) .withColumn("day_calls", df["day_calls"].cast(FloatType())) .withColumn("day_charges", df["day_charges"].cast(FloatType())) .withColumn("eve_mins", df["eve_mins"].cast(FloatType())) .withColumn("eve_calls", df["eve_calls"].cast(FloatType())) .withColumn("eve_charges", df["eve_charges"].cast(FloatType())) .withColumn("nit_mins", df["nit_mins"].cast(FloatType())) .withColumn("nit_calls", df["nit_calls"].cast(FloatType())) .withColumn("nit_charges", df["nit_charges"].cast(FloatType())) .withColumn("intl_mins", df["intl_calls"].cast(FloatType())) .withColumn("intl_calls", df["intl_calls"].cast(FloatType())) .withColumn("intl_charges", df["intl_charges"].cast(FloatType())) .withColumn("customer_service_calls", df["customer_service_calls"].cast(FloatType()))
     #.withColumn("churn", df["churn"].cast(FloatType())) 

In [16]:
#Checking schema of variables
df.printSchema()

In [17]:
df.select('account_length','voice_mail_msgs').show(10)

In [18]:
df.printSchema()

In [19]:
#Checking the count of churned and non-churned customers if they have international plan.
df.filter(df.international_plan == True).groupby(df.churn).count().show()

### From the above results we could see that if the customer is having an international plan, number of customers who are willing to churn are more. 73% of the people with international plan are churned.

In [21]:
#Checking the same for the customers that churned that doesnot have international plan
df.filter(df.international_plan == False).groupby(df.churn).count().show()

### If the customer doesn't have an international plan, the percentage of customer's being churned are 12%. This can be interpretted in different ways. 

1) We could see that number of people with international plans are less compared to the number of people without international plan.
2) As we checked the percentage of churned to non-churned in both the cases. we can see that churn is more for international plan customers. We can assume that the international plans are not good or the service is not good that might caused the churn.

In [23]:
#Checkin the relation between customer service calls and churn
df.groupBy("churn").agg({'customer_service_calls': 'sum'}).show()

### We assume that the people that make more customer service calls would churn. As per the results we could see that the people with more customer service calls churned. As there might be issues with the service or the service calls couldn't help to resolve the issue.

In [25]:
# Exploring the customer churn with respect to area code.
df.filter(df.churn == True).groupby(df.area_code).count().orderBy('count', ascending=False).show()

### As per the above results we could see that the customers with area_code '415' churned the most with count of 256. followed by the area codes of '510' and '408'. Company could increase the service in that Area code or they might offer special packages for the customers with these area codes.

In [27]:
df.describe().show()

In [28]:
#Created string indexes to a categorical variable in order to convert each category into a specific index(neumerical value)
from pyspark.ml.feature import StringIndexer
indexer_state = StringIndexer(inputCol="state", outputCol="stateIndex")
indexed_state_df = indexer_state.fit(df).transform(df)
indexed_state_df.show()

In [29]:
#Further expanding the same logic to other categorical columns.
from pyspark.ml.feature import StringIndexer
indexer_intplan = StringIndexer(inputCol="international_plan", outputCol="intlplanIndex")
indexed_state_intplan_df = indexer_intplan.fit(indexed_state_df).transform(indexed_state_df)
indexed_state_intplan_df.show()

In [30]:
#Expanding indexes to further to dependent variables.
from pyspark.ml.feature import StringIndexer
indexer_churn = StringIndexer(inputCol="churn", outputCol="churnIndex")
indexed_c_s_ip_df = indexer_churn.fit(indexed_state_intplan_df).transform(indexed_state_intplan_df)
indexed_c_s_ip_df.show()

In [31]:
from pyspark.ml.feature import StringIndexer
indexer_vp = StringIndexer(inputCol="voice_mail_plan", outputCol="voiceplanIndex")
indexed_c_s_ip_vp_df = indexer_vp.fit(indexed_c_s_ip_df).transform(indexed_c_s_ip_df)
indexed_c_s_ip_vp_df.show()

In [32]:
#Applying onehot encoder to convert all the categorical variables into sparse matrices.
from pyspark.ml.feature import OneHotEncoderEstimator
encoder = OneHotEncoderEstimator(inputCols=["stateIndex","intlplanIndex","voiceplanIndex","churnIndex"],
                                 outputCols=["stateVec1","intlplanvec1","voiceplanvec1","churnvec1"])
model = encoder.fit(indexed_c_s_ip_vp_df)
encoded = model.transform(indexed_c_s_ip_vp_df)
encoded.show()

In [33]:
# Assigning the variables to apply encoder.
df1=encoded.select([c for c in encoded.columns if c not in {'state','area_code','phone_number','churn','stateIndex','intlplanIndex','voiceplanIndex','international_plan','voice_mail_plan'}])

In [34]:
df1.show(2)

In [35]:
#Selecting the required columns for features.
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["account_length","customer_service_calls","day_calls","day_charges","day_mins","eve_calls","eve_charges","eve_mins","intl_calls","intl_charges","intl_mins","nit_calls","nit_charges","nit_mins","voice_mail_msgs","stateVec1","intlplanvec1","voiceplanvec1"],
    outputCol="features")
output = assembler.transform(df1)
output.show(2)
#output.select("features").show(truncate=False)

In [36]:
df1.show(2)

In [37]:
#Applying features and lables to provide the same to model.
df_classifier = output.selectExpr("features as features","churnIndex as label")
df_classifier.show(2)

In [38]:
df_log_reg = output.selectExpr("features as features","churnvec1 as label")
df_log_reg.show(2)

In [39]:
df_log_reg.select("features").show(1)

In [40]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.5)
# Fit the model
lrModel = lr.fit(df_classifier)
# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")

# Fit the model
mlrModel = mlr.fit(df_classifier)

# Print the coefficients and intercepts for logistic regression with multinomial family
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))

In [41]:
#Evaluation metrices
trainingSummary = lrModel.summary
# for multiclass, we can inspect metrics on a per-label basis

print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s\nareaUnderROC: %s "
      % (trainingSummary.accuracy, trainingSummary.weightedFalsePositiveRate, trainingSummary.weightedTruePositiveRate, trainingSummary.weightedFMeasure(), trainingSummary.weightedPrecision, trainingSummary.weightedRecall,trainingSummary.areaUnderROC ))

In [42]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator,BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = df_classifier.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

# Chain indexers and tree in a Pipeline
#pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = dt.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
#predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g\naccuracy = %g " % ((1.0 - accuracy) ,accuracy))

# summary only
print(model)

evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [2, 4, 6])
             .addGrid(dt.maxBins, [20, 60])
             .build())
crossval = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

cvModel = crossval.fit(trainingData)
prediction = cvModel.transform(testData)
evaluator.evaluate(predictions)

In [43]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)
model = rf.fit(trainingData)
# Make predictions.
predictions = model.transform(testData)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g\naccuracy = %g " % ((1.0 - accuracy) ,accuracy))
treeModel = model
# summary only
print(treeModel)
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

In [44]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=50)

# Chain indexers and GBT in a Pipeline
#pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = gbt.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g\naccuracy = %g " % ((1.0 - accuracy) ,accuracy))

print(model)  # summary only

evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

In [45]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator

paramGrid = (ParamGridBuilder()
             .addGrid(gbt.maxDepth, [2, 4, 6])
             .addGrid(gbt.maxBins, [20, 60])
             .addGrid(gbt.maxIter, [10, 20])
             .build())
cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)
# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
cvModel = cv.fit(trainingData)
predictions = cvModel.transform(testData)
evaluator.evaluate(predictions)

In [46]:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
#https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
# Load training data
#data = spark.read.format("libsvm")\.load("data/mllib/sample_multiclass_classification_data.txt")

# Split the data into train and test
splits = df_classifier.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# specify layers for the neural network:
# input layer of size 4 (features), two intermediate of size 5 and 4
# and output of size 2 (classes)
layers = [4, 5, 4, 2]

# create the trainer and set its parameters
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# train the model
model = trainer.fit(trainingData)

# compute accuracy on the test set
result = model.transform(testData)
predictionAndLabels = result.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g\naccuracy = %g " % ((1.0 - accuracy) ,accuracy))
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

In [47]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')
scaler_model = scaler.fit(df_classifier)
scaled_data = scaler_model.transform(df_classifier)
type(scaled_data)

In [48]:
scaled_data.show()

In [49]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = scaled_data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="label", featuresCol="scaledFeatures")

# Chain indexers and tree in a Pipeline
#pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = dt.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
#predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g\naccuracy = %g " % ((1.0 - accuracy) ,accuracy))
print(model)
evaluator = BinaryClassificationEvaluator()
print("Test Area Under ROC: " + str(evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})))

In [50]:
from pyspark.ml.classification import LinearSVC

# Load training data
#training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

lsvc = LinearSVC(maxIter=5, regParam=0.1)

# Fit the model
lsvcModel = lsvc.fit(trainingData)

# Print the coefficients and intercept for linearsSVC
print("Coefficients: " + str(lsvcModel.coefficients))
print("Intercept: " + str(lsvcModel.intercept))

In [51]:
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(df_classifier)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(df_classifier)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
scaledData.select("features", "scaled_features").show()

In [52]:
scaledData.show()

In [53]:
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors

#data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
#f = spark.createDataFrame(data, ["features"])
#f.show()
pca = PCA(k=5, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(scaledData)
#print(model.pc)
pcaData = model.transform(scaledData).select("pcaFeatures")
pcaData.show(1)

In [54]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = scaledData.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
dt = DecisionTreeClassifier(labelCol="label", featuresCol="scaled_features")

# Chain indexers and tree in a Pipeline
#pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])

# Train model.  This also runs the indexers.
model = dt.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
#predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g\naccuracy = %g " % ((1.0 - accuracy) ,accuracy))
print(model)

In [55]:
from pyspark.ml.clustering import KMeans

# Loads data.
#dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(df_classifier)

# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(df_classifier)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

## Results:
### Below are the results according to the accuracy and ROC learning rates of different models, 

####                            Accuracy	ROC
#### LogisticRegression	        0.85	    --
#### Decion tree	            0.92	    0.37
#### Random Forest	            0.85        0.89
#### Gradient boosted                   
#### tree                       0.94	    0.95
#### Multilayer                         
#### Perceptron      	        0.94	    0.946

#### As per the above results we could see that all models perform better. With multilayer perceptron neural network and gradient boosted tree working better with accuracy of 94% and ROC  score of 0.95. As these are test accuracies obtained, which are calculated with the test split of 30%. Almost all the models have test accuracies of more than 85%. ROC scores with highest of 0.95 and lowest of 0.37, inferes that the model's of more ROC score learn the data faster than the models like decision trees and logistical regression with lesser ROC rates compared to other models. 

#### Also we can see that K-means algorithm under performs and couldnot learn any specific patterns in the dataset. 

#### Multilayer perceptron is a neural network that predicts well. 

#### After experimenting on different models and considering our problem statement we can infere that Tree based models works better for this situation. Tree based models show better predictability(Though the ROC score is less for decision tree) we can say that gradient boosted trees with highest accuracies and explainabilities perform better. These tree based models can be interpreted and Explained better way compared to the Multilayer perceptron and neural networks.

## Companies can make use of these models and predict whether the customer will churn or not. By leveraging the data customer can make new offers for these clients and on a whole improve the service to retain the customers.

#### References:
#### 1) https://spark.apache.org/docs/1.5.2/api/python/pyspark.ml.html
#### 2) https://towardsdatascience.com/predict-customer-churn-with-r-9e62357d47b4
#### 3) https://towardsdatascience.com/cutting-the-cord-predicting-customer-churn-for-a-telecom-company-268e65f177a5
#### 4) https://stackoverflow.com/
#### 5) Frank-Kanes-Taming-Big-Data-with-Apache-Spark-and-Python

### Challenges faced :
### We faced some challenges while implementing the models in mulltilayer perceptron and PCA analysis as the data frame transformation had to be done and also we weren't able to get the evaluation metrices for different  models. Later we had to explore various sources and finally have implemented those metrics using BinaryEvaluator
