In this project, a study will be conducted on classification from Data Mining predictive methods. Our application will be on a telecommunications company

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv",sep=',',decimal='.')
df.head(10000)


Our data set consists of 7043 rows and 21 columns.

Let's remove the CostumerID (Customer ID) part from our data set because it is unnecessary.

In [None]:
df.drop('customerID', axis=1, inplace=True)
df.info()

Let's categorize the data of the elder column.

In [None]:
df["SeniorCitizen"]= df["SeniorCitizen"].replace(0, "No") 


The "Total Payout" part is specified as object when it should be float. Let's fix that.

In [None]:
df["SeniorCitizen"]= df["SeniorCitizen"].replace(1, "Yes") 
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'] = df['TotalCharges'].fillna(value=0)
df['SeniorCitizen'] = df['SeniorCitizen'].astype('object')
df.describe()

In [None]:
df['SeniorCitizen'] = df['SeniorCitizen'].astype('object')
df.describe()

The number of lost and missing customers was visualized by creating a box chart of lost customers.

In [None]:
sns.countplot(x = "Churn", data = df)
df.loc[:, 'Churn'].value_counts()

There is no missing data.

In [None]:
df.isnull().sum()

The data are categorically and numerically classified. Our target variable, Loss, will not be included in categorical fields.

In [None]:
Categorical = df.select_dtypes(include='object').drop('Churn', axis=1).columns.tolist()
numerical = df.select_dtypes(exclude='object').columns.tolist()
for c in Categorical:
    print('Column {} unique values: {}'.format(c, len(df[c].unique())))

let's observe outlier data

In [None]:

sns.boxplot(x=df['tenure'],y=df['Churn'])


In [None]:
 sns.boxplot(x=df['TotalCharges'],y=df['Churn'])

We often see contrary data in the part of the total payment where customer loss occurs. In fact, even this situation shows us why customer losses occur, but let's do a more detailed analysis. Apart from this, some outlier data appear during the subscription period. Now, let's clear my outlier data. First of all, the data will be completely digitized with labelencoder.

In [None]:
from sklearn.preprocessing import LabelEncoder
encoded = df.apply(lambda x: LabelEncoder().fit_transform(x) if x.dtype == 'object' else x)
encoded.head(8000)

In [None]:
customerlost=encoded.loc[encoded['Churn'].abs()>0]
customerlost

In [None]:
Q1 = customerlost['TotalCharges'].quantile(0.25)
Q3 = customerlost['TotalCharges'].quantile(0.75)
IQR = Q3 - Q1
IQR

In [None]:
Q=Q3+(1.5*IQR)
Q

In [None]:
encoded_out = encoded[~((encoded['TotalCharges'] < (Q3 + 1.5 * IQR)))&(encoded['Churn']>0)]
encoded_out.head(8000)
encoded.drop(encoded[~((encoded['TotalCharges'] < (Q3 + 1.5 * IQR)))&(encoded['Churn']>0)].index, inplace=True)
encoded.head(5000)

Outlier data in the total payment section has been deleted. During Subscription Period.

In [None]:
Q1_A = customerlost['tenure'].quantile(0.25)
Q3_A = customerlost['tenure'].quantile(0.75)
IQR_A = Q3_A - Q1_A
IQR_A

In [None]:
Q_A=Q3_A+(1.5*IQR_A)
Q_A


Data contrary to the subscription period was brought.

In [None]:
encoded_A_out = encoded[~((encoded['tenure'] < (Q3_A + 1.5 * IQR_A)))&(encoded['Churn']>0)]
encoded_A_out.head(8000)
encoded.drop(encoded[~((encoded['tenure'] < (Q3_A + 1.5 * IQR_A)))&(encoded['Churn']>0)].index, inplace=True)
encoded.head(8000)

Preparation of Test and Training Data¶
At this stage, it is the process of dividing the data, whose target variable is defined and which takes its final form before the algorithm, into test and training. The separation threshold suitable for this process was found to be 85-15.

For the sake of a general demonstration, I will show the separation of test and training data below. These steps will be applied after performing operations according to the test before applying the test algorithms.

In [None]:
x = df.drop('Churn', axis = 1)              
y = df['Churn'] 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.85, random_state = 400)
x_test.head(8000)
x_train.head(8000)

In [None]:
y_test.head(5000)

In [None]:
y_train.head(500)

**Application of Classification Algorithms and Performance Analysis**
We will predict customer losses with classification algorithms. In our application, four classification algorithms will be applied to the data and the performance outputs will be analyzed. These performance outputs will be compared in section 5 and the classification algorithm to be used will be determined.
These four classification algorithms are;
Logistic Regression
Naive Bayes
Decision Tree
K-NN (Nearest Neighbor)

**Logistic Regression**
Logistic regression is valid when the output variable takes discrete values.
Since our data has columns with more than two options, we will use multiple logistic regression.

Before starting logistic regression, let's do a little correlation analysis and throw out the columns that don't work for us.

In [None]:
x=encoded['gender']
y=encoded['Churn']
print('gender:', x.corr(y)*100)
x=encoded['SeniorCitizen']
y=encoded['Churn']
print('SeniorCitizen:', x.corr(y)*100)
x=encoded['Partner']
y=encoded['Churn']
print('Partner:', x.corr(y)*100)
x=encoded['Dependents']
y=encoded['Churn']
print('Dependents:', x.corr(y)*100)
x=encoded['PhoneService']
y=encoded['Churn']
print('PhoneService:', x.corr(y)*100)
x=encoded['MultipleLines']
y=encoded['Churn']
print('MultipleLines:', x.corr(y)*100)
x=encoded['tenure']
y=encoded['Churn']
print('tenure:', x.corr(y)*100)
x=encoded['InternetService']
y=encoded['Churn']
print('InternetService:', x.corr(y)*100)
x=encoded['OnlineSecurity']
y=encoded['Churn']
print('OnlineSecurity:', x.corr(y)*100)
x=encoded['OnlineBackup']
y=encoded['Churn']
print('OnlineBackup:', x.corr(y)*100)
x=encoded['DeviceProtection']
y=encoded['Churn']
print('DeviceProtection:', x.corr(y)*100)
x=encoded['TechSupport']
y=encoded['Churn']
print('TechSupport:', x.corr(y)*100)
x=encoded['StreamingTV']
y=encoded['Churn']
print('StreamingTV:', x.corr(y)*100)
x=encoded['StreamingMovies']
y=encoded['Churn']
print('StreamingMovies:', x.corr(y)*100)
x=encoded['Contract']
y=encoded['Churn']
print('Contract:', x.corr(y)*100)
x=encoded['PaperlessBilling']
y=encoded['Churn']
print('PaperlessBilling:', x.corr(y)*100)
x=encoded['MonthlyCharges']
y=encoded['Churn']
print('MonthlyCharges:', x.corr(y)*100)
x=encoded['MonthlyCharges']
y=encoded['Churn']
print('MonthlyCharges:', x.corr(y)*100)
x=encoded['TotalCharges']
y=encoded['Churn']
print('TotalCharges:', x.corr(y)*100)

In [None]:
encoded.drop('gender', axis=1, inplace=True)
encoded.drop('PhoneService', axis=1, inplace=True)
encoded.drop('MultipleLines', axis=1, inplace=True)
encoded.drop('InternetService', axis=1, inplace=True)
encoded.drop('StreamingTV', axis=1, inplace=True)
encoded.drop('StreamingMovies', axis=1, inplace=True)

We dropped the columns that were less than 10 as a result of the correlation. As a result of the correlation, we see the effect of the contract period and subscription period on customer loss.

In [None]:
x=encoded.drop('Churn',axis=1)
y=encoded['Churn']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.85, random_state=43)
from sklearn.preprocessing import MinMaxScaler,StandardScaler
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_test)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

Multinomial logistic compression only supports newton-cg and lbfgs as solver. In this case, penalty = l2 is a mandatory option.

In [None]:
Logistic_Regression = LogisticRegression(C=0.5,tol=0.1,multi_class='multinomial',solver='newton-cg',penalty='l2',max_iter=100)
Logistic_Regression.fit(x_train, y_train)

In [None]:
y_pred=Logistic_Regression.predict(x_test)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
classification_report(y_true=y_test, y_pred=y_pred)

In [None]:
accuracy_score(y_test, y_pred)*100

In [None]:
confusion_matrix(y_test, y_pred)

**Decision Tree**
Classification is a classification method that creates a model in the form of a tree structure consisting of decision nodes and leaf nodes according to feature and target.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer  
est = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
encoded['tenure'] = est.fit_transform(encoded['tenure'].values.reshape(-1,1))
from sklearn.preprocessing import KBinsDiscretizer  
est = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
encoded['MonthlyCharges'] = est.fit_transform(encoded['MonthlyCharges'].values.reshape(-1,1))
from sklearn.preprocessing import KBinsDiscretizer  
est = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
encoded['TotalCharges'] = est.fit_transform(encoded['TotalCharges'].values.reshape(-1,1))
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score 
from sklearn.metrics import confusion_matrix as Confusion_Matrix

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.85, random_state=42)
tree_Decision= DecisionTreeClassifier(max_depth = 4, random_state=42)
tree_Decision.fit(x_train, y_train)

In [None]:
predictions = tree_Decision.predict(x_test)
score = round(accuracy_score(y_test, predictions), 2)
Confusion_Matrix = Confusion_Matrix(y_test, predictions)
sns.heatmap(Confusion_Matrix, annot=True, fmt=".0f")
plt.xlabel('estimated value')
plt.ylabel('real value')
plt.title('Score: {0}'.format(score), size = 15)
plt.show()

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions, target_names=['Not Lost customer', 'Lost customer']))

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth = 4, random_state=42,min_weight_fraction_leaf=0.0)
clf = clf.fit(x, y)
tree.plot_tree(clf,fontsize=10) 

**KNN (Nearest Neighbor)****

It is based on the principle of choosing k value distances, which is an observation value determined later as a parameter, and k number of observations with the smallest distance.

The method used to best assign the K value.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train, y_train)
    pred_i = knn.predict(x_test)
    error.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',markerfacecolor='blue', markersize=10)
plt.title('Fail rate for K ')
plt.xlabel('K value')
plt.ylabel('Average Error')

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x_train, y_train)

In [None]:
y_pred = knn.predict(x_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

**Naive Bayes**

The way the algorithm works calculates the probability of each state for an element and classifies it according to the one with the highest probability value.
Naive Bayes is divided into 3 groups;
GaussianNB
MultinomialNB
BernoulliNB

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
NBG = GaussianNB()
NBG.fit(x_train, y_train)
y_forcast = NBG.predict(x_test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.85, random_state=42)
print("x_train: ", x_train.shape)
print("x_test: ", x_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)
print("Navy Bayes Gaussian Score :",accuracy_score(y_test, y_forcast))
print("Confusion Matrix :",confusion_matrix(y_test, y_forcast))
print("Classification Report :",classification_report(y_test, y_forcast))

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
NBM = MultinomialNB()
NBM.fit(x_train, y_train)
y_forcast = NBM.predict(x_test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.85, random_state=42)
print("x_train: ", x_train.shape)
print("x_test: ", x_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)
print("Navy Bayes Multinomial Score :",accuracy_score(y_test, y_forcast))
print("Confusion Matrix :",confusion_matrix(y_test, y_forcast))
print("Classification Report :",classification_report(y_test, y_forcast))

In [None]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
NBB = BernoulliNB()
NBB.fit(x_train, y_train)
y_forcast = NBB.predict(x_test)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.85, random_state=42)
print("x_train: ", x_train.shape)
print("x_test: ", x_test.shape)
print("y_train: ", y_train.shape)
print("y_test: ", y_test.shape)
print("Navy Bayes Bernoulli Score :",accuracy_score(y_test, y_forcast))
print("Confusion Matrix :",confusion_matrix(y_test, y_forcast))
print("Classification Report :",classification_report(y_test, y_forcast))

**Comparison of the Performance of Classification Algorithms¶**
The Performance of the Classification Model can be measured in detail over the Confusion Matrix.
However, we will only use the auc-roc chart.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
from sklearn.datasets import make_classification
x, y = make_classification(n_samples=1000, n_classes=2, random_state=1)
trainx, testx, trainy, testy = train_test_split(x, y, test_size=0.85, random_state=42)
knn_forcast = [0 for _ in range(len(testy))]
NBG_forcast = [0 for _ in range(len(testy))]
tree_Decision_forcast = [0 for _ in range(len(testy))]
Logistic_Regression_forcast = [0 for _ in range(len(testy))]

model = NBG
model.fit(trainx, trainy)
model2 = knn
model2.fit(trainx, trainy)
model3=tree_Decision
model3.fit(trainx, trainy)
model4=Logistic_Regression
model4.fit(trainx, trainy)

NBG_forcast = model.predict_proba(testx)
knn_forcast = model2.predict_proba(testx)
tree_Decision_forcast = model3.predict_proba(testx)
Logistic_Regression_forcast= model4.predict_proba(testx)

NBG_forcast = NBG_forcast[:, 1]
knn_forcast = knn_forcast[:, 1]
tree_Decision_forcast = tree_Decision_forcast[:, 1]
Logistic_Regression_forcast = Logistic_Regression_forcast[:, 1]

KNN_sensitivity = roc_auc_score(testy, knn_forcast)
NBG_sensitivity = roc_auc_score(testy, NBG_forcast)
tree_Decision_sensitivity= roc_auc_score(testy, tree_Decision_forcast)
Logistic_Regression_sensitivity = roc_auc_score(testy, Logistic_Regression_forcast)

print('KNN: ROC AUC=%.3f' % (KNN_sensitivity))
print('Navy Bayes Gaussian: ROC AUC=%.3f' % (NBG_sensitivity))
print('Desicion_tree: ROC AUC=%.3f' % (tree_Decision_sensitivity))
print('Logistic Regresyon: ROC AUC=%.3f' % (Logistic_Regression_sensitivity))

KNN_fpr, KNN_tpr, _ = roc_curve(testy, knn_forcast)
NBG_fpr, NBG_tpr, _ = roc_curve(testy, NBG_forcast)
tree_Decision_fpr, tree_Decision_tpr, _ = roc_curve(testy, tree_Decision_forcast)
Logistic_Regression_fpr, Logistic_Regression_tpr, _ = roc_curve(testy, Logistic_Regression_forcast)

pyplot.plot(KNN_fpr, KNN_tpr, linestyle='--', label='KNN')
pyplot.plot(NBG_fpr, NBG_tpr, marker='.', label='Navy Bayes Gaussian')
pyplot.plot(tree_Decision_fpr, tree_Decision_tpr, marker='.', label='tree Decision')
pyplot.plot(Logistic_Regression_fpr, Logistic_Regression_tpr, marker='.', label='Logistic Regresyon')

pyplot.xlabel('customer loss')
pyplot.ylabel('no customer loss')

pyplot.legend()

pyplot.show()

**Final Report
It is possible to predict customer loss using various analyzes and thus to warn the business before customer loss occurs.
The most suitable algorithm chosen for the database we analyze is Logistic Regression.
As a result of our logistic regression analysis, we achieved a successful prediction score of 81.3%.
The main factor that causes customer loss is the height of the total payments.
Total payments are followed by subscription period and monthly payments. We can say that customers who do not receive online security and technical support have a higher loss rate.****