<h2 id="about_dataset">Customer Churn with Logistic Regression</h2>

In this notebook, we'll create a Logistic Regression model for a telecommunication company, to predict when its customers will leave for a competitor, so that they can take some action to retain the customers.
We will use a historical telecommunications dataset for predicting customer churn.


In [None]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt

###  Load the Churn data 
Telco Churn is a hypothetical data file from IBM Object Storage.

In [None]:
!wget -O ChurnData.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv

## Load Data From CSV File  


In [None]:
churndf = pd.read_csv("churndata.csv")
churndf.head()

<h2 id="preprocessing">Data pre-processing and selection</h2>


In [None]:
churndf = churndf[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless', 'churn']]
churndf['churn'] = churndf['churn'].astype('int')
churndf.head()

In [None]:
x = np.asanyarray(churndf[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
y = np.asanyarray(churndf[['churn']])

print(x[0:5],"\n\n", y[0:5])

In [None]:
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)
x[0:5]

## Train/Test dataset


In [None]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=4)
print(xtrain.shape, ytrain.shape, xtest.shape, ytest.shape)

<h2 id="modeling">Modeling (Logistic Regression with Scikit-learn)</h2>


__Logistic Regression__ from the Scikit-learn package can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers.

The version of Logistic Regression in Scikit-learn, supports regularization as well. 


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
logmodel = LogisticRegression(C = 0.01, solver='liblinear')
logmodel.fit(xtrain, ytrain)
logmodel

<h2 id="modeling">Prediction using test set</h2>


In [None]:
yhat = logmodel.predict(xtest)
yhat

__predict_proba__  returns estimates for all classes, ordered by the label of classes.

In [None]:
yhatprob = logmodel.predict_proba(xtest)
yhatprob

<h2 id="evaluation">Evaluation</h2>


### jaccard index
If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1, otherwise it is 0.

In [None]:
from sklearn.metrics import jaccard_score
jaccard_score(ytest, yhat, pos_label=0)

### confusion matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes, 
               normalize = False, 
               title = 'Confusion Matrix', 
               cmap = plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
        print("Normalized matrix")
    else:
        print('without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tickmarks = np.arange(len(classes))
    plt.xticks(tickmarks, classes, rotation = 45)
    plt.yticks(tickmarks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i,j], fmt), 
                 horizontalalignment = 'center', 
                 color = 'white' if cm[i,j] > thresh else 'black')
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

print(confusion_matrix(ytest, yhat, labels=[1,0]))
#plot_confusion_matrix(confusion_matrix(ytest, yhat, labels = [1, 0]), classes = ['0','1'])

In [None]:
cnfmatrix = confusion_matrix(ytest, yhat, labels=[1,0])
np.set_printoptions(precision=2)

plt.figure()
plot_confusion_matrix(cnfmatrix, classes=['churn = 1', 'churn = 0'], normalize = False)

In [None]:
print(classification_report(ytest, yhat))

### log loss

In [None]:
from sklearn.metrics import log_loss
log_loss(ytest, yhatprob)

<h2 id="practice">Alternate model parameters</h2>
We shall rebuild the Logistic Regression model for the same dataset, but this time, we will use different __solver__ and __regularization__ values.

In [None]:
model = LogisticRegression(C=0.008, solver='sag')
model.fit(xtrain, ytrain)
yhat1 = model.predict(xtest)
yhatprob1 = model.predict_proba(xtest)

log_loss(ytest, yhatprob1)