# Logistic Regression

A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. Our aim is to find out who is leaving and why. In this notebook, we will create a Logistic Regression model to predict when its customers will leave for a competitor, so that they can take some action to retain the customers.

## Difference between Linear and Logistic Regression

While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression 

Let's recall linear regression. As we know, Linear regression finds a function that relates a continuous dependent variable, y, to some predictors (independent variables $x_1$, $x_2$, etc.). For example, Simple linear regression assumes a function of the form $y = \theta_0 + \theta_1  x_1 + \theta_2  x_2 + \cdots$, and finds the values of parameters $\theta_0, \theta_1, \theta_2$, etc. It can be generally shown as: $h_\theta(𝑥) = \theta^TX$.

Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, y, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables. Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function, which is called sigmoid function 𝜎:
<br>
<br>
$$h_\theta(𝑥) = \sigma({\theta^TX}) =  \frac {1}{1 + e^{-(\theta_0 + \theta_1  x_1 + \theta_2  x_2 +\cdots)}}$$
    
$h_\theta(x)$ can be interpreted as probability of a certain observation belonging to a certain class. Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability. Sigmoid function is a common S-shape curve.

<img
src="https://ibm.box.com/shared/static/kgv9alcghmjcv97op4d6onkyxevk23b1.png" width="400" align="center">

The objective of Logistic regression is to find the best parameters $\theta_i$, for $h_\theta(𝑥)$ = $\sigma({\theta^TX})$, in such a way that the model best predicts the class of each case.


## Data Pre-Processing

In [None]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

We use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company.

This data set provides information to predict what behavior will help to retain customers. We analyze all relevant customer data and develop focused customer retention programs. The dataset includes information about:

- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

In [None]:
df = pd.read_csv('churnData.csv')

In [None]:
df.head()

Let's select some features for the modeling. We change the target datatype to int (required by scikit-learn algorithm)

In [None]:
df = df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless', 'churn']]
df['churn'] = df['churn'].astype(int)

In [None]:
df.head()

Let's define the feature matrix X and the target vector y:

In [None]:
X = np.asarray(df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
y = np.asarray(df['churn'])

Let's normalize the dataset 

In [None]:
from sklearn import preprocessing

X = preprocessing.StandardScaler().fit(X).transform(X)

Now, we split our dataset into train-test datasets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4)

print('Train set : ', X_train.shape, y_train.shape)
print('Test set : ', X_test.shape, y_test.shape)

## Logistic Regression

We now build our model using __LogisticRegression__ from Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. The version of Logistic Regression in Scikit-learn, supports regularization. Regularization is a technique used to solve the overfitting problem in machine learning models. Parameter __C__ indicates inverse of regularization strength, which must be a positive float. Smaller values specify stronger regularization.

Now lets fit our model with train set:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

lr = LogisticRegression(C = 0.01, solver = 'liblinear').fit(X_train, y_train)

Now, we can use this to make predictions on our test set

In [None]:
yhat = lr.predict(X_test)
yhat

Function __predict_proba__  returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1|X), and second column is probability of class 0, P(Y=0|X):

In [None]:
yhat_prob = lr.predict_proba(X_test)
yhat_prob

Let's try **Jaccard index** for accuracy evaluation. We can define Jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0, otherwise it is 0.0.

In [None]:
from sklearn.metrics import jaccard_similarity_score

jaccard_similarity_score(y_test, yhat)

Another way of looking at accuracy of a classifier is to look at **confusion matrix**:

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes, normalize = False, title = 'Confusion Matrix', cmap = plt.cm.Blues):
    """This function prints/plots the confusion  matrix. Normalization can be applied by setting normalize = True"""
    
    if normalize:
        cm = cm.astype('float')/cm.sum(axis = 1)[:, np.newaxis]
        print('Normalized confusion matrix')
    else:
        print('Confusion matrix, without normalization')
        
    print(cm)
    
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max()/2.
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment = 'center',
                 color = 'white' if cm[i,j] > thresh else 'black')

    plt.tight_layout()
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    
print(confusion_matrix(y_test, yhat, labels = [1,0]))

In [None]:
# compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels = [1,0])
np.set_printoptions(precision = 2)

# plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes = ['churn=1', 'churn=0'], normalize = False, title = 'Confusion matrix')

The first and second row is for customers whose actual churn value in test set is 1 and 0 respectively. As we can see, out of 40 customers, the churn value of 15 of them is 1, and for the rest 25 it is 0.

Now, out of these 15 (actual churn value 1), the classifier correctly predicted 6 of them as 1, and 9 of them as 0. It means, for 6 customers, the actual churn value were 1 in test set, and classifier also correctly predicted those as 1. However, while the actual label of 9 customers were 1, the classifier predicted those as 0, which is not very good. We can consider it as error of the model for first row.

Now, for the rest 25 (actual churn value 0), the classifier correctly predicted 24 of them as 0, and one of them wrongly as 1. So, it has done a good job in predicting the customers with churn value 0. 

A good thing about confusion matrix is that shows the model’s ability to correctly predict or separate the classes.  In specific case of binary classifier, such as this example,  we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives.

In [None]:
print(classification_report(y_test, yhat))

Based on the count of each section, we can calculate precision and recall of each label:

- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

- __Recall__ is true positive rate. It is defined as: Recall =  TP / (TP + FN)

So, we can calculate precision and recall of each class.

__F1 score:__ Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision

And finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case.

**log loss:** In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1. Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1.

In [None]:
from sklearn.metrics import log_loss

log_loss(y_test, yhat_prob)