# Logistic Regression with Python

### First things first. 
1. What is the difference between linear and logistic regression ?

> While Linear regression is suited for predicting continuous values(House prices for example), it is not the best tool for predicting the class of an observed data point. 

> Basically, in order to estimate the class of the data point, we need some sort of guidance on what would be the most probable class for that datapoint. For this, we use something called as Logistic Regression

> As we know Linear Regression finds a function that relates a continuous dependent variable y, to some predictors ( independent variables x1,x2 etc) For example, SImple Linear Regresision assumes the function of the form:
    y = a + bx1 + cx2 + dx3 + ...
    
> and then find all the values of the parameters and the intercepy. We write it in vector notation as:
    h(X) = (paramVector)T X (feature vector)

2. Now, some nice details about logistic regression.

> Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable y is categorical and not continuous. It produces a formula that predicts the probability of a class label as a function of independent variables

> Logistic regression basically fits a special S shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function usually called as the signoid function.
    
    h(X) = P(Y=1|X) = sigma(parameterVector X featureVector )
          
         = e^(ParameterVector X featureVector)/(1+e^(ParameterVector X featureVector))
         
     

1. So, in brief, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability.

2. The objective of Logistic regression algorithm, is to find the best parameterVector for the above sigmoid equation, in such a way that the model predicts the class of each case

In [4]:
import pandas as pd
import numpy as np
import pylab as pl
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline
import matplotlib.pyplot as plt

In [5]:
#Click here and press Shift+Enter
!wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv

--2019-10-29 20:14:28--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36144 (35K) [text/csv]
Saving to: ‘ChurnData.csv’


2019-10-29 20:14:32 (20.5 KB/s) - ‘ChurnData.csv’ saved [36144/36144]



## Loading data from CSV file

In [7]:
churn_df = pd.read_csv('ChurnData.csv')
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


## Data pre-processing and selection

In [10]:
churn_df = churn_df[['tenure','age','address','income','ed','employ','equip','callcard','wireless','churn']]
churn_df['churn'] = churn_df['churn'].astype('int')
churn_df.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,1
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,1
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,0


In [14]:
churn_df.info()
churn_df.shape
churn_df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 10 columns):
tenure      200 non-null float64
age         200 non-null float64
address     200 non-null float64
income      200 non-null float64
ed          200 non-null float64
employ      200 non-null float64
equip       200 non-null float64
callcard    200 non-null float64
wireless    200 non-null float64
churn       200 non-null int64
dtypes: float64(9), int64(1)
memory usage: 15.7 KB


Index(['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',
       'callcard', 'wireless', 'churn'],
      dtype='object')

In [15]:
# Lets define X and Y for our array.
X = np.asarray(churn_df[['tenure','age','address','income','employ','equip']])
X[0:5]

array([[ 11.,  33.,   7., 136.,   5.,   0.],
       [ 33.,  33.,  12.,  33.,   0.,   0.],
       [ 23.,  30.,   9.,  30.,   2.,   0.],
       [ 38.,  35.,   5.,  76.,  10.,   1.],
       [  7.,  35.,  14.,  80.,  15.,   0.]])

In [17]:
Y = np.asarray(churn_df['churn'])
Y[0:5]

array([1, 1, 0, 0, 0])

In [20]:
# Also, importantly we normalize the dataset.
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-1.13518441, -0.62595491, -0.4588971 ,  0.4751423 , -0.58477841,
        -0.85972695],
       [-0.11604313, -0.62595491,  0.03454064, -0.32886061, -1.14437497,
        -0.85972695],
       [-0.57928917, -0.85594447, -0.261522  , -0.35227817, -0.92053635,
        -0.85972695],
       [ 0.11557989, -0.47262854, -0.65627219,  0.00679109, -0.02518185,
         1.16316   ],
       [-1.32048283, -0.47262854,  0.23191574,  0.03801451,  0.53441472,
        -0.85972695]])

## Train/Test split

In [22]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=4)
print("Train set: ",X_train.shape,Y_train.shape)
print("Test set: ",X_test.shape,Y_test.shape)

Train set:  (160, 6) (160,)
Test set:  (40, 6) (40,)


## Modeling (Logistic Regression with Scikit-learn)

1. Lets build our model using LogisticRegression from scikit learn package. THe version of Logistic Regression in Scikit Learn, supports regularization. Regularization is a technique used to solve the overfiiting problem in machine learning models. 

2. C parameter indicates inverse of Regularization Strength which must be a positive float. Smaller values indicate stronger regularization.

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01,solver='liblinear').fit(X_train,Y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

## Prediction

In [31]:
yhat = LR.predict(X_test)
print(yhat)
yhat_prob = LR.predict_proba(X_test)
yhat_prob

[0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1
 0 0 0]


array([[0.55084393, 0.44915607],
       [0.58894579, 0.41105421],
       [0.54561547, 0.45438453],
       [0.63735317, 0.36264683],
       [0.55707784, 0.44292216],
       [0.53654226, 0.46345774],
       [0.53504402, 0.46495598],
       [0.58883148, 0.41116852],
       [0.41941097, 0.58058903],
       [0.62707511, 0.37292489],
       [0.59030247, 0.40969753],
       [0.61799371, 0.38200629],
       [0.46776399, 0.53223601],
       [0.43580494, 0.56419506],
       [0.64728729, 0.35271271],
       [0.53118273, 0.46881727],
       [0.5269285 , 0.4730715 ],
       [0.49454596, 0.50545404],
       [0.49927727, 0.50072273],
       [0.53708295, 0.46291705],
       [0.60993916, 0.39006084],
       [0.51572106, 0.48427894],
       [0.63298113, 0.36701887],
       [0.52124923, 0.47875077],
       [0.49454026, 0.50545974],
       [0.71104927, 0.28895073],
       [0.54195079, 0.45804921],
       [0.51186228, 0.48813772],
       [0.51913451, 0.48086549],
       [0.7160412 , 0.2839588 ],
       [0.

## Evaluation:
    1. Jaccard Index
    2. Confusion matrix
    3. Log loss

# Jaccard Index

> This is basically defined as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a smaple match strictly with the true set of labels, then the subset accuracy is 1.0, otherwise it is 0.0

In [34]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(Y_test,yhat)

0.725

## log loss
> Now, lets try __log loss__ for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1.

> Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. 


In [None]:
from sklearn