# Logistic regression for binary classification

<b>Logistic model</b>  is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.

A <b>binary logistic model</b> has a dependent variable with two possible values, such as pass/fail which is represented by labels 1/0. 

<div align="right">   Reference: Wikipedia </div> 


In <b>logistic classification</b>, the classification is based on the hypothesis $ h_\theta(x) $:

If $ h_\theta(x) \geq 0.5 $ predict $ y = 1 $.

If $ h_\theta(x) < 0.5 $ predict $ y = 0 $. 

In other words,  $ 0 \leq h_\theta(x) \leq 1 $.
    
    

 ![title](images/LogisticRegression.png)

<div align="right">   Reference: techdifferences.com </div>  

## 1.  Import necessary libraries

In [2]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## 2. Know and prepare the dataset

Iris dataset consists of 150 observations. 

The features or attributed (columns) are Sepal Length, Sepal Width, Petal Length and Petal Width.

The observations (rows) belong to 3 different types of iris species - Setosa, Versicolour, and Virginica. Each class has 50 observations.

The iris dataset can be downloded from Kaggle: https://www.kaggle.com/uciml/iris as csv file.

In [3]:
# Load Iris dataset from the library

iris = load_iris()


In [4]:
# Consider only the fourth feature or column - petal width

X = iris["data"][:,3:]  

# consider only the class 2.
# similar to one-vs-all case.

y = (iris["target"]==2).astype(np.int)

In [5]:
# Split data into train and test

from sklearn.model_selection import train_test_split

(X_train, X_test, y_train, y_test) = train_test_split(X, y, stratify=y, test_size= 0.3)

# testing size = 30 %
# rest 70 % is used for training
# stratify parameter ensures that observations from each class is are given equal weightage

print(X_train.shape)
print(X_test.shape)

(105, 1)
(45, 1)


## 3. Create and train the machine learning model¶

In [6]:
# build the model

logistic_model = LogisticRegression(solver='lbfgs') # default classfier

logistic_model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

## 4. Predict the y values

In [7]:
# predict values for the test data
y_prob  = logistic_model.predict(X_test)


## 5. Performance measures

In [8]:
# compute the accuracy

acc = accuracy_score(y_test, y_prob)
print(acc * 100)

97.77777777777777


In [9]:
# build the confusion matrix

from sklearn.metrics import confusion_matrix 

conf_matrix = confusion_matrix(y_test, y_prob) 
print('Confusion Matrix :')
print(conf_matrix)


Confusion Matrix :
[[29  1]
 [ 0 15]]


$$ Accuracy = \frac{(TP+TN)}{(P+N)} $$

$$  Precision = \frac{TP}{TP + FP} $$

$$ Recall = Sensitivity = \frac{TP}{P} $$ 

$$ F1 Score = 2 * \frac{(precision * recall)}{(precision + recall)} $$

In [10]:
# Classification Report - precision, recall

from sklearn.metrics import classification_report 

print('Classification Report: ')
print(classification_report(y_test, y_prob))


Classification Report: 
              precision    recall  f1-score   support

           0       1.00      0.97      0.98        30
           1       0.94      1.00      0.97        15

   micro avg       0.98      0.98      0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

