# Logistic Regression
-----
1.	Logistic regression, despite its name, is a linear model for `classification` rather than regression.
2.	Logistic regression is also known in the literature as `logit regression`, `maximum-entropy classification (MaxEnt)` or the `log-linear classifier`. 
3.	In this model, the probabilities describing the possible outcomes of a single trial are modelled using a `logistic function`.
4.	A logistic function or logistic curve is a common "S" shape (sigmoid curve), with equation

    <b>$f(x) =\frac {L}{1+e^{-k(x-x_{0})}}$</b>
    ----
    
    ![image](https://github.com/rritec/datahexa/blob/master/images/ml/logreg.png?raw=true)
    
    <b>$f(x) =\frac {1}{1+e^{-x}}$</b>
    
    
5. Maths behind it, just walk through it. If you are not understand do not be panic

    a.	https://en.wikipedia.org/wiki/Logistic_function
    
    b.	https://en.wikipedia.org/wiki/Logistic_regression
    
    c.  http://scikit-learn.org/dev/modules/linear_model.html#logistic-regression




**Business Problem Statement:**
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has **diabetes**, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.    

## Step 1: Import required modules

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

## Step 2: Load and understand the data

In [None]:
import os 
os.chdir("C:\\Users\\Hi\\Google Drive\\01 DS ML DL NLP and AI With Python Lab Copy\\02 Lab Data\\Python")

In [None]:
os.getcwd()

In [None]:
import pandas as pd
df = pd.read_csv("pima-indians-diabetes.txt")

In [None]:
df.info()

In [None]:
X = df[["pregnancies","glucose","diastolic","triceps","insulin","bmi","dpf","age"]]

In [None]:
X.info()

In [None]:
y=df["diabetes"] 

In [None]:
type(y)

In [None]:
type(X)

## Step 3: Split into Train Test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                   random_state=42 # seed 
                                            )

In [None]:
X_train.shape

In [None]:
type(X_train)

In [None]:
768* 0.6

In [None]:
type(X_test)

In [None]:
X_test.shape

In [None]:
type(y_train)

In [None]:
y_train.shape

In [None]:
y_train[2:14]

In [None]:
X_train[2:14]

## Step 4: Fit the algorithm

In [None]:
LogisticRegression?

In [None]:
log = LogisticRegression()

In [None]:
log.fit(X_train,y_train)

## Step 5: Predict

In [None]:
y_pred = log.predict(X_test)

In [None]:
y_test[2:14]

In [None]:
y_pred[2:14]

## Step 6: Check accuracy

In [None]:
accuracy_score(y_test,y_pred)

## Step 7: Confusion Matrix
--------

In [None]:
confusion_matrix?

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
174+32+36+66

- Real values and machine values are on same page then T else F
- P or N is decided based on real values 

![image](https://github.com/rritec/datahexa/blob/dev/images/ds000003.png?raw=true)

[Read about confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)


## Step 8: Classification Report
--------

- `Precision:` When it predicts yes, how often is it correct? When it predicts No, how often is it correct? 

    - Precision = TP/predicted yes = 66/98=0.67
    - Precision = TN/predicted No = 174/210=0.828
    
- `True Positive Rate/Sensitivity/Recall`: When it's actually yes, how often does it predict yes? or When it's actually no, how often does it predict no?

    - TP/actual yes = 66/(36+66)=0.647
    - TN/actual NO = 174/(174+32)=0.84
    
- F Score: This is a weighted average of the true positive rate (recall) and precision	

    - $F1-score = \frac {(2 * precision * recall)}{(precision + recall)}$  
    
- Support: 
    - Actual No label patients count =174 + 32 = 206
    - Actual Yes label patients count =36+66 = 102


In [None]:
classification_report?

In [None]:
print(classification_report(y_test,y_pred))

## ROC Curve
-----------------

- Receiver Operating Characteristic(ROC) Curve
- The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings
- In machine learning The true-positive rate is also known as sensitivity, recall or probability of detection Linear regression
- The false-positive rate is also known as the fall-out or probability of `False` alarm

[Read about ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


In [None]:
# Import necessary modules
from sklearn.metrics import roc_curve

In [None]:
roc_curve?

In [None]:
y_pred[:10]

In [None]:
log.predict_proba(X_test)[10:15]

In [None]:
# Compute predicted probabilities: y_pred_prob
y_pred_prob = log.predict_proba(X_test)[:,1]

In [None]:
y_pred_prob[10:15]

In [None]:
roc_curve?

In [None]:
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

In [None]:
# Plot ROC curve
import matplotlib.pyplot as plt 
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

## AUC computation
----------

In [None]:
# Import necessary modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

In [None]:
y_pred

In [None]:
log.predict_proba(X_test)[:5]

In [None]:
# Compute predicted probabilities: y_pred_prob # related yes label # 1 label
y_pred_prob = log.predict_proba(X_test)[:,1]


In [None]:
y_pred_prob

In [None]:
roc_auc_score?

In [None]:
# Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))


In [None]:
accuracy_score(y_test,y_pred)

In [None]:
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(log, X, y, cv=5, scoring='roc_auc')
# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))

## Hyperparameter tuning with GridSearchCV

In [None]:
# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
# Fit it to the data
logreg_cv.fit(X, y)

In [None]:
# Print the tuned parameter and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))

> **Home Work:** [With one more dataset](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8)