<a href="https://colab.research.google.com/github/stephenfrein/logistic_regression_python/blob/master/LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression in Python
---
This is the code for the "Logistic Regression in Python" Peer Learning Session.

# Load and Examine Data

To get things started, we'll load some CSV data about diabetes from a URL into a data frame (good for holding tabular data).

In [0]:
# it's conventional to alias pandas as pd once imported
import pandas as pd
url="https://drive.google.com/uc?export=download&id=1ZbLGkNQwMfvVNu0WaxNj8HOBRS1kbLDS"
# pandas will read this data into a DataFrame, the typical pandas data structure
diabetes_raw=pd.read_csv(url)
diabetes_raw.head()


In [0]:
# examine statistics about data
diabetes_raw.describe(include='all')

In [0]:
# base rates of Outcome (1 = has diabetes, 0 = does not)
print(diabetes_raw['Outcome'].value_counts(normalize=True))

In [0]:
# seaborn is a visualization library
import seaborn as sn
# create a matrix of correlations between variables
corrMatrix = diabetes_raw.corr()
print(corrMatrix)
# generate a heeatmap based on correlation matrix
sn.heatmap(corrMatrix, annot=True)

<img src="https://i.ya-webdesign.com/images/start-race-runner-icon-png-1.png" width="100" />

#Exercise 1

Age is most correlated with what other variable? (Put your answer in Chat.)

# Training/Test Split

We will break our data into **training** and **test** sets

*   Training set is used to build model – what X values explain our Y?
*   Test set allows us to check our model against data it has never “seen” and allows us to estimate its performance against future data

Other methods involve use of cross-validation and validation sets so we can tune models without compromising independence of test data (but we won’t go there today)

![Training and Test Data](https://lh3.googleusercontent.com/proxy/kLwjNB7rdsIHbpQpKdkRLxIF1tD6zep857pMB0HGLE5qCEunPajE0in6FtQkoYwVniZyWPyzVu5YdEI6omPflvOOf-fH2vB4lhfF7pKU0X96Bn5YgVrfv9wX)


In [0]:
# predictor variables - all but column called Outcome
X = diabetes_raw.drop("Outcome",1)
# target variable
y = diabetes_raw.Outcome

# need subsetting tools 
from sklearn.model_selection import train_test_split
# split into training (70%) and test (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

print(X_train.head())
print(X_test.head())
print(y_train.head())



<img src="https://i.ya-webdesign.com/images/start-race-runner-icon-png-1.png" width="100" />

#Exercise 2

Did the random split create data sets that are very different? Since we know glucose seems to be related to diabetes diagnosis, check the average glucose value for both X_train and X_test. Are the two pretty close together (say, within a few points)?

In [0]:
# check the average glucose values for your training and test sets - remember how we saw stats for a data frame earlier
# write your code in this box below

# Build a Logistic Regression Model

In [0]:
# import Logistic Regression classifier
from sklearn.linear_model import LogisticRegression
# create new instannce of the classifier
logistic_regresssion = LogisticRegression(max_iter=200)
# fit the model to the training data
logistic_regresssion.fit(X_train,y_train)
# check the model output
print(X_train.columns)
print(logistic_regresssion.coef_)

# Make Predictions and Evaluate Your Model

In [0]:
# make predictions
y_pred = logistic_regresssion.predict(X_test)
print(y_pred)

In [0]:
# create confusion matrix
from sklearn import metrics
import matplotlib.pyplot as plt
conf_matrix=metrics.confusion_matrix(y_test, y_pred)
sn.heatmap(conf_matrix, annot=True)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Remember: 1 = diabetes, 0 = no diabetes

# Accuracy Metrics

**Accuracy:** How often were you right?

**Precison:** When you predicted your target class (diabetes), how often were you right?

**Recall:** Of all the examples of your target class (diabetees), how many did you find? 

**F1 Score:** Harmonic mean of precision and recall - a kind of "average" that inclines toward the lower number (so you raise it by balancing precison and recall).

In [0]:
# see accuracy metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1 Score:",metrics.f1_score(y_test, y_pred))

# Predict Single Example


In [0]:
#             Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	  DiabetesPedigreeFunction	Age
new_case = [  3,          141,    74,           31,           38,     31.4, 0.592,                    48]
import numpy as np
new_case = np.array(new_case)
print(logistic_regresssion.predict(new_case.reshape(1, -1)))


In [0]:
# show probability of 0 (no diabetes) and 1 (diabetes)
print(logistic_regresssion.predict_proba(new_case.reshape(1, -1)))

<img src="https://i.ya-webdesign.com/images/start-race-runner-icon-png-1.png" width="100" />

#Exercise 3

What if the patient above had a glucose level of 81? What prediction would we make for that person? What is their probability of having diabetes?

In [0]:
# make prediction for the same patient if their glucose level were 81

# what is their probability of having diabetes


# Set a Different Threshold

The current model uses a prediction threshold of 0.50. If the value is greater than this, the person is predicted to be in the 1 (diabetes) class.

What if we want to use a different threshold to alter the chances of false negatives? We want higher recall (find all the cases) at the expense of lower precision (flagging more people without diabetes).

In [0]:
new_threshold = 0.45
preds_new_threshold = np.where(logistic_regresssion.predict_proba(X_test)[:,1] > new_threshold, 1, 0)
# see accuracy metrics
print("Accuracy:",metrics.accuracy_score(y_test, preds_new_threshold))
print("Precision:",metrics.precision_score(y_test, preds_new_threshold))
print("Recall:",metrics.recall_score(y_test, preds_new_threshold))
print("F1 Score:",metrics.f1_score(y_test, preds_new_threshold))

<img src="https://i.ya-webdesign.com/images/start-race-runner-icon-png-1.png" width="100" />

#Exercise 4

Imagine that patients who are believed to have diabetes will be put through a rigorous and expensive treatment program so you want to be very sure that flagged patients have diabetes. Change your threshold so as to increase your precision.

In [0]:
# make your predictions with higher precision