# Logistic Regression Model with real Data

 # Predicting Diabetes

In [1]:
# Import our Dependencies
from path import Path
import pandas as pd

In [2]:
# Download our Data
data = Path('../Resources/diabetes.csv')
df = pd.read_csv(data)
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


We can see from the preview of the DataFrame that multiple variables (also called features), such as the number of previous pregnancies, blood glucose level, and age, can be used to predict the outcome: whether a person has diabetes (1) or does not have diabetes (0):

A common task in machine learning is data preparation. In previous examples, we assigned the label X to input variables, and used them to predict y, or the output. With this diabetes dataset, we need to categorize features from the target. We can do so by separating the Outcome column from the other columns.

 ## Separate the Features (X) from the Target (y)
 The terms features and variables are synonymous. Target and output are synonymous.

In [3]:
y = df["Outcome"]
X = df.drop(columns="Outcome")

The Outcome column is defined as y, or the target.

X, or features, is created by dropping the Outcome column from the DataFrame.

 ## Split our data into training and testing Sets
 Examining the shape of the training set with X_train.shape returned (576,8), meaning that there are 576 samples (rows) and eight features (columns).
 

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)
X_train.shape
# There are 576 sampLES(ROWS) and eight features (columns)

(576, 8)

 ## Create a Logistic Regression Model
 The next step is to create a logistic regression model with the specified arguments for solver, max_iter, and random_state:

In [1]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs',
                                max_iter=200, # sets an upper limit on the number of iterations used by the solver
                                random_state=1)

 ## Fit (train) or model using the training data

In [6]:
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=1, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

 ## Make predictions
 To create predictions for y-values, we used the X_test set:

In [7]:
y_pred = classifier.predict(X_test) 
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results.head(20)

Unnamed: 0,Prediction,Actual
0,0,0
1,1,1
2,0,0
3,1,1
4,0,0
5,0,0
6,1,1
7,1,0
8,1,1
9,0,0


When the first 20 rows of the predicted y-values (y_pred) are compared with the actual y-values (y_test), we see that most of the predictions are correct, but that there are also some missed predictions, such as rows 14 and 15:

The final step is to answer an important question: how well does our logistic regression model predict? We do so with sklearn.metrics.accuracy_score:

In [8]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))


0.7760416666666666


This method compares the actual outcome (y) values from the test set against the model's predicted values. In other words, y_test are the outcomes (whether or not a woman has diabetes) from the original dataset that were set aside for testing. The model's predictions, y_pred, were compared with these actual values (y_test). The accuracy score is simply the percentage of predictions that are correct. In this case, the model's accuracy score was 0.776, meaning that the model was correct 77.6% of the time.