## Logistic Regression

Let's build a diabetes prediction model.

Here, you are going to predict diabetes using Logistic Regression Classifier.

You can find the dataset [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

In [2]:
#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv('data/diabetes.csv', header=None, names=col_names)
pima = pima.drop(0)

In [3]:
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
1,6,148,72,35,0,33.6,0.627,50,1
2,1,85,66,29,0,26.6,0.351,31,0
3,8,183,64,0,0,23.3,0.672,32,1
4,1,89,66,23,94,28.1,0.167,21,0
5,0,137,40,35,168,43.1,2.288,33,1


**Selecting Feature**


Lets divide the given columns into target variable and feature variables.

In [4]:
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

Splitting Data

Remember we need to split our data into training and testing sets. 

In [5]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

**Model Development and Prediction** 

First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function.

Then, fit your model on the train set using fit() and perform prediction on the test set using predict().

In [6]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

#
y_pred=logreg.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [7]:
y_pred

array(['1', '0', '0', '1', '0', '0', '1', '1', '0', '0', '1', '1', '0',
       '0', '0', '0', '1', '0', '0', '0', '1', '0', '0', '0', '0', '0',
       '0', '1', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0',
       '1', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '0',
       '1', '0', '0', '0', '0', '1', '0', '0', '1', '0', '0', '1', '1',
       '1', '1', '0', '0', '0', '0', '0', '0', '1', '1', '0', '0', '1',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0',
       '0', '0', '0', '1', '0', '0', '1', '1', '0', '0', '0', '0', '0',
       '1', '0', '0', '0', '0', '1', '0', '0', '1', '0', '1', '1', '0',
       '1', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0',
       '0', '0', '0', '1', '0', '0', '0', '0', '1', '0', '0', '1', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '1', '0',
       '1', '0', '1', '1', '1', '1', '0', '0', '1', '0', '0', '0', '0',
       '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0

In [8]:
y_proba=logreg.predict_proba(X_test)
y_proba

array([[0.04919636, 0.95080364],
       [0.83865688, 0.16134312],
       [0.89274456, 0.10725544],
       [0.37371214, 0.62628786],
       [0.87265026, 0.12734974],
       [0.96215065, 0.03784935],
       [0.25919649, 0.74080351],
       [0.14772333, 0.85227667],
       [0.54048289, 0.45951711],
       [0.59484374, 0.40515626],
       [0.428596  , 0.571404  ],
       [0.03678218, 0.96321782],
       [0.69155128, 0.30844872],
       [0.76780816, 0.23219184],
       [0.86179699, 0.13820301],
       [0.83939765, 0.16060235],
       [0.16179634, 0.83820366],
       [0.9727441 , 0.0272559 ],
       [0.60589102, 0.39410898],
       [0.73481005, 0.26518995],
       [0.37117664, 0.62882336],
       [0.5567617 , 0.4432383 ],
       [0.75382506, 0.24617494],
       [0.93081654, 0.06918346],
       [0.9458559 , 0.0541441 ],
       [0.66514475, 0.33485525],
       [0.95522462, 0.04477538],
       [0.08841859, 0.91158141],
       [0.90158063, 0.09841937],
       [0.8605942 , 0.1394058 ],
       [0.

In [9]:
logreg.score(X_train, y_train)

0.7673611111111112

In [10]:
logreg.score(X_test, y_test)

0.8072916666666666

In [11]:
X_test.shape

(192, 7)

In [12]:
y_test

662    1
123    0
114    0
15     1
530    0
      ..
367    1
302    1
383    0
141    0
464    0
Name: label, Length: 192, dtype: object

In [15]:
y_df = pd.DataFrame(y_test)


In [16]:
#y_df

Unnamed: 0,label
662,1
123,0
114,0
15,1
530,0
...,...
367,1
302,1
383,0
141,0


In [17]:
X_df = X_test.merge(y_df, left_index=True, right_index=True).reset_index()
X_df

Unnamed: 0,index,pregnant,insulin,bmi,age,glucose,bp,pedigree,label
0,662,1,0,42.9,22,199,76,1.394,1
1,123,2,100,33.6,23,107,74,0.404,0
2,114,4,0,34,25,76,62,0.391,0
3,15,5,175,25.8,51,166,72,0.587,1
4,530,0,0,24.6,31,111,65,0.66,0
...,...,...,...,...,...,...,...,...,...
187,367,6,0,27.6,29,124,72,0.368,1
188,302,2,135,31.6,25,144,58,0.422,1
189,383,1,182,25.4,21,109,60,0.947,0
190,141,3,0,21.1,55,128,78,0.268,0


In [22]:
y_df2 = pd.DataFrame(y_pred, columns =['Predictions'])


In [23]:
X_df = X_df.merge(y_df2, left_index=True, right_index=True)

In [29]:
y_df3 = pd.DataFrame(y_proba[:,1], columns =['Proba'])
y_df3.shape

(192, 1)

In [30]:
X_df = X_df.merge(y_df3, left_index=True, right_index=True)
X_df

Unnamed: 0,index,pregnant,insulin,bmi,age,glucose,bp,pedigree,label,Predictions,Proba
0,662,1,0,42.9,22,199,76,1.394,1,1,0.950804
1,123,2,100,33.6,23,107,74,0.404,0,0,0.161343
2,114,4,0,34,25,76,62,0.391,0,0,0.107255
3,15,5,175,25.8,51,166,72,0.587,1,1,0.626288
4,530,0,0,24.6,31,111,65,0.66,0,0,0.127350
...,...,...,...,...,...,...,...,...,...,...,...
187,367,6,0,27.6,29,124,72,0.368,1,0,0.239255
188,302,2,135,31.6,25,144,58,0.422,1,0,0.397294
189,383,1,182,25.4,21,109,60,0.947,0,0,0.145913
190,141,3,0,21.1,55,128,78,0.268,0,0,0.180799


**Model Evaluation using Confusion Matrix**  
A *confusion matrix* is a table that is used to evaluate the performance of a classification model. You can also visualize the performance of an algorithm. The fundamental of a confusion matrix is the number of correct and incorrect predictions are summed up class-wise.

In [20]:
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[117,  13],
       [ 24,  38]])

*There are more metrics in the tutorial. You do not need to do this for now as we will check them out next week.