In [None]:
# Objective 04 - Use scikit learn to fit and interpret logistic regression models
## Overview
So far, we've looked at the function and coefficients used to fit a logistic regression. In this objective, we're going to go more into detail about how to use the scikit learn LogisticRegression predictor. We'll also cover how to fit this model using the two features in the dataset and how to interpret these results.

## Follow Along
Let's load the geyser data set we used earlier, and go through the steps to fit a logistic regression model.

In [1]:
# Import seaborn and load the data
import seaborn as sns

geyser = sns.load_dataset("geyser")

# Convert target labels to 0 or 1

# Import the label encoder and instantiate
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Create a new column with 0=long and 1=short class labels
geyser['kind_binary'] = le.fit_transform(geyser['kind'])
display(geyser.head())

Unnamed: 0,duration,waiting,kind,kind_binary
0,3.6,79,long,0
1,1.8,54,short,1
2,3.333,74,long,0
3,2.283,62,short,1
4,4.533,85,long,0


Now that we have our geyser class encoded, we can follow the usual model fitting procedure. First, we create the feature matrix and target array. Then we import the `LogisticRegression` model, instantiate the predictor class, and fit the model.

In [3]:
# Import logistic regression predictor
from sklearn.linear_model import LogisticRegression

# Prepare the feature (we'll begin with one feature)
import numpy as np

x = geyser['duration']
X = x[:, np.newaxis]

# Assign the targert variable to y
y = geyser['kind_binary']

# Fit the model using the default parameters
model = LogisticRegression()
model.fit(X, y)

  X = x[:, np.newaxis]


LogisticRegression()

In [4]:
# Import the cross validation method
from sklearn.model_selection import cross_val_score

# Implement a cross-validation with k=5
print(cross_val_score(model, X, y, cv=5))

# Calculate the mean of the cross-validation scores
score_mean = cross_val_score(model, X, y, cv=5).mean()
print('The mean CV score is: ', score_mean)

[0.94545455 1.         1.         0.94444444 1.        ]
The mean CV score is:  0.977979797979798


In [None]:
This model is pretty accurate. If we remember from earlier in this module, our baseline accuracy was 63%. We've improved over the baseline by a significant amount.

Now, how much can we improve our model by using an additional feature in fitting our model? We still have the waiting column which is the amount of time that passes between eruptions. Let's add that feature to the feature matrix, fit the model, and calculate the cross-validation score.

In [5]:
# Create new feature matrix
features = ['duration', 'waiting']
X_two = geyser[features]

# Fit the model using the default parameters
model_two = LogisticRegression()
model_two.fit(X_two, y)

# Implement a cross-validation with k=5
print(cross_val_score(model_two, X_two, y, cv=5))

# Calculate the mean of the cross-validation scores
score_mean = cross_val_score(model_two, X_two, y, cv=5).mean()
print('The mean CV score is (two features): ', score_mean)

[1. 1. 1. 1. 1.]
The mean CV score is (two features):  1.0


The accuracy is perfect for this model. This is likely because the two classes have a very clear division. It's important to remember that not all data sets will be so easy to model with such accurate results!

## Challenge
For this challenge, try to plot the two features on the same plot. So instead of a plot with the feature on the x-axis and the class on the y axis, plot one feature on each axis. Are the two classes distinct as visualized on the plot? If you use two different colors for the classes, there should be a clear division between the classes. Think about where you would draw the decision boundary

## Additional Resources
- Scikit learn: Logistic Regression['https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html]