[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_06-Modeling/blob/master/F06_SCS0--LS--Solution_Code_for_Logistic_Regression_Sprint_Challenge.ipynbSolution_Code_for_Logistic_Regression_Sprint_Challenge.ipynb)

# Logistic Regression Sprint Challenge

Objectives:
* Create a training set and train a Logistic Regression model with it
* Predict values for $\hat{y}$ using a test set
* Calculate sum-of-squared error $SSE(y,\hat{y})$
* Calculate the error rate of a model as a percentage

Dataset: https://www.dropbox.com/s/bnwfu81bjpf22hp/logistic_regression.csv?raw=1

### 1. Compute Linear Regression Model

Create a training set and train a Logistic Regression model with it

In [0]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [0]:
data = pd.read_csv('https://www.dropbox.com/s/bnwfu81bjpf22hp/logistic_regression.csv?raw=1')

# Be sure to check the data.
# This dataset has a dummy column that doubles the indices, so we drop that column.
data = data.drop('Unnamed: 0', axis=1)
data.head()

Unnamed: 0,x1,x2,y
0,2.903104,3.281307,0.0
1,3.838055,2.758941,0.0
2,1.407508,1.485069,0.0
3,0.332565,1.473001,0.0
4,2.756526,2.390291,0.0


In [0]:
# Use train_test_split to create a training set

# We can wrangle the data to have the appropriate shape needed by train_test_split in multiple ways. Here are a few:*

# Way One:
X = data[['x1', 'x2']].as_matrix()
Y = data['y'].values

# Way Two:
X = data.drop('y', axis=1)
Y = data.y

# Now we can split the data. We can set the portion of the data we want in the test size, and we'll also want to 
# designate the seed for the random state so our results are reproducible.
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# *not an exhaustive list...

In [0]:
# Create and train(fit) the model
model = LogisticRegression()
model.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### 2. Predict values for $\hat{y}$ for the test set

In [0]:
# Using the model we just trained, we predict classifications for the test data.
y_hat = model.predict(x_test)

### 3. Calculate SSE for the test-set

To calculate this, we use the definition for the sum of squared errors:

$\qquad \sum (y_i-\hat{y}_i)^2$

Which, for all practical intents and purposes, simply counts the number of times our trained logit model mis-classifies a point in the test data.

In [0]:
# Calculate SSE
sse = np.sum(np.power((y_hat - y_test), 2))

# Print SSE
print('SSE:', sse)

SSE: 1.0


### 4. Calculate the error rate of the model as a percentage.

In [0]:
# Calculate percentage
error = sse / len(y_test)

# Print percentage
print('Percent error: {:.0f}%'.format(error*100))

12
Percent error: 8%


The percent error is just another way of saying our logit classifier misclassified 1 out of the 12 data points in the test data set. 