# Logistic Regression Code-Along

Implement the code-blocks below in order to implement the logistic regressor. We will be using the `breast_cancer` toy dataset which we can directly load from sklearn.

In [19]:
from sklearn.datasets import load_breast_cancer

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import time

In [20]:
# load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)

# view first 5 rows of predictors
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [21]:
# view first 5 random samples of target data
y.sample(5)

437    1
248    1
380    1
494    1
379    0
Name: target, dtype: int64

In [None]:
# TODO: select only the "mean radius", "mean texture", and "mean perimeter" predictors for your predictors
X = ...

# TODO: split into test and training sets
X_train, X_test, y_train, y_test = ...

# view first 5 rows of training data
X_train.head()

In [None]:
# Randomly search for the best hyperparameters on a logistic regression model
param_dist = {
    'penalty': ['l1', 'l2'],
    'C': np.linspace(0.01, 1, 100),
    'solver': ['saga'], 
    'max_iter': [10000]
}

random_search = RandomizedSearchCV(LogisticRegression(), param_distributions=param_dist, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

# Best model from random search
best_params_random = random_search.best_params_
best_score_random = random_search.best_score_

print(f"RandomizedSearchCV - Best Params: {best_params_random}")
print(f"RandomizedSearchCV - Cross-Val Accuracy: {best_score_random:.2f}")

RandomizedSearchCV - Best Params: {'solver': 'saga', 'penalty': 'l1', 'max_iter': 10000, 'C': np.float64(0.8)}
RandomizedSearchCV - Cross-Val Accuracy: 0.89
RandomizedSearchCV - Time elapsed: 8.52 seconds


In [None]:
# Use the best model found from RandomizedSearchCV to predict on unseen test data

# extract the best estimator
best_log = random_search.best_estimator_

# predict on testing data
log_predictions = best_log.predict(X_test)

# evaluate its accuracy
test_score = accuracy_score(log_predictions, y_test)

print(f"RandomizedSearchCV - Coefficients: {best_log.coef_}")
print(f"RandomizedSearchCV - Test Accuracy: {test_score:.2f}")

RandomizedSearchCV - Coefficients: [[ 3.2243442   0.01382344 -0.53723209]]
RandomizedSearchCV - Test Accuracy: 0.89


The coefficients above relate to the predictors "mean radius", "mean texture", and "mean perimeter."

Note that positive values indicate a higher log-odds that a tumor is malignant. However, negative values indicate a lower log-odds that a tumor is malignant.

Which predictor seems to indicate a higher probability that a tumor is malignant? 

In [None]:
# Now we will see why we prefer to use RandomizedSearchCV
# Search the grid for the best hyperparameters on a logistic regression model (this will take at least 3 minutes!)
grid_search = GridSearchCV(LogisticRegression(), param_grid=param_dist, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best model from grid search
best_params_grid = grid_search.best_params_
best_score_grid = grid_search.best_score_

print(f"GridSearchCV - Best Params: {best_params_grid}")
print(f"GridSearchCV - Cross-Val Accuracy: {best_score_grid:.2f}")

GridSearchCV - Best Params: {'C': np.float64(0.73), 'max_iter': 10000, 'penalty': 'l1', 'solver': 'saga'}
GridSearchCV - Cross-Val Accuracy: 0.89


# Challenge

Create a logistic regressor on the `hotel` dataset. Your target variable will be `is_canceled` column, while your predictor will be the `lead_time` column.

After creating your model extract best hyperparameters using both `RandomizedSearchCV` and `GridSearchCV`. Evaluate the coefficients and the accuracy of both hyperparameter finding algorithms (we will learn more about classification accuracy metrics tomorrow). 

Answer the analytical questions listed below as well.

In [None]:
# TODO: Implement the logistic regression model on the `hotel.csv` dataset
...

## Writeup

Answer the analytical questions below using the metrics you've calculated.

1) What was the accuracy of your hyperparameters extracted from `GridSearchCV`? Was this any different from `RandomizedSearchCV`? Did the extended run-time of `GridSearchCV` provide huge increases in accuracy?

...

2) According to your coefficients, how did `lead_time` influence the probability that a booking would be cancelled? 

...
