# Logistic Regression

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    confusion_matrix, 
    classification_report, 
    roc_auc_score, 
    roc_curve,
)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

<b> Goals </b>

- Build a logistic regression classification model using the sci-kit learn library
- Describe the sigmoid function, odds, and odds ratios as well as how they relate to logistic regression
- Evaluate a model using metrics, such as: classification accuracy/error, confusion matrix, ROC / AOC curves, and loss functions

## Overview

- Logistic regression is a generalization of the linear regression model, adapted to classification problems
- Very popular because it's very fast and interpretable
- Not vulnerable to overfitting when you don't have many features
- In linear regression, we use a set of quantitative feature variables to predict a continuous response variable. In logistic regression, we use a set of quantitative feature variables to predict probabilities of class membership.
- Named for the function used at the core of the method, the logistic function (aka the sigmoid function)
- Logistic regression is a linear regression between our feature, X, and the log-odds of our data belonging to a certain class that we will call true for the sake of generalization

### Pros:

- Highly interpretable
- Model training and prediction are fast
- No tuning is required (most of the time)
- Features don't need scaling
- Can perform well with a small number of observations
- Outputs well-calibrated predicted probabilities

### Cons:

- Presumes a linear relationship between the features and the log-odds of the response
- Performance is (generally) not competitive with the best supervised learning methods
- Sensitive to irrelevant features

### Logit Formula:
![w](http://faculty.cas.usf.edu/mbrannick/regression/gifs/lo8.gif)

$a$ = intercept <br>
$b$ = coefficient value

### Logit Model:
![logit](https://camo.githubusercontent.com/0b115390d4832bfca4c423d6b9c3acdaa1ff01b3/68747470733a2f2f7170682e65632e71756f726163646e2e6e65742f6d61696e2d71696d672d3035656463313837336430313033653336303634383632613435353636646261)

The preceding graph represents the logistic function's ability to map our continuous input, x, to a smooth probability curve that begins at the left, near probability 0, and as we increase x, our probability of belonging to a certain class rises naturally and smoothly up to probability 1. 


In other words:

- Logistic regression gives an output of the probabilities of a specific class being true
- Those probabilities can be converted into class predictions: for example, if $p >= 0.5$, the models returns 1 and if $p < 0.5$, it returns 0
- Logistic function is S-shaped and will always produce values greater than 0 and less than 1
- As you know, not all relationships are linear, so LR is not always the right model

### Key difference in use of coefficients in linear regression vs. logistic regression


**Linear Regression:** Betas / coefficients represent the change in the **response variable** for a unit change in x

**Logistic Regression:** Betas / coefficients represent the change in the **log-odds ratio** for a unit change in x

## Modeling Time!

In [None]:
# Use sklearn to create fake data
data = make_classification(n_samples=800,
                           n_features=2,
                           class_sep=.89,
                           n_informative=2,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=42)

df = pd.DataFrame(data[0], columns=['feature_one', 'feature_two'])
df['target'] = data[1]

# Assign red to class 0 and blue to class 1 (for plotting purposes)

# Draw a scatter plot


**If you had to draw a straight line that best separates the two classes, where would you put the line?**


Let's focus on feature_two and plot it against the target variable.

In [None]:
# Draw a scatter plot of feature_two vs. target


**Imagine a logit (or S-curve) modeling the relationship between the x and y axes.**

Let's fit a logistic regression model on the data above and plot the predicted labels and the probabilities

In [None]:
# Assign X and y
features = ['feature_two']
target = 'target'

In [None]:
# Intialize and fit the logistic regression model


In [None]:
# Score the model 
score = 
print("The accuracy score is {:.1f}%.".format(score*100))

In [None]:
# Generate label predictions


Plot the probabilities and the predictions.

In [None]:
# Assign predictions to pred_labels


In [None]:
# Assign probability of class 1 to pred_probs
pred_probs = lr.predict_proba(X)[:, 1]

In [None]:
# Combine X, y, pred_labels and pred_probs together


In [None]:
# Plot the data
ax = plt.gca()
plt.xlabel('Feature Two')
plt.ylabel('Target')

# Plot feature_two vs. target as a scatter plot


# Plot feature_two vs. labels as a line plot


# Plot feature_two vs pred_probs as a line plot


plt.legend(loc='right', fontsize='x-large')

**What do you see? What is the graph showing us?**

Go back to the original dataset with two features and visualize the linear boundary.

In [None]:
# Plot visualizing function
def plot_decision_boundary(model, X, y):
    
    X_max = X.max(axis=0)
    X_min = X.min(axis=0)
    
    xticks = np.linspace(X_min[0], X_max[0], 100)
    yticks = np.linspace(X_min[1], X_max[1], 100)
    
    xx, yy = np.meshgrid(xticks, yticks)
    ZZ = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    
    Z = ZZ >= 0.5
    Z = Z.reshape(xx.shape)
    
    fig, ax = plt.subplots()
    ax = plt.gca()
    ax.contourf(xx, yy, Z, cmap=plt.cm.bwr, alpha=0.2)
    ax.scatter(X[:,0], X[:,1], c=y,s=40, alpha=0.4)
    
    plt.xlabel('Feature One')
    plt.ylabel('Feature Two')

In [None]:
# Create X and y variables from data using df
features = ['feature_one', 'feature_two']
target = 'target'

# Color code y


In [None]:
# Intialize model and fit it to X and y


**Imagine what the boundary would look like in this plot**

In [None]:
plot_decision_boundary(lr, X.values, color)

**This graph demonstrates the linearity of the logistic regression algorithm.**

In [None]:
# Print out the model intercept and coefficients


### How do we interpret logistic regression coefficients?

$$probability = \frac {one\ outcome} {all\ outcomes}$$

$$odds = \frac {one\ outcome} {all\ other\ outcomes}$$

**Examples:**

- Dice roll of 1: $probability = 1/6$, $odds = 1/5$
- Even dice roll: $probability = 3/6$, $odds = 3/3 = 1$
- Dice roll less than 5: $probability = 4/6$, $odds = 4/2 = 2$

$$odds = \frac {probability} {1 - probability}$$

In [None]:
# Create a table of probability vs. odds
table = pd.DataFrame({'probability': np.arange(start=0.01, stop=1.0, step=0.01)})
table['odds'] = table['probability'] / (1 - table['probability'])

# Plot the probability vs. odds


In [None]:
# Add the log-odds to the table by taking the **natural log** of the odds
table['logodds'] = np.log(table['odds'])

# Plot the probability vs. log odds


**The log odds are passed through the logistic function.**

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

### Visualize how the coefficients and intercept can affect the probabilities

![logit](http://nbviewer.jupyter.org/github/justmarkham/DAT8/blob/master/notebooks/images/logistic_betas.png)

Changing the $\beta_0$ (or intercept) value shifts the curve horizontally, whereas changing the $\beta_1$ (or coefficient) value changes the slope of the curve.

## <b> Can you use Spotify data to predict whether or not I will like a song? </b>

In [None]:
# Example getting data using APIs
import spotipy
import json
from spotipy.oauth2 import SpotifyClientCredentials

sp = spotipy.Spotify() 

with open('../../../spotify_credentials.json') as f:
    creds = json.load(f)
    client_id = creds['client_id']
    secret = creds['client_secret']

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=secret) 
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) 

playlist = sp.user_playlist(1255971084, '4oKlWPG8WIMhemePCfCyxn') 
songs = playlist["tracks"]["items"] 
ids = [] 

for i in range(len(songs)): 
    ids.append(songs[i]["track"]["id"]) 

features = sp.audio_features(ids) 
df = pd.DataFrame(features)

In [None]:
df.head()

## Attributes


    Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    
    Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

    Instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    
    Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    
    Mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

    Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    
    Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

    Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
    
More information here https://developer.spotify.com/web-api/get-audio-features/

In [None]:
# Load the spotify dataset
df = pd.read_pickle("../../data/Spotify_Data.pkl")

Quick EDA: Summary stats and correlations

In [None]:
# Print summary stats


In [None]:
# Print summary stats for each label


In [None]:
# Print correlation matrix


**Train a logistic regression model on the data to predict whether or not the user will like a certain song**

In [None]:
# Assign X and y
features = df.drop('target', axis=1).columns
target = 'target'


# Intialize, fit and score the model

print("The model produces an accuracy score of {:.1f}%".format(score * 100))

Is that a good or bad score? To find out, let's compare it to the null accuracy.

In [None]:
# Find the null accuracy (aka the benchmark score)


Let's make a table of the coefficients and odds

In [None]:
# Create a dataframe of coefficients and their values


**Odds ratio**: the ratio of the odds after increasing $X_i$ by 1 to the odds before increasing $X_i$ by 1. Therefore, $odds\_ratio - 1$ can be interpreted as the percentage change in the odds for a 1 unit change in $X_i$.

In [None]:
# Calculate the odds ratio


## Model Evaluation Techniques

![s](http://www.dataschool.io/content/images/2015/01/confusion_matrix2.png)

**True Positives (TP):** Number of correct positive predictions

**True Negatives (TN):** Number of correct negative predictions

**False Positives (FP):** Number incorrect positive predictions

**False Negatives (FN):** Number of incorrect negative predictions

**Recall** (also known as *sensitivity* or the *true positive rate*): Out of all the positive labels, what percentage were predicted correctly?

**Precision:** Out of all the positive predictions, what percentage have a positive label?

**False Positive Rate:** The number of incorrect positive predictions divided by number of negative labels

**True Negative Rate** (also known as *specificity*): The number of correct negative predictions divided by number of negative labels 

### Formula Table
![a](http://www.chioka.in/wp-content/uploads/2013/08/Metrics-Table.png)

### Confusion Matrix with Metrics

![s](https://eus-www.sway-cdn.com/s/4YEmvTlyess2YF1M/images/VfcIF1yrYJrvLl?quality=1071&allowAnimation=true)

Create confusion matrix for the Spotify data and calculate recall and precision scores

In [None]:
# Pass the targets and predictions into a confusion matrix
cm = confusion_matrix(y, lr.predict(X))
sns.heatmap(cm, annot=True)

If you were a spotify data scientist, would you want a model that produces more false negatives or false positives?

In [None]:
# Calculate precision and recall scores with sklearn

print("The precision is {:.1f}% and the recall is {:.1f}%.".format(precision * 100, recall * 100))

No function for false positive (fall out) scores

### Area Under the ROC Curve (AUC)

![w](https://chrisalbon.com/images/machine_learning_flashcards/Receiver_Operating_Characteristic_print.png)

In [None]:
y_prob = lr.predict_proba(X)[:,1]
false_positive_rate, true_positive_rate, threshold = roc_curve(y, y_prob)

# Plot ROC curve
plt.figure(figsize=(10,8))
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls='--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

In [None]:
# Calculate the area under the curve using roc_auc_score


**What is the relationship between the thresholds and FPR and TPR?**

In [None]:
# Plot ROC_curve again but this time annotate the curve with the threshold value
plt.figure(figsize=(10,8))
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate)
plt.plot([0, 1], ls='--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

for label, x, y in zip(threshold[::25], false_positive_rate[::25], true_positive_rate[::25]):
    plt.annotate("{0:.2f}".format(label), xy=(x, y + .04))

Let's plot threshold vs. FPR / TPR on the same plot

In [None]:
plt.figure(figsize=(10,8))

plt.plot(threshold, false_positive_rate, linewidth=5, label='False Positive Rate')
plt.plot(threshold, true_positive_rate, linewidth=5, label='True Positive Rate')

plt.xlabel('Thresholds')
plt.ylabel("True / False Positive Rate")
plt.legend()

What do you see here? Why are is there a negative correlation in both lines?

## Resources

Logistic regression:
- http://www.dataschool.io/guide-to-logistic-regression/
- https://onlinecourses.science.psu.edu/stat504/node/149
- https://www.youtube.com/watch?v=_Po-xZJflPM
- https://www.youtube.com/watch?v=gNhogKJ_q7U
- https://www.youtube.com/watch?v=fJ53tIDbvTM

Evalution:
- http://www.dataschool.io/roc-curves-and-auc-explained/
- http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
