# Logistic Regression & Model Evaluation Techniques

<b> Goals </b>

- Learn the ins and outs of the Logisitic Regression model
- The pros and cons of LR and how it compares to the two other models we've learned so far
- Model evaluation beyond accuracy score: sensitivity, recall, precision, roc_auc,and more
- Cross validating and plotting with new model evaluation techniques

## Logistic Regression

- Logistic regression is a generalization of the linear regression model adapted to
 to classification problems.
 
- Very popular because it's very fast and interpretable. Doesn't need scaling or much tuning.

- Not vulnerable to overfitting when you don't have many features.
 
- In linear regression, we use a set of quantitative feature variables to predict a continuous response variable. In logistic regression, we use a set of quantitative feature variables to predict probabilities of class membership.

- Named for the function used at the core of the method, the logistic function aka the sigmoid function. 

- Logistic regression is a linear regression between our feature, X, and the log-odds of our data belonging to a certain class that we will call true for the sake of generalization.

Pros:

- Highly interpretable
- Model training and prediction are fast
- No tuning is required (most of the time)
- Features don't need scaling
- Can perform well with a small number of observations
- Outputs well-calibrated predicted probabilities

Cons:

- Presumes a linear relationship between the features and the log-odds of the response
- Performance is (generally) not competitive with the best supervised learning methods
- Sensitive to irrelevant features

Logit formula:
![w](http://faculty.cas.usf.edu/mbrannick/regression/gifs/lo8.gif)

a = intercept

b = coefficient value

Logit model:
![logit](https://camo.githubusercontent.com/0b115390d4832bfca4c423d6b9c3acdaa1ff01b3/68747470733a2f2f7170682e65632e71756f726163646e2e6e65742f6d61696e2d71696d672d3035656463313837336430313033653336303634383632613435353636646261)

The preceding graph represents the logistic function's ability to map our continuous input, x, to a smooth probability curve that begins at the left, near probability 0, and as we increase x, our probability of belonging to a certain class rises naturally and smoothly up to probability 1. 


In other words:

    • Logistic regression gives an output of the probabilities of a specific class being true
    
    • Those probabilities can be converted into class predictions: if p>= 0.5 the models returns 1 and if p<.0.5 it returns 0
    
    • Logistic function is S-shaped and will always produced values > 0 and < 1.
    
    • Not all relationships as you know are linear, so LR is not always the right model.
    



### Key difference in use of coefficients in linear vs logistic
<br>
Linear Regression: Betas/coefficients represents the change in the response variable for a unit change in x. 

Logistic Regression: They represents the change in the log-odds. For a unit change in x. This means that e^β gives us the change in the odds for a unit change in x.

Coding time

In [None]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report, roc_auc_score, roc_curve, log_loss
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification;

In [None]:
#Use Sklearn to create fake data
data = make_classification(n_samples=800,n_features=2,class_sep=.89,n_informative=2,
                         n_redundant=0, n_repeated=0,n_classes=2, random_state=3)
df = pd.DataFrame(data[0], columns=["feature_one", "feature_two"])
df["target"] = data[1]
#assign red to class 0 and blue to class 1. For plotting purposes.
colors = pd.Series(df["target"]).map({0:"red", 1:"blue"})
plt.figure(figsize=(15,11))
plt.scatter(df.feature_one, df.feature_two, c=colors, s=180, alpha = .6)
plt.xlabel("Feature One",)
plt.ylabel("Feature Two");

If you had to draw a straight line that best separates the two classes, where would you put the line?
<br><br>
Let's focus on Feature Two and plot it against the target variable

In [None]:
plt.figure(figsize=(12,8))
plt.scatter(df.feature_one, df.target, s=260, alpha=.8)
plt.xlabel("Feature One")
plt.ylabel("Target");

Imagine a logit or S-curve modeling the relationship between the x and y axes.

Let's fit a logistic regression model on the data above and plot the predicted labels and the probabilities

In [None]:
#Sort the dataframe by the feature one and create a new data frame from that.
df2 = df.sort_values("feature_one").copy();

In [None]:
#Assign X and y
X = 
y = 

In [None]:
#Intialize the logistic regression model

;

In [None]:
#Score the model 
score = 
print ("The accuracy score is {:.2f} percent".format(score*100))

Plot the probabilities and the predictions

In [None]:
#Assign label predictions to pred_labels
pred_labels = 

In [None]:
#Assign probability of class 1 to pred_probs
pred_probs = 

In [None]:
plt.figure(figsize=(14,9))
plt.xlabel("Feature One")
plt.ylabel("Target")
plt.scatter(X.values,y, s=70, c= "blue", alpha=1, label="Scatter Plot Data")
plt.plot(X, pred_labels, c="y", linewidth=8, alpha=.5, label = "Logistic Regression Label Predictions")
plt.plot(X, pred_probs, c="g", linewidth=8, alpha=.5, label = "Logistic Regression Probability Predictions")
plt.legend(loc="right", fontsize="x-large");

What do you see? What is the graph showing us?

Go back to the original dataset with two features and visualize the linear boundary

In [None]:
#Plot visualizing function
def plot_decision_boundary(model, X, y):
    X_max = X.max(axis=0)
    X_min = X.min(axis=0)
    xticks = np.linspace(X_min[0], X_max[0], 100)
    yticks = np.linspace(X_min[1], X_max[1], 100)
    xx, yy = np.meshgrid(xticks, yticks)
    ZZ = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = ZZ >= 0.5
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(12, 9))
    plt.contourf(xx, yy, Z, cmap="RdBu", alpha=0.2)
    plt.scatter(X[:,0], X[:,1],cmap = "RdBu", c=y,s=60, alpha=0.4)
    plt.xlabel("Feature One")
    plt.ylabel("Feature Two")

In [None]:
#Create X and y variables from data using df
X = 
y = 

In [None]:
#Intialize model and fit it to X and y



Imagine what the boundary would look like in this plot

This graph demonstrates the linearness of the logistic regression algorithm.

## <b> Can you use Spotify data to predict whether or not I will like a song? </b>

<b> Attributes </b>


    Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    
    Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

    Instrumentalness: Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    
    Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
    
    Mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

    Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    
    Tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

    Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
    
More information here https://developer.spotify.com/web-api/get-audio-features/

My article detailing my process and findings: https://opendatascience.com/blog/a-machine-learning-deep-dive-into-my-spotify-data/

In [None]:
#Load the datafile "Spotify_Data.pkl" and check it out
df = pd.read_pickle("../../data/Spotify_Data.pkl")
df.head()

Quick EDA: Summary stats grouped by class and correlations

In [None]:
#Summary stats


In [None]:
#Difference between class 0 and class 1


Thoughts? Things of interest? Which variables stick out to you?

Train a logistic regression model on the data to predict whether or not I will like a certain song

In [None]:
#Create X and y variables
X = 
y = 

#Intialize, fit, and score the model
lr = 


score = 

print ("The model produces an accuracy score of {:.2f} percent".format(score*100))

Is that a good or bad score? To find out let's compare it to the null accuracy.

In [None]:
#Find the null accuracy aka the benchmark score


### Class Exercise

#### <b> Training/testing  </b>
1. Split the data into train/test splits
2. Fit data onto training set
3. Make predictions on test set with the training model
4. Calculate accuracy score by comparing predicted labels of the test set to its actual labels

In [None]:
#Step 1. Use random_state = 42

#Step 2


#Step 3



#Step 4
testing_score =

print ("The model accurately classified {:.2f} percent of the testing data".format(testing_score*100))

How does the testing accuracy compare to the first one?
<br><br><br><br>
Use cross validation to derive a truer testing accuracy score

In [None]:
#Use cross_val_score method to generate the average accuracy score for 5 CVs
mean_cv_score = 

print ("The cross validated accuracy score is {:.2f} percent").format(mean_cv_score*100)

### Probability, odds, e, log, log-odds. How to interpret logisitc regression coefficients
<br>
Quick stats and probability detour

$$probability = \frac {one\ outcome} {all\ outcomes}$$

$$odds = \frac {one\ outcome} {all\ other\ outcomes}$$

Examples:

- Dice roll of 1: probability = 1/6, odds = 1/5
- Even dice roll: probability = 3/6, odds = 3/3 = 1
- Dice roll less than 5: probability = 4/6, odds = 4/2 = 2

$$odds = \frac {probability} {1 - probability}$$

In [None]:
# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table

What is e? It is the base rate of growth shared by all continually growing processes:

In [None]:
# exponential function: e^1
np.exp(1)

In [None]:
# time needed to grow 1 unit to 2.718 units
np.log(2.718)

It is also the inverse of the exponential function:

In [None]:
np.log(np.exp(5))

In [None]:
# add log-odds to the table
table['logodds'] = np.log(table.odds)
table

The log odds are what is passed throught the logistic function

Train model using one feature: "valence"

In [None]:
V = 
lr_V = 


In [None]:
# compute predicted log-odds for valence_value=0.5...
#by multiplying it by coefficien and then adding the intercept to it

valence_value = 0.5
logodds = 

logodds

In [None]:
# convert log-odds to odds
odds = 
odds

In [None]:
# convert odds to probability
prob = 
prob

In [None]:
# compute predicted probability for valence_value using the predict_proba method



Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).
<br><br><br><br><br><br><br><br>
Re-fit model but all the variables and make table of the coefficients and odds

In [None]:
X = 
y = 
lr = 


In [None]:
#Table of coefficients and their values
coef = pd.DataFrame(list(zip(X.columns, np.transpose(lr.coef_[0]))), columns=["coef", "value"])
coef

Odds ratio is the ratio of the odds(after increasing X_i by 1) over (divided) by odds(before increasing X_i by 1).

In [None]:
coef_odds = np.e**(coef["value"])
coef["odds_ratio"] = coef_odds
coef

The increase in probability is hard to quantify.  The lower p(before) is, the greater increase you'll have vs. a higher p(before).

#Visualize how the Beta and intercept can affect the probabilities

![logit](http://nbviewer.jupyter.org/github/justmarkham/DAT8/blob/master/notebooks/images/logistic_betas.png)

Changing the $\beta_0$ or intercept value shifts the curve horizontally, whereas changing the $\beta_1$ or coefficient value changes the slope of the curve.

## <b> Model Evaluation techniques </b>

![s](http://www.dataschool.io/content/images/2015/01/confusion_matrix2.png)

True Positives (TP): Number of correct positive predictions

True Negatives (TN): Number of correct negative predictions

False Positives (FP): Number incorrect positive predictions

False Negatives (FN): Number of incorrect negative predictions

Recall aka sensitivity aka the True Positive Rate: The number of correct positive predictions divided by number of positive instances

Precision: The number of correct positive predictions divided by number of positive predictions

False Positive Rate aka Fall Out: The number of incorrect positive predictions divided by number of negative instances

True Negative Rate aka Specificity: The number of correct negative predictions divided by number of negative instances 

Formula table:
![a](http://www.chioka.in/wp-content/uploads/2013/08/Metrics-Table.png)

Confusion matrix with metrics:

![s](https://eus-www.sway-cdn.com/s/4YEmvTlyess2YF1M/images/VfcIF1yrYJrvLl?quality=1071&allowAnimation=true)

Super confusion matrix:
![q](https://image.ibb.co/bXkGxm/Screen_Shot_2017_11_28_at_12_03_48_PM.png)

Think about how these metrics can tell us more about the efficacy of a model as opposed to accuracy score.

Is one metrics more useful than others? In which context would it make sense to evaluate a model based on FPR vs FNR?

Create confusion matrix for the Spotify data and calculate recall and precision scores

In [None]:
#Make a train test split of the spotify data and train logistic regression model
X_train, X_test, y_train, y_test = 

lr = 
preds = 
probs = 

In [None]:
#Null accuracy of y_test



In [None]:
#Pass the predictions and y_test into a confusion matrix
cm = 
cm

Let's try calculating the TPR, TNR, FPR, and FNR rates manually

In [None]:
#TPR
cm[1,1]/float(cm.sum(axis=1)[1])

In [None]:
#TNR
cm[0,0]/float(cm.sum(axis=1)[0])

In [None]:
#FPR
cm[0,1]/float(cm.sum(axis=1)[0])

In [None]:
#FNR
cm[1,0]/float(cm.sum(axis=1)[1])

If you were a spotify data scientist would you want a model that produces more false negatives or false positives?

In [None]:
#Calculate precision and recall scores with sklearn
ps = 
rs = 

print ("The precision score is {:.2f} and the recall score is {:.2f}".format(ps*100, rs*100))

No function for false positive (fall out) scores

Cross validate with precision and recall

In [None]:
#Precision


In [None]:
#Recall


### Log Loss

![s](images/log_loss_2.png)

![w](http://wiki.fast.ai/images/4/43/Log_loss_graph.png)

In [None]:
#evaluate on test set



In [None]:
#cross validate 



Now let's add some context to the log loss value by using the null accuracy

In [None]:
#Repeat null_accuracy for each row in y_test



#Pass into log_loss function



### ROC_AUC curve

![w](https://chrisalbon.com/images/machine_learning_flashcards/Receiver_Operating_Characteristic_print.png)

ROC (receiver operating characteristic) curve is a commonly used way to visualize the performance of a binary classifier.

AUC (area under curve) is arguably the best way to summarize a model performance's in a single number.

In [None]:
#Derive probabilities of class 1 from the test set
test_probs = 

#Pass in the test_probs variable and the true test labels aka y_test in the roc_curve function
fpr, tpr, thres = 

#Outputs the fpr, tpr, for varying thresholds

In [None]:
#Plotting False Positive Rates vs the True Positive Rates
#Dotted line represents a useless model
plt.figure(figsize=(10,8))
plt.plot(fpr, tpr, linewidth=8)
#Line of randomness
plt.plot([0,1], [0,1], "--", alpha=.7)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

How do you rate this model?

In [None]:
#Caculate the area under the curve score using roc_auc_score



In [None]:
#Cross validated roc_auc score



What is the relationship between the thresholds and FPR and TPR?

In [None]:
#Plot ROC_curve again but this time annotate the curve with the threshold value
plt.figure(figsize=(12,9))
plt.plot(fpr, tpr, linewidth=8)
plt.plot([0,1], [0,1], "--", alpha=.7)
for label, x, y in zip(thres[::10], fpr[::10], tpr[::10]):
    plt.annotate("{0:.2f}".format(label), xy=(x, y + .04))
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

Plotting threshold vs FPR/TPR on the same plot

In [None]:
plt.figure(figsize=(11,8))
plt.plot(thres, fpr, linewidth=5, label = "FPR Line")
plt.plot(thres, tpr, linewidth=5, label = "TPR line")
plt.xlabel("Thresholds")
plt.ylabel("False Positive Rate")
plt.legend()
plt.show();

What do you see here? Why are is there a negative correlation in both lines?


Thresholds and model performance: Does tweaking the threshold give us a better model?</b>
<br><br>
Unfortunately there's no threshold to configure in a logistic regression model. 
<br>For example: lr = logisticregression(threshold=n) 
<br>
<br>So we need to create our own threshold function using the np.where function

In [None]:
#Assign all the values in test_probs >=0.6 == 1 and the rest equal to 0
#First argument is condition
#Second argument is the value you use to replace all the values that satisfy the condition
#Third argument is the value you use to replace all the values that don't satisfy the condition
labels_60 = 
labels_60[:20]

Does this give a better accuracy score?

In [None]:
#Put this in function form

def thres_acc(t, yt, tp):
    labels = np.where(tp>=t, 1, 0)
    return accuracy_score(yt, labels)

Plot various thresholds vs their accuracy scores

In [None]:

thresholds = np.linspace(0,1, 30)
acc_scores = [thres_acc(i, y_test, test_probs) for i in thresholds]

In [None]:
#Plot thresholds vs accuracy scores
plt.figure(figsize=(12,9))
plt.plot(thresholds, acc_scores, linewidth=5)
plt.xlabel("Thresholds")
plt.ylabel("Accuracy Scores")
plt.show()

In [None]:
#Which threshold produces the best accuracy score?
thres_score_dict = dict(zip(thresholds,acc_scores))
sorted(thres_score_dict.items(), key = lambda x:x[1], reverse=True)[0][0]

## Bonus Section: KNN vs DT vs LR

In this section, let's compare and contrast the three algorithms we've learned so far by visualizing them on varying fake data from sklearn.

In [None]:
#Imports
from itertools import product
from sklearn.datasets import make_circles, make_moons, make_blobs
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

Dataset 1: circles

In [None]:
X, y = make_circles(n_samples=500,noise=.05, random_state=30,factor=.6)

In [None]:
# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=3)
clf2 = KNeighborsClassifier(n_neighbors=5)
clf3 = LogisticRegression()

clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)


# Plotting decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [clf1, clf2, clf3],
                        ['Decision Tree (depth=3)', 'KNN (k=5)',
                         'Logistic Regression']):

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z,cmap = "RdBu", alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, cmap = "RdBu",
                                  s=20, edgecolor='k')
    axarr[idx[0], idx[1]].set_title(tt)

plt.show()

Dataset 2: Moons

In [None]:
X, y = make_moons(n_samples=500,noise=.15, random_state=30)
# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=3)
clf2 = KNeighborsClassifier(n_neighbors=5)
clf3 = LogisticRegression()

clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)


# Plotting decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [clf1, clf2, clf3],
                        ['Decision Tree (depth=3)', 'KNN (k=5)',
                         'Logistic Regression']):

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z, cmap = "RdBu", alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, cmap = "RdBu",
                                  s=20, edgecolor='k')
    axarr[idx[0], idx[1]].set_title(tt)

plt.show()

Dataset 3: Blobs

In [None]:
X, y = make_blobs(n_samples=500, n_features=2, random_state=30)
# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=3)
clf2 = KNeighborsClassifier(n_neighbors=5)
clf3 = LogisticRegression()

clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)


# Plotting decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [clf1, clf2, clf3],
                        ['Decision Tree (depth=3)', 'KNN (k=5)',
                         'Logistic Regression']):

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y,
                                  s=20, edgecolor='k')
    axarr[idx[0], idx[1]].set_title(tt)

plt.show()

## Resources

Logistic regression:
- http://www.dataschool.io/guide-to-logistic-regression/
- https://onlinecourses.science.psu.edu/stat504/node/149
- https://www.youtube.com/watch?v=_Po-xZJflPM
- https://www.youtube.com/watch?v=gNhogKJ_q7U
- https://www.youtube.com/watch?v=fJ53tIDbvTM

Evalution:
- http://www.dataschool.io/roc-curves-and-auc-explained/
- http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf


## In class work
<br><br>
For the rest of class work on modeling one of the following datasets: primary, spotify, employee churn (HR_comma_sep.csv), iris, titanic, pokemon, or use fake data from sklearn. Create roc_curves for your models.
<br><br>
Compare and contrast logistic regression, decision trees, and k-nearest neighbors using the new metrics we learned in this class. Which algorithm is better for FPR or FNR? 
<br><br>