## Why do you need logistic regression?
The datasets up to this point all had a numeric response variable but sometimes you'll have a binary response variable. An example is 0 for not churn and 1 for churn in a customer database. If you were to run a linear model with x as the time since the last customer purchase and y as the churn data, you would see that some predictions are above 1 and below 0, which are impossible probabilites because you can only churn (1) or not (0). 

A solution to this problem is to use logistic regression models. Logistic regression models are another type of generalized linear models. It's used when the response variable is logical. Logistic models result in models that follow a logistic curve (S-shaped). To run a logistic regression, use the logit() function from statsmodels

In [None]:
from statsmodels.formula.api import logit
mdl_churn_vs_recency_logit = logit("has_churned ~ time_since_last_purchase", 
                                   data=churn).fit()
# this is basically the same thing as fitting a linear regression but the interpretation is a little different
# the result will be two coefficients: one is the intercept and the other is
# the numerical explanatory variable
print(mdl_churn_vs_recency_logit.params)

In [None]:
# add the logistic regression predictions to the plot
sns.regplot(x="time_since_last_purchase",
            y="has_churned",
            data=churn,
            ci=None,
            logistic=True)
# you should now have a logistic curve that never goes below 0 or above 1
# to interpret this curve, when the time since last is small the probability
# of churning is close to 0, when thi time since last purchase is very high 
# the probability of churn is close to 1
# so customers that have recently bought something are less likely to churn

## Exploring The Explanatory Variables
when the response variable is logical all the points lie on the 1 and 0 lines
this makes it difficult to see what is happening
you want to see how the explanatory variable is distributed on each line
this can be solved with a histogram of the explanatory variable, grouped by the response

In [None]:
# create the histograms of time_since_last_purchase split by has_churned
sns.displot(data=churn, 
            x="time_since_last_purchase", 
            y="has_churned",
            col="has_churned"
            )

plt.show()
#you'll see two histograms

## Visualizing Linear and Logistic Models Together
you can graph both trend side by side to see how they differ
the linear line will be straight and the logistic line will be curved (S-shaped)

In [None]:
# plot has_churned versus time_since_first_purchase as a scatter plot with a red linear 
# regression trend line (without a standard error ribbon)
# Draw a linear regression trend line and a scatter plot of time_since_first_purchase vs. has_churned
sns.regplot(data=churn,
            x="time_since_first_purchase",
            y="has_churned", 
            ci=None,
            line_kws={"color": "red"})

# Draw a logistic regression trend line and a scatter plot of time_since_first_purchase vs. has_churned
sns.regplot(x="time_since_first_purchase",
            y="has_churned",
            data=churn, 
            ci=None,
            line_kws={"color": "blue"},
            logistic=True)

plt.show()

## Making Predictions and Odds Ratios
to make predictions (calculate probabilities)with a logistic model, use the same techniques as for linear models:
* create a dataframe of explanatory variable values
* add a response column calculated using the predict method
* you can then add those predictions onto the plot by creating a scatter plot of the pred data

a simpler prediction you can make is the most likely response
if the probability of churning is less than 0.5 then the most likely outcome is not to churn
if the probability of churning is more than 0.5 then it's more likely they will churn
* to calculate, just round the predicted probabilites 
* plot the most likely outcome by using the prediction data with the numbers just calculated

another way to talk about binary responses is with odds ratios (also used in gambling)
odds ratio is the probability of something happening divided by the probability that it doesn't: odds_ratio = probability / (1 - probability)
* calculate the value
* create a new plot (visualizing odds with the original points doesn't make sense)

a nice feature is that logistic regression odds ratios is that they'll change linearly with the explanatory variable on the log-scale
since this option is available, it means that log odds ratio is another common way of describing logistic regression predictions, 
the log-odds ratio is also known as the logit (that's the name of the function, too!)

each way of describing responses has different benefits
-most likely outcome is easiest to understand because it's either "yes" or "no" but it lacks precision
-for probabilities and odds ratios, both are fairly easy to understand but the non-linear predictions make it more difficult to reason about how changes in the explanatory variable will chance the response variable
-log odds ratio is difficult to interpret for individual values but the linear relationship with the explanatory variables makes it easy to reason about changes

In [None]:
# calculating probabilities of a response
mdl_recency = logit("has_churned ~ time_since_last_purchase",
                     data = churn).fit()
# create the dataframe of explanatory variable values
explanatory_data = pd.DataFrame({"time_since_last purchase": np.arange(-1, 6,25, 0,25)})

# add a response column calculated using the predict method
prediction_data = explanatory_data.assign(has_churned = mdl_recency.predict(explanatory_data))

# you can then add those predictions onto the plot by creating a scatter plot of the pred data
sns.regplot(x="time_since_last_purchase",
            y="has_churned",
            data=churn,
            ci=None,
            logistic=True)

sns.scatterplot(x="time_since_last_purchase",
                y="has_churned",
                data=prediction_data,
                color="red")

plt.show()

# the points will follow the trend line

In [None]:
# round the predicted probabilities
prediction_data = explanatory_data.assign(
                  has_churned = mdl_recency.predict(explanatory_data))
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])

# plot the most likely outcome by using the prediction data with the numbers just calculated
sns.regplot(x="time_since_last_purchase",
            y="has_churned",
            data=churn,
            ci=None,
            logistic=True)

sns.scatterplot(x="time_since_last_purchase",
                y="most_likely_outcome",
                data=prediction_data,
                color="red")

plt.show()

# the results will show that for the most recently active customes the most likely outcome is 
# to not churn, otherwise the most likely outcome is to churn

In [None]:
# calculate the odds ratio
prediction_data["odds_ratio"] = prediction_data["has_churned"] /
                                (1 - prediction_data["has_churned"])
    
# create a new plot using seaborn's lineplot function to create a continuous line
sns.lineplot(x="time_since_last_purchase", 
             y="odds_ratio", 
             data=prediction_data)

# add a horizontal line for where odds ratio equals 1
plt.axhline(y=1, linestyle="dotted")


plt.show()

# the dotted line is at 1 and shows where churning and not churning are equally likely
# the bottom left shows predictions below 1, the chance of churning is less than the chance of not churning
# the top right shows predictions where churning is more likely than not churning

In [None]:
# you can see the odds ratios on the log-scale too, add a logarithmic log scale
plt.yscale("log")
plt.show()

# calculate log odds ratio
prediction_data["log_odds_ratio"] = np.log(prediction_data["odds_ratio"])

## Quantifying Logistic Regression Fit
this is how you can assess the performance of logistic regression models
the diagnostic plots drawn for linear models are less useful for logistic models so confusion matrices are used instead

a logical response variable leads to four possible outcomes:
* predicted false
* predicted true
* actual false
* actual true

if we predicted churn and it was true or if we predicted not churn and that was correct, the model did good, false positives and false negatives are bad, counts of each of these outcomes in called a confusion matrix

model fit can be quantified with the performance metrics of: 
* accuracy: the proportion of correct predictions, tn+tp/(tn+fn+fp+tp), the porportion of customers where the model correctly predicted whether or not they churned
* sensitivity: the proportion of true positives, tp/(fn+tp), the proportion of customers who churned where the model correctly predicted that they churned
* specificity: the proportion of true negatives, tn/(tn+fp), the proportion of customers who didn't churn where the model correctly predicted that they didn't churn
in general, higher is better but there is often a tradeoff where increasing spec will degrease sens or increasing sens will decrease specificity 



In [None]:
# confusion matrix: counts of outcomes

# first get the actual responses from the has_churned column of the dataset
actual_response = churn["has_churned"]
# get the predicted responses from the model by calling predict() on the fitted logistic regression
# these predicted values are probabilites 
# to get the most likely outcome, round the values to 0 or 1
predicted_response = np.round(mdl_recency.predict())

# combine actual and predicted responses in a dataframe
outcomes = pd.DataFrame({"actual_response": actual_response,
                         "predicted_response": predicted_response})
# use the value_counts method to get the counts of each combination of values
print(outcomes.value_counts(sort=False))

# where the actual and pred are the same are the correct ones, the others are incorrect

In [None]:
# the confusion matrix can also be created automatically with pred_table (returns an array)
conf_matrix = mdl_recency.pred_table()
print(conf_matrix)

# the results will be  true neg, false pos
#                     false neg, true pos

In [None]:
# plot the confusion matrix by using the mosaic function
from statsmodels.graphics.mosaicplot import mosaic

mosaic(conf_matrix)
# the column widths tell the fraction of observations in each category
# each column displays the fraction of predicted observations with each value

In [None]:
# performance metrics
# extract values from the conf_matrix
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]

# accuracy 
acc = (TN + TP) / (TN + TP + FN + FP)
print(acc)

# sensitivity
sens = TP / (FN + TP)
print(sens)

# specificity
spec = TN / (TN + FP)
print(spec)
