# Logistic regression with two explanatory variables

Logistic regression also supports multiple explanatory variables. To include multiple explanatory variables in logistic regression models, the syntax is the same as for linear regressions.

Here you'll fit a model of churn status with both of the explanatory variables from the dataset: the length of customer relationship and the recency of purchase, and their interaction.

`churn` is available.

In [None]:
# # Import logit
# from statsmodels.formula.api import logit

# # Fit a logistic regression of churn status vs. length of relationship, recency, and an interaction
# mdl_churn_vs_both_inter = logit("has_churned ~ time_since_first_purchase + time_since_last_purchase + time_since_first_purchase : time_since_last_purchase", data = churn).fit()

# # Print the coefficients
# print(mdl_churn_vs_both_inter.params)

# Logistic regression prediction

As with linear regression, the joy of logistic regression is that you can make predictions. Let's step through the prediction flow one more time!

`churn` and `mdl_churn_vs_both_inter` are available; `itertools.product` is loaded.

In [1]:
# # Create time_since_first_purchase
# time_since_first_purchase = np.arange(-2, 4.1, 0.1)

# # Create time_since_last_purchase
# time_since_last_purchase = np.arange(-1, 6.1, 0.1)

# # Create p as all combinations of values of time_since_first_purchase and time_since_last_purchase
# p = product(time_since_first_purchase, time_since_last_purchase)

# # Transform p to a DataFrame and name the columns
# explanatory_data = pd.DataFrame(p, columns=["time_since_first_purchase",
#                                             "time_since_last_purchase"])

# # Create prediction_data
# prediction_data = explanatory_data.assign(
#     has_churned = mdl_churn_vs_both_inter.predict(explanatory_data)
# )

# # Create most_likely_outcome
# prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])

# # See the result
# print(prediction_data)

# Visualizing multiple explanatory variables

Plotting has similar issues as with the linear regression case: it quickly becomes difficult to include more numeric variables in the plot. Here you'll look at the case of two numeric explanatory variables, and the solution is basically the same as before: use color to denote the response.

Here there are only two possible values of response (zero and one), both in the actual dataset and the predicted dataset.

`churn` and `prediction_data` are available.

In [2]:
# # Using churn, plot recency vs. length of relationship, colored by churn status
# sns.scatterplot(x="time_since_first_purchase",
#                 y="time_since_last_purchase",
#                 data=churn, 
#                 hue="has_churned")

# # Using prediction_data, plot recency vs. length of relationship, colored by most_likely_outcome
# sns.scatterplot(x="time_since_first_purchase",
#                 y="time_since_last_purchase",
#                 data=prediction_data, 
#                 hue="most_likely_outcome",
#      alpha=0.2,
#      legend=False)

# # Show the plot
# plt.show()

# Confusion matrix


When the response variable has just two outcomes, like the case of churn, the measures of success for the model are "how many cases where the customer churned did the model correctly predict?" and "how many cases where the customer didn't churn did the model correctly predict?". These can be found by generating a confusion matrix and calculating summary metrics on it.

Recall the following definitions:

- Accuracy is the proportion of predictions that are correct.
- Sensitivity is the proportion of true observations that are correctly predicted by the model as being true.
- Specificity is the proportion of false observations that are correctly predicted by the model as being false.
 

`churn` and `mdl_churn_vs_both_inter` are available.

In [3]:
# # Create conf_matrix
# conf_matrix = mdl_churn_vs_both_inter.pred_table()

# # Extract TN, TP, FN and FP from conf_matrix
# TN = conf_matrix[0,0]
# TP = conf_matrix[1,1]
# FN = conf_matrix[1,0]
# FP = conf_matrix[0,1]

# # Calculate and print the accuracy
# accuracy = (TP + TN) / (TP + TN + FP + FN)
# print("accuracy", accuracy)

# # Calculate and print the sensitivity
# sensitivity = TP / (TP + FN)
# print("sensitivity", sensitivity)

# # Calculate and print the specificity
# specificity = TN / (TN + FP)
# print("specificity", specificity)

# Cumulative distribution function

Understanding the logistic distribution is key to understanding logistic regression. Like the normal (Gaussian) distribution, it is a probability distribution of a single continuous variable. Here you'll visualize the cumulative distribution function (CDF) for the logistic distribution. That is, if you have a logistically distributed variable, x, and a possible value, xval, that x could take, then the CDF gives the probability that x is less than xval.

The logistic distribution's CDF is calculated with the logistic function (hence the name). The plot of this has an S-shape, known as a sigmoid curve. An important property of this function is that it takes an input that can be any number from minus infinity to infinity, and returns a value between zero and one.

In [4]:
# # Import logistic
# from scipy.stats import logistic

# # Create x ranging from minus ten to ten in steps of 0.1
# x = np.arange(-10, 10.1, 0.1)

# # Create logistic_dist
# logistic_dist = pd.DataFrame({"x": x,
#                               "log_cdf": logistic.cdf(x),
#                               "log_cdf_man": 1 / (1 + np.exp(-x))})

# # Using logistic_dist, plot log_cdf vs. x
# sns.lineplot(x = "x", y= "log_cdf", data = logistic_dist)

# # Show the plot
# plt.show()

# Inverse cumulative distribution function

The logistic function (logistic distribution CDF) has another important property: each x input value is transformed to a unique value. That means that the transformation can be reversed. The logit function is the name for the inverse logistic function, which is also the logistic distribution inverse cumulative distribution function. (All three terms mean exactly the same thing.)

The logit function takes values between zero and one, and returns values between minus infinity and infinity.

`logistic` is available from `scipy.stats`.

In [5]:
# # Create p ranging from 0.001 to 0.999 in steps of 0.001
# p = np.arange(0.001, 1, 0.001)

# # Create logistic_dist_inv
# logistic_dist_inv = pd.DataFrame({"p": p,
#                                   "logit": logistic.ppf(p),
#                                   "logit_man": np.log(p / (1 - p))})

# # Using logistic_dist_inv, plot logit vs. p
# sns.lineplot(x = "p", y = "logit", data = logistic_dist_inv)

# # Show the plot
# plt.show()

# Logistic distribution parameters

The logistic CDF is not just a single curve. In the same way that the normal distribution has mean and standard deviation parameters that affect the CDF curve, the logistic distribution has location and scale parameters. Here, you'll visualize how changing those parameters changes the CDF curve.

How do changes to the parameters change the CDF curve?

<center><img src="images/04.091.jpg"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/04.092.jpg"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/04.093.jpg"  style="width: 400px, height: 300px;"/></center>


- As `location` increases, the logistic CDF curve moves rightwards. As `scale` increases, the steepness of the slope decreases.

# Likelihood & log-likelihood

Linear regression tries to optimize a "sum of squares" metric in order to find the best fit. That metric isn't applicable to logistic regression. Instead, logistic regression tries to optimize a metric called likelihood, or a related metric called log-likelihood.

The dashboard shows churn status versus time since last purchase from the `churn` dataset. The blue dotted line is the logistic regression prediction line. (That is, it's the "best fit" line.) The black solid line shows a prediction line calculated from the intercept and slope coefficients you specify as `logistic.cdf(intercept + slope * time_since_last_purchase)`.

Change the intercept and slope coefficients and watch how the likelihood and log-likelihood values change.

As you get closer to the best fit line, what statement is true about likelihood and log-likelihood?

<center><img src="images/04.111.jpg"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/04.112.jpg"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/04.113.jpg"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/04.114.jpg"  style="width: 400px, height: 300px;"/></center>


- Both likelihood and log-likelihood increase to a maximum value. Logistic regression chooses the prediction line that gives you the maximum likelihood value. It also gives maximum log-likelihood.

# Logistic regression algorithm

Let's dig into the internals and implement a logistic regression algorithm. Since `statsmodels`'s `logit()` function is very complex, you'll stick to implementing simple logistic regression for a single dataset.

Rather than using sum of squares as the metric, we want to use likelihood. However, log-likelihood is more computationally stable, so we'll use that instead. Actually, there is one more change: since we want to maximize log-likelihood, but `minimize()` defaults to finding minimum values, it is easier to calculate the negative log-likelihood.

The log-likelihood value for each observation is

The metric to calculate is the negative sum of these log-likelihood contributions.

The explanatory values (the time_since_last_purchase column of churn) are available as x_actual. The response values (the has_churned column of churn) are available as y_actual. logistic is imported from scipy.stats, and `logit()` and `minimize()` are also loaded.

In [6]:
# # Complete the function
# def calc_neg_log_likelihood(coeffs):
#     # Unpack coeffs
#     intercept, slope = coeffs
#     # Calculate predicted y-values
#     y_pred = logistic.cdf(intercept + slope * x_actual)
#     # Calculate log-likelihood
#     log_likelihood = np.log(y_pred) * y_actual + np.log(1 - y_pred) * (1 - y_actual)
#     # Calculate negative sum of log_likelihood
#     neg_sum_ll = -np.sum(log_likelihood)
#     # Return negative sum of log_likelihood
#     return neg_sum_ll
  
# # Call minimize on calc_sum_of_squares  
# print(minimize(fun=calc_neg_log_likelihood,
#                x0=[0,0]))

# # Compare the output with the logit() call.
# print(logit("has_churned ~ time_since_last_purchase", data=churn).fit().params)