### Multiple Logistic Regression

time to switch from linear regression to logistic regression

to perform a logistic regression in statsmodels, used logit instead of ols

when the response variable has 2 possible values, there are four possible outcomes for the model: actual false, actual true, predicted true, predicted false
these 4 outcomes can be quantified and visualize using a confusion matrix 
the confusion matrix lets you calculate metrics like model accuracy, sesitivity, and specificity

prediction flow is also similar to before

In [None]:
# use logit to perform logistic regression 
from statsmodels.formula.api import logit

logit("response ~ explanatory", data=dataset).fit()

# to extend logistic regression to multiple explanatory variables change the formula, just like linear
# you can use a plus to ignore interactions
logit("response ~ explanatory1 + explanatory2", data=dataset).fit()
# or a times to include interactions
logit("response ~ explanatory1 * explanatory2", data=dataset).fit()

# use a confusion matrix to visualize and quantify
conf_matrix = mdl_logit.pred_table()
print(conf_matrix)

# prediction flow
from itertools import product
explanatory1 = some_values
explanatory2 = some_values
# create the combinations of the explanatory variables 
p = product(explanatory1, explanatory2)
# store the combinations in a dataframe 
explanatory_data = pd.DataFrame(p, columns=["explanatory1", "explanatory2"])
# assign a new column of predictions
prediction_data = explanatory_data.assign(mass_g = mdl_logit.predict(explanatory_data))
# for visualization purposes, you can alse create a column with most likely outcomes
# it holds the rounded values of the churn predictions so that you can see <0.5 is won't churn and >=0.5 is churned
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])
# draw two scatter plots, one for actual and one for prediction data
# scatter plot for the actual churn data
sns.scatterplot(...
                data=churn,
                hue="has_churned",
                ...)
# scatter plot for the prediction data, colored by most likely outcome
sns.scatterplot(...
                data=prediction_data,
                hue="most_likely_outcome",
                ...)
# probably want to include Legend=False and the example showed alpha=0.2 too 
# The reason why the data points lie on or to the right of the diagonal line is due to nature of the data: 
# the time since the last purchase can't exceed the time since the first purchase. Also notice the pattern of the 
# prediction data: longer customer relationships and shorter purchase recency predicts lower churns.

#### The Logistic Distribution

in order to understand logistic regression, you need to know about the logistic distribution 

the Gaussian or normal distribution is the bell curve (probability density function), made with scipy's norm.pdf() function

for regression, we care more about the area under this curve, by integrating the norm.pdf function, calculationg the area underneath it, we get another curve called the cumulative distribution function

to get the cumulative distribution function (cdf), call norm.cdf instead of pdf

cdf can be thought of as the transformation from the values of x to probabilities 

if you have a logistically distributed variable, x, and a possible value, xval, that x could take, then the CDF gives the probability that x is less than xval
The plot of this has an S-shape, known as a sigmoid curve. An important property of this function is that it takes an input that can be any number from minus infinity to infinity, and returns a value between zero and one

the CDF for the logistic distribution is also known as the logistic function, the terms are interchangeable
its equation is: cdf(x) = 1 / (1+exp(-x))
the inverse CDF is sometimes called the logit function, the terms are interchangeable, logit is also know as the log odds ratio
for describing predictions
its equation is the logarithm of 1 divided by 1 minus p: inverse_cdf(p)=log(1/(1-p))

The logistic function (logistic distribution CDF) has another important property: each x input value is transformed to a unique value. That means that the transformation can be reversed. The logit function is the name for the inverse logistic function, which is also the logistic distribution inverse cumulative distribution function. (All three terms mean exactly the same thing.)
The logit function takes values between zero and one, and returns values between minus infinity and infinity.

the logistic distribution consists of a whole family of curves specified by the location and scale parameters
this allows logistic model prediction curves to have different positions or steepness
The logistic CDF is not just a single curve. In the same way that the normal distribution has mean and standard deviation parameters that affect the CDF curve, the logistic distribution has location and scale parameters.
How do changes to the parameters change the CDF curve?
As location increases, the logistic CDF curve moves rightwards. As scale increases, the steepness of the slope decreases.

In [None]:
# gaussian cumulative distribution function, CDF
x = np.arange(-4, 4.05, 0.05)

gauss_dist = pd.DataFrame({"x": x, 
                           "gauss_pdf": norm.pdf(x),
                           "gauss_cdf": norm.cdf(x)})

sns.lineplot(x="x",
             y="gauss_cdf",
             data=gauss_dist)
# the left side will be close to zero and the right side will be close to one, 
# when x has its minimum value (-infinity) y will be 0, when x has it's maximum value (infinity) y will be 1
# when x is at 1 the CDF curve is at 0.84, so for a normally distributed variable x, the probability that x is than 1 is 84%

# you'll also need a way to get back from probabilities to x values so there's the Gaussian 
# inverse PDF, percent point function, PPF, or quantile function
# this example uses a new dataset with probability from nearly 0 to nearly 1
p = np.arange(0.001, 1, .001)
gauss_dist_inv = pd.DataFrame({"p": p,
                               "gauss_inv_cdf": norm.pdf(p)})
# the line plot you'll see will be the same as the GDF plot above but with the x and y axes flipped 

# logistic probability density function
from scipy.stats import logistic

x = np.arange(-4, 4.05, 0.05)
logistic_dist = pd.DataFrame(x="x",
                            y="log_pdf",
                            data=logistic_dist)
# this will look similar to the Gaussian PDF but the tails at the extreme left and right of the plot are fatter 


# cumulative distribution function (exercise code)
from scipy.stats import logistic
# Create x ranging from minus ten to ten in steps of 0.1
x = np.arange(-10, 10.1, 0.1)
# Create logistic_dist
logistic_dist = pd.DataFrame({"x": x,
                              "log_cdf": logistic.cdf(x),
                              "log_cdf_man": 1 / (1 + np.exp(-x))})
# Using logistic_dist, plot log_cdf vs. x
sns.lineplot(x="x", 
             y="log_cdf",
             data=logistic_dist)
# Show the plot
plt.show()


# inverse cumulative distribution function
# Create p ranging from 0.001 to 0.999 in steps of 0.001
p = np.arange(0.001, 1, 0.001)
# Create logistic_dist_inv
logistic_dist_inv = pd.DataFrame({"p": p,
                                  "logit": logistic.ppf(p),
                                  "logit_man": np.log(p / (1 - p))})
# Using logistic_dist_inv, plot logit vs. p
sns.lineplot(x="p",
             y="logit",
             data=logistic_dist_inv)
# Show the plot
plt.show()

#### How Logistic Regression Works

basically the same way as linear regression!
choose a metric that measures how far the predicted responses are from the actual responses and then optimize that metric
for linear regression. the metric to optimize was the sum of squares
that won't work for logistic regression though because the actual response is always 0 or 1 and the predicted response is between those two values so the sum of squares performs poorly in this case

the likelihood metric is a better bet: np.sum(y_pred * y_actual + (1 - y_pred) * (1 - y_actual))
with likelihood you want to find the maximum value 
since the actual response only has two possible values, this equation will simplify in two ways 
    --when the actual response is 1 the equation for each observation simplifies to the predicted response, y_pred
        --y_pred * 1 + (1 - y_pred) * (1 - 1) = y_pred
as y_pred increases the metric will increase too and the maximum likelihood occurs when y_pred is 1, the same as the actual value
    --when the actual response is 0 the equation simplifies to y_pred * 0 + (1 - y_pred) * (1 - 0) = 1 - y_pred
as y_pred decreases the metric increases and the maximum likelihood occurs when y_pred is zero
in either case, you get a higher likelihood score when the predicted response is close to the actual response

computing likelihood involves adding many very small numbers which leads to errors, log-likelihood is easier and more efficient
the only difference is that you take the logarithm of the predicted response terms
log_likelihood = np.log(y_pred) * y_actual + np.log(1 - y_pred) * (1 - y_actual)
optimizing to find the log-likelihood gives the same coefficients as optimizing to find the likelihood (both equations give the same answer) 

we want to maximize likelihood but thu optimize package can only minimize functions 
so you could calculate the negative log-likelihood
maximizing log-likelihood is the same as minimizing negative log-likelihood
-np.sum(log_likelihoods)

now you can write the logistic regression algorithm
def calc_neg_log_likelihood(coeffs):
    intercept, slope = coeffs
    # more calculations
the metric function takes the coefficients argument then you extract the intercept and slope from it then do some more calcs

find the coefficients that minimize the metric
from scipy.optimize import minimize

minimize(fun=calc_neg_log_likelihood, x0=[0, 0]

In [None]:
#EXERCISE
#since we want to maximize log-likelihood, but minimize() defaults to finding minimum values, it is easier to calculate the negative log-likelihood.
def calc_neg_log_likelihood(coeffs):
    # Unpack coeffs
    intercept, slope = coeffs
    # Calculate predicted y-values
    y_pred = logistic.cdf(intercept + slope * x_actual)
    # Calculate log-likelihood
    log_likelihood = np.log(y_pred) * y_actual + np.log(1 - y_pred) * (1 - y_actual)
    # Calculate negative sum of log_likelihood
    neg_sum_ll = -np.sum(log_likelihood)
    # Return negative sum of log_likelihood
    return neg_sum_ll
  
# Call minimize on calc_sum_of_squares  
print(minimize(fun=calc_neg_log_likelihood,
               x0=[0, 0]))

# Compare the output with the logit() call.
print(logit("has_churned ~ time_since_last_purchase", data=churn).fit().params)
