# Logistic Regression Introduction

Logistic regression is a statistical method used for binary classification tasks, where the goal is to predict a categorical outcome that has two classes (e.g., "yes" or "no", "spam" or "not spam"). Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of a given input belonging to a certain class.

The output of logistic regression is constrained to be between 0 and 1, representing the probability that an observation belongs to a particular class.


This function maps any input value to a range between 0 and 1. It is used in logistic regression to convert the output of a linear equation into probabilities.

We will use the famous **Iris Dataset** to classify whether a flower belongs to a specific species (Setosa or not).

Let's begin by loading the necessary libraries and the dataset.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

import statsmodels.api as sm


We are going to start with some data on irises, the plant.  This data set has four different features/predictors all of which are measurements on the flower of an iris in centimeters.
These measurement in order are: _sepal length_, _sepal width_, 	_petal length_ and _petal width_.

The data also has species data and there are three species: _Iris Setosa_, _Iris Versicolour_, or _Iris Virginica_ and they are coded as 0, 1, and 2, respectively.

In [None]:
# Load the Iris dataset
iris = load_iris()
print(iris)

Next we define two functions, logit(p) which takes a probability and returns the log odds, and ilogit which takes log odds and returns a probability.  The latter is the inverse of the former.

In [None]:

def logit(p):
    """
    Calculates the logit of a probability p.

    Args:
        p (float or numpy.ndarray): Probability value(s) between 0 and 1.

    Returns:
        float or numpy.ndarray: Logit value(s).
    """
    p = np.asarray(p)
    return np.log(p / (1 - p))

def  ilogit(logodds):
    ''' 
    Calculate the inverse logit or inverse log odds of a value.

    Args:
         float or numpy.ndarray: probability value(s).
    '''
    logodds = np.asarray(logodds)
    return np.exp(logodds)/(1+np.exp(logodds))

In [None]:
# here are some sample values from these two functions.
print(logit(0.4))
print(logit(0.8))
print(ilogit(0))
print(ilogit(-2))
print(ilogit(2))

Next we load in the iris data.  

In [None]:
# Load the Iris dataset
iris = load_iris()
# we will start with just the sepal length
X = iris.data[:,0]
print(X)
y = iris.target

# For simplicity, let's classify whether the flower is of the species 'Setosa' or not.
# 
y = (y == 0).astype(int)  # 1 if Setosa, 0 otherwise





Now, let's create a logistic regression model trying to predict our target, _y_, using the
_sepal length_.

In [None]:
print(X.mean())
# let's center X by subtracting the mean from each value.
X = X - X.mean()

# for this particular model formulation we need to add a 
# column of 1's to the feature array
#add constant to predictor variables

X = sm.add_constant(X)

# Fit logistic regression model using statsmodels
model_sm = sm.Logit(y, X).fit()

# Print the summary table, which includes p-values
print(model_sm.summary())

Let's intepret these coefficients.  For a logistic regression, the intepretations are similar but the response is the log-odds of the event, in this case the log odds of being an 'iris setosa'.

So the '-2.41' is the predicted log odds of an iris that has a mean sepal length (of 5.84 cm) will be of the species 'setosa'.  

For each additional cm of length, the predicted log odds of an iris being of the 'setosa' species decreases by -5.18.


Next we find the predicted predicted probability of being a 'setosa' for the mean sepal length.

In [None]:
# for a log-odds of -2.41, find the probability
ilogit(-2.41)
# here that is 0.0824.

Next let's get the predicted probabilities for all of the data.  

By default when we use the predict function for our model we get probabilities.  Some other versions of
logistic regression (in different python packages) give you predicted log odds, others give you probabilities and others give you the predicted category (here 0 or 1) as your output.

In [None]:
# get predicted probabilities of being a 1 
# which here represents being an 'iris setosa'
pred_probs=np.round(model_sm.predict(X),3)
print(pred_probs)

# convert probabilities to ones and zeros
pred_cat = (pred_probs>0.5).astype(int)
print(pred_cat)

Next we'll look at a new dataset about whether or not students overdraw (take too much out of their bank account).

In [None]:
# read in the monkey data
overdrawn = pd.read_csv("https://webpages.charlotte.edu/mschuck1/classes/DTSC2302/Overdrawn.csv")
# get info about these data
overdrawn.dropna(inplace=True)
overdrawn.info()

The data is about students.  Here are the details on the variables in these data:

_Age_ =	Age of the student (in years)

_Sex_ =	0=male or 1=female

_DaysDrink_ = Number of days drinking alcohol (in past 30 days)

_Overdrawn_	= Has student overdrawn a checking account? 0=no or 1=yes

The first regression that we will do will use _DaysDrink_ as a predictor.

In [None]:
overdrawn_X = overdrawn[['DaysDrink']]
overdrawn_X = sm.add_constant(overdrawn_X)

overdrawn_y = overdrawn['Overdrawn']
print(overdrawn_X)
model2_sm = sm.Logit(overdrawn_y, overdrawn_X).fit()

# Print the summary table, which includes p-values
print(model2_sm.summary())

We have a similar summary to what we had previously for linear regression.  And we have coefficients to interpret.

So for a student who had zero drinks in the last 30 days, we would expect that the log odds that they would be overdrawn to be -2.33.

For each additional day that a student drank in the last 30 days, we would predict that the log odds of being overdrawn increase by 0.0541.

Why might we not want to center the data here?

In [None]:

overdrawn_X = overdrawn[['Sex','Age']]
overdrawn_X = sm.add_constant(overdrawn_X)

overdrawn_y = overdrawn['Overdrawn']
print(overdrawn_X)
model2_sm = sm.Logit(overdrawn_y, overdrawn_X).fit()

# Print the summary table, which includes p-values
print(model2_sm.summary())

Let's interpret these coefficients.  

First our intercept, -7.05, is the predicted log odds of an age zero male being overdrawn.

Next, the coefficient for _Sex_ is 0.9671 which means that a female will have a predicted log odds of being overdrawn that is 0.97 higher than a male with the same age.

For each additional year of age, we expect that the log odds of being overdrawn will be predicted to increase by 0.23 assuming that _Sex_ remains the same.

Below we center age so that our intercept coefficient is more interpretable.

In [None]:
print(overdrawn['Age'].mean())
overdrawn['Age']=overdrawn['Age']-overdrawn['Age'].mean()
overdrawn_X = overdrawn[['Sex','Age']]
overdrawn_X = sm.add_constant(overdrawn_X)
overdrawn_y = overdrawn['Overdrawn']
print(overdrawn_X)
model2_sm = sm.Logit(overdrawn_y, overdrawn_X).fit()

# Print the summary table, which includes p-values
print(model2_sm.summary())

Note that after centering _Age_ the coefficients for _Sex_ and _Age_ remain the same, but the coefficient for _const_ changes.  

From the above code, we can see that the average age of students in these data was 19.6 years of age.  Having subtracted that from _Age_, we now intepret the y-intercept as the predicted log odds that
a 19.6 year old male would overdraw from their bank account is -2.56.  

### Tasks

1. Using the overdrawn data, fit a model that has _Age_, _Sex_ and as an interaction term between 'Age' and 'Sex'.  Interpret all of the coefficients.  

2. Again with the overdrawn data, fit a model with all of the predictors including the interaction term from the previous task.  Which of the predictors is discernibly different from zero.

3. Using the model from Task 2, remove any non-discernibly different from zero predictors from the model and get the predicted probabilities of being overdrawn.  

4. Convert the predicted probabilities in the previous task to whether or not a student was overdrawn.  Is the predicted number of overdrawn students the same as the actual number of overdrawn students?