# Logistic Regression

In [1]:
# import numpy
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

## Loading Data



In [2]:
# load data
titanic = pd.read_csv("titanic.csv")
titanic = titanic[["survived", "pclass", "sex", "age", "embark_town"]]
titanic = titanic.dropna()

In [3]:
X = titanic[titanic.columns[~titanic.columns.isin(['survived'])]]
y = titanic['survived']

We print the shape of X, and inspect the top 5 rows.

In [4]:
X.shape
X.head()

Unnamed: 0,pclass,sex,age,embark_town
0,3,male,22.0,Southampton
1,1,female,38.0,Cherbourg
2,3,female,26.0,Southampton
3,1,female,35.0,Southampton
4,3,male,35.0,Southampton


We also print the first 5 rows of y.

In [5]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

## Limitations for Linear Regression

In Regression, we allow the response to take on any real number. But what if the range is restricted?

1. Positive values: river flow. 
    - Lower limit: 0
2. Percent/proportion data: proportion of income spent on housing in Vancouver. 
    - Lower limit: 0
    - Upper limit: 1. 
3. Binary data: success/failure data.
    - Only take values of 0 and 1.
4. Count data: number of male crabs nearby a nesting female
    - Only take count values (0, 1, 2, ...)

Notice the problem here? **The regression lines (i.e. the predicted values by linear regression) extend beyond the possible range of the response**.   

This is *mathematically incorrect*, the expected value cannot extend outside of the range of Y. 

The *practical* consequences:
- When fitting a linear regression model when the range of the response is restricted, we lose the ability to extrapolate, as we obtain logical fallacies if we do. 

- However, a linear regression model might still be useful in these settings. After all, the linear trend looks good for the range of the data. 


## Classification and Regression Problems (Quick Review)


![img](https://static.javatpoint.com/tutorial/machine-learning/images/regression-vs-classification-in-machine-learning.png)

## Logistic Regression Models

Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many social science applications. Logistic Regression is used when the dependent variable(target) is categorical. For example,
it is used to predict whether an email is spam (1) or (0), or whether the tumor is malignant (1) or not (0).

Regression analysis is a series of statistical processes used to estimate the relationships between a dependent variable and various independent variables in statistical modeling. 

The logistic regression model is given by: 
![logistic regression]("img/logistic.png")


Logistic regression is:

- A linear model for classification. 
- It learns weights associated with each feature and the bias. 
- The decision boundary is a hyperplane dividing the feature space in "half". That's why we call it a linear classifier.

## Creating a Simple Logistic Regression Model (using `statsmodel` package)


In [6]:
logit = smf.logit("survived ~ sex + age + embark_town + pclass", data=titanic)

In [7]:
model = logit.fit()
print_model = model.summary()
print(print_model)

Optimization terminated successfully.
         Current function value: 0.451367
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               survived   No. Observations:                  712
Model:                          Logit   Df Residuals:                      706
Method:                           MLE   Df Model:                            5
Date:                Sun, 07 Apr 2024   Pseudo R-squ.:                  0.3311
Time:                        01:35:04   Log-Likelihood:                -321.37
converged:                       True   LL-Null:                       -480.45
Covariance Type:            nonrobust   LLR p-value:                 1.247e-66
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                      5.2698      0.523     10.071      0.000      

## Accessing model parameters

In statsmodels, the `fit()` method returns a `Result` object. The model coefficients, standard errors, p-values, etc., are all available from this Result object.

Conveniently these are stored as Pandas dataframes with the parameter name as the dataframe index.

In [8]:
# Inspect paramaters

model.params

Intercept                     5.269797
sex[T.male]                  -2.518666
embark_town[T.Queenstown]    -0.813194
embark_town[T.Southampton]   -0.478896
age                          -0.036276
pclass                       -1.211837
dtype: float64

Here are some of the relevant values for a Logistic Regression.


|Attr/func|Description|
| ------------- |-------------|
|params|Estimated model parameters. Appears as coef when calling summary() on a fitted model|
|bse|Standard error|
|tvalues|A coefficient's t-statistic|
|pvalues|The model's p-value|
|conf_int(alpha)|Method that calculates the confidence interval for the estimated parameters. To call: model.conf_int(0.05)|



## Evaluating Logistic Regression Model

Two Ways:

1. Examine the model output
2. Use .pred_table() methods

In [9]:
model.pred_table()

array([[357.,  67.],
       [ 79., 209.]])