# Logistic regression

## Lecture 4

### GRA 4160
### Predictive modelling with machine learning

#### Lecturer: Vegard H. Larsen

Logistic Regression is a popular supervised learning algorithm used in various applications, including binary classification.
It is used to model the relationship between a set of input features and a binary outcome (0 or 1) by using a logistic function.

The logistic regression model can be represented mathematically by the following equation:

$$ \hat{y} = \frac{1}{1 + e^{-z}} $$

where $z$ is a linear combination of the input features and parameters (weights), represented as:

$$ z = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p $$

The parameters (weights) are learned during the training process, where the logistic regression model is fit to the training data by minimizing the loss function.

The likelihood function for logistic regression models is a probability function that measures the goodness of fit of the model to the observed data.
It is used to find the parameters of the model that maximize the probability of observing the training data, given the model.

Mathematically, the likelihood function for a logistic regression model with $n$ data points and $m$ input features is given by:

$$ L(w) = \prod_{i=1}^n [y_i = 1]p_i + [y_i = 0](1 - p_i) $$

where w is the vector of model parameters, $y_i$ is the binary class label for the $i$-th data point, and $p_i$ is the predicted probability of the positive class for the $i$-th data point, given by the logistic function:

$$ p_i = \frac{1}{1 + e^{-z_i}} $$

and $z_i$ is the linear combination of the input features and the model parameters for the $i$-th data point:

$$ z_i = \beta_0 + âˆ‘_{j=1}^m \beta_j x_{ij} $$

The goal of logistic regression is to find the values of the model parameters that maximize the likelihood function, given the observed data.
This is typically done using optimization algorithms, such as gradient descent or L-BFGS.

The predicted outcome, $\hat{y}$, represents the probability of the positive class (1).
In binary classification problems, a threshold of 0.5 is often used to determine the class label, with values greater than 0.5 being classified as 1 and values less than 0.5 being classified as 0.
The logistic function produces a probability between 0 and 1, which is transformed into binary class predictions through the use of a threshold.

When there are more than two outcomes, logistic regression can be extended to a multiclass classification problem using one of the following techniques:

1. **One vs All (OvA)**: This involves training multiple binary classifiers, each one making a binary decision between one of the classes as positive and all other classes as negative. The class with the highest predicted probability is chosen as the final prediction.
2. **Softmax Regression (Multinomial Logistic Regression)**: This involves directly modeling the probability distribution over all classes using the softmax function. The softmax function computes the exponential of each input and then normalizes the result to produce a probability distribution over the classes.

Both of these methods allow logistic regression to be applied to multiclass classification problems, and the choice between them often depends on the size and structure of the data, as well as computational considerations.

## Predicting the survival rate of passengers on the Titanic using logistic regression


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv('../data/titanic/train.csv')

# Preprocess the data
df = df.dropna()
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)

# Split the data into training and test sets
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)

## Training a logistic regression model

Training a logistic regression model involves finding the coefficients of the model that best fit the training data.
The goal is to find the coefficients that maximize the likelihood of observing the training data, given the model.
The coefficients are estimated using optimization algorithms, such as gradient descent or the limited-memory BFGS method.
The optimization problem is solved iteratively, with the algorithm updating the coefficients at each iteration based on the gradient of the likelihood function.

In scikit-learn, the following solvers can be used for training logistic regression models:

- "newton-cg" - Uses the Newton-CG method.
- "lbfgs" - Uses the limited-memory BFGS method.
- "liblinear" - Uses a library for large linear classification problems.
- "sag" - Uses the Stochastic Average Gradient descent.
- "saga" - Uses the Stochastic Average Gradient descent with an optimized scaling for the step size.

The choice of solver will depend on the size and characteristics of your data and the specific requirements of your problem.

Once the coefficients have been estimated, the model can be used to make predictions on new data.
To make a prediction for a new data point, the input features are multiplied by the estimated coefficients and passed through the logistic function.
The output of the logistic function represents the predicted probability of the positive class.
A threshold is then applied to the predicted probability to determine the final binary class prediction.

In [2]:
# Train a logistic regression model
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train, y_train)

# Predict the outcome on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model's accuracy
score = accuracy_score(y_test, y_pred)
print("Accuracy:", score.round(3))
print("Survival rate in test set:", y_test.mean().round(3))

Accuracy: 0.892
Survival rate in test set: 0.541


In [5]:
# Let's look at the model's parameters

pd.DataFrame(log_reg.coef_)

Unnamed: 0,0,1,2,3,4,5
0,-0.325814,-2.136436,-0.021277,0.136485,-0.242407,0.002133


In [12]:
# Access the probability of the positive class

In [6]:
# Fit the Logistic Regression model
X = np.array([[1, 2], [2, 4], [3, 6], [4, 8]])
y = np.array([0, 0, 1, 1])
clf = LogisticRegression(solver='lbfgs').fit(X, y)

# Obtain predicted probabilities for a set of predictor variables
x_new = np.array([[5, 10]])
probabilities = clf.predict_proba(x_new)
print(probabilities)

[[0.00632882 0.99367118]]
