# Logistic Regression

## Lecture 4

### GRA 4160: Predictive Modelling with Machine Learning

#### Lecturer: Vegard H. Larsen

---
### Overview
- Basic theory of Logistic Regression
- Binary classification interpretation
- Multi-class extension
- Titanic survival prediction example

In this notebook, we will revisit the fundamental concept of **Logistic Regression**, a popular classification algorithm, and demonstrate how to apply it to a well-known dataset: **Titanic** passenger survival data.


## Logistic Regression Fundamentals

**Logistic Regression** is commonly used for binary classification (e.g., 0 or 1, "yes" or "no"). Despite its name, it is actually a **classification** technique rather than a regression technique.

1. **Model Form**:
    $$
    \hat{y} = \frac{1}{1 + e^{-z}}\quad \text{where}\quad z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p.
    $$
    
    - $\hat{y}$ represents the **predicted probability** of the positive class (labelled as 1).
    - $z$ is the linear combination of input features and parameters (weights).

2. **Interpreting $\beta_j$**:
    - Each coefficient $\beta_j$ corresponds to the change in the **log-odds** of the outcome per unit change in the associated feature $x_j$.
    - "Log-odds" means $\ln\left(\frac{p}{1-p}\right)$, where $p$ is the probability of the positive class.

3. **Likelihood Maximization**:
    - Logistic Regression is typically fit by **maximizing** the **likelihood** (or equivalently, **minimizing** the **negative log-likelihood**).
    - Common optimization methods include **Gradient Descent**, **Newton-CG**, **L-BFGS**, etc.

4. **Decision Threshold**:
    - By default, $\hat{y} > 0.5$ is classified as 1, else 0.
    - This threshold can be adjusted based on problem context (e.g., wanting fewer false positives vs. fewer false negatives).

5. **Multi-class Extensions**:
    - **One-vs-All (OvA)**: Train a separate binary classifier per class.
    - **Multinomial (Softmax) Regression**: Model all classes simultaneously with a softmax output.

Next, we will explore an example using the Titanic dataset, aiming to predict passenger survival (1) or death (0).

## Dataset: Predicting Titanic Survival

We will:
- Load the dataset
- Perform basic preprocessing
- Train-test split
- Fit a Logistic Regression model and evaluate accuracy

### About the Titanic dataset
Each row corresponds to a passenger, with columns such as:
- **Survived** (1 = yes, 0 = no)
- **Pclass** (Passenger class, 1 = upper, 2 = middle, 3 = lower)
- **Sex** (male or female)
- **Age** (Passenger age)
- **SibSp** (Number of siblings/spouses aboard)
- **Parch** (Number of parents/children aboard)
- **Fare** (Ticket fare cost)

Our goal: **Predict whether a passenger survived** (`Survived`) using their class, sex, age, fare, etc.

In [None]:
# Data loading and preprocessing
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')  # Suppress some warning messages

# Load Titanic data
df = pd.read_csv('../data/titanic/train.csv')

# Let's take a quick look at the dataset structure
print("Data shape:", df.shape)
df.head()

### Data Cleaning and Feature Engineering
To simplify, we:
- **Drop** rows with missing values (note: in practice, we might want a more sophisticated approach to missing data)
- Convert `Sex` to a numeric variable (1 = male, 0 = female)
- Select relevant features for our model

Finally, we **split** the data into training and test sets.

In [None]:
# Drop rows with missing values
df = df.dropna()

# Convert 'Sex' to numeric: male=1, female=0
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)

# Define our features (X) and target (y)
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = df['Survived']

# Split the data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=15
)

print("Training set size:", X_train.shape)
print("Test set size:\t", X_test.shape)

### Train a Logistic Regression Model
We use **scikit-learn**'s `LogisticRegression` class with the `lbfgs` solver for optimization.

After fitting, we:
- **Predict** on the test set
- Compare the predictions to the **true** labels using **accuracy**.
  - Accuracy = Number of correct predictions / Total predictions.


In [None]:
log_reg = LogisticRegression(solver='lbfgs', max_iter=500)
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)
score = accuracy_score(y_test, y_pred)

print("Accuracy:", round(score, 3))
print("Survival rate in test set:", round(y_test.mean(), 3))

# Observing baseline: if we guess 'no survival' for everyone, we'd get
no_survival_acc = round((1 - y_test.mean()), 3)
print("If we predicted no survivors always, Accuracy would be:", no_survival_acc)

### A Quick Look at Coefficients
In Logistic Regression, `coef_` reflects the effect of each feature on the **log-odds** of survival. A **positive** coefficient increases the log-odds (thus the probability of survival), while a **negative** coefficient lowers it.

Remember:
- `Sex` is coded as 1 for male, 0 for female, so a **negative** coefficient for `Sex` means being male **reduces** the log-odds of survival compared to female.
- `Intercept` is stored in `intercept_`.


In [None]:
# Coefficients and intercept
coef_df = pd.DataFrame(
    log_reg.coef_,
    columns=X_train.columns
)
coef_df['Intercept'] = log_reg.intercept_
coef_df

### Model Evaluation: Confusion Matrix & Classification Report
Besides accuracy, it's helpful to see **where** mistakes occur.

- **Confusion Matrix**: Compares predicted vs. actual classes (TP, FP, TN, FN).
- **Classification Report**: Provides precision, recall, and F1-score for each class.

In many real-world problems, especially those with **imbalanced** data, these metrics can be more informative than simple accuracy.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

print("\nClassification Report:\n", classification_report(y_test, y_pred))

### Questions/Extensions
1. **Interpretation**: Based on the coefficients, which features appear most influential for survival?
2. **Threshold Adjustment**: By default, predictions are 1 if $\hat{y} > 0.5$. What if we change this threshold (e.g., 0.3 or 0.7)? How does that affect confusion matrix/accuracy?
3. **Feature Engineering**: Are there other variables you could create (e.g., traveling alone vs. with family) that might improve accuracy?


## Example: Predicting Probability on Synthetic Data
Below is a small demonstration to show how **predict_proba** returns probabilities of each class.

In [None]:
# Example for illustrate predict_proba
X_synth = np.array([[1, 2], [2, 4], [3, 6], [4, 8]])
y_synth = np.array([0, 0, 1, 1])

clf_synth = LogisticRegression(solver='lbfgs').fit(X_synth, y_synth)

# Let's create a new data point
x_new = np.array([[5, 10]])
probabilities = clf_synth.predict_proba(x_new)

print("Predicted probabilities for the new point:")
print(probabilities)
print("Predicted class:", clf_synth.predict(x_new))

### Questions/Extensions
1. How do these probabilities correspond to the logistic function (sigmoid) we covered theoretically?
2. Compare `predict_proba(x)` with `decision_function(x)` in scikit-learn. What does `decision_function` return?
3. Change the input to `[10, 20]`. How do you expect the probabilities to shift?


## Summary

- Logistic Regression provides **probabilistic** outputs for binary classification, mapping linear combinations of features to probabilities via the **sigmoid** function.
- **Coefficients** represent **log-odds** contributions of each feature.
- Evaluation metrics like **accuracy**, **confusion matrix**, **precision**, **recall**, and **F1-score** offer a more complete picture than a single metric.
- Real-world performance may depend on data quality, balanced vs. imbalanced classes, and thoughtful **feature engineering**.

With these ideas in mind, you should now be more comfortable using Logistic Regression on datasets like Titanic and beyond!