

## Chapter 6: Building Logistic Regression from Scratch

A Logistic Regression model is essentially a mathematical machine that takes in features (X), multiplies them by weights (W), adds a bias (b), and pushes the result through a "Squashing Function" to get a probability.

### 1. The Prediction (The Sigmoid Function)

Unlike Linear Regression, which can predict any number from negative to positive infinity, Logistic Regression must predict a probability between 0 and 1. We use the Sigmoid Function to achieve this:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

Where  = X \cdot W + b$. If $\sigma(z)$ is 0.85, the model is 85% sure the passenger survived.

### 2. The Penalty (Log Loss)

How do we know if the model is doing a bad job? We use a Cost Function called Log Loss.

*   If the passenger survived (1) but the model predicted 0.01, the "Penalty" is very high.
*   If the model predicted 0.99, the "Penalty" is near zero.

### 3. The Teacher (Gradient Descent)

This is the "learning" part. The model calculates the slope (gradient) of the error and takes a small step in the opposite direction to reduce the penalty. It repeats this thousands of times until it finds the "best" weights for Pclass, Sex, and Age.

#### Step 1: Initialize the Modeling Notebook

In your new notebook, start by importing the clean data and setting up the mathematical foundations.


In [32]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('../data/titanic_model_ready.csv')

# Separate features and target variable
# We assume 'Survived' is the target variable and all other columns are features
# So we add 'Survived' to the drop list for features and add it to the target variable
# We add a column of 1s to X to handle the 'Bias' (b) term automatically
X = df.drop('Survived', axis=1).astype(float).values
y = df['Survived'].astype(float).values.reshape(-1, 1)

# Initialize weights and bias
weights = np.zeros((X.shape[1], 1))
bias = 0.0

print(f"Ready to train model with {X.shape[0]} samples and {X.shape[1]} features.")

Ready to train model with 891 samples and 9 features.


In this section, we build the three core functions of Logistic Regression: the Sigmoid (prediction), the Log Loss (error measurement), and Gradient Descent (learning).
#### 1. The Sigmoid Function

This function takes any real number and "squashes" it into a probability between 0 and 1.

In [33]:
# The sigmoid activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))


#### The Cost Function (Log Loss)

This measures how "wrong" the model's guesses are. A perfect prediction has a cost of 0, while a confident but wrong prediction has a very high cost.

$$ J(W,b) = -\frac{1}{m} \sum [y \log(\hat{y}) + (1-y) \log(1-\hat{y})] $$

#### The Gradient Descent Algorithm

This is the training loop. In each "epoch" (round of learning), the model:

*   **Predicts:** Guesses survival probabilities.
*   **Calculates Error:** Finds the difference between its guess and the real answer.
*   **Updates Weights:** Adjusts the importance of features like Sex or Pclass to reduce future errors.


In [34]:
# Hyperparameters
learning_rate = 0.01
epochs = 1000
m = X.shape[0]  # Number of samples

# Initial weights (9 features) and bias
weights = np.zeros((X.shape[1], 1))
bias = 0

# The Training Loop
for i in range(epochs):
    # 1. Forward Pass: Calculate z and prediction (y_hat)
    z = np.dot(X, weights) + bias
    y_hat = sigmoid(z)
    
    # 2. Backward Pass: Calculate the Gradients (The Calculus)
    dw = (1/m) * np.dot(X.T, (y_hat - y))
    db = (1/m) * np.sum(y_hat - y)
    
    # 3. Update Weights and Bias (The Learning Step)
    weights -= learning_rate * dw
    bias -= learning_rate * db
    
    # Optional: Print progress every 100 epochs
    if i % 100 == 0:
        loss = -np.mean(y * np.log(y_hat + 1e-9) + (1 - y) * np.log(1 - y_hat + 1e-9))
        print(f"Epoch {i}: Loss = {loss:.4f}")

print("\n--- Training Complete ---")

Epoch 0: Loss = 0.6931
Epoch 100: Loss = 0.6131
Epoch 200: Loss = 0.5887
Epoch 300: Loss = 0.5705
Epoch 400: Loss = 0.5559
Epoch 500: Loss = 0.5436
Epoch 600: Loss = 0.5332
Epoch 700: Loss = 0.5242
Epoch 800: Loss = 0.5163
Epoch 900: Loss = 0.5093

--- Training Complete ---


In [35]:
feature_importance = pd.DataFrame({
    'Feature': df.drop('Survived', axis=1).columns,
    'Weight': weights.flatten()
}).sort_values(by='Weight', ascending=False)

print("\nFeature Importance:")
print(feature_importance)


Feature Importance:
      Feature    Weight
1  Sex_binary  0.921427
6    HasCabin  0.308134
3        Fare  0.295043
8      Port_C  0.145387
7      Port_S -0.107551
4  FamilySize -0.133154
5     IsAlone -0.180992
2         Age -0.246106
0      Pclass -0.379951


*   **The Gender Dominance (Sex_binary: 0.92):** This is by far your strongest positive predictor. Because we mapped females to 1, this high positive weight confirms that being female was the single most significant factor in increasing survival probability.

*   **The Status Proxy (HasCabin: 0.31 & Fare: 0.30):** These two features move together. Their positive weights suggest that having a recorded cabin and paying a higher fare significantly boosted survival odds, likely due to better lifeboat access.

*   **The Class Penalty (Pclass: -0.38):** This is your strongest negative predictor. As the class number increases (from 1st to 3rd), the survival probability drops sharply. This mathematically captures the tragedy of the lower decks.

*   **The Age Factor (Age: -0.25):** The negative weight suggests that, generally, as age increased, the chance of survival decreased. This aligns with the "Children First" priority we saw during our EDA.

*   **Social Support (IsAlone: -0.18 & FamilySize: -0.13):** Interestingly, both carry negative weights. This suggests that being entirely alone or having a very large family size (which we standardized earlier) actually hindered survival compared to being in a small, cohesive family unit.

#### Evaluating the Scratch Model

To calculate accuracy, we need to convert the continuous output of our sigmoid function (which ranges from 0 to 1) into a binary output (0 or 1). We use a Threshold of 0.5:

*   If the probability is â‰¥0.5, we predict Survived (1).
*   If the probability is <0.5, we predict Not Survived (0).


In [36]:
def predict(X, weights, bias):
    z = np.dot(X, weights) + bias
    probabilities = sigmoid(z)
    return [1 if p >= 0.5 else 0 for p in probabilities]

# Make predictions on the training set
predictions = predict(X, weights, bias)

# Calculate accuracy
correct_predictions = np.sum(predictions == y.flatten())
accuracy = np.mean(correct_predictions / len(y)) * 100

print(f"Total Correct Predictions: {correct_predictions} out of {len(y)}")
print(f"\nTraining Accuracy: {accuracy:.2f}%")


Total Correct Predictions: 681 out of 891

Training Accuracy: 76.43%


### The Professional Baseline (Scikit-Learn)

We will now implement LogisticRegression from the sklearn library to see if the optimized algorithms can improve upon our 76.43% accuracy.

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create and train the logistic regression model
sklearn_model = LogisticRegression(max_iter=1000)
sklearn_model.fit(X, y.ravel()) # .ravel() flattens y to 1D array for sklearn

# Make predictions
sklearn_predictions = sklearn_model.predict(X)

# Calculate accuracy
sklearn_accuracy = accuracy_score(y, sklearn_predictions) * 100

print(f"\nSklearn Logistic Regression Training Accuracy: {sklearn_accuracy:.2f}%")


Sklearn Logistic Regression Training Accuracy: 80.58%


In [38]:
# Compare weights side-by-side
comparison_df = pd.DataFrame({
    'Feature': df.drop('Survived', axis=1).columns,
    'Scratch_Weight': weights.flatten(),
    'Sklearn_Weight': sklearn_model.coef_.flatten()
}).sort_values(by='Sklearn_Weight', ascending=False)

print(comparison_df)

      Feature  Scratch_Weight  Sklearn_Weight
1  Sex_binary        0.921427        2.572020
6    HasCabin        0.308134        0.625443
3        Fare        0.295043        0.060970
8      Port_C        0.145387        0.041350
7      Port_S       -0.107551       -0.297884
2         Age       -0.246106       -0.488666
4  FamilySize       -0.133154       -0.582355
5     IsAlone       -0.180992       -0.637918
0      Pclass       -0.379951       -0.818242


In [39]:
from sklearn.metrics import classification_report

# Generate the professional report
report = classification_report(y, sklearn_predictions, target_names=['Died', 'Survived'])
print("--- Final Model Health Report ---")
print(report)

--- Final Model Health Report ---
              precision    recall  f1-score   support

        Died       0.83      0.85      0.84       549
    Survived       0.76      0.73      0.74       342

    accuracy                           0.81       891
   macro avg       0.80      0.79      0.79       891
weighted avg       0.80      0.81      0.81       891

