Week 1 – Day 5: Your First ML Model

Objective of the day
Train your first machine learning model (logistic regression) to predict Titanic survival. By the end, you’ll know how to:

Prepare data for ML

Train a model

Evaluate it

🟢 Labels (y)

The answer we want the model to predict.

In the Titanic dataset, that’s the column survived (0 = no, 1 = yes).

It’s also called the target variable, dependent variable, or output.

Think of it as:
👉 “What do I want to know?”

🟡 Features (X)

The information we give the model so it can make a guess.

In Titanic, features could be sex, age, pclass, fare, etc.

These are also called independent variables or inputs.

Think of it as:
👉 “What do I know?”

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url)

# Convert categorical column 'sex' to numbers
df["sex"] = df["sex"].map({"male": 0, "female": 1})

# Drop rows with missing age
df = df.dropna(subset=["age"])

X = df[["sex", "pclass", "age", "fare"]]
y = df["survived"]

#Spliinto train/test

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#Train Model

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

#Evaluate

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))


print(model.coef_)
print(model.intercept_)



Accuracy: 0.7552447552447552
Confusion matrix:
 [[68 19]
 [16 40]]
[[ 2.53047810e+00 -1.24068794e+00 -4.24723266e-02  2.27399583e-04]]
[2.61171493]


A confusion matrix is just a table that compares:

What the model predicted vs

What the true answers (labels) were

For binary classification (like Titanic survival: 0 = died, 1 = survived), it looks like this:

|               | Predicted: 0 (died) | Predicted: 1 (survived) |
| ------------- | ------------------- | ----------------------- |
| **Actual: 0** | True Negative (TN)  | False Positive (FP)     |
| **Actual: 1** | False Negative (FN) | True Positive (TP)      |


🟢 What are model coefficients?

When you train a logistic regression model, it learns a mathematical formula to predict survival.

That formula looks like this:

z = b_0 + b_1*x_1 + b_2*x_2 .. ..

b_0 = intercept (the baseline value when all features are 0).

b_1, b_2, ... = coefficients (weights) that multiply each feature (x_1, x_2, ...).

Then logistic regression passes z through a sigmoid function to squash it into a probability between 0 and 1.

-Model coefficients

Each model coefficient shows how much that feature pushes survival odds up or down.

-Interpreting coefficients

Positive value -> feature increases chance of survival.
Negative value -> feature decreases chane of survival.
Bigger absolute value -> stronger influence.


🌟 Mini-Challenge

Add one more feature: “alone” (0 = not alone, 1 = alone).

Retrain the model with this extra column.

Did accuracy improve, stay the same, or get worse?

What does that suggest about being alone vs with family?

In [5]:
X = df[["sex", "pclass", "age", "fare", "alone"]]
y = df["survived"]

#Spliinto train/test

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#Train Model

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

#Evaluate

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.7342657342657343
Confusion matrix:
 [[68 19]
 [19 37]]
