# Baseline Modeling For Titanic Life Prediction with Logistic Regression
Establish a simple, interpretable baseline model for Titanic survival prediction using logistic regression. This provides a benchmark performance metric and helps identify key predictive features before exploring more complex algorithms.<br>

**Approach**
- Use encoded categorical features and numerical variables
- Apply logistic regression for binary classification (survived/did not survive)
- Evaluate performance using accuracy, precision, recall, and F1-score
- Analyze feature coefficients to understand survival factors<br><br>

**Expected Outcomes**
- Baseline accuracy score for comparison with advanced models
- Identification of most influential survival predictors
- Foundation for iterative model improvement

### About Logistic Regression
A statistical machine learning algorithm used for binary classification problems. It predicts the probability of an event occurring (survival in this case) by fitting data to a logistic function.<br><br>

**Key Characteristics**
- Binary Output: Perfect for yes/no predictions (survived/not survived)
- Probability Estimates: Provides likelihood scores between 0 and 1
- Interpretable: Coefficients show feature importance and direction
- Linear Decision Boundary: Assumes linear relationship between features and log-odds<br><br>

**Why Use for Titanic?**
- Simple & Fast: Quick to implement and train
- No Feature Scaling Required: Works well with mixed feature scales
- Interpretable Results: Easy to explain which factors influence survival
- Robust Baseline: Provides reliable performance benchmark<br>
Logistic regression serves as an excellent starting point for understanding the Titanic dataset before advancing to more complex modeling techniques.

## Dataset
preprocessed seaborn built in dataset `titanic` is used for this modeling.

## Tasks
- Load and Prepare Dataset
    - Split Dataset into Features and Target
    - Split Dataset into Training and Testing Sets
- Train Model
- Apply Model to Make Prediction
- Evaluate Model Performance
- Summarize Model Evaluation

# Import Libraries

In [1]:
import pandas as pd, numpy as np
import warnings
warnings.filterwarnings("ignore")

# model and evaluation libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Load and prepare data for modeling

## Load data

In [2]:
# load clean data
df = pd.read_csv("data/titanic_clean.csv")

# load vertify
df.head()

Unnamed: 0,survived,pclass,fare,alone,sex_male,who_man,who_woman,embark_town_Queenstown,embark_town_Southampton
0,0,3,7.25,False,True,True,False,False,True
1,1,1,71.2833,False,False,False,True,False,False
2,1,3,7.925,True,False,False,True,False,True
3,1,1,53.1,False,False,False,True,False,True
4,0,3,8.05,True,True,True,False,False,True


## Split the dataset into input features (x) and target variable (y)

In [3]:
# features
x = df.drop('survived', axis=1)

# target variable
y = df['survived']

In [4]:
# check data load
x

Unnamed: 0,pclass,fare,alone,sex_male,who_man,who_woman,embark_town_Queenstown,embark_town_Southampton
0,3,7.2500,False,True,True,False,False,True
1,1,71.2833,False,False,False,True,False,False
2,3,7.9250,True,False,False,True,False,True
3,1,53.1000,False,False,False,True,False,True
4,3,8.0500,True,True,True,False,False,True
...,...,...,...,...,...,...,...,...
886,2,13.0000,True,True,True,False,False,True
887,1,30.0000,True,False,False,True,False,True
888,3,23.4500,False,False,False,True,False,True
889,1,30.0000,True,True,True,False,False,False


In [5]:
# target variable
y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64

## Split the dataset into training and testing sets

In [6]:
# split data into training and testing sets (80% train, 20% test)
x_train, y_train, x_test, y_test = train_test_split(x, y, test_size=0.2, random_state=42)