# Titanic Survival Analysis — Homework Project

Author: Milzon  
Dataset: Titanic (Kaggle)

In [None]:
import os
print(os.getcwd())


d:\AIEngineering-8\Milzon


# Titanic Survival Analysis - Step 1: Data Loading and Initial Exploration

In this step, we load the Titanic dataset and perform initial data exploration to understand the data structure, missing values, and data types. This will help us decide on necessary cleaning and preprocessing steps.


In [None]:
**Caption:**  
First five rows of the Titanic dataset. Let's explore the features and look for missing values or anomalies.


In [1]:
import pandas as pd

# Load Titanic dataset (adjust path if needed)
df = pd.read_csv('../data/train.csv')

# Show first 5 rows
print(df.head())

# Show info summary including missing values
print(df.info())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
<c

## Initial Data Exploration Insights

- The dataset contains **891 passengers** with **12 columns** describing various features such as passenger ID, survival status, class, name, gender, age, family relations, ticket info, fare, cabin, and embarkation port.
- There are **some missing values**, for example:
  - `Cabin` has many missing entries (NaN).
  - `Embarked` has a few missing values (889 non-null out of 891).
- Data types vary:
  - Numerical columns like `Age`, `Fare` are floats.
  - Categorical columns like `Sex`, `Ticket`, `Cabin`, `Embarked` are objects (strings).
- Age and Fare have decimal values, which suggests continuous variables.
- The dataset appears well-structured and ready for initial cleaning and feature engineering.

### Next steps based on this exploration:
- Handle missing values, especially in `Age` and `Embarked`.
- Consider encoding categorical features (`Sex`, `Embarked`, `Cabin`) to numeric for modeling.
- Explore feature creation such as family size or title extraction from names.
- Visualize distributions of key variables and survival rates to find patterns.

---

This initial exploration sets the foundation for effective preprocessing and model building.


## Task 1: Feature Detective 

Goal: Identify which features have the greatest impact on model accuracy by removing one feature at a time.

Procedure:
- Train a baseline model with all features.
- Remove one feature (e.g., 'Sex', 'Pclass', 'Age', 'FamilySize') at a time.
- Measure accuracy for each variation.
- Compare accuracies to determine feature importance.

Questions to answer:
- Which feature removal hurt accuracy the most?
- Which feature seems least important?
- Why might 'Sex' or 'Pclass' be important for survival prediction?


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Preprocessing: example - create FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Define features and target
features = ['Pclass', 'Sex', 'Age', 'FamilySize']
target = 'Survived'

# Convert categorical feature 'Sex' to numeric
df['Sex_numeric'] = df['Sex'].map({'male': 0, 'female': 1})

# Prepare dataset function for modeling
def prepare_data(feature_list):
    X = df[feature_list].copy()
    # Replace 'Sex' with 'Sex_numeric' if present
    if 'Sex' in X.columns:
        X['Sex'] = df['Sex_numeric']
    # Fill missing age with median
    if 'Age' in X.columns:
        X['Age'] = X['Age'].fillna(df['Age'].median())
    return X

# Baseline model with all features
X_all = prepare_data(features)
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
baseline_acc = accuracy_score(y_test, y_pred)
print(f'Baseline accuracy with all features: {baseline_acc:.4f}')

# Test removing each feature
results = {}
for feature in features:
    features_subset = [f for f in features if f != feature]
    X_sub = prepare_data(features_subset)
    X_train, X_test, y_train, y_test = train_test_split(X_sub, y, test_size=0.2, random_state=42)
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    results[feature] = acc
    print(f'Accuracy without {feature}: {acc:.4f}')

# Show comparison
print("\nAccuracy impact of removing each feature:")
print(f"Baseline (all features): {baseline_acc:.4f}")
for feat, acc in results.items():
    print(f"Without {feat}: {acc:.4f}")


Baseline accuracy with all features: 0.8045
Accuracy without Pclass: 0.7821
Accuracy without Sex: 0.7542
Accuracy without Age: 0.7877
Accuracy without FamilySize: 0.8101

Accuracy impact of removing each feature:
Baseline (all features): 0.8045
Without Pclass: 0.7821
Without Sex: 0.7542
Without Age: 0.7877
Without FamilySize: 0.8101
