# Titanic Survival Analysis — Homework Project

Author: Milzon  
Dataset: Titanic (Kaggle)

## Step 1. Data Loading and Initial Exploration

In [None]:
import pandas as pd

# Load data
df = pd.read_csv('data/train.csv')
df.head()

**Caption:**  
First five rows of the Titanic dataset. Let's explore the features and look for missing values or anomalies.

## Step 2. Exploratory Data Analysis (EDA)

In [None]:
df.info()
df.describe()
df.isnull().sum()

**Caption:**  
Checking data types, summary statistics, and missing values for initial EDA.

In [None]:
import matplotlib.pyplot as plt

# Visualize survival counts
plt.figure(figsize=(5,3))
df['Survived'].value_counts().plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Survival Distribution')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

**Figure 1.**  
Most passengers did not survive the Titanic tragedy, as shown in the bar plot.

In [None]:
# Survival rate by gender
df.groupby('Sex')['Survived'].mean().plot(kind='bar', color=['pink', 'blue'])
plt.title('Survival Rate by Gender')
plt.ylabel('Survival Rate')
plt.show()

**Figure 2.**  
Female passengers had a much higher survival rate compared to male passengers.

## Step 3. Data Preprocessing

In [None]:
# Fill missing values and encode categorical data
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Embarked'] = le.fit_transform(df['Embarked'])

**Caption:**  
Missing values handled, categorical features encoded for modeling.

## Step 4. Model Training and Evaluation

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = df[features]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

**Caption:**  
Logistic Regression model evaluation metrics on test set.

In [None]:
# Feature importance (for logistic regression, use coefficients)
importance = pd.Series(model.coef_[0], index=features)
importance.sort_values().plot(kind='barh')
plt.title('Feature Importance')
plt.show()

**Figure 3.**  
Feature importance according to logistic regression coefficients.

## Step 5. Conclusions and Next Steps

- Majority of Titanic passengers did not survive.
- Gender and passenger class are the most significant features for survival prediction.
- Logistic Regression achieves decent accuracy with basic preprocessing.

**Next steps:**  
- Try other models (Random Forest, SVM)
- Advanced feature engineering (e.g., family size, title extraction)
- Hyperparameter tuning