# 1. Problem Understanding
The goal is to predict whether a passenger survived the Titanic disaster based on:

* Gender

* Age

* Passenger class

* Fare and family information

This is a binary classification problem.

# 2. Dataset Description

From Kaggle Titanic dataset:

| Column   | Description |
|----------|------------|
| Survived | Target (0 = No, 1 = Yes) |
| Pclass   | Passenger class (1, 2, 3) |
| Sex      | Gender |
| Age      | Age |
| SibSp    | Siblings/Spouses aboard |
| Parch    | Parents/Children aboard |
| Fare     | Ticket fare |
| Embarked | Port of embarkation |


# 3. Load the Dataset

In [None]:
import pandas as pd

data = pd.read_csv(r"C:\Users\telug\Downloads\train.csv")
data.head()


# 4. Data Cleaning & Preparation
Handle Missing Values

In [None]:

data["Age"] = data["Age"].fillna(data["Age"].median())

data["Embarked"] = data["Embarked"].fillna(data["Embarked"].mode()[0])


# 5. Exploratory Data Analysis (EDA)

## Survival by Gender

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x="Sex", hue="Survived", data=data)
plt.title("Survival by Gender")
plt.show()


## Survival by Passenger Class

In [None]:
sns.countplot(x="Pclass", hue="Survived", data=data)
plt.title("Survival by Class")
plt.show()


## Survival by Age

In [None]:
sns.histplot(data=data, x="Age", hue="Survived", bins=30, kde=True)
plt.title("Survival by Age")
plt.show()


# Insights to write:

* Females survived more than males

* First-class passengers had higher survival rates

* Children had better survival chances

# 6. Encode Categorical Variables

In [None]:
data = pd.get_dummies(data, columns=["Sex", "Embarked"], drop_first=True)


# 7. Feature Selection

In [None]:
X = data.drop("Survived", axis=1)
y = data["Survived"]


# 8. Trainâ€“Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


# 9. Build Models

## ðŸ”¹ Scale the data + Logistic Regression

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
lr = LogisticRegression(max_iter=2000)
lr.fit(X_train_scaled, y_train)

y_pred_lr = lr.predict(X_test_scaled)


## ðŸ”¹ Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)


# 10. Model Evaluation
Accuracy Score

In [None]:
from sklearn.metrics import accuracy_score

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))


## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


# 11. Conclusion (Write This in Your Project)

* Gender and passenger class strongly influence survival.

* Logistic Regression provides stable and interpretable results.

* Decision Trees capture complex patterns but may overfit.

* The model achieves good accuracy in predicting survival.

# 12. Optional Improvements

* Feature engineering (FamilySize = SibSp + Parch)

* Hyperparameter tuning

* Random Forest or XGBoost

* Cross-validation