# Titanic Survival Prediction with XGBoost ðŸš¢

This notebook builds a machine learning model to predict which passengers survived the Titanic shipwreck. We will use the XGBoost algorithm for this classification task.

**Project Steps:**
1.  **Load and Inspect Data**: Import the dataset and get a first look at its structure.
2.  **Data Preprocessing & Feature Engineering**: Clean the data by handling missing values and create new features to improve model performance.
3.  **Model Building**: Split the data and train an XGBoost classifier.
4.  **Model Evaluation**: Assess the model's performance using metrics like accuracy and a confusion matrix.
5.  **Feature Importance**: Analyze which features were most influential in the model's predictions.

## 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# Set plot style
sns.set_style('whitegrid')

## 2. Load and Inspect the Data

In [None]:
# Load the Titanic dataset from a reliable URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Display the first 5 rows of the dataframe
print("First 5 rows of the dataset:")
display(df.head())

In [None]:
# Get a concise summary of the dataframe
print("\nDataset Information:")
df.info()

## 3. Data Preprocessing and Feature Engineering

Here, we'll clean the data to make it suitable for the model.

In [None]:
# Handle missing 'Age' values by filling with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing 'Embarked' values with the most frequent value (mode)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop the 'Cabin' column due to a large number of missing values
df.drop('Cabin', axis=1, inplace=True)

# --- Feature Engineering ---

# Create a 'FamilySize' feature from 'SibSp' and 'Parch'
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Create a 'Title' feature by extracting titles from the 'Name' column
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
# Consolidate rare titles into a single 'Rare' category
df['Title'] = df['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')

# --- Convert Categorical Features to Numerical ---

# Map 'Sex' to 0 and 1
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Map 'Embarked' to numerical values
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

# Map 'Title' to numerical values
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
df['Title'] = df['Title'].map(title_mapping)
df['Title'] = df['Title'].fillna(0) # Fill any remaining NaNs in Title

print("Data after preprocessing and feature engineering:")
display(df.head())

## 4. Model Building

Now we'll prepare the data for training and build the XGBoost model.

In [None]:
# Select features for the model
# We drop columns that are not useful or have been replaced by new features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'Title']
X = df[features]
y = df['Survived']

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

In [None]:
# Initialize and train the XGBoost classifier
# objective='binary:logistic': for binary classification
# use_label_encoder=False: to avoid a deprecation warning
# eval_metric='logloss': evaluation metric for binary classification
model = xgb.XGBClassifier(objective='binary:logistic', use_label_encoder=False, eval_metric='logloss')

print("Training the XGBoost model...")
model.fit(X_train, y_train)
print("Model training complete.")

## 5. Model Evaluation

Let's see how well our model performs on the unseen test data.

In [None]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Print the classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Did not Survive', 'Survived']))

In [None]:
# Generate and plot the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Did not Survive', 'Survived'], 
            yticklabels=['Did not Survive', 'Survived'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## 6. Feature Importance

Finally, let's visualize which features the model found most important for making predictions.

In [None]:
# Plot feature importance
fig, ax = plt.subplots(figsize=(12, 8))
xgb.plot_importance(model, ax=ax, importance_type='weight') # or 'gain', 'cover'
plt.title('Feature Importance')
plt.show()