# üö¢ Predicting Titanic Survival Using Naive Bayes ‚ö°
### A Machine Learning Approach to Disaster Analysis  



## Business Problem:  
**Can we predict survival outcomes using passenger demographics and socio-economic data?**  

This model isn‚Äôt just historical‚Äîit mirrors modern use cases like:  
- **Emergency Preparedness**: Prioritizing vulnerable groups (children, elderly) in evacuation plans.  
- **Bias Investigation**: Quantifying how *class, gender, and age* affected survival.  
- **Safety Benchmarking**: Evaluating if "women and children first" was truly followed.  

## Why Naive Bayes?  
- **Efficiency**: Handles large datasets with minimal computational power.  
- **Interpretability**: Outputs probabilities for clear decision-making.  
- **Baseline Model**: Perfect for establishing a performance benchmark before trying complex algorithms.  

## Key Workflow Steps:  
1. **Feature Selection**: Focused on `Pclass`, `Sex`, `Age`, and `Fare` (critical factors proven in historical analysis).  
2. **Data Prep**:  
   - Convert categorical data (e.g., `Sex` to 0/1).  
   - Handle missing values (e.g., median imputation for `Age`).  
3. **Model Evaluation**: Metrics like accuracy, precision, and recall to assess real-world usability.  

**Challenge to Audience**: *Could a passenger‚Äôs fare price indirectly reveal their survival odds? Let‚Äôs find out!*  

### üö¢ Titanic Survival Dataset: Key Features for ML
### üìä Passenger Demographics & Survival Patterns

The Titanic dataset contains information about passengers aboard the RMS Titanic, with the following key features:

- **PassengerId**: Unique identifier for each passenger
- **Survived**: Binary indicator (0 = No, 1 = Yes)
- **Pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- **Name**: Passenger name (including title and family information)
- **Sex**: Gender (male/female)
- **Age**: Age in years (fractional for infants)
- **SibSp**: Number of siblings/spouses aboard
- **Parch**: Number of parents/children aboard
- **Ticket**: Ticket number
- **Fare**: Passenger fare
- **Cabin**: Cabin number (contains missing values)
- **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

This dataset is commonly used for predictive modeling tasks, particularly to predict passenger survival based on various attributes.

In [1]:
# Step 1: Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB


from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.impute import SimpleImputer  # To handle missing values


In [3]:
#üì• Step 2: Load the Titanic Dataset

# Load dataset from GitHub

df = pd.read_csv('titanic.csv')

# Preview
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# üßπ Step 3: Select and Prepare Features  

### **Why These Features?**  
To predict survival, we need features that logically influence a passenger‚Äôs chance of survival. Here‚Äôs why we selected these:  
- **Pclass**: Wealthier passengers (1st class) had priority access to lifeboats.  
- **Sex**: The infamous *"women and children first"* policy affected survival rates.  
- **Age**: Children were prioritized during evacuation.  
- **Fare**: Correlates with class and potentially better deck locations.  

We dropped less impactful columns (like `PassengerId`, `Name`, or `Ticket`) to simplify our model.  

### **Key Preparation Step:**  
- Convert **categorical data (Sex)** to numeric (male=0, female=1) because Naive Bayes requires numerical inputs.  
- Missing values? We‚Äôll handle them next! *(Hint: Age has gaps!)*  

In [4]:
#üßπ Step 3: Select and Prepare Features
# Select relevant columns
df_model = df[["Survived", "Pclass", "Sex", "Age", "Fare"]]

# Convert 'Sex' to numeric values
df_model["Sex"] = df_model["Sex"].map({"male": 0, "female": 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_model["Sex"] = df_model["Sex"].map({"male": 0, "female": 1})


In [5]:
df_model.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,0,22.0,7.25
1,1,1,1,38.0,71.2833
2,1,3,1,26.0,7.925
3,1,1,1,35.0,53.1
4,0,3,0,35.0,8.05


In [6]:
df_model.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
Fare          0
dtype: int64

In [7]:
df_model.shape

(891, 5)

In [10]:
#ü©π Step 4: Handle Missing Values (Impute Age)

# Separate features (X) and target (y)
X = df_model.drop("Survived", axis=1)
y = df_model["Survived"]
X.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,0,22.0,7.25
1,1,1,38.0,71.2833
2,3,1,26.0,7.925
3,1,1,35.0,53.1
4,3,0,35.0,8.05


In [11]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [14]:

# Use SimpleImputer to fill missing Age with mean
imputer = SimpleImputer(strategy='mean')  # Mean imputation for numerical columns

X_imputed = imputer.fit_transform(X)


X_imputed[:10]

# What is SimpleImputer?
# SimpleImputer is a scikit-learn class that provides basic strategies for imputing missing values:

# It replaces missing values (NaN) with a specified strategy (mean, median, most_frequent, or constant)

# In your case, you're using strategy='mean' which replaces missing Age values with the mean age


array([[ 3.        ,  0.        , 22.        ,  7.25      ],
       [ 1.        ,  1.        , 38.        , 71.2833    ],
       [ 3.        ,  1.        , 26.        ,  7.925     ],
       [ 1.        ,  1.        , 35.        , 53.1       ],
       [ 3.        ,  0.        , 35.        ,  8.05      ],
       [ 3.        ,  0.        , 29.69911765,  8.4583    ],
       [ 1.        ,  0.        , 54.        , 51.8625    ],
       [ 3.        ,  0.        ,  2.        , 21.075     ],
       [ 3.        ,  1.        , 27.        , 11.1333    ],
       [ 2.        ,  1.        , 14.        , 30.0708    ]])

In [None]:
# .fit(X)

# Computes the mean (or other chosen statistic) of each column but does not modify the data yet.

# Stores these computed values (e.g., mean age) for later use.

# .transform(X)

# Actually replaces missing values (NaN) in X with the computed mean (or other strategy).

# Returns a new array (X_imputed) where all missing values are filled.

# .fit_transform(X)

# A shortcut that combines fit() and transform() in one step.

# Computes the required statistics (mean) and applies them immediately.

In [15]:
# Now X_imputed is a NumPy array with no missing values

import numpy as np
np.isnan(X_imputed).any()

False

In [16]:
#‚úÇÔ∏è Step 5: Split into Train and Test Sets
# Train-test split (80% train, 20% test)

X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.2, random_state=42)

In [19]:
# ü§ñ Step 6: Train Naive Bayes Classifier
# Initialize the classifier
nb_model = GaussianNB()

# Train on the training data
nb_model.fit(X_train, y_train)

In [17]:
X_test

array([[  3.        ,   0.        ,  29.69911765,  15.2458    ],
       [  2.        ,   0.        ,  31.        ,  10.5       ],
       [  3.        ,   0.        ,  20.        ,   7.925     ],
       [  2.        ,   1.        ,   6.        ,  33.        ],
       [  3.        ,   1.        ,  14.        ,  11.2417    ],
       [  1.        ,   1.        ,  26.        ,  78.85      ],
       [  3.        ,   1.        ,  29.69911765,   7.75      ],
       [  3.        ,   0.        ,  16.        ,  18.        ],
       [  3.        ,   1.        ,  16.        ,   7.75      ],
       [  1.        ,   1.        ,  19.        ,  26.2833    ],
       [  1.        ,   0.        ,  37.        ,  53.1       ],
       [  3.        ,   0.        ,  44.        ,   8.05      ],
       [  3.        ,   1.        ,  29.69911765,  25.4667    ],
       [  3.        ,   0.        ,  30.        ,   7.225     ],
       [  2.        ,   0.        ,  36.        ,  13.        ],
       [  1.        ,   1

In [20]:
# üìä Step 7: Make Predictions and Evaluate
# Predict on test set
y_pred = nb_model.predict(X_test)

# Evaluation metrics
print("‚úÖ Accuracy:", accuracy_score(y_test, y_pred))
print("\nüßæ Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nüìÑ Classification Report:\n", classification_report(y_test, y_pred))

‚úÖ Accuracy: 0.7597765363128491

üßæ Confusion Matrix:
 [[83 22]
 [21 53]]

üìÑ Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.79      0.79       105
           1       0.71      0.72      0.71        74

    accuracy                           0.76       179
   macro avg       0.75      0.75      0.75       179
weighted avg       0.76      0.76      0.76       179



In [None]:
# Summary

# üì¢ "Out of 179 total predictions, around 76% were correct. 
# That‚Äôs a decent accuracy, especially for a simple probabilistic model like Naive Bayes."

# Fairly balanced performance across both classes

# Some room for improvement, especially in distinguishing class 1