### Titanic Survival Prediction Using Naïve Bayes

Build a Na¨ıve Bayes algorithm on the titanic data set attached to predict whether a passenger survived or not.
 This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ”Titanic”, summarized according to survival (target variable with 1=survived and 0=died) and explanatory variables: Name, Pclass (passenger class), Sex, Age, SibSp (total number of siblings including the spouse traveling with the passenger), Parch (total number of parents and children traveling with the passenger), Ticket, Fare, Cabin, and Embarked (where the traveler mounted from: Southampton, Cherbourg, or Queenstown).

 ## 1. Import the data set into pandas data frame.

In [7]:
# 1. Import the dataset into pandas
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [8]:
# Load dataset (change path if needed)
df = pd.read_csv("train.csv")

print("Top rows of the dataset:")
print(df.head())
print("\n")

Top rows of the dataset:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8

## 2. Split the data into training and test sets.

In [9]:
# 2. Split the data into training and test sets

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target = 'Survived'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (712, 7)
Shape of X_test: (179, 7)
Shape of y_train: (712,)
Shape of y_test: (179,)


## 3. Select one or more explanatory variables you would like to use.

I will use all the remaining features as explanatory variables for building the Naive Bayes model: Pclass, Sex, Age, SibSp, Parch, Fare, and Embarked.

## 4. Figure out if there are any missing values in the explanatory variables you want to use and either delete those passengers from the data set or fill in the missing values. If a numerical variable has missing values, you might fill those in with the average or median of that variable. If a categorical variable has missing values, you might fill those in using the most common value. You can create your own script for missing values or you can use sklearn SimpleImputer.

In [11]:
# 4. Handle missing values (numerical -> median, categorical -> most frequent)
print("Missing values in X_train before imputation:")
print(X_train.isnull().sum())
print("\nMissing values in X_test before imputation:")
print(X_test.isnull().sum())

numeric_features = ['Age', 'SibSp', 'Parch', 'Fare', 'Pclass']
categorical_features = ['Sex', 'Embarked']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Apply the transformations to X_train and X_test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("\nShape of X_train_processed:", X_train_processed.shape)
print("Shape of X_test_processed:", X_test_processed.shape)


Missing values in X_train before imputation:
Pclass        0
Sex           0
Age         137
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

Missing values in X_test before imputation:
Pclass       0
Sex          0
Age         40
SibSp        0
Parch        0
Fare         0
Embarked     0
dtype: int64

Shape of X_train_processed: (712, 10)
Shape of X_test_processed: (179, 10)


 ## 5. Convert the categorical variables to numerical using encoding. You can create your own script or use sklearn LabelEncoder.

Done above.

### This step was completed earlier using OneHotEncoder as part of the ColumnTransformer setup.

## 6. Build a model on the training data. You can create your own code or use sklearn Na¨ıveBayes. If you use a mix of continuous and categorical explanatory variables, think of how you can build the model.

In [17]:
# 6. Build Naive Bayes model using the training data
model = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('nb', GaussianNB())
])

model.fit(X_train, y_train)

# Initialize Gaussian Naive Bayes classifier
model = GaussianNB()

# Train the model
model.fit(X_train_processed, y_train)

print("Naive Bayes model trained successfully.")

Naive Bayes model trained successfully.


## 7. Inspect the evaluation measures (accuracy score, confusion matrix, classification report).

In [19]:
# Make predictions on the test set
y_pred = model.predict(X_test_processed)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {accuracy:.4f}")

# Generate Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

# Generate Classification Report
class_report = classification_report(y_test, y_pred)
print("\nClassification Report:\n", class_report)

Accuracy Score: 0.7877

Confusion Matrix:
 [[93 17]
 [21 48]]

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.85      0.83       110
           1       0.74      0.70      0.72        69

    accuracy                           0.79       179
   macro avg       0.78      0.77      0.77       179
weighted avg       0.79      0.79      0.79       179



## 8. Take some values for the explanatory variables and use your model to predict if that person would have survived or not

In [23]:
# 8. Predict survival for new passengers

sample_passengers = pd.DataFrame([
    {'Pclass': 1, 'Sex': 'female', 'Age': 30, 'SibSp': 0, 'Parch': 0, 'Fare': 100, 'Embarked': 'C'},
    {'Pclass': 3, 'Sex': 'male', 'Age': 25, 'SibSp': 0, 'Parch': 0, 'Fare': 7.25, 'Embarked': 'S'},
    {'Pclass': 2, 'Sex': 'male', 'Age': 60, 'SibSp': 1, 'Parch': 0, 'Fare': 20, 'Embarked': 'Q'}
])

# Apply the same preprocessor to the sample passengers
sample_passengers_processed = preprocessor.transform(sample_passengers)

predictions = model.predict(sample_passengers_processed)
probabilities = model.predict_proba(sample_passengers_processed)

print("\nSample passengers:")
print(sample_passengers)

print("\nPredicted Survival (1=survived, 0=died) for preprocessed samples:")
print(predictions)

print("\nPredicted Probabilities [P(died), P(survived)] for preprocessed samples:")
print(probabilities)


Sample passengers:
   Pclass     Sex  Age  SibSp  Parch    Fare Embarked
0       1  female   30      0      0  100.00        C
1       3    male   25      0      0    7.25        S
2       2    male   60      1      0   20.00        Q

Predicted Survival (1=survived, 0=died) for preprocessed samples:
[1 0 0]

Predicted Probabilities [P(died), P(survived)] for preprocessed samples:
[[2.29451196e-05 9.99977055e-01]
 [9.93680084e-01 6.31991609e-03]
 [8.31802635e-01 1.68197365e-01]]
