# Predicting Survival Rate on the Titanic Dataset Using Various Classification Models

This project aims to predict the survival rate of passengers on the Titanic using various classification models. The models evaluated include Support Vector Classifier (SVC), Linear Support Vector Classifier (LinearSVC), Random Forest, Logistic Regression, K-Nearest Neighbors (KNN), Gaussian Naive Bayes, and Decision Tree. We will use cross-validation to evaluate the performance of each model and select the best one based on accuracy.

## Requirements

- Python 3.x
- pandas
- scikit-learn
- numpy

## Installation

1. Install Python from the official [website](https://www.python.org/).
2. Install the required libraries using pip:
    ```bash
    pip install pandas scikit-learn numpy
    ```

## Dataset

The initial dataset was obtained from Kaggle's Titanic dataset, which includes the following columns:
- Unnamed: 0 (index column)
- Survived (target variable)
- Pclass
- Sex
- SibSp
- Parch
- Cabin
- Embarked
- relative
- not_alone
- Agerange
- Faregroup

## Data Analytics

The data analytics part, including data cleaning, feature engineering, and handling missing values, was performed in R by Sattaya. The processed dataset is saved as `titanic.csv`.


In [14]:
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np


### Load and preprocess the data

In [15]:
dataset = pd.read_csv('./dataset/titanic.csv')

In [16]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Cabin,Embarked,relative,not_alone,Agerange,Faregroup
0,1,0,3,1,1,0,0,0,1,0,3,1
1,2,1,1,0,1,0,3,1,1,0,4,1
2,3,1,3,0,0,0,0,0,0,1,3,1
3,4,1,1,0,1,0,3,0,1,0,4,1
4,5,0,3,1,0,0,0,0,0,1,4,1


In [17]:
dataset.describe()

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Cabin,Embarked,relative,not_alone,Agerange,Faregroup
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.647587,0.523008,0.381594,0.776655,0.361392,0.904602,0.602694,3.338945,1.113356
std,257.353842,0.486592,0.836071,0.47799,1.102743,0.806057,1.590899,0.635673,1.613459,0.489615,0.939542,0.482911
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,223.5,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0
50%,446.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,1.0
75%,668.5,1.0,3.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,4.0,1.0
max,891.0,1.0,3.0,1.0,8.0,6.0,8.0,2.0,10.0,1.0,6.0,6.0


In [18]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  891 non-null    int64
 1   Survived    891 non-null    int64
 2   Pclass      891 non-null    int64
 3   Sex         891 non-null    int64
 4   SibSp       891 non-null    int64
 5   Parch       891 non-null    int64
 6   Cabin       891 non-null    int64
 7   Embarked    891 non-null    int64
 8   relative    891 non-null    int64
 9   not_alone   891 non-null    int64
 10  Agerange    891 non-null    int64
 11  Faregroup   891 non-null    int64
dtypes: int64(12)
memory usage: 83.7 KB


In [19]:
dataset = dataset.drop(columns=['Unnamed: 0'])  # Drop the Unnamed: 0 column

# Separate features and target variable
X = dataset.drop(columns=['Survived'])  # Features
y = dataset['Survived']  # Target variable

### Define model 

In [20]:
models = [
    ('SVC', SVC()),
    ('LinearSVC', LinearSVC(dual='auto', max_iter=2000)),  # Set dual to 'auto' and increase max_iter
    ('RandomForest', RandomForestClassifier()),
    ('LogisticRegression', LogisticRegression(max_iter=1000)),
    ('KNeighbors', KNeighborsClassifier()),
    ('GaussianNB', GaussianNB()),
    ('DecisionTree', DecisionTreeClassifier())
]

### Perform cross-validation


In [21]:
results = {}
for name, model in models:
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')  # 5-fold cross-validation
    results[name] = scores
    print(f"{name}: Mean Accuracy = {scores.mean():.4f}, Std Dev = {scores.std():.4f}")

SVC: Mean Accuracy = 0.8092, Std Dev = 0.0117
LinearSVC: Mean Accuracy = 0.7946, Std Dev = 0.0124
RandomForest: Mean Accuracy = 0.8002, Std Dev = 0.0197
LogisticRegression: Mean Accuracy = 0.8002, Std Dev = 0.0109
KNeighbors: Mean Accuracy = 0.7902, Std Dev = 0.0230
GaussianNB: Mean Accuracy = 0.7857, Std Dev = 0.0358
DecisionTree: Mean Accuracy = 0.7991, Std Dev = 0.0185


### Select the best model based on mean accuracy

In [22]:
# Select the best model based on mean accuracy
best_model_name = max(results, key=lambda k: results[k].mean())
best_model_score = results[best_model_name].mean()

print(f"\nBest Model: {best_model_name} with Mean Accuracy = {best_model_score:.4f}")



Best Model: SVC with Mean Accuracy = 0.8092


In [23]:
# Instantiate the best model
best_model = None
for name, model in models:
    if name == best_model_name:
        best_model = model
        break

In [24]:
best_model #Now we got the best model 

### Train the best model on the entire training set


In [None]:
best_model.fit(X, y)

### Use model to predict the test datasets

In [70]:
test_data = pd.read_csv('test.csv')

In [76]:
test_data

Unnamed: 0.1,Unnamed: 0,PassengerId,Pclass,Sex,SibSp,Parch,Cabin,Embarked,relative,not_alone,Agerange,Faregroup
0,1,892,3,1,0,0,0,2,0,1,4,1.0
1,2,893,3,0,1,0,0,0,1,0,5,1.0
2,3,894,2,1,0,0,0,2,0,1,6,1.0
3,4,895,3,1,0,0,0,0,0,1,3,1.0
4,5,896,3,0,1,1,0,0,2,0,3,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
413,414,1305,3,1,0,0,0,0,0,1,3,1.0
414,415,1306,1,0,0,0,3,1,0,1,4,2.0
415,416,1307,3,1,0,0,0,0,0,1,4,1.0
416,417,1308,3,1,0,0,0,0,0,1,3,1.0


In [71]:
X_test = test_data.drop(columns=['Unnamed: 0', 'PassengerId'])

In [72]:
X_test

Unnamed: 0,Pclass,Sex,SibSp,Parch,Cabin,Embarked,relative,not_alone,Agerange,Faregroup
0,3,1,0,0,0,2,0,1,4,1.0
1,3,0,1,0,0,0,1,0,5,1.0
2,2,1,0,0,0,2,0,1,6,1.0
3,3,1,0,0,0,0,0,1,3,1.0
4,3,0,1,1,0,0,2,0,3,1.0
...,...,...,...,...,...,...,...,...,...,...
413,3,1,0,0,0,0,0,1,3,1.0
414,1,0,0,0,3,1,0,1,4,2.0
415,3,1,0,0,0,0,0,1,4,1.0
416,3,1,0,0,0,0,0,1,3,1.0


In [81]:
rows_with_nan = X_test[X_test.isnull().any(axis=1)]
rows_with_nan

Unnamed: 0,Pclass,Sex,SibSp,Parch,Cabin,Embarked,relative,not_alone,Agerange,Faregroup
152,3,1,0,0,0,0,0,1,5,


In [86]:
df_filled = X_test.fillna(2)

In [87]:
new_dataframe = test_data[['PassengerId']].copy()

In [88]:
new_dataframe['Survived'] = best_model.predict(df_filled)

In [89]:
new_dataframe

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
