<a href="https://colab.research.google.com/github/ylfoo/ERA2036/blob/main/Learn_Classification_thru_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Example for Classification
In this example, the modified Titanic dataset is used to construct two classifiers using k-Nearest Neighbours and Decision Tree to predict the survivality of the passengers.
The modified Titanic dataset consists of the following columns:
- Pclass - ticket class
- Sex - gender of the passenger
- Age - age of passenger
- SibSp - number of siblings / spouses aboard the Titanic
- Parch - number of parents / children aboard the Titanic
- Fare - passenger fare
- Survived - survival (0 = No, 1 = Yes)

In [None]:
# Import the necessary modules and packages
import pandas as pd
from sklearn.model_selection import train_test_split as split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# load the dataset from CSV file
df = pd.read_csv('titanic_demo.csv')

In [None]:
# Check the number of columns and rows
df.info()

In [None]:
# Randomly view 5 data samples from the dataset
df.sample(5)

In [None]:
# Check for missing data
# if there is any missing data, they must be handled first
df.isna().sum()

In [None]:
# Replace missing values with the median value
df['Age'] = df['Age'].fillna(df['Age'].median())

# Check whether there is any more missing data
df.isna().sum()

In [None]:
# Calculate descriptive statistics
df.describe()

In [None]:
# Apply one-hot encoding to convert nominal categorical data to numerical data
df2 = pd.get_dummies(df, drop_first=True)
df2.sample(5)

In [None]:
# Extract the "charges" column (targets) into y
y = df2['Survived'].values

# Delete the "charges" column
del df2['Survived']

# Extract the remaining columns (features) into X
X = df2.values

# Print the dimensions of X and y
print(f"Dimension of X: {X.shape}")
print(f"Dimension of y: {y.shape}")

In [None]:
# Split 75% of the dataset for training and the remaining 25% for testing
X_train, X_test, y_train, y_test = split(X, y, test_size=0.25, random_state=42)

# Print the number of data samples for training and testing
print(f"Number of data samples for training: {X_train.shape[0]}")
print(f"Number of data samples for testing: {X_test.shape[0]}")

In [None]:
# Train a k-NN model with the training data to predict the survivality of the passengers
knn = KNeighborsClassifier().fit(X_train, y_train)

# Evaluate the k-NN model with the testing data and print the accuracy
print(f"knn accuracy: {knn.score(X_test, y_test)}")

In [None]:
# Train a decision tree model with the training data to predict the survivality of the passengers
dtc = DecisionTreeClassifier(random_state=42).fit(X_train, y_train)

# Evaluate the decision tree model with the testing data and print the accuracy
print(f"knn accuracy: {dtc.score(X_test, y_test)}")