# Advanced Learning Algorithms

## Import Modules

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.activations import relu,linear
from tensorflow.keras.optimizers import Adam

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor

2025-03-03 09:59:27.998842: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740970768.015407  593439 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740970768.020143  593439 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-03 09:59:28.035313: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Data Processing

We're using data from the Titanic Competition. 

In [2]:
data = pd.read_csv("./titanic/train.csv")
test = pd.read_csv("./titanic/test.csv")

In [3]:
# Preprocessing
data['Age'].fillna(data['Age'].median(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
data['Fare'].fillna(data['Fare'].median(), inplace=True)

X = data[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']]
y = data['Survived']

# One-hot encoding and scaling
categorical_features = ['Pclass', 'Sex', 'Embarked']
numerical_features = ['Age', 'Fare']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

test['Age'].fillna(test['Age'].median(), inplace=True)
test['Embarked'].fillna(test['Embarked'].mode()[0], inplace=True)
test['Fare'].fillna(test['Fare'].median(), inplace=True)

X_test = test[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']]

X = preprocessor.fit_transform(X)
X_test = preprocessor.transform(X_test)
y = np.array(y)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on 

## Main Content

### Neural Network, Neural Network Training

Neural network is defined via Sequential() function. This is the most basic neural network model.

Each element in neural network is layer, and they are defined via Dense() function. There are other types of layers as well.

Each layer has some number of neurons, defined inside Dense() function (units). Each layer has an activation function. There are some common types of activation functions:
- Linear function: $g(z) = z$ -- also means no activation function
- Sigmoid function: $g(z) = \frac{1}{1 + e^{-z}}$  -- for binary classification
- Softmax function: $g(z) = \frac{e^{z_j}}{\sum e^{z_i}}$  -- for multiclass classification
- Rectified linear unit (ReLU): $g(z) = \max(0, z)$ -- most common for hidden layer

This is a binary classification problem, so the output should have sigmoid activation function. To reduce roundoff error, first set the activation function linear.


In [4]:
model = Sequential([
    Dense(units=64, activation='relu'),
    Dense(units=32, activation='relu'),
    Dense(units=1, activation='linear')
])


Then, compile the model using loss function. There are some loss functions:
- BinaryCrossentropy: logistic loss function, for binary classification
- SparseCategoricalCrossentropy: loss function that expects the target to be an integer corresponding to the index, for multiclass classification
- CategoricalCrossentropy: loss function that expects the target to be one-hot encoded where value at target index is 1, else 0, for multiclass classification
- MeanSquaredError: loss function for linear regression

For this problem, BinaryCrossentropy is selected as our loss function. Note that, because the output layers contain logits (activation is linear), so pass `from_logits=True` into the function to convert them into probability, before computing loss.

In addition, a learning algorithm should be defined in the compiler. Adam algorithm is an improved algorithm of gradient descent, because it automatically adjusts the learning rate based on movement of parameters (adaptive moment estimation), which improves efficiency.

In [5]:
model.compile(
    loss=BinaryCrossentropy(from_logits=True),
    optimizer=Adam(learning_rate=0.01)
)

W0000 00:00:1740970770.421444  593439 gpu_device.cc:2344] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Finally, train on the data to minimize cost function.

In [6]:
model.fit(X, y, epochs = 1000)

Epoch 1/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.5492
Epoch 2/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.4178 
Epoch 3/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.4038 
Epoch 4/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 0.4104 
Epoch 5/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.4070 
Epoch 6/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.3872 
Epoch 7/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.4594 
Epoch 8/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.3965 
Epoch 9/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.4021 
Epoch 10/1000
[1m28/28[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/s

<keras.src.callbacks.history.History at 0x7e8e1fa74440>

Predict new examples. After the training, the prediction will be logits. Convert logits into probability, then apply threshold.

In [7]:
logits = model(X_test)
yhat_test = tf.nn.sigmoid(logits)
yhat_test_labels = np.where(yhat_test.numpy() >= 0.5, 1, 0).flatten()
yhat_test_labels


array([0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

Check the accuracy of the model

In [8]:
result = pd.read_csv("./titanic/gender_submission.csv")
y_test = np.array(result.iloc[:, 1])

In [9]:
(sum(yhat_test_labels == y_test)/y_test.shape * 100)

array([83.01435407])

### Decision Tree, Random Forest, XGBoost

Re-import data

In [10]:
data = pd.read_csv("./titanic/train.csv")
test = pd.read_csv("./titanic/test.csv")

Reprocessing data (note that, in decision tree, data needn't scaling)

In [11]:
data['Age'].fillna(data['Age'].median(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
data['Fare'].fillna(data['Fare'].median(), inplace=True)

X = data[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']]
y = data['Survived']

# One-hot encoding and scaling
categorical_features = ['Pclass', 'Sex', 'Embarked']
numerical_features = ['Age', 'Fare']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

test['Age'].fillna(test['Age'].median(), inplace=True)
test['Embarked'].fillna(test['Embarked'].mode()[0], inplace=True)
test['Fare'].fillna(test['Fare'].median(), inplace=True)

X_test = test[['Pclass', 'Sex', 'Age', 'Fare', 'Embarked']]

X = preprocessor.fit_transform(X)
X_test = preprocessor.transform(X_test)
y = np.array(y)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on 

Decision Tree Learning 

- Start with all examples at root node
- Calculate information gain for all possible features, pick one with highest information gain
- Split dataset according to selected feature, create left, right branches of the tree
- Keep repeating splitting process until stopping criteria is met:
    - A node is 100% one class
    - Splitting a node results in the tree exceeding maximum depth
    - Information gain from additional splits < threshold
    - Number of examples in a node < threshold

By default, criterion used for DecisionTreeClassifier is Gini Impurity, which has the formula:
\begin{equation}
\text{Gini Impurity} = 1 - \sum_{i=1}^C p_i^2
\end{equation}

In which,
- $p_i$ is the proportion of instances belonging to class $i$
- $C$ is number of classes

In lecture note, Shannon Information Gain is mentioned, which has the formula:
\begin{equation}
\text{Information Gain} = H(p_1^{root}) - (w^{left}H(p_1^{left}) + w^{right}(p_1^{right}))  
\end{equation}

In which
- $H(p_1)$ is entropy of a node.
- $w$ is weighted of a node, and: $w = \frac{\text{\# splitted objects}}{\text{\# objects in root node}}$

Practical Usage

Gini Impurity: Often preferred in practice (e.g., in CART) due to computational efficiency, especially for large datasets.

Entropy (Information Gain): May be more informative in certain datasets with many classes or when fine distinctions in uncertainty are needed.

In [18]:
model = DecisionTreeClassifier(random_state=42, max_depth=5)
model.fit(X, y)

In [19]:
yhat_test = model.predict(X_test)

result = pd.read_csv("./titanic/gender_submission.csv")
y_test = result.iloc[:, 1]

sum(yhat_test == y_test)/y_test.shape[0]

0.9019138755980861

Random Forest

Applying the principle of tree ensemble:
- Given training set of size $m$
- Repeat $B$ times:
    - Use sampling with replacement to create new training set of size $m$
    - Train a decision tree on the new dataset
- Final prediction

At each node, when choosing a feature to split, if $n$ features are available, pick a random subset of $k < n$ features, allow the algorithm to only choose from that subset of features.

In [22]:
model = RandomForestClassifier(random_state = 42, max_depth = 5)

model.fit(X, y)

In [24]:
yhat_test = model.predict(X_test)
result = pd.read_csv("./titanic/gender_submission.csv")
y_test = result.iloc[:, 1]

sum(yhat_test == y_test)/y_test.shape[0]

0.9019138755980861

XGBoost

XGBoost has slight difference compared to tree ensemble:

Instead of picking from all examples with equal $1/m$ probability, make it more likely to pick misclassified examples from previously trained trees

In [27]:
model = XGBClassifier(n_estimators = 500, learning_rate = 0.1, random_state = 42)

model.fit(X, y)

In [28]:
yhat_test = model.predict(X_test)
result = pd.read_csv("./titanic/gender_submission.csv")
y_test = result.iloc[:, 1]

sum(yhat_test == y_test)/y_test.shape[0]

0.9019138755980861