In [1]:
#importing modules
import time
import torch
import numpy as np
import pandas as pd
import plotly_express as px
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate, RepeatedStratifiedKFold, StratifiedKFold
from sklearn.preprocessing import StandardScaler
import sklearn.metrics as metrics
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, accuracy_score, balanced_accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler

# Portfolio-Exam
# Part 1

## Exercise 1.
(Perceptron. – 5 points)

Given a perceptron with weights (0.2, 0.3, 0.6, 0.6) and bias 0.15, compute the output for the tensor {...} of dimensions (dataset, features).

In [2]:
# configuring x
x = torch.tensor(
    [[1, 0, 1, 0],  
    [.3, .2, .3, .2]]
    )

# configuring weights
w = torch.tensor([0.2, 0.3, 0.6, 0.6])

# configuring bias
b = torch.tensor(0.15)

# writing perceptron function
def perceptron(x: torch.tensor, w: torch.tensor, b: torch.tensor):
    """multiplies x and w elementwise, then adds b

    Args:
        x (torch.tensor): input tensor
        w (torch.tensor): weight tensor
        b (torch.tensor): bias

    Returns:
        _type_: returns results of the calculation in a tensor
    """
    return (x*w).sum(dim=1) + b
    # possible activation function
    # return np.sign((x*w).sum(dim=1) + b)

# calling perceptron to calculate y
y = perceptron(x, w, b)
y

tensor([0.9500, 0.5700])

Result is a 1-dimensional 2-element tensor. No activation function was applied because none was specified. A possible implementation is shown in the code as comment
_______________
_______________

## Exercise 2.
### 2.1
*List the entities that the tensor should contain.*
- product identifier
- user identifier
- tags
- if a tag was assigned to a product by a user or not


### 2.2
*Describe a tensor to model the tagging data of Yangtze. Describe its dimensionality.*
- Dimension 0: user identifier with ordinal encoding, cardinality: 1,989,345
- Dimension 1: product identifier with ordinal encoding, cardinality: 1,214,467
- Dimension 3: tags with ordinal encoding, cardinality: 56,892
- Date type: boolean
- Description of the meaning values: 
  - 0 means a given user did not assign a given tag to a given product
  - 1 means a given user did assign a given tag to a given product

### 2.3
*How many entries does the tensor have? How many are 1?*

We multiply the length of each dimension:
 
  n Products * n dist. Tags * n Users

= 1,214,467 * 56,892 * 1,989,345

= 137,450,722,348,310,580

Of the 137,450,722,348,310,580 entries in total 1,199,438 of this are 1, (0.0000000000087%)

Since any users only buys a few products and even fewer users will assign tags. This leaves the vast majority of entries as 0.

With such a low number of entries sparse encoding of the data is advisable.

### 2.4
*How does the tensor change when a user tags a product with a tag that has been used in Yangtze before?*
- The system will look up the user, the product and tag in the encoding tables, to find the correct position within the tensor, navigate there and change its entry from a zero to a one

### 2.5
*How does the tensor change when a user tags a product with a tag that has not previously been used in Yangtze before?*
- The system will attempt look up the user, the product and tag in the encoding tables but fail to find the tag, because there is no entry for this tag
- The encoding tables has to be extended by a new entry for the new tag
- The tag dimension will increase by one for all users and products and the system will fill the new cells with zeros as no other users has used this tag yet
- Then it will navigate the cell according to the user, product and tag encoding and write a 1 there

_______________
_______________

## Exercise 3.
## 3.1
*In which situations can it be useful (explain in general and provide three examples)?*
- SMOTE is useful for dealing with imbalanced datasets, where the cost of misclassifying an abnormal example as a normal example is higher than the cost of the reverse error.

- Useful examples would be:
    - fraudulent telephone calls: the vast majority of phone calls are legitimate, only a small fraction is fraudulent
    - detection of oil spills in satellite images: an oil spill on a satelite image will only be shown in a small part of that image
    - classification of pixels in mammogram images as possibly cancerous: 98% of pixels are normal and 2% of pixels are abnormal

## 3.2
*What is its fundamental idea?*
- The fundamental idea is to improve the accuracy of classifiers for a minority class. This is achieved through a combination of over-sampling the minority class by generating additional synthetic examples while also under-sampling the majority class.

## 3.3
*How is SMOTE different from oversampling with replacement?*
- When applying oversampling with replacement we increase the chance of drawing from the minority class by emphasizing the minority class. SMOTE differs from that as follows:
    - not only is the minority class over-sampled but the majority class is also under-sampled, this will increase the probability of drawing from the minority class more than using oversampling with replacement
    - additional synthetic 'examples' are generated by picking a random selection of the k nearest neighbors of a minority class example and introducing synthetic examples at a random distance in the space between the original examples and the selected nearest neighbors, while random over-sampling will simply pick the same data points repeatedly

_______________
_______________

## Exercise 4.
### 4.1
Create a Python class Perceptron

In [3]:
class Perceptron:
# margin and random seed are 1, unless overridden
    def __init__(self, input_size, margin=1, random_seed=1):
        super(Perceptron, self).__init__()
# allows controlling all random seed
        np.random.seed(random_seed)
        self.w = np.random.rand(input_size)
        self.b = np.random.rand(1)
        self.t = margin

# a method for computing the pre-activation values
    def pre_activation_value(self, X):
        return np.dot(X, self.w) + self.b

# method for training, that takes the training data, the learning rate and the number of epochs
    def train(self, X, y, learning_rate, num_epochs):
        for epoch in range(num_epochs):
            for xi, yi in zip(X, y):
                y_pre = self.pre_activation_value(xi)
                if (self.t - yi * y_pre > 0):
                    self.w = self.w + learning_rate * (yi * xi)
                    self.b = self.b + learning_rate * yi

# fit() is required for the pipeline API, takes data as X and targets as y and calls train()
# in this implementation learning rate and number of epochs are hard coded
    def fit(self, X, y):
        return self.train(X, y, 0.1, 20)

# method to compute the predictions
    def predict(self, X):
        return np.sign(self.pre_activation_value(X))

### 4.2
Load and arrange the dataset portfolio_data_wise_2022.csv

In [4]:
# loading data
raw_data = pd.read_csv('portfolio_data_wise_2022.csv')

# splitting variables and target
data, target = raw_data[['feature_1', 'feature_2']], raw_data[['target']]

In [5]:
# data prep for perceptron
data = data.to_numpy()
target = target.replace(0, -1)
target = target.to_numpy()

# shuffling data
np.random.seed(1)
shuffled_index = np.random.permutation(len(data))
X = data[shuffled_index, :]
y = target[shuffled_index]
y = y.ravel()

### 4.3
Describe the class distribution.

In [6]:
# determining class distribution.
raw_data.target.value_counts(normalize=True)

0    0.99
1    0.01
Name: target, dtype: float64

99% of the data points belong to class 0 and 1% of the data points belong to class 1. 

In a binary classification problem, we would achieve 99% accuracy already by always betting on the majority class.

### 4.4
Plot the data

In [7]:
# data preperation for plotting
df_viz = raw_data.copy()
df_viz.target = df_viz.target.astype(str)

# plotting the data
px.scatter(df_viz, x='feature_1', y='feature_2', title='Scatter Plot of Data Points', color='target')

### 4.5
Compare the performance of the perceptron with a margin of 1 in three different settings:
- a) trained directly on the plain data,
- b) trained using SMOTE (default configuration, only the oversampling algorithm, no under-
sampling of the majority class), and
- c) trained using random oversampling.

In [8]:
# configuring pipeline for scaler and estimator
def get_pipe(estimator, random_seed, sampling=None):
    if sampling == None:
        return Pipeline([
            ('scaler', StandardScaler()),
            ('estimator', estimator)])
    elif sampling == 'SMOTE':
        return Pipeline([
            # oversampling the minority class with SMOTE, no under-sampling of the majority class
            # n_jobs=-1 to improve run time
            ('SMOTE', SMOTE(sampling_strategy='minority', random_state=random_seed, n_jobs=-1)),
            ('scaler', StandardScaler()),
            ('estimator', estimator)])
    elif sampling == 'Oversampling':
        return Pipeline([
            # using random oversampling
            ('Random Oversampling', RandomOverSampler(sampling_strategy='minority', random_state=random_seed)),
            ('scaler', StandardScaler()),
            ('estimator', estimator)])
    else:
        raise NotImplementedError()

In [9]:
# configuring nested cross validation
NUM_TRIALS = 5
NUM_OUTER_SPLITS = 5

def nested_cv(input_size, features, targets, sampling, margin=1):

    start = time.time()
    accs = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))
    baccs = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))
    fit_times = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))
    test_times = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))
    tn = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))
    tp = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))
    fn = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))
    fp = np.zeros((NUM_TRIALS, NUM_OUTER_SPLITS))

    # use custom scorer to save confusion matrix values
    # adapted from: https://stackoverflow.com/a/46796347/348501
    def custom_scorer(clf, X, y):
        y_pred = clf.predict(X)
        cm = confusion_matrix(y, y_pred)
        return {'accuracy': accuracy_score(y, y_pred),
                'balanced_accuracy': balanced_accuracy_score(y, y_pred),
                'tn': cm[0, 0],
                'fp': cm[0, 1],
                'fn': cm[1, 0],
                'tp': cm[1, 1]}

    for i in range(NUM_TRIALS):
        print('Running Outer CV in iteration: ', i, ' at ', time.time() - start)
        estimator = Perceptron(input_size, margin=margin, random_seed=i)
        pipe = get_pipe(estimator, i, sampling)
        outer_cv = StratifiedKFold(n_splits=NUM_OUTER_SPLITS, shuffle=True, random_state=i)

        cv_results = cross_validate(
            pipe, 
            X = features,
            y = targets,
            cv = outer_cv,
            scoring = custom_scorer,
            n_jobs = -1)

        accs[i] = cv_results['test_accuracy']
        baccs[i] = cv_results['test_balanced_accuracy']
        tn[i] = cv_results['test_tn']
        tp[i] = cv_results['test_tp']
        fn[i] = cv_results['test_fn']
        fp[i] = cv_results['test_fp']
        baccs[i] = cv_results['test_balanced_accuracy']
        fit_times[i] = cv_results['fit_time']
        test_times[i] = cv_results['score_time']

    print('Total time: ', (time.time()-start), 'sec.')
    return accs, baccs, tn, tp, fn, fp, fit_times, test_times

In [10]:
# function to store mean, standard deviation, minimum and maximum values of performance measures
def add_result(results, name, accs, baccs, tns, tps, fns, fps, fit_times, test_times):

    row = {
        'name': name,
        'acc_mean': accs.mean(), 
        'acc_std': accs.std(), 
        'acc_min': accs.min(), 
        'acc_max': accs.max(), 
        'bacc_mean': baccs.mean(), 
        'bacc_std': baccs.std(), 
        'bacc_min': baccs.min(), 
        'bacc_max': baccs.max(), 
        'tn_mean': tns.mean(), 
        'tp_mean': tps.mean(), 
        'fn_mean': fns.mean(), 
        'fp_mean': fps.mean(), 
        'fit_time': fit_times.mean(),
        'test_time': test_times.mean()
        }
    results.append(row)

# instantiating an empty list to store results of classification runs
results = []

### 4.5 a)
Trained directly on the plain data

In [11]:
# calling nested_cv function using sampling=None to train on plain data
res = nested_cv(2, X, y, None)

# appending results to list
add_result(results, 'Plain Data', *res)

Running Outer CV in iteration:  0  at  1.5020370483398438e-05
Running Outer CV in iteration:  1  at  2.8873109817504883
Running Outer CV in iteration:  2  at  3.8904600143432617
Running Outer CV in iteration:  3  at  4.925734043121338
Running Outer CV in iteration:  4  at  5.73088812828064
Total time:  6.422816038131714 sec.


### 4.5 b)
Trained using SMOTE

In [None]:
# calling nested_cv function using sampling='SMOTE'
res = nested_cv(2, X, y, 'SMOTE')

# appending results to list
add_result(results, 'SMOTE', *res)

### 4.5 c)
Trained using random oversampling

In [None]:
# calling nested_cv function using smapling='Oversampling'
res = nested_cv(2, X, y, 'Oversampling')

# appending results to list
add_result(results, 'Random Oversampling', *res)

### Performance Evaluation

In [None]:
# converting list of results into a dataframe
results = pd.DataFrame(results)

# store results of best run based on the highest balanced accuracy
best_results = results.sort_values(by='bacc_mean', ascending=False).iloc[0]

# displaying and sorting data frame descending by balanced accuary
results.sort_values(by='bacc_mean', ascending=False)

When comparing the mean values for accuracy over five runs for each of the three different approaches, we find that training the classifier on the plain data yields vastly better results:

Accuracies (mean, 5 runs):
- Plain Data:     99,5%
- SMOTE:           9,3%
- Oversampling:    8,6%

Because of the class imbalance in the dataset, classifing always as the majority class should by default be at least 99% accurate. Therefore we inspect the balanced accuracy to get a better understanding of the acutal performance of the different approaches.

Balanced Accuracies (mean, 5 runs):
- Plain Data:     79,3%
- SMOTE:          54,1%
- Oversampling:   53,7%

When comparing the mean values for balanced accuracy, the approach of training the classifier on the plain data still beats SMOTE and oversamplong.

The results generally indicate a very similar performance when comparing the appraoches of over-sampling the minority class using SMOTE against random over-sampling the minority class.

### Averaged Confusion Matrixes

In [None]:
# preparing confusion matrix based on results table
plain = results[['tn_mean', 'tp_mean', 'fn_mean', 'fp_mean']].iloc[0]
plain_cm = np.zeros((2, 2))
plain_cm[0, 0] = plain['tn_mean']
plain_cm[0, 1] = plain['fp_mean']
plain_cm[1, 0] = plain['fn_mean']
plain_cm[1, 1] = plain['tp_mean']
plain_cm = plain_cm * len(X) / plain_cm.sum()

# plotting confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=plain_cm.astype(int),
                              display_labels=['Majority', 'Minority'])
disp.plot()
disp.ax_.set_title('Plain Data')
plt.show()

Trained on the plain data, out of 10.000 cases on average, the Perceptron was able to correctly identify:
- 9887 cases of the majority class, while misclassifing 41 cases
- 58 cases of the minority class, while misclassifing 13 cases

While only misclassifing 13 cases of the majority class making 0,13% of that class, it also misclassified 41 cases of the minority class which makes up 41% of that class.

In regards to reliably identifying the minority class correctly, this approach is only marginally better than a random guess.

In [None]:
# preparing confusion matrix based on results table
smote = results[['tn_mean', 'tp_mean', 'fn_mean', 'fp_mean']].iloc[1]
smote_cm = np.zeros((2, 2))
smote_cm[0, 0] = smote['tn_mean']
smote_cm[0, 1] = smote['fp_mean']
smote_cm[1, 0] = smote['fn_mean']
smote_cm[1, 1] = smote['tp_mean']
smote_cm = smote_cm * len(X) / smote_cm.sum()

# plotting confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=smote_cm.astype(int),
                              display_labels=['Majority', 'Minority'])
disp.plot()
disp.ax_.set_title('Minority Class Over-Sampled using SMOTE')
plt.show()

Trained on the data with the minority class over-sampled using SMOTE, out of 10.000 cases on average, the Perceptron was able to correctly identify:
- 760 cases of the majority class, while misclassifing 9139 cases
- 99 cases of the minority class, while misclassifing none

While misclassifing 9139 cases (92%) of the majority class, it correctly predicted all cases of the minority class.

Each case of correctly classifying the minority class comes at a cost of approximately 92 misclassifications of the majority class.

In [None]:
# preparing confusion matrix based on results table
oversampling = results[['tn_mean', 'tp_mean', 'fn_mean', 'fp_mean']].iloc[2]
oversampling_cm = np.zeros((2, 2))
oversampling_cm[0, 0] = oversampling['tn_mean']
oversampling_cm[0, 1] = oversampling['fp_mean']
oversampling_cm[1, 0] = oversampling['fn_mean']
oversampling_cm[1, 1] = oversampling['tp_mean']
oversampling_cm = oversampling_cm * len(X) / oversampling_cm.sum()

# plotting confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=oversampling_cm.astype(int),
                              display_labels=['Majority', 'Minority'])
disp.plot()
disp.ax_.set_title('Minority Class Random Over-Sampled')
plt.show()

Trained on the data after random over-sampling the minority class, out of 10.000 cases on average, the Perceptron was able to correctly identify:
- 832 cases of the majority class, while misclassifing 9067 cases
- 99 cases of the minority class, while misclassifing 0 cases

While misclassifing 9067 cases (91%) of the majority class, it correctly predicted all cases of the minority class.

Each case of correctly classifying the minority class comes at a cost of approximately 91 misclassifications of the majority class.

In a binary classification problem with an imbalanced class distribution, the challanges usually lies with correctly identifying the minority class, rather than the majority class. In such cases, when comparing the performance of different approaches merely based on accuracy is misleading.

In this case there is an overlap of the data points across the two classes (see plot above). Therefore not all misclassification can be avoided. We can only decide whether a higher percentage of misclassifications are more acceptable in one or the other class.

The accepted ratio is scenario dependent and will differ from case to case.
_______________
_______________

## Exercise 5. 
### 5.1
*Compare your expectations after Exercise 3 with the results of Exercise 4.*

I expected to see an improved balanced accuracy as a result of training the Perceptron using synthetically over-sampled data, compared to using the plain data. 

Though the results showed that using SMOTE and random over-sampling decreased the balanced accuracy compared to training the Perceptron on the plain data. There seems no significant differnce in balanced accuracy between using SMOTE and random over-sampling.

However Perceptrons trained using SMOTE or random over-sampling were able to correctly identify all minority class cases correctly. This comes at the cost of misclassifying almost all of the majority class cases.

## 5.2
*Visualize each scenario*

In [None]:
# code was copied from: https://imbalanced-learn.org/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html

def plot_decision_function(X, y, clf, ax, title=None):
    plot_step = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(
        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)
    )

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.4)
    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor='k')
    if title is not None:
        ax.set_title(title, fontsize = 35)

In [None]:
# code was copied from: https://imbalanced-learn.org/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html

fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(35, 35))

clf_1 = Perceptron(2)
clf_2 = Perceptron(2)
sampler = SMOTE(sampling_strategy='minority', random_state=1)

models = {
    'Original Data': clf_1,
    'SMOTE Resampled': clf_2}

X_res, y_res = sampler.fit_resample(X, y)

for ax, (title, model) in zip(axs, models.items()):
    X_plot, y_plot = X, y
    if title == 'SMOTE Resampled':
        X_plot, y_plot = X_res, y_res
    model.fit(X_plot, y_plot)
    plot_decision_function(X_plot, y_plot, model, ax=ax, title=title)

fig.tight_layout()

### 5.3
*Explain the situation. How did the application of SMOTE or random oversampling
lead to such results.*

In [None]:
# original number of cases per class
unique, counts = np.unique(y, return_counts=True)
print(np.asarray((unique, counts)).T)

In [None]:
# cases per class after SMOTE resampling
X_res, y_res = sampler.fit_resample(X, y)
unique, counts = np.unique(y_res, return_counts=True)
print(np.asarray((unique, counts)).T)

SMOTE provides additional related minority class samples to learn from, thus moving the decision boundary away from the data points of the minority class towards those of the majority class, leading to more coverage of the minority class.

*What would have to be changed to yield better results?*

There are several elements in the approach that produce the result. Firstly, the Perceptron is configured to apply a margin of 1, lowering this might yield less misclassification of the majority class. Furthermore, there is the over-sampling performed by SMOTE. Which created an additional 9.800 examples to fill in the minority class. A further approach could be to treat the degree of over-sampling as a hyperparameter and optimize it to maximize the balanced accuracy.

This might decrease the misclassification rate of the majority class whilst continuing to predict the minority class cases correctly.
_______________
_______________