In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Task 1: Statistics

## Conditional Probabilities
You know the University of Bremen has 18,631 students, of which 6,671 are in natural sciences and engineering (see https://www.uni-bremen.de/en/university/profile/facts-figures ). Three-quarters of your friends in the natural sciences like mate (a beverage) from your personal experience. 
You are curious if you can determine how likely someone studies in this field, given they like mate. Therefore, you conduct a quick experiment in the mensa and ask at random tables the field and how much they like mate. 

The following matrix describes your data. The first column describes if the person studies natural sciences (or not) and the second how much they like mate (scale from -2 to 2, higher=likes better, neutral is not allowed).

In [19]:
questionaire_mate = np.array([[True, 1], [False, -1], [False, 1], [False, -1], [True, 1], [False, 1], [False, -2], [False, -1]])
questionaire_mate

array([[ 1,  1],
       [ 0, -1],
       [ 0,  1],
       [ 0, -1],
       [ 1,  1],
       [ 0,  1],
       [ 0, -2],
       [ 0, -1]])

Given a person likes mate, how likely are they to study in the natural sciences?

In [26]:
# change second column to boolean to differentiate only between *like* and *dont like*
questionaire_mate[:,1] = [True if x > 0 else False for x in questionaire_mate[:,1]]

# calculate using the bayes theorem (https://en.wikipedia.org/wiki/Bayes%27_theorem)
p_ns = 6671/18631
print(f"Chance that a student studies natural sciences: {p_ns}")
p_mate = sum(questionaire_mate[:,1])/len(questionaire_mate)
print(f"Chance that student likes mate: {p_mate}")
p_ns_mate = 3/4
print(f"Chance that student that studies natural sciences likes mate: {p_ns_mate}")
p_mate_ns = p_ns * p_ns_mate / p_mate
print(f"Chance that student that likes mate studies natural sciences: {p_mate_ns}")

Chance that a student studies natural sciences: 0.3580591487306103
Chance that student likes mate: 0.5
Chance that student that studies natural sciences likes mate: 0.75
Chance that student that likes mate studies natural sciences: 0.5370887230959154


## Maximum Likelihood Estimation
A Gaussian normal distribution can be fitted by applying the Maximum Likelihood Estimation to determine the best parameters for explaining a given dataset. This is equivalent to calculating the mean (and variance) on the dataset directly; why?

The Gaussian normal distribution is given as follows:

$$
N(\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}} 
e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}
$$

Hint: The partial derivative is easier to compute when using a log likelihood.

## Kullback-Leibler Divergence

$$D_{KL}(P|Q) = \sum_x P(x)log(\frac{P(x)}{Q(x)})$$

a) Calculate the KL divergence for two discrete distributions $P$ and $Q$ over events $A,B,C$. 
Calculate $D_{KL}(P|Q)$ and $D_{KL}(Q|P)$ and compare! 

| Distribution | A | B | C |
| --- | --- | --- | --- |
| P | 0.5 | 0.3 | 0.2 |
| Q | 0.4 | 0.2 | 0.4 |

In [3]:
p = [0.5,0.3,0.2]
q = [0.4,0.2,0.4]
# implement or calculate

b) For this task, assume for simplicity that $P$ and $Q$ are discrete distributions over two events $A,B$. 

i) For a given $P$, what $Q_{min}$ minimizes $D_{KL}(P|Q)$? Justify your answer!

ii) For a given P, show that there is no upper bound for $D_{KL}(P|Q)$.

In [4]:
# implement or calculate

c) What is the relationship between KL divergence and cross-entropy?

# Task 2: Feature Compression

Assume you want to perform classification with two classes, $A$ and $B$ in the feature space ${\rm I\!R}^{2n}$. We can assume that the two classes follow a normal distribution, with $\mu_A = (\mu_1, \mu_2)$ and $\mu_B = (\mu_1, \mu_3)$, with $\mu_1, \mu_2, \mu_3 \in {\rm I\!R}^n$. $\Sigma$ is identical for both distributions, see below ($\sigma \in [0,1]$, $\alpha \approx 1$). You perform a Principal Component Analysis (PCA) for feature space transformation.

$$\sum =
\left(
  \begin{array}{ccc}
  \begin{array}{cc} 
\sigma & \alpha\\
\alpha & \sigma
\end{array} & \dots & 0  \\
  \vdots & \ddots & \vdots  \\
  0 & \dots & \begin{array}{cc} 
\sigma & \alpha\\
\alpha & \sigma
\end{array} 
\end{array} \right) \in \mathbb{R}^{2n \times 2n} $$

a) Without calculating the result, make a prediction about how the sorted sequence of Eigenvalues will look like! You do not need to give exact numbers, but sketch the graph of Eigenvalues by Eigenvector index. How many components do you anticipate to keep to retain most of the variance in the data?

Hint: you may choose to implement and plot an example for this task.
If so, np.random.multivariate_normal and sklearn.decomposition might come in handy.

b) Is the number of features you answered for part a) representative of the minimum number of features required for discriminating the two classes? Justify your answer!

# Task 3: Logistic Regression


Logistic regression is a simple, but important classification technique (despite the name, it is not used for regression) for binary classification tasks. 

To classify a sample $x$, we:

1. Calculate $z(x) = \theta^Tx$ (to include a bias term, add a constant feature 1 to $x$).
2. Apply $h(x)=\sigma(z(x))$ with $\sigma(s)=\frac{1}{1+e^{-s}}$  
3. Apply a threshold $t$ to $h(x)$ to discriminate between the two classes (i.e., assign class 0 to $x \iff h(x) < t$)

For training, we initialize $\theta$ randomly and perform gradient descent, i.e., loop over the following steps:

1. Calculate the loss $J(\theta)$ on the training data with $J(\theta) = -y_1 \cdot log(p_1) - (1-y_1) \cdot log(1-p_1)$
2. Adjust the weights $\theta$ in the direction of $\frac{\delta J}{\delta \theta}$ with a learning rate of $l$

a) Argue why logistic regression can be considered a special case of a neural network.

b) Assume the logistic regression to detect a target class among non-targets. Describe how you can adjust the algorithm depending on whether a high recall or a high precision are more important in your application.

c) Program a classifier object LogisticRegression with methods fit(X,y) and predict(X) that implements training and classification as described above. While you should use PyTorch in all following programming tasks, use only elementary Python and numpy methods. For this purpose, you will need to determine the partial derivative of $J(\theta)$. Fill out the following skeleton class for this purpose.

In [5]:
EPS = 1e-12

class MyLogisticRegression:
    def __init__(self, lr=0.01, num_iter=100000, verbose=False):
        self.lr = lr
        self.num_iter = num_iter
        self.verbose = verbose
    
    def __add_intercept(self, X):
        intercept = np.ones((X.shape[0], 1))
        return np.concatenate((intercept, X), axis=1)
    
    def __sigmoid(self, z):
        z = h = np.clip(z, EPS, 1-EPS)
        return 1 / (1 + np.exp(-z))
    
    def __loss(self, h, y):
        h = np.clip(h, EPS, 1-EPS)
        return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
    
    def fit(self, X, y):
        # implement
        pass
     
    def predict_prob(self, X):
        # implement
        pass
    
    def predict(self, X, threshold):
        # implement
        pass

d) Evaluate your logistic regression classifier with the BreastCancer data set (available in scikit-learn). The optimization problem during training of logistic regression is convex, i.e., it will always converge towards a global minimum. How can you verify this empirically?

Hint: if you had trouble implementing the logistic regression earlier, you may use the sklearn version here.

In [6]:
from sklearn.datasets import load_breast_cancer

(X,y) = load_breast_cancer(return_X_y=True)
# implement