## Instance based learning

- systems that learn the training examples by heart and then generalizes to new instances based on some similarity measure
- called instance-based because it builds the hypotheses from the training instances
- also called lazy learning, memory-based learning, or case-based reasoning
- time complexity is $O(n)$ where $n$ is the number of training instances

### k-Nearest Neighbors (kNN) 
- non-parametric method, supervised learning
- used for classification
    - object is classified by a majority vote of its $k$ nearest neighbors
    - if $k=1$, then the object is simply assigned to the class of that single nearest neighbor
- used for regression
    - object's value is estimated by the average of its $k$ nearest neighbors
- $k$ is a hyperparameter that is usually chosen by cross-validation
- kNN is sensitive to the local structure of the data
    - if the data is not uniformly sampled, then the nearest neighbors will not be representative of the entire data set
    - in this case, the decision boundary will be irregular
- kNN is sensitive to the distance metric used (Euclidean, Manhattan, Minowski, etc.)
- drawback when class distribution is skewed
    - if one class is much more frequent than the others, then the nearest neighbors will be dominated by the most frequent class
    - solution: use weighted voting, where the weights are the inverse of the distance to the query point

### Locally Weighted Regression (LWR)
- memory-based method that performs a regression around a point of interest using only training data that are local to that point
- non-parametric method (parameters are computed individually for each query point)
- cost function is modified as $$J(\theta) = \frac{1}{2m} \sum_{i=1}^m w^{(i)} (h_\theta(x^{(i)}) - y^{(i)})^2$$ where $w^{(i)} = \exp \left( - \frac{(x^{(i)} - x)^2}{2\tau^2} \right)$ and $\tau$ is the bandwidth parameter that controls the degree of smoothing
    - if $(x^{(i)} - x)^2$ is small, then $w^{(i)}$ is close to 1, and vice versa
- training data must be available at the time of prediction

### Kernel function
- for linear regression, the hypothesis is $h_\theta(x) = \theta^T x$, the dot product is integral to the prediction operation
- suppose the vectors are not linearly separable, then we can use a function $\phi(x)$ to map the vectors to a higher dimensional space where they are linearly separable
- kernel function is a function that maps a function of the vectors in the original space to a dot product of the vectors in a higher dimensional space
- Let $\phi(x) : \mathbb{R}^2 \rightarrow \mathbb{R}^3$ be a function that maps $x$ from 2D to 3D space, and the kernel function is defined as $K(x, x^*) = \phi(x) \cdot \phi(x^*)$
- Consider $x = [x_1, x_2]$,  $x^* = [x^*_1, x^*_2]$, then $\phi(x) = [x_1^2, \sqrt{2}x_1x_2, x_2^2]$ and $\phi(x^*) = [x_1^{*2}, \sqrt{2}x_1^*x_2^*, x_2^{*2}]$
- Here $\phi(x) \cdot \phi(x^*) = x_1^2x_1^{*2} + 2x_1x_2x_1^*x_2^* + x_2^2x_2^{*2} = (x_1x_1^* + x_2x_2^*)^2 = (x \cdot x^*)^2$
- So $K(x, x^*) = (x \cdot x^*)^2$ is a kernel function, and we were able to perform the dot product in a higher dimensional space without explicitly computing $\phi(x)$ and $\phi(x^*)$, which is computationally expensive operation
- $x \rightarrow \phi(x)$, $x^* \rightarrow \phi(x^*)$, then $x \cdot x^* \rightarrow K(x, x^*) = \phi(x) \cdot \phi(x^*)$

### Radial Basis Functions (RBF) 
- The basic idea behind RBFs is to model the data using a set of basis functions, where each basis function represents a localized influence on the data.
- a real-valued function $\varphi$ whose value depends only on the distance from a fixed point
    - point is either the origin, so that $\varphi(\mathbf{x}) = \hat{\varphi}(\|\mathbf{x}\|)$
    - a fixed point $\mathbf{c}$, called a center, so that $\varphi_c(\mathbf{x}) = \hat{\varphi}(\|\mathbf{x} - \mathbf{c}\|)$
- examples of RBFs:
    - Gaussian: $\varphi(r) = e^{{-(\epsilon r)}^2}$ where $r = \|\mathbf{x} - \mathbf{c}\|$ and $\epsilon$ is a shape parameter
    - Multiquadric: $\varphi(r) = \sqrt{1 + (\epsilon r)^2}$
    - Inverse quadratic: $\varphi(r) = \frac{1}{1 + (\epsilon r)^2}$
    - Inverse multiquadric: $\varphi(r) = \frac{1}{\sqrt{1 + (\epsilon r)^2}}$
    - Thin plate spline: $\varphi(r) = r^2 \ln(r)$
    - Cubic: $\varphi(r) = r^3$
    - Wendland $\varphi(r) = (1 - \epsilon r)^4_+ (4\epsilon r + 1)$
- used to build function approximations of the form $$y(\mathbf{x}) = \sum_{i=1}^N w_i \varphi(\|\mathbf{x} - \mathbf{x}_i\|)$$
    - approximating function is represented as a sum of $N$ radial basis functions, each associated with a different center $\mathbf{x}_i$, and weighted by an appropriate coefficient $w_i$
    - weights $w_i$ can be estimated using the matrix methods of linear least squares, because the approximating function is linear in the weights $w_i$
- numerical: finding target value for a new point
    - say the data points are [1, 2, 3, 6, 7] and their targets [4, 6, 2, 10, 8]
    - new point is $x_{new} = 4$
    - choosing Gaussian RBF, $\varphi(r) = e^{{-(\epsilon r)}^2}$
    - for simplicity, let $\epsilon = 1$
    - choose the data points themselves as the centers, so $\mathbf{x}_i = [1, 2, 3, 6, 7]$
    - RBF for each center using the formula is $\varphi(x_{new}, c_i) = e^{-1 \times (x_{new} - c_i)^2}$
    - $predicted\_target = \frac{\sum_{i=1}^N target(c_i) \times \varphi(x_{new}, c_i)}{\sum_{i=1}^N \varphi(x_{new}, c_i)}$
- numerical: interpolation function that passes through the data points
    - say the data points are (x, y): (1, 2) (2, 3) (3, 4) (4, 5) (5, 6)
    - choose 3 centers: $c_1 = 1, c_2 = 3, c_3 = 5$
    - choose Gaussian RBF, $\varphi_i(r) = e^{{-(\epsilon ||x - c_i||)}^2}$ where $\epsilon = 1$
    - then interpolation function, $F(x) = \sum_{i=1}^3 w_i \varphi_i(x, c_i)$
    - to find the weights, we need to solve the system of linear equations
        1. $2 = w_1 \varphi_1(1, 1) + w_2 \varphi_2(1, 3) + w_3 \varphi_3(1, 5)$
        2. $3 = w_1 \varphi_1(2, 1) + w_2 \varphi_2(2, 3) + w_3 \varphi_3(2, 5)$ and so on...


### RBF Network
- fundamental idea is that an item's predicted target value is likely to be the same as other items with close values of predictor variables
- places one or many RBF neurons in the space described by the predictor variables
- space has multiple dimensions corresponding to the number of predictor variables present
- calculates the Euclidean distance from the evaluated point to the center of each neuron
- RBF (kernel function) is applied to the distance to calculate every neuron's weight (influence)
- the greater the distance of a neuron from the point being evaluated, the less influence (weight) it has
- predict value for new points by adding the output values of RBF functions applied to the distance between the new point and the center of each neuron multiplied by the weight of each neuron
- RBF network is a three-layer neural network
    - input layer: neurons that receive the input values
    - hidden layer: neurons that apply the RBF function to the distance between the input values and the center of each neuron
    - output layer: neurons that sum the output values of the hidden layer neurons multiplied by the weight of each neuron $$ y(\mathbf{x}) = \sum_{i=1}^N w_i \varphi(\|\mathbf{x} - \mathbf{x}_i\|)$$
- the approximant $y(\mathbf{x})$ is differentiable with respect to the weights $w_i$, hence the weights can be estimated using any of the standard iterative methods for neural networks
- numerical: RBF network, you have output value for a few 2D points
    - method is same as interpolation (pg. 700)

## Support Vector Machines (SVM) [🔗](https://drive.google.com/file/d/12KgpHBHalf4WFJHsVfbim1dQDdVyPk8d/view?usp=drive_link)
- maps training examples to points in space so as to maximise the width of the gap between the two categories
- new examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall
- SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces
- [Link](https://shuzhanfan.github.io/2018/05/understanding-mathematics-behind-support-vector-machines/)

The functional margin is represented as $\hat{\gamma}$ and the geometric margin is represented as $\gamma$. The geometric margin can be expressed as $\gamma = \frac{\hat{\gamma}}{||w||}$, where $w$ is the weight vector. So, the geometric margin is a scaled version of the functional margin. Here $\hat{\gamma} = y_i(w^Tx_i + b)$. The functional margin represents the correctness and confidence of the prediction if the magnitude of the $w^T$ orthogonal to the hyperplane has a constant value all the time.

The functional margin gives the position of a point with respect to the hyperplane, which does not depend on the magnitude. The geometric margin is a scaled version of the functional margin and gives the distance between a given training example and the given hyperplane. It is invariant to the scaling of the vector orthogonal to the hyperplane.

The optimization equation for hard margin SVM is to maximize the margin between two classes subject to the constraint that all data points are classified correctly. The margin is defined as the distance between two parallel hyperplanes that separate the two classes. The optimization problem can be expressed as minimizing $\frac{1}{2}||w||^2$ subject to the constraint $y_i(w^Tx_i + b) \geq 1$ for all data points $i$. [Video](https://www.youtube.com/watch?v=vNt_WCM1M3M&list=PLAoF4o7zqskR7U98D799FKHkZ4YrHKPqs&index=82)

The primal optimization problem for SVM is $$ L(w, b, \alpha) = \frac{1}{2}||w||^2 - \sum_{i=1}^n \alpha_i[y_i(w^Tx_i + b) - 1] $$

The dual optimization problem for SVM is 
$$ \Theta_D(\alpha) = \min_{w, b} L(w, b, \alpha) $$

Solving for $w$ and $b$ in the dual optimization problem gives $$ \Theta_D(\alpha) = \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j <x_i^T x_j> $$

This gives us the dual problem, which is a quadratic optimization problem. The dual problem is easier to solve than the primal problem because it is a convex optimization problem. It is given by $$ \max_{\alpha} \Theta_D(\alpha) $$ subject to $ \sum_{i=1}^n \alpha_i y_i = 0 $ and $ \alpha_i \geq 0 $ for all $i$.

So we can first solve for $\alpha$ using the dual optimization problem, and then solve for $w$ and $b$ using the values of $\alpha$.

#### Kernel Trick

Kernel functions are functions that map data points from a low-dimensional space to a higher-dimensional space. The kernel function is used to transform the data into a higher-dimensional space so that the data becomes linearly separable. The kernel function is defined as $$ K(x_i, x_j) = \phi(x_i)^T \phi(x_j) $$ where $\phi(x_i)$ is the transformed data point. The kernel function is a dot product between the transformed data points.

Equation of linear kernel function is $$ K(x_i, x_j) = x_i^T x_j $$
Equation of polynomial kernel function is $$ K(x_i, x_j) = (x_i^T x_j + c)^d $$
Equation of radial basis function kernel function is $$ K(x_i, x_j) = \exp(\gamma ||x_i - x_j||^2) $$
    - small value of $\gamma$ will make the model behave like a linear SVM
    - large value of $\gamma$ will make the model heavily impacted by the support vectors examples
Equation of sigmoid kernel function is $$ K(x_i, x_j) = \tanh(\beta x_i^T x_j + \theta) $$

#### Soft margin SVM

The optimization problem is given by $$ \min_{w, b} \frac{1}{2}||w||^2 + C \sum_{i=1}^n \xi_i $$ subject to $ y_i(w^Tx_i + b) \geq 1 - \xi_i $ and $ \xi_i \geq 0 $ for all $i$.

In soft margin SVM, a slack variable $\xi_i$ is introduced for every data point $x_i$. The value of $\xi_i$ is the distance of $x_i$ from the corresponding class’s margin if $x_i$ is on the wrong side of the margin, otherwise zero¹. This allows some misclassifications to happen while keeping the margin as wide as possible so that other points can still be classified correctly.

#### Regularization parameter C
- determines how important $xi$ should be
    - smaller $C$ emphasizes the importance of $xi$
    - larger $C$ diminishes the importance of $xi$
- controls how the SVM will handle errors
    - if $C$ is positive infinite, then we will get the same result as the hard margin SVM
    - if $C$ is 0, then there will be no constraint anymore, and we will end up with a hyperplane not classifying anything
- small values of $C$ will result in a wider margin, at the cost of some misclassifications
- large values of $C$ will give you the hard margin classifier and tolerates zero constraint violation



## Naive Bayes classifier

- Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions between the features.
- We have a set of features $X = {X_1, X_2, ..., X_n}$ and a class variable $Y$.
- We want to find the class $Y$ that maximizes the posterior probability $P(Y|X)$.
- Then $$ P(Y = y_k|X_1, X_2, ..., X_n) = \frac{P(Y = y_k)P(X_1, X_2, ..., X_n|Y = y_k)}{\sum\limits_{j} P(Y = y_j)P(X_1, X_2, ..., X_n|Y = y_j)} $$
- Assuming conditional independence, we have $P(X_1, X_2, ..., X_n|Y = y_k) = \prod\limits_{i=1}^n P(X_i|Y = y_k)$. Therefore, $$P(Y = y_k|X_1, X_2, ..., X_n) = \frac{P(Y = y_k)\prod\limits_{i=1}^n P(X_i|Y = y_k)}{\sum\limits_{j} P(Y = y_j)\prod\limits_{i=1}^n P(X_i|Y = y_j)}$$
- Pick the most probable class: $$\hat{y} = \arg\max\limits_{y_k} P(Y = y_k)\prod\limits_{i} P(X_i|Y = y_k)$$

Steps to apply Naive Bayes classifier, given a table like this:
|Weather|Play|
|---|---|
|Sunny|No|
|...|...|
|Rainy|Yes|

We convert it into a frequency table like this:
|Weather|No|Yes|Total|Prob|
|---|---|---|---|---|
|Sunny|2|3|5| P(Sunny) = $\frac{5}{14}$|
|Overcast|0|4|4| P(Overcast) = $\frac{4}{14}$|
|Rainy|3|2|5| P(Rainy) = $\frac{5}{14}$|
|Total|5|9|14|1|
|Prob|P(No) = $\frac{5}{14}$|P(Yes) = $\frac{9}{14}$|1|

Then we can calculate the posterior probability of each class, given the evidence (weather), for example, $P(Yes|Sunny)$:
$$ P(Yes|Sunny) = \frac{P(Sunny|Yes)P(Yes)}{P(Sunny)} = \frac{\frac{3}{9}\frac{9}{14}}{\frac{5}{14}} = \frac{3}{5}$$

If there are multiple features, we can calculate the posterior probability of each class, given the evidence (weather and temperature), for example, $P(Yes|Sunny, Cool)$:
$$ P(Yes|Sunny, Cool) = \frac{P(Sunny, Cool|Yes)P(Yes)}{P(Sunny, Cool)} = \frac{P(Sunny|Yes)P(Cool|Yes)P(Yes)}{P(Sunny)P(Cool)} $$

## Naive Bayes for text classification
You need a document $d$, a set of classes $C = {c_1, c_2, ..., c_n}$, and a set of $m$ hand-labelled documents $(d_1, c_1), (d_2, c_2), ..., (d_m, c_m)$. The for a document $d$, we want to find the class $c$ that maximizes the posterior probability $P(c|d)$.
$$ P(c|d) = \frac{P(c)P(d|c)}{P(d)} = \frac{P(c)\prod\limits_{i=1}^n P(w_i|c)}{P(d)}$$
Here, there are two assumptions : bag of words (position doesn't matter) and conditional independence.
Then, we pick the most probable class: $$c_{MAP} = \arg\max\limits_{c} P(c)\prod\limits_{i=1}^n P(w_i|c)$$
Here, $$ P(c_j) = \frac{docCount(C = c_j)}{N_{doc}} $$ and $$ P(w_i|c_j) = \frac{wordCount(w_i, C = c_j)}{\sum\limits_{w \in V} wordCount(w, C = c_j)} $$, where $V$ is the vocabulary.
This has a problem of zero probability, so we use Laplace smoothing: $$ P(w_i|c_j) = \frac{wordCount(w_i, C = c_j) + 1}{\sum\limits_{w \in V} wordCount(w, C = c_j) + |V|} $$

[Example](https://www.fi.muni.cz/~sojka/PV211/p13bayes.pdf):
<img src="https://i.imgur.com/p3nZUNM.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">
<img src="https://i.imgur.com/kcNsCro.png" width="500" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

Therefore, $$P(C|d_5) = \frac{3}{4} {(\frac{3}{7})}^3 \frac{1}{14} \frac{1}{14} \frac{1}{P(d_5)}$$
and $$P(\bar{C} | d_5) = \frac{1}{4} {(\frac{2}{9})}^3 \frac{2}{9} \frac{2}{9} \frac{1}{P(d_5)}$$

$P(d_5)$ is the same for both classes, so we can ignore it.

## Ensemble learning

- Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem, and then combined to get better results.
- types of ensemble methods:
1. bagging - decrease variance
    - building multiple models (typically of the same type) from different subsamples of the training dataset (with replacement) 
    - considers homogeneous weak learners, learns them independently from each other in parallel and combines them following some kind of deterministic averaging process
    - eg. random forest, extra trees
2. boosting - decrease bias
    - building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain
    - considers homogeneous weak learners, learns them sequentially in a very adaptative way (a base model depends on the previous ones) and combines them following a deterministic strategy
3. stacking  - increase predictive power
    - building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the base models
    - considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions

### Random forest vs Extra trees
- Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. The steps for random forest are:
    1. Sample $n$ cases at random with replacement to create a subset of the data, called the bag used to build a tree
    2. At each node, randomly select $d$ features without replacement
    3. Calculate the best split point for the selected features
    4. Split the node into two daughter nodes
    5. Repeat steps 1 to 4 $k$ times
    6. Aggregate the prediction by each tree to assign the class label by majority vote (classification) or average (regression)
- Extra trees is an ensemble learning method for classification, regression and other tasks that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected features and the best split among those is chosen. When max_features is set 1, this amounts to building a totally random decision tree.

### AdaBoost vs Gradient Boosting vs XGBoost
- AdaBoost is an ensemble learning method for classification and regression. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers.

<img src="https://i.imgur.com/uYsoCL2.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

- Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Gradient boosting is a greedy algorithm and can overfit if run for too many iterations.

<img src="https://i.imgur.com/Fz2HzoG.png" width="400" style="display: block; margin-left: auto; margin-right: auto; padding-top: 10px; padding-bottom: 10px;">

- XGBoost is short for eXtreme Gradient Boosting. It is an optimized distributed gradient boosting library. It provides a parallel tree boosting (also known as GBDT, GBM) that solves many ML problems quickly and accurately.

## Clustering

- Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

### K-means clustering
- K-means clustering aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
    - K-means clustering minimizes within-cluster variances (squared Euclidean distances)
    - Given a set of observations $(x_1, x_2, ..., x_n)$, where each observation is a $d$-dimensional real vector, k-means clustering aims to partition the $n$ observations into $k$ sets $S = {S_1, S_2, ..., S_k}$ so as to minimize the within-cluster sum of squares (WCSS) $$ \sum_{i=1}^k \sum_{x \in S_i} ||x - \mu_i||^2 $$ where $\mu_i$ is the mean of points in $S_i$.

#### Evaluation metrics
- Distortion
    - the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
    - The distortion is given by $$ J = \sum_{i=1}^k \frac{1}{|S_i|} \sum_{x \in S_i} ||x - \mu_i||^2 $$
- Inertia
    - the sum of squared distances of samples to their closest cluster center.
    - The inertia is given by $$ I = \sum_{i=1}^k \sum_{x \in S_i} ||x - \mu_i||^2 $$

- Dunn index
    - the ratio between the minimum inter-cluster distance to maximum intra-cluster distance. The higher the value of Dunn index, the better the clustering.
    - The Dunn index is defined as $$ D = \frac{\min\limits_{1 \leq i < j \leq n} d(i, j)}{\max\limits_{1 \leq k \leq n} \Delta(k)} = \frac{\min(\text{inter-cluster distance})}{\max(\text{intra-cluster distance})}$$ where $d(i, j)$ is the distance between clusters $i$ and $j$, and $\Delta(k)$ is the diameter of cluster $k$.

In [8]:
import numpy as np

# Define the number of clusters and the number of iterations
K = 2
max_iterations = 3

# Generate some sample data
# data = np.random.randint(0, 10, size=(5, 2))
data = np.array([[1, 1], [2, 2], [2, 3], [1, 2], [5,6], [5, 7], [6, 7], [6, 6]])
print(f"Points: {data}")

# Initialize the centroids by randomly selecting K data points
# centroids = data[np.random.choice(data.shape[0], K, replace=False)]
centroids = np.array([[1, 1], [5, 6]])
centroids = centroids.astype(float)

# Iterate the k-means algorithm
for i in range(max_iterations):
    # Assign each point to the nearest centroid
    distances = np.sqrt(np.sum((data[:, np.newaxis, :] - centroids) ** 2, axis=2))
    labels = np.argmin(distances, axis=1)
    
    # Print the centroids and the distances at each iteration
    print(f"Iteration {i+1}:")
    print(f"Centroids: {centroids}")
    print(f"Distances: {distances}")
    
    # Update the centroids to the mean of the assigned points  
    for k in range(K):
        centroids[k] = np.mean(data[labels == k], axis=0)

Points: [[1 1]
 [2 2]
 [2 3]
 [1 2]
 [5 6]
 [5 7]
 [6 7]
 [6 6]]
Iteration 1:
Centroids: [[1. 1.]
 [5. 6.]]
Distances: [[0.         6.40312424]
 [1.41421356 5.        ]
 [2.23606798 4.24264069]
 [1.         5.65685425]
 [6.40312424 0.        ]
 [7.21110255 1.        ]
 [7.81024968 1.41421356]
 [7.07106781 1.        ]]
Iteration 2:
Centroids: [[1.5 2. ]
 [5.5 6.5]]
Distances: [[1.11803399 7.1063352 ]
 [0.5        5.70087713]
 [1.11803399 4.94974747]
 [0.5        6.36396103]
 [5.31507291 0.70710678]
 [6.10327781 0.70710678]
 [6.72681202 0.70710678]
 [6.02079729 0.70710678]]
Iteration 3:
Centroids: [[1.5 2. ]
 [5.5 6.5]]
Distances: [[1.11803399 7.1063352 ]
 [0.5        5.70087713]
 [1.11803399 4.94974747]
 [0.5        6.36396103]
 [5.31507291 0.70710678]
 [6.10327781 0.70710678]
 [6.72681202 0.70710678]
 [6.02079729 0.70710678]]


### K-means++ algorithm

- K-means++ algorithm is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm
- The algorithm is as follows:
    1. Choose one center uniformly at random from among the data points.
    2. For each data point $x$, compute $D(x)$, the distance between $x$ and the nearest center that has already been chosen.
    3. Choose one new data point at random as a new center, using a weighted probability distribution where a point $x$ is chosen with probability proportional to $D(x)^2$.
    4. Repeat Steps 2 and 3 until $k$ centers have been chosen.
    5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
- The k-means++ seeding method gives a provable upper bound on the expected running time of the resulting k-means algorithm, which is nearly-optimal up to constant factors.
- main advantage is that it helps avoid poor local minima, by ensuring a more spread out initial set of centers, leading to faster convergence and better final solution

### K Medoids clustering
- k-medoids chooses datapoints as centers (medoids or exemplars) and forms clusters around them. It is similar to k-means clustering, but the difference is that the center of the cluster is always a data point. In k-means, the center of the cluster is the mean of the data points in the cluster.
- The algorithm is as follows:
    1. Initialize: randomly select $k$ of the $n$ data points as the medoids
    2. Associate each data point to the closest medoid. (Thus forming $k$ clusters of data points.)
    3. For each cluster $k$ and its medoid $m$:
        - For each non-medoid data point $o$ in the cluster:
            - Swap $o$ and $m$ and compute the total cost of the configuration
        - Select the configuration with the lowest cost.
    4. Repeat Steps 2 and 3 until there is no change in the medoid.

### Fuzzy C-means clustering
- Fuzzy C-means clustering is a method of clustering which allows one piece of data to belong to two or more clusters
- The algorithm is as follows:
    1. Specify the number of clusters $c$ and the fuzzy parameter $m$.
    2. Initialize the cluster centers randomly, $v_j \in \mathbb{R}^d$ for $j = 1, 2, ..., c$.
    3. For each data point, compute the degree of membership of that point to each cluster center: $$\mu_{ij} = \frac{1}{\sum\limits_{k=1}^c \left( \frac{d_{ij}}{d_{kj}} \right) ^ \frac{2}{m-1}}$$ where $d_{ij}$ is the distance between the $i^{th}$ data point and the $j^{th}$ cluster center.
    4. Recompute the cluster centers: $$v_j = \frac{\sum\limits_{i=1}^n \mu_{ij}^m x_i}{\sum\limits_{i=1}^n \mu_{ij}^m}$$
    5. Repeat Steps 3 and 4 until the membership coefficients $\mu_{ij}$ do not change.


## Expectation Maximization (EM) algorithm
- Expectation Maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between:
    - Expectation step (E-step): create a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
        $$ Q(\theta | \theta^{(t)}) = E_{Z|X, \theta^{(t)}} [\log L(\theta; X, Z)] $$
    - Maximization step (M-step): compute parameters maximizing the expected log-likelihood found on the E-step
        $$ \theta^{(t+1)} = \arg\max\limits_{\theta} Q(\theta | \theta^{(t)}) $$
- The EM algorithm is guaranteed to converge to a local maximum, but not necessarily to the global maximum of the likelihood. In practice, EM can be susceptible to getting stuck in local maxima, so multiple restarts are used. The EM algorithm can also be generalized to maximize incomplete-data likelihood functions.

## Model evaluation / Comparison

### Confidence Interval for Accuracy
$$ CI = \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $$
where $\hat{p}$ is the observed accuracy, $z_{\alpha/2}$ is the critical value of the normal distribution at $\alpha/2$ (e.g. for 95% confidence interval, $\alpha = 0.05$ and $z_{\alpha/2} = 1.96$), and $n$ is the number of test instances.