# Algoritma Clustering

- 13515089 - Vincent Hendryanto Halim
- 13515099 - Mikhael Artur Darmakesuma
- 13515107 - Roland Hartanto

## 1. Deskripsi Singkat

### 1.1 K-Means
K-Means mencari centroid yang akan digunakan sebagai titik tengah suatu cluster. Data akan dipasangkan pada cluster berdasarkan kedekatan pada cluster tersebut. Setelah memasangkan data pada cluster, titik tengah pada cluster akan dihitung ulang untuk melakukan pemasangan ulang data pada titik tengah cluster yang baru. Algoritma akan berhenti bila cluster tidak berpindah lagi.

```javascript
// K-Means pseudocode
initialize_centroid()
cluster_new = pair_data_with_centroid()
do    
    cluster_old = cluster_new
    find_new_centroid()
    cluster_new = pair_data_with_centroid()    
while(cluster_new != cluster_old)

```

### 1.2 K-Medoids
K-Medoids menggunakan medoid berdasarkan data yang dimiliki. Data akan dipasangkan pada cluster berdasarkan kedekatan dengan titik medoids. Setelah memasangkan data pada cluster, titik medoid akan dicoba dipindah menggunakan data lain pada cluster. Setelah itu akan dihitung nilai error dengan menggunakan rumus absolute error dan dibandingan dengan nilai error pada cluster saat ini

$ E = \sum_{j=1}^{k}\sum_{p\in C_j}\left|p-o_j\right| $

dengan $k$ adalah jumlah cluster, $p$ adalah data pada cluster, dan $o_j$ adalah medoid pada cluster

Jika didapatkan nilai error menurun, maka medoid akan berpindah ke data yang baru saja diubah.

```js
// K-Medoids pseudocode
initialize_medoid()
cluster_new = pair_data_with_medoid()
do    
    cluster_old = cluster_new
    old_error = count_error(cluster_old)
    for each (swappable_pair_of_medoid)
        cluster_swap = swap_cluster()
        swap_error = count_error(cluster_swap)
        if(old_error > swap_error)
            cluster_new = cluster_swap
while(cluster_new != cluster_old)

```

### 1.3 Agglomerative
Algoritma agglomerative memasangkan data berdasarkan kedekatan data dengan data yang lain.

Untuk memasangkan data pertama dihitung matriks jarak untuk N data. Kemudian dicari jarak minimum antara dua data untuk memasangkan antara dua data

$D(C_i,C_j)= \min_{1\leq m,l\lt N,m\neq l}  D(C_m, C_l)$

Matriks jarak kemudian diupdate untuk membandingkan jarak antara cluster dengan data yang lain. Proses terus diulang hingga membentuk jumlah cluster yang diinginkan atau satu cluster besar

```js
// Agglomerative pseudocode
dist_matrix = count_distance_matrix()
for [n_data..n_cluster]
    a,b = find_closest_pair(dist_matrix)
    dist_matrix = update_dist_matrix(dist_matrix)    

```

### 1.4 DBSCAN
DBSCAN memasangkan titik berdasarkan dua nilai, yaitu $\epsilon$ dan $minpts$. Titik core adalah titik yang memiliki tetangga dengan kurang dari $\epsilon$ lebih dari atau sama dengan nilai $minpts$. Titik yang berhasil dipasangkan dengan titik lain akan menjadi cluster

```js
// DBSCAN pesudocode
dist_matrix = count_distance_matrix()
for each data
    n_neighbors = count neighbors(eps)
    if (n_neighbors > min_pts)
        give_label_to_data()
        give_label_to_neighbors()
        adjust_density_reachable_data_label()
```

## 2. Source Code dan Penjelasan Source Code

### 2.1 K-Means

In [None]:
class KMeans(Classifier):
    '''
    Parameters
    ----------
    
    num_clusters : int, number of centroid/means/cluster
    init : {'random', 'np.ndarray or list(user defined)'}
           initial centroids
    max_iteration : maximum iteration limit

    Attributes
    ----------
    means : cluster centers/centroids
    labels : labels of each instance
    num_clusters : number of cluster
    init : initial means
    max_iteration : maximum iteration limit
    
    '''
    def __init__(self, num_clusters=3, init='random', max_iteration=300,  **kwargs):
        self.num_clusters = num_clusters
        self.init = init
        self.max_iteration = max_iteration

    def fit(self, X : DataFrame):
        '''
        '''
        num_of_instances = X.shape[0]
        
        if self.num_clusters > num_of_instances:
            raise ValueError(
                "num_of_instances = %d must be larger than num_clusters = %d" % (num_of_instances, self.num_clusters))
        else:
            if (self.num_clusters == 1):
                self.labels = [0 for i in range(0, num_of_instances)]
                self.means = self.update_means(labels, X)
            else:
                init_type = type(self.init)
                if (init_type == str):
                    if (self.init == 'random'):
                        means = [X.iloc[random.randrange(0, num_of_instances)] for i in range(0, self.num_clusters)]
                        self.k_means(means, X)
                    else:
                        raise ValueError("Init type 'random' or ndarray expected, found %s" % (self.init))
                else:
                    if (len(self.init) == self.num_clusters):
                        self.k_means(self.init, X)
                    else:
                        raise ValueError(
                            "number of initial means = %d must be equal to num_clusters = %d" % (len(self.init), self.num_clusters))

    def predict(self, X):
        distances = self.count_distances(self.means, X)
        return self.assign_labels(distances)

    def fit_predict(self, X):
        y = 0 # dummy ... may be deleted soon (?)
        self.fit(X, y)
        return self.labels
    
    def k_means(self, initial_means, data):
        means = initial_means
        prev_means = [[0 for j in range(0, len(means[0]))] for i in range(0, len(means))]
        iteration = 0
        while ((not self.is_means_equal(means, prev_means)) and (iteration < self.max_iteration)):
            distances = self.count_distances(means, data)
            # print(means)
            labels = self.assign_labels(distances)
            # print(labels)
            prev_means = means
            means = self.update_means(labels, data)
            iteration += 1
        
        self.labels = labels
        self.means = means

    def count_distances(self, means, data):
        num_of_instances = data.shape[0]
        distances = [[-1 for j in range(0, len(means))] for i in range(0, num_of_instances)]
        #distances = [[-1 for j in range(0, num_of_instances)] for i in range(0, len(means))]
        
        for instance_idx in range(0, num_of_instances):
            instance_data = [x for x in data.iloc[instance_idx]]
            # print(instance_data)
            for means_idx in range(0, len(means)):
                distances[instance_idx][means_idx] = self.calculate_euclidean_dist(means[means_idx], instance_data)
        
        return distances

    def calculate_euclidean_dist(self, attribute1, attribute2):
        squared_distance = 0

        if (len(attribute1) == len(attribute2)):
            for i in range(0, len(attribute1)):
                squared_distance += pow((attribute1[i] - attribute2[i]), 2)
        else:
            raise ValueError(
                "number of attributes must be equal, attribute1 = %d, attribute2 = %d" % (len(attribute1), len(attribute2)))
        
        return math.sqrt(squared_distance)
    
    def is_means_equal(self, means, prev_means):
        is_equal = True
        num_of_attributes = len(means[0])

        for i in range(0, len(means)):
            for j in range(0, num_of_attributes):
                if(not (means[i][j] == prev_means[i][j])):
                    is_equal = False
                    break
            if (not is_equal):
                break
        
        return is_equal
    
    def assign_labels(self, distances):
        labels = [-1 for i in range(0, len(distances))]
        for i in range(0, len(distances)):
            idx, val = self.min_val(distances[i])
            labels[i] = idx
        
        return labels
        
    def min_val(self, list_elements):
        index = -1
        value = sys.maxsize
        for i in range(0, len(list_elements)):
            if (list_elements[i] < value):
                index = i
                value = list_elements[i]

        return index, value
    
    def update_means(self, labels, data):
        means = [[0 for j in range(0, data.shape[1])] for i in range(0, self.num_clusters)]
        sums = [[0 for j in range(0, data.shape[1])] for i in range(0, self.num_clusters)]
        n_cluster_elmt = [0 for i in range(0, self.num_clusters)]

        for j in range(0, data.shape[1]): # access by column
            for i in range(0, len(labels)):
                for k in range(0, self.num_clusters):
                    if (k == labels[i]):
                        sums[k][j] += data[j][i]
                        if (j == 0):
                            n_cluster_elmt[k] += 1
        
        for cluster_idx in range(0, self.num_clusters):
            for attr_idx in range(0, data.shape[1]):
                means[cluster_idx][attr_idx] = sums[cluster_idx][attr_idx] / n_cluster_elmt[cluster_idx]

        return means


### 2.2 K-Medoids

In [None]:
class KMedoids(Classifier):
    '''
    Parameters
    ----------
    
    num_clusters : int, number of centroid/means/cluster
    init : {'random', 'np.ndarray or list(user defined)'}
           initial medoids
    max_iteration : maximum iteration limit
    swap_medoid : {'optimized', 'random'}
                  swap medoid method, optimized = search for the minimum possible swap, 
                  random = random swap

    Attributes
    ----------
    medoids : cluster centers
    labels : labels of each instance
    num_clusters : number of cluster
    init : initial means
    max_iteration : maximum iteration limit
    
    '''
    def __init__(self, num_clusters=3, init='random', max_iteration=300, swap_medoid='optimized', **kwargs):
        self.num_clusters = num_clusters
        self.init = init
        self.max_iteration = max_iteration
        self.swap_medoid = swap_medoid

    def fit(self, X : DataFrame):
        self.train_data = X
        num_of_instances = X.shape[0]
        
        if self.num_clusters > num_of_instances:
            raise ValueError(
                "num_of_instances = %d must be larger than num_clusters = %d" % (num_of_instances, self.num_clusters))
        else:
            if (self.num_clusters == 1):
                self.labels = [0 for i in range(0, num_of_instances)]
                # self.medoids = self.update_medoids(labels, X)
            else:
                init_type = type(self.init)
                if (init_type == str):
                    if (self.init == 'random'):
                        medoids = [random.randrange(0, num_of_instances) for i in range(0, self.num_clusters)]
                        self.k_medoids(medoids, X)
                    else:
                        raise ValueError("Init type 'random' or ndarray expected, found %s" % (self.init))
                else:
                    if (len(self.init) == self.num_clusters):
                        # init : list of index of selected initial medoids
                        self.k_medoids(self.init, X)
                    else:
                        raise ValueError(
                            "number of initial means = %d must be equal to num_clusters = %d" % (len(self.init), self.num_clusters))

    def predict(self, X):
        distances = self.count_distances(self.medoids, X)
        return self.assign_labels(distances)

    def fit_predict(self, X):
        self.fit(X)
        return self.labels
    
    def k_medoids(self, initial_medoids, data):
        medoids = initial_medoids
        iteration = 0
        while (iteration < self.max_iteration):
            # print(medoids)
            distances = self.count_distances(medoids, data)
            labels = self.assign_labels(distances)
            error = self.calculate_error(distances)
            # print(error)
            
            new_distances = []
            new_medoids = []
            new_error = 0
            if (self.swap_medoid == 'optimized'):
                # daftar error tiap swap buat tiap instance [swap cluster, error]
                swap_errors = [[0, 0] for i in range(0, data.shape[0])]
                # swap buat tiap cluster i
                for i in range(0, self.num_clusters):
                    swap_candidate_idxs = []
                    for j in range(0, len(labels)):
                        if (labels[j] == i):
                            swap_candidate_idxs.append(j)
                    
                    for j in range(0, len(swap_candidate_idxs)):
                        new_medoids = []
                        new_medoids = [x for x in medoids]
                        new_medoids[i] = swap_candidate_idxs[j]
                        new_distances = self.count_distances(new_medoids, data)
                        new_error = self.calculate_error(new_distances)
                        swap_errors[swap_candidate_idxs[j]] = [i, new_error]
                
                min_error = sys.maxsize
                min_swap_idx = -1
                medoid_to_swap = -1
                for i in range(0, len(swap_errors)):
                    if (swap_errors[i][1] < min_error):
                        min_error = swap_errors[i][1]
                        min_swap_idx = i
                        medoid_to_swap = swap_errors[i][0]
                # print(error, min_error)
                if(min_error - error < 0):
                    medoids[medoid_to_swap] = min_swap_idx
                    # print(self.calculate_error(self.count_distances(medoids, data)))
                else:
                    break
            elif (self.swap_medoid == 'random'):
                temp_medoid_to_swap = random.randrange(0, self.num_clusters)
                swap_candidate_idxs = []
                for i in range(0, len(labels)):
                    if (labels[i] == temp_medoid_to_swap):
                        swap_candidate_idxs.append(i)
                new_medoids = medoids
                new_medoids[temp_medoid_to_swap] = random.choice(swap_candidate_idxs)
                new_distances = self.count_distances(new_medoids, data)
                new_error = self.calculate_error(new_distances)
                if (new_error - error < 0):
                    medoids = new_medoids
                else:
                    break
            else:
                raise ValueError(
                    "swap_medoid must be equal to 'optimized' or 'random', found %d" % (self.swap_medoid))

            iteration += 1
        
        self.labels = labels
        self.medoids = medoids

    def count_distances(self, medoids, data):
        num_of_instances = data.shape[0]
        distances = [[-1 for j in range(0, len(medoids))] for i in range(0, num_of_instances)]
        
        for instance_idx in range(0, num_of_instances):
            instance_data = [x for x in data.iloc[instance_idx]]
            for medoid_idx in range(0, len(medoids)):
                medoid_instance = [x for x in self.train_data.iloc[medoids[medoid_idx]]]
                distances[instance_idx][medoid_idx] = self.calculate_absolute_dist(medoid_instance, instance_data)
        
        return distances
    
    def calculate_absolute_dist(self, attribute1, attribute2):
        absolute_distance = 0

        if (len(attribute1) == len(attribute2)):
            for i in range(0, len(attribute1)):
                absolute_distance += abs(attribute1[i] - attribute2[i])
        else:
            raise ValueError(
                "number of attributes must be equal, attribute1 = %d, attribute2 = %d" % (len(attribute1), len(attribute2)))
        
        return absolute_distance
    
    def assign_labels(self, distances):
        labels = [-1 for i in range(0, len(distances))]
        for i in range(0, len(distances)):
            idx, val = self.min_val(distances[i])
            labels[i] = idx
        
        return labels
    
    def calculate_error(self, distances):
        total_error = 0
        for i in range(0, len(distances)):
            idx, val = self.min_val(distances[i])
            total_error += val
        
        return total_error

    def min_val(self, list_elements):
        index = -1
        value = sys.maxsize
        for i in range(0, len(list_elements)):
            if (list_elements[i] < value):
                index = i
                value = list_elements[i]

        return index, value

### 2.3 Agglomerative

In [None]:
class Agglomerative(Classifier):
    '''
    Agglomerative Clustering class for clustering

    Parameters
    ----------
    link = ["single", "complete", "average", "average-group"]
        Link type for distance between cluster
    distance = ["manhattan", "euclidean"]
        Distance for distance matrix

    Examples
    --------
    
    Fit and predict in separate process
    >>> classifier = Agglomerative()
    >>> classifier.fit(X)
    >>> classifier.predict()

    Using fit_predict
    >>> class
    >>> classifier = Agglomerative()
    >>> classifier.fit_predict(X)
    
    Take note that X is a dataframe
    '''
    
    def __init__(self, link="single", n_cluster=2, distance="manhattan", **kwargs):
        link_type = ["single", "complete", "average", "average-group"]
        dist_type = ["euclidean", "manhattan"]
        if(link in link_type):
            self.link_type = link
        else:
            raise ValueError("Link type %s not supported" % link)
        if(distance in dist_type):
            self.distance = distance
        else:
            raise ValueError("Link type %s not supported" % link)
        if(n_cluster > 0):
            self.n_cluster = n_cluster
        else:
            raise ValueError("n_cluster must be > 0")

    def fit(self, X : DataFrame):
        self.__data = X.copy()
        n_row, n_col = X.shape
        self.__n_elmt = n_row
        
        self.__createDistMatrix(self.__data)
        for i in range(self.__n_elmt-self.n_cluster):
            # Select the smallest distance index
            min_idx = np.argmin(self.dist_matrix)
            min_row, min_col = self.__idxToRowCol(min_idx, n_row, n_row)
            
            # Generate new distance matrix
            combined_node = self.__generateNode(min_row, min_col)
            self.__removeNode(min_row, min_col)
            self.__mergeNode(combined_node)

    def predict(self):       
        flatten_cluster = self.__flattenCluster()

        result_array = np.array([-1 for n in range(self.__n_elmt)])
        for cluster in range(len(flatten_cluster)):
            for i in flatten_cluster[cluster]:
                result_array[i] = cluster

        return result_array

    def fit_predict(self, X : DataFrame):
        self.fit(x)
        return self.predict() 

    def __flattenCluster(self):
        return [self.__flatten(f) for f in self.cluster_tree]

    def __generateNode(self, a, b):
        if(self.link_type == "single"):
            row_a = self.dist_matrix[a,]
            row_b = self.dist_matrix[b,]
            new_row = np.minimum(row_a, row_b)
            new_row = np.delete(new_row, a)
            # Handle row shift
            if(a<b):
                new_row = np.delete(new_row, b-1)
            else:
                new_row = np.delete(new_row, b)
            new_row = np.append(new_row, float('inf'))
        elif(self.link_type == "complete"):
            row_a = self.dist_matrix[a,]
            row_b = self.dist_matrix[b,]
            new_row = np.maximum(row_a, row_b)
            new_row = np.delete(new_row, a)
            # Handle row shift
            if(a<b):
                new_row = np.delete(new_row, b-1)
            else:
                new_row = np.delete(new_row, b)
            new_row = np.append(new_row, float('inf'))
        elif(self.link_type == "average"):
            # Get list of element for cluster and non cluster
            cluster_el = [self.cluster_tree[a], self.cluster_tree[b]]
            opposite_cluster = []
            
            for idx in range(len(self.cluster_tree)):
                if(idx != a and idx != b):
                    opposite_cluster.append(self.cluster_tree[idx])
            
            cluster_el = self.__flatten(cluster_el)

            # Count distance between cluster and other cluster
            new_row = np.array([])
            for opposite_cluster_el in opposite_cluster:
                flat_op_elmt = self.__flatten(opposite_cluster_el)
                for op_el in flat_op_elmt:
                    sum_dist = 0
                    for el in cluster_el:
                        sum_dist += self.dist_matrix_orig[el,op_el]
                        
                new_row = np.append(new_row, [sum_dist/len(cluster_el)])
            new_row = np.append(new_row, float('inf'))

        elif(self.link_type == "average-group"):
            # Get list of element for cluster and non cluster
            cluster_el = [self.cluster_tree[a], self.cluster_tree[b]]
            opposite_cluster = []
            
            for idx in range(len(self.cluster_tree)):
                if(idx != a and idx != b):
                    opposite_cluster.append(self.cluster_tree[idx])
            
            cluster_el = self.__flatten(cluster_el)
            # Get cluster center
            cluster_center = self.__data.iloc[cluster_el].mean()            

            new_row = np.array([])            
            # Count distance between cluster center and other cluster center
            for opposite_cluster_el in opposite_cluster:
                flat_op_elmt = self.__flatten(opposite_cluster_el)
                op_cluster_center = self.__data.iloc[flat_op_elmt].mean()
                dist = math.fabs(op_cluster_center.subtract(cluster_center,fill_value=0).abs().sum()) 
                new_row = np.append(new_row, dist)
            new_row = np.append(new_row, float('inf'))
        return new_row

    def __removeNode(self, a, b):
        a_val = self.cluster_tree.pop(a)
        self.dist_matrix = np.delete(self.dist_matrix,a,0)
        self.dist_matrix = np.delete(self.dist_matrix,a,1)
        if(a<b):
            self.dist_matrix = np.delete(self.dist_matrix,b-1,0)
            self.dist_matrix = np.delete(self.dist_matrix,b-1,1)
            b_val = self.cluster_tree.pop(b-1)
        else:
            self.dist_matrix = np.delete(self.dist_matrix,b,0)
            self.dist_matrix = np.delete(self.dist_matrix,b,1)
            b_val = self.cluster_tree.pop(b)
        self.cluster_tree.append([a_val,b_val])

    def __mergeNode(self, node):
        n_row, n_col = self.dist_matrix.shape        
        
        b = np.zeros((n_row+1, n_col+1))
        b[:-1,:-1] = self.dist_matrix
        self.dist_matrix = b

        for i in range(n_row+1):
            self.dist_matrix[i,n_col] = node[i]
            self.dist_matrix[n_col,i] = node[i]

    def __createDistMatrix(self, X : DataFrame):
        self.dist_matrix = np.ndarray(shape=(self.__n_elmt, self.__n_elmt))
        self.cluster_tree = [n for n in range(self.__n_elmt)]
        for index, row in X.iterrows():
            for index_2, row_2 in X.iterrows():
                if(index_2 == index):
                    distance = float('inf')
                else:
                    if(self.distance == "manhattan"):
                        distance = math.fabs(row_2.subtract(row,fill_value=0).abs().sum())
                    elif(self.distance=="euclidean"):
                        distance = (row_2.subtract(row,fill_value=0).pow(2).sum()) ** 0.5
                self.dist_matrix.itemset((index, index_2), distance)
        
        # Create copy for average - averagegroup
        self.dist_matrix_orig = self.dist_matrix.copy()

    def __idxToRowCol(self, idx, n_row, n_col):
        return (int(idx/n_row),idx%n_col)

    def __flatten(self, x):
        if isinstance(x, collections.Iterable):
            return [a for i in x for a in self.__flatten(i)]
        else:
            return [x]

### 2.4 DBSCAN

In [None]:
class DBSCAN(Classifier):
    '''
    DBSCAN class for clustering

    Parameters
    ----------

    eps : float, minimal neighbor distance
    min_pts: int, minimal neighbors within eps distance, including itself
    distance: {"euclidean", "manhattan"}, distance measuring algorithm

    Attributes
    ----------
    eps : float, minimal neighbor distance
    min_pts: int, minimal neighbors within eps distance, including itself
    distance: {"euclidean", "manhattan"}, distance measuring algorithm
    __data: DataFrame, data to cluster
    __n_elmt: int, number of data
    distance_matrix: ndarray, distance matrix
    labels : int, labels of each instance


    '''

    def __init__(self, eps=0.5, min_pts=5, distance="euclidean", **kwargs):
        self.eps = eps
        self.min_pts = min_pts
        distance_type = ["euclidean", "manhattan"]
        if (distance in distance_type):
            self.distance = distance
        else:
            raise ValueError("Distance type %s not supported" % distance)

    def fit(self, X: DataFrame):
        # Get data's information
        self.__data = X.copy()
        n_row, n_col = X.shape
        self.__n_elmt = n_row

        # Create distance matrix
        self.__createDistanceMatrix(self.__data)

        self.labels = [-1 for i in range(self.__n_elmt)]
        for i in range (self.__n_elmt):
            # Count neighbors with distance <= eps, including itself
            n_neighbor = 1 # Count itself
            neighbor_labels = []
            for j in range (self.__n_elmt):
                if (i != j):
                    if (self.distance_matrix[i, j] <= self.eps):
                        n_neighbor += 1
                        # Check for density reachable data's label
                        if ((self.labels[j] != -1) and not(self.labels[j] in neighbor_labels)):
                            neighbor_labels.append(self.labels[j])

            # Data have at leas min_pts neighbors, data is core point
            if (n_neighbor >= self.min_pts):
                # Initiate label for current data
                if (neighbor_labels):
                    self.labels[i] = min(neighbor_labels)
                    # Change density reachable data's label to match this data's label
                    self.__replace_label(neighbor_labels, self.labels[i])
                else:
                    self.labels[i] = self.__generate_new_label()
                # Change neighbor's label
                for j in range(self.__n_elmt):
                    if (i != j):
                        if (self.distance_matrix[i, j] <= self.eps):
                            self.labels[j] = self.labels[i]

    def fit_predict(self, X: DataFrame):
        self.fit (X)
        return self.labels

    def __createDistanceMatrix(self, X: DataFrame):
        self.distance_matrix = np.ndarray(shape=(self.__n_elmt, self.__n_elmt))
        self.cluster_tree = [n for n in range(self.__n_elmt)]
        for index, row in X.iterrows():
            for index_2, row_2 in X.iterrows():
                if (index_2 == index):
                    distance = float('inf')
                else:
                    if (self.distance == "manhattan"):
                        distance = math.fabs(row_2.subtract(row, fill_value=0).abs().sum())
                    elif (self.distance == "euclidean"):
                        distance = (row_2.subtract(row, fill_value=0).pow(2).sum()) ** 0.5
                self.distance_matrix.itemset((index, index_2), distance)

    def __generate_new_label(self):
        label = 0
        while (label in self.labels):
            label += 1
        return label

    def __replace_label(self, old_labels, new_label):
        for i in range (self.__n_elmt):
            if (self.labels[i] in old_labels):
                self.labels[i] = new_label

## 3. Hasil Clustering

In [52]:
from sklearn import datasets
from clustering import kmeans as kmean
from clustering import kmedoids as kmed
from clustering import agglomerative as agglo
from clustering import dbscan as dbsc
import numpy as np
import pandas as pd

In [53]:
iris_data = pd.DataFrame(datasets.load_iris().data)

In [54]:
# K-Means
# kmeans = kmean.KMeans(num_clusters=3, init='random')
# change init if needed
kmeans = kmean.KMeans(num_clusters=3, init=[[5.1, 3.5, 1.4, 0.2], [5.0, 2.0, 3.5, 1.0], [5.9, 3.0, 5.1, 1.8]])
test_data = np.array([[5.8,  3.1 ,  6.1,  0.4], [0.8,  1.1 ,  2.1,  0.1]])
kmeans.fit(iris_data)
print('Labels:', kmeans.labels)
print('Centroid/Means:', kmeans.means)

prediction = kmeans.predict(kmean.ndarray_to_dataframe(test_data))
print('Prediction:', prediction)

Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1]
Centroid/Means: [[5.0059999999999993, 3.4180000000000006, 1.464, 0.24399999999999991], [5.8836065573770497, 2.7409836065573772, 4.3885245901639349, 1.4344262295081966], [6.8538461538461526, 3.0769230769230766, 5.7153846153846146, 2.0538461538461532]]
Prediction: [2, 0]


In [55]:
# K-Medoids
kmedoids = kmed.KMedoids(num_clusters=3, init='random')
# change swap_medoid if needed, default is optimized
# kmedoids = KMedoids(num_clusters=3, init='random', swap_medoid='random')

test_data = np.array([[5.8,  3.1 ,  6.1,  0.4], [0.8,  1.1 ,  2.1,  0.1]])
kmedoids.fit(iris_data)
print('Labels:', kmedoids.labels)
print('Medoids:', kmedoids.medoids)

prediction = kmedoids.predict(kmean.ndarray_to_dataframe(test_data))
print('Prediction:', prediction)

Labels: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 2, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
Medoids: [94, 7, 147]
Prediction: [2, 1]


In [56]:
# Agglomerative
agglomerative = agglo.Agglomerative(link="single", n_cluster=3, distance="manhattan")
agglomerative.fit(iris_data)
labels = agglomerative.predict()
print('Labels:', labels)

IndexError: index 148 is out of bounds for axis 0 with size 147

In [60]:
# DBSCAN
dbscan = dbsc.DBSCAN(eps=0.5, min_pts=4, distance="euclidean")
labels = dbscan.fit_predict(iris_data)
print(labels)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Pembagian Tugas

```
13515089 : Agglomerative
13515099 : DBSCAN
13515107 : K-Means, K-Medoids
```