# Research on Parallelization of finding K Value Algorithms
The objective of this project is to compare different algorithms for finding optimal K Value, and compare their performances after parallelization and how much impact parallelization has on these algorithms.

## The K Means Algorithm
A clustering algorithm is a process of dividing a physical or abstract object into a collection of similar objects. A cluster is a collection of data objects; objects in the same cluster are like each other and different from objects in other clusters
I K Means Clustering
1.  Specify number of clusters K.
2. Randomly select k data points. These data points are called centroids.
3. Measure the distance of first data point with all centroids.
4. Assign the data point to the nearest cluster.
5. Repeat 3 and 4 for all points.
6. Calculate Mean of each cluster.
7. Make these means new centroids.
8. We repeat the process until taking mean no longer effects our clusters.



## K Means Implementation

#### Importing Dependencies

In [3]:
import numpy as np
import  pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Making Class for K Means

In [4]:
class K_Means:
    #Initialization
    def __init__(self, k=2, tol=0.001, max_iter=300):
        self.k = k
        self.tol = tol
        self.max_iter = max_iter
        self.centroids = {}
        self.classifications = {}

    def fit(self,data):
        #Randomly assign the Centroid
        for i in range(self.k):
            self.centroids[i] = data[i]

        for i in range(self.max_iter):
            #Making Empty array for each cluster
            for i in range(self.k):
                self.classifications[i] = []

            for feature_set in data:
                #Finding distance between Features and Centroid
                distances = [np.linalg.norm(feature_set-self.centroids[centroid]) for centroid in self.centroids]
                #Selecting the cluster with minimum distance
                classification = distances.index(min(distances))
                self.classifications[classification].append(feature_set)

            prev_centroids = dict(self.centroids)

            #Reassingning Centroids
            for classification in self.classifications:
                self.centroids[classification] = np.average(self.classifications[classification],axis=0)

            optimized = True

            #Comparing Centroid
            for c in self.centroids:
                original_centroid = prev_centroids[c]
                current_centroid = self.centroids[c]
                if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > self.tol:
                    print(np.sum((current_centroid-original_centroid)/original_centroid*100.0))
                    optimized = False

            if optimized:
                break

    def predict(self,data):
        distances = [np.linalg.norm(data-self.centroids[centroid]) for centroid in self.centroids]
        classification = distances.index(min(distances))
        return classification

### Loading Dataset

In [5]:
iris = sns.load_dataset("iris")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [9]:
x=iris.iloc[:,:4] #all parameters
y=iris["species"] #class labels
x.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [11]:
# Shape of Dataset
m=x.shape[0] #number of training examples
n=x.shape[1] #number of features
print("Training Examples: " + str(m) + " Features: "+ str(n))

Training Examples: 150 Features: 4


In [8]:
clf = K_Means()
clf.fit(x)

for centroid in clf.centroids:
    plt.scatter(clf.centroids[centroid][0], clf.centroids[centroid][1], marker="o", color="k", s=150, linewidths=5)

KeyError: 0