# KMean Clustering
## Varieties of the wheat seed dataset
This is a real dataset which provides **measurements of the geometrical properties of kernels belonging to three different varieties of the wheat**. A soft X-ray technique and GRAINS package were used to construct all seven, real-valued attributes. Original dataset is available at UCI Machine Learning Repository [Seed dataset](https://archive.ics.uci.edu/ml/datasets/seeds). You can download the file and use it. <br>
However, I recommend using the file "**Seed_Data.csv**".
The file is processed for columns names, separators (longer than 1 characters and also of different form), while reading. The datafile contain following 7 features and 1 target class. 

Features are:
* A: Area 
* P: Perimeter  
* C: Compactness {C = 4*pi*A/P^2} 
* LK: Length of Kernel 
* WK: Width of Kernel
* A_Coef: Asymmetry Coefficient 
* LKG: Length of Kernel Groove<br>

Target Class is:
* target: target class (0, 1, 2)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

**Reading file `Seed_Data.csv` and show the head of the file.**

In [None]:
df = pd.read_csv('../input/Seed_Data.csv')
df.head()

#####  Let's use the info () function to get a broader view of Dataset

In [None]:
print('Numbers of rows {} and number of columns {} '.format(df.shape[0], df.shape[1]))
print('\n')
df.info()

### Lets display the basic statistics, mean, std, max etc....
* Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [None]:
df.describe()

## Exploratory Data Analysis

Let's do some EDA here, always good to know our data!

**How the area 'A' is related to the compactness 'C'. 
#### Luckily, we have the target values in column Target

In [None]:
import warnings
warnings.filterwarnings("ignore")

sns.set(style="darkgrid")
sns.lmplot('A','C',data=df, hue='target',
           palette='Set1',size=7,aspect=1.2,fit_reg=False);

**Let's see, how area 'A' is related to the A_Coef using scatter plot.** Hint: `hue = target`

In [None]:
sns.lmplot('A','A_Coef',data=df, hue='target',
           palette='Set1',size=7,aspect=1.2,fit_reg=False);

Here we will generate a histogram to visualize the data by class

In [None]:
g = sns.FacetGrid(data = df, hue='target', palette='Set2', size=7, aspect=3)
g = g.map(plt.hist,'A',bins=22,alpha=0.6)
plt.legend();

## KMeans Clustering

Time for machine learning using KMeans clustering unsupervised algorithm.<br>

Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields.


##### Set 3 clusters

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)

K-Means is probably the most well know clustering algorithm. It’s taught in a lot of introductory data science and machine learning classes. It’s easy to understand and implement in code! Check out the graphic below for an illustration.

**Fitting the model to all the data except for the `'target'`.**
* We can do do this using drop()

In [None]:
kmeans.fit(df.drop('target',axis=1))

K-Means has the advantage that it’s pretty fast, as all we’re really doing is computing the distances between points and group centers; very few computations!

In [None]:
centers = kmeans.cluster_centers_
centers


##### Let's add a new column called klabels that contemplated our predictions with the algorithm Kmeans

In [None]:
df['klabels'] = kmeans.labels_
df.head()

We have below our two plots, the left being the clusters we generate through our Kmeans model and on the right we have the correct labels that came from Dataset.

In [None]:
f, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True,figsize = (12,8) )

# For fitted with kmeans 
ax1.set_title('K Means (K = 3)')
ax1.scatter(x = df['A'], y = df['A_Coef'], 
            c = df['klabels'], cmap='rainbow')
ax1.scatter(x=centers[:, 0], y=centers[:, 5],
            c='black',s=300, alpha=0.5);

# For original data 
ax2.set_title("Original")
ax2.scatter(x = df['A'], y = df['A_Coef'], 
            c = df['target'], cmap='rainbow')

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

In [None]:
X = df.iloc[:, [0,1,2,3,4,5,6]].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
hc = AgglomerativeClustering(n_clusters= 3, affinity= 'euclidean', linkage= 'ward')
previsoes = hc.fit_predict(X)

In [None]:
fig = plt.figure(figsize=(12,9))
fig = dendograma = dendrogram(linkage(previsoes, method= 'ward'), color_threshold=1, show_leaf_counts=True,
                             truncate_mode='lastp')

In [None]:
df.klabels.value_counts()

In [None]:
df.target.value_counts()

## Elbow point 
**Estimate the elbow point to see if our selection for K was right!**

In [None]:
sum_square = {}

# Let's test for K from 1 to 10
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k).fit(df.drop('target',axis=1))
    
    sum_square[k] = kmeans.inertia_ 

In [None]:
plt.plot(list(sum_square.keys()), list(sum_square.values()),
         linestyle ='-', marker = 'H', color = 'g',
         markersize = 8,markerfacecolor = 'b');