<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/03_Transforming_data_for_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Transformation for Clustering**

The fish measurement dataset was sourced from the Journal of Statistics Education. 

http://jse.amstat.org/jse_data_archive.htm

**The Fish Dataset**

159 fishes of 7 species are caught and measured. Altogether there are
8 variables.  All the fishes are caught from the same lake
(Laengelmavesi) near Tampere in Finland.



**VARIABLE DESCRIPTIONS**

1  Obs:       Observation number ranges from 1 to 159

2  Species   (Numeric)

        Code Finnish  Swedish    English        Latin      
         1   Lahna    Braxen     Bream          Abramis brama
         2   Siika    Iiden      Whitewish      Leusiscus idus
         3   Saerki   Moerten    Roach          Leuciscus rutilus
         4   Parkki   Bjoerknan  ?              Abramis bjrkna
         5   Norssi   Norssen    Smelt          Osmerus eperlanus
         6   Hauki    Jaedda     Pike           Esox lucius
         7   Ahven    Abborre    Perch          Perca fluviatilis

3  Weight:      Weight of the fish (in grams)

4  Length1:     Length from the nose to the beginning of the tail (in cm)

5  Length2:     Length from the nose to the notch of the tail (in cm)

6  Length3:     Length from the nose to the end of the tail (in cm)

7  Height%:     Maximal height as % of Length3

8  Width%:      Maximal width as % of Length3

9  Sex:         1 = male 0 = female



          ___/////___                  _
         /           \    ___          |
       /\             \_ /  /          H
     <   )            __)  \           |
       \/_\\_________/   \__\          _

     |------- L1 -------|
     |------- L2 ----------|
     |------- L3 ------------|


The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, we need to standardize these features first. 

Here, we build a pipeline to standardize and cluster the data.

In [0]:
# Download the seeds data using wget (Linux)
!wget http://jse.amstat.org/datasets/fishcatch.dat.txt

--2020-01-01 17:59:55--  http://jse.amstat.org/datasets/fishcatch.dat.txt
Resolving jse.amstat.org (jse.amstat.org)... 107.180.48.28
Connecting to jse.amstat.org (jse.amstat.org)|107.180.48.28|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10740 (10K) [text/plain]
Saving to: ‘fishcatch.dat.txt.4’


2020-01-01 17:59:55 (149 MB/s) - ‘fishcatch.dat.txt.4’ saved [10740/10740]



Import NumPy and Pandas

In [0]:
import numpy as np
import pandas as pd

Load the dataset to an array 'samples' containing the measurements (such as area, perimeter, length, and several others) of samples of grain.

In [0]:
data = np.genfromtxt('fishcatch.dat.txt')

In [0]:
# Test the dataload for the seeds dataset
samples = data[:,2:-2][~np.isnan(data).any(axis=1)]
samples

array([[ 600. ,   29.4,   32. ,   37.2,   40.2],
       [ 700. ,   30.4,   33. ,   38.3,   38.8],
       [ 575. ,   31.3,   34. ,   39.5,   38.3],
       [ 725. ,   31.8,   35. ,   40.9,   40. ],
       [1000. ,   33.5,   37. ,   42.6,   44.5],
       [ 920. ,   35. ,   38.5,   44.1,   40.9],
       [ 925. ,   36.2,   39.5,   45.3,   41.4],
       [ 975. ,   37.4,   41. ,   45.9,   40.6],
       [ 800. ,   33.7,   36.4,   39.6,   29.7],
       [ 110. ,   19.1,   20.8,   23.1,   26.7],
       [ 120. ,   19.4,   21. ,   23.7,   25.8],
       [ 150. ,   20.4,   22. ,   24.7,   23.5],
       [ 145. ,   20.5,   22. ,   24.3,   27.3],
       [ 160. ,   20.5,   22.5,   25.3,   27.8],
       [ 160. ,   21.1,   22.5,   25. ,   25.6],
       [ 200. ,   22.1,   23.5,   26.8,   27.6],
       [ 272. ,   25. ,   27. ,   30.6,   28. ],
       [  60. ,   14.3,   15.5,   17.4,   37.8],
       [  90. ,   16.3,   17.7,   19.8,   37.4],
       [ 120. ,   17.5,   19. ,   21.3,   39.4],
       [ 170. ,   19

In [0]:
labels = data[:,1]
labels

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 2., 2., 2., 2., 2., 2., 3., 3., 3., 3., 3., 3., 3., 3., 3., 3.,
       3., 3., 3., 3., 3., 3., 3., 3., 3., 3., 4., 4., 4., 4., 4., 4., 4.,
       4., 4., 4., 4., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.,
       5., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6., 6.,
       6., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7.,
       7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7.,
       7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7., 7.,
       7., 7., 7., 7., 7., 7.])

In [0]:
np.species = np.where(labels==1, 'Bream', np.where(labels==2, 'Whitewish', np.where(labels==3, 'Roach', 
                      np.where(labels==4, 'Abramis', np.where(labels==5, 'Smelt', np.where(labels==6, 'Pike', np.where(labels==7, 'Abborre', labels)))))))

np.species

Import the StandardScaler from sklearn.preprocessing, make_pipeline from sklearn.pipeline and KMeans function from sklearn.cluster

In [0]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [0]:
# Create scaler: scaler
scaler = StandardScaler()

In [0]:
# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters = 4)

In [0]:
# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)

**Clustering the fish data**

Use standardization and clustering pipelines from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

In [0]:
# Fit the pipeline to samples
pipeline.fit(samples)

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('kmeans',
                 KMeans(algorithm='auto', copy_x=True, init='k-means++',
                        max_iter=300, n_clusters=4, n_init=10, n_jobs=None,
                        precompute_distances='auto', random_state=None,
                        tol=0.0001, verbose=0))],
         verbose=False)

In [0]:
# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

In [0]:
# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels': labels, 'species': np.species})

In [0]:
# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

In [0]:
# Display ct
print(ct)

species  Abborre  Abramis  Bream  Pike  Roach  Smelt  Whitewish
labels                                                         
1.0            0        0     35     0      0      0          0
2.0            0        0      0     0      0      0          6
3.0            0        0      0     0     20      0          0
4.0            0       11      0     0      0      0          0
5.0            0        0      0     0      0     14          0
6.0            0        0      0    17      0      0          0
7.0           56        0      0     0      0      0          0
