## Applications of Unsupervised learning

* **Recommender systems**, which involve grouping together users with similar viewing patterns in order to recommend similar content.
* **Customer segmentation**, or understanding different customer groups around which to build marketing or other business strategies.
* **Genetics**, for example clustering DNA patterns to analyze evolutionary biology.
* **Anomaly detection**, including fraud detection or detecting defective mechanical parts (i.e., predictive maintenance).
* **Outlier detection** within a data science / data analytics workflow.

# Let's do some unsupervised learning!

**We're doing KMeans clustering**

In [1]:
# let's get some data

In [2]:
import pandas as pd

In [3]:
from sklearn import datasets

In [4]:
data = datasets.load_wine()

In [5]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

In [6]:
print(data['DESCR'])

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [15]:
data['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [16]:
data['data']

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [17]:
data['data'].shape

(178, 13)

In [20]:
X = pd.DataFrame(data['data'], columns=data['feature_names'])

y = pd.Series(data['target'])

In [22]:
X.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [23]:
y.unique()

array([0, 1, 2])

In [25]:
from sklearn.preprocessing import StandardScaler
X_prep = StandardScaler().fit_transform(X)

In [28]:
# dataframe of scaled features
X_prep_df = pd.DataFrame(X_prep, columns=data['feature_names'])

In [29]:
from IPython.display import Image
from IPython.core.display import HTML

In [30]:
Image("k_means.gif")

<IPython.core.display.Image object>

In [33]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=8, random_state=1234)
kmeans.fit(X_prep_df)

KMeans(random_state=1234)

In [38]:
kmeans.cluster_centers_

array([[-4.84417110e-01, -7.00351821e-01, -1.21040122e+00,
        -8.00011646e-01, -1.65489870e-01, -7.62520085e-01,
        -7.59964629e-01,  3.19770238e-01, -1.17142800e+00,
        -6.09430162e-01,  2.08957908e-01, -8.50932705e-01,
        -5.57436408e-01],
       [-7.35477484e-01,  8.13356482e-02, -4.06837479e-01,
        -7.94280943e-02, -5.28297935e-01,  7.54657455e-01,
         6.68152851e-01, -7.01603842e-01,  7.77721643e-01,
        -6.80753574e-01,  3.25934043e-01,  6.16637895e-01,
        -5.92945111e-01],
       [-1.05800485e+00, -4.04894729e-01, -1.96912466e-01,
         6.33883161e-01, -7.33550744e-01, -3.90866171e-01,
        -1.53109902e-01,  3.61882301e-01, -1.96365437e-01,
        -1.02690913e+00,  4.30568144e-01,  3.25804721e-01,
        -8.84574695e-01],
       [ 6.77376912e-02,  1.36609385e+00,  2.08791281e-01,
         5.61140326e-01, -3.93560477e-01, -1.05043553e+00,
        -1.40664557e+00,  1.24500923e+00, -1.08945987e+00,
         2.98906303e-01, -9.60007074e