# Clustering Exploration
Let's play with clustering models using our recipe vectors.
This notebook depends on: 
* `recipe_vecs.h5`, which is generated by `recipe2vec.py`
* `all_recipes.h5`, which is generated by `converter.py`

In [96]:
import pandas as pd
import sklearn.cluster
from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler, StandardScaler

In [10]:
import numpy as np

## Import data

In [17]:
with pd.HDFStore('recipe_vecs.h5', 'r') as store:
    recipes = store.get('vecs')

In [18]:
recipes.head()

name,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,783.0,784.0,785.0,786.0,787.0,788.0,789.0,790.0,791.0,boil_time
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0
1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0
5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.024016,0.0,0.0,0.0,0.0,90.0
7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0
8.0,0.103101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0


In [28]:
with pd.HDFStore('all_recipes.h5', 'r') as store:
    recipe_catalog = store.get('core')
    recipe_catalog = recipe_catalog.loc[recipes.index]ß

# Scale data

In [109]:
scaler = StandardScaler()

In [110]:
recipes_scaled = scaler.fit_transform(recipes.values)

In [111]:
recipes_scaled

array([[-0.20650155, -0.34672294, -0.24489874, ..., -0.01341072,
        -0.06554427, -0.01674973],
       [-0.20650155, -0.34672294, -0.24489874, ..., -0.01341072,
        -0.06554427, -0.01674973],
       [-0.20650155, -0.34672294, -0.24489874, ..., -0.01341072,
        -0.06554427,  0.1074797 ],
       ...,
       [-0.20650155, -0.34672294, -0.24489874, ..., -0.01341072,
        -0.06554427, -0.01674973],
       [-0.20650155, -0.34672294, -0.24489874, ..., -0.01341072,
        -0.06554427, -0.01674973],
       [ 0.86734716, -0.34672294,  1.8023968 , ..., -0.01341072,
        -0.06554427, -0.01674973]])

# KMeans

In [8]:
KMeans = sklearn.cluster.KMeans

As a first guess, let's assume that beers fit into these 5 categories:
* IPAs
* APAs
* Stouts and porters
* Lagers
* Belgian beers

In [100]:
num_clusters = 5
random_state = 0 

In [101]:
kmeans = KMeans(n_clusters=num_clusters, random_state=random_state)

In [102]:
kmeans.fit(recipes_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [103]:
clusters = kmeans.predict(recipes_scaled)

In [104]:
recipe_catalog['cluster'] = clusters

In [105]:
gb_c = recipe_catalog.groupby('cluster')

In [106]:
dist = gb_c.style_name.value_counts()

In [107]:
with pd.option_context('display.max_rows', None):  # more options can be specified also
    display(dist)

cluster  style_name                                    
0        american ipa                                      26874
         american pale ale                                 22344
         specialty beer                                     9871
         imperial ipa                                       5890
         american amber ale                                 5249
         saison                                             4453
         american wheat or rye beer                         4333
         american brown ale                                 3854
         robust porter                                      3463
         blonde ale                                         3323
         weizen/weissbier                                   3005
         extra special/strong bitter (english pale ale)     2879
         american stout                                     2823
         russian imperial stout                             2438
         irish red ale            

Oh boy. These clusters aren't very meaningful. It looks like we're trying to cluster too many features with too few samples. How many?

In [47]:
recipes.shape

(171699, 793)

# Principal Component Analysis
Let's see if we can get a better clustering result by reducing the number of features.

In [112]:
try_components = [13, 21, 34, 55, 89, 144, 233, 377]

In [114]:
for n_components in try_components:
    pca = PCA(n_components=n_components)
    pca.fit(recipes_scaled)
    print(f'Number of components: {n_components}, explained variance ratio: {sum(pca.explained_variance_ratio_)}')

Number of components: 13, explained variance ratio: 0.05797128862534642
Number of components: 21, explained variance ratio: 0.0804569429023829
Number of components: 34, explained variance ratio: 0.11170134873426918
Number of components: 55, explained variance ratio: 0.15693579857670242
Number of components: 89, explained variance ratio: 0.2164864269842909
Number of components: 144, explained variance ratio: 0.309654031745537
Number of components: 233, explained variance ratio: 0.44342012168416517
Number of components: 377, explained variance ratio: 0.638453919760694


Even approaching half of our features, we can't capture very much variance by PCA.