# Naive Engine with Clustering

## Use $k$-Means Clustering for Better Recommendation

We can reduce the search space using a clustering algorithm. Here we will use the $k$-means clustering algorithm.

In [1]:
import pandas as pd

In [2]:
wine_ratings = pd.read_csv('data/reviews.csv')

In [3]:
wine_ratings.head()

Unnamed: 0,id,username,wine,rating,comment
0,0,jadianes,Manzanilla La Gitana,4,Beautiful Manzanilla. Great price.
1,1,jadianes,Pol Roger Rose 1998,3,Classy Rose. Not great.
2,2,jadianes,Molino Real 2002,4,This can be great with time.
3,3,jadianes,Le Grappin Bagnum Rose 2013,2,Drinkable...
4,4,jadianes,La Bota de Amontillado 1,5,A treasure of a wine


In [4]:
wine_ratings_pivoted = wine_ratings.pivot('username', 'wine', 'rating')

In [5]:
wine_ratings_pivoted = wine_ratings_pivoted.fillna(0)
wine_ratings_pivoted

wine,Chateau Latour 1982,JL Chave Hermitage 2001,La Bota de Amontillado 1,Le Grappin Bagnum Rose 2013,Manzanilla La Gitana,Molino Real 2002,Pol Roger Rose 1998,Raveneu Le Clos 1996,Rosseau Chambertin 2001,Vega Sicilia Unico 1989,Viña Tondonia Blanco Reserva 1981
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
carlos,0.0,5.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0
jadianes,0.0,0.0,5.0,2.0,4.0,4.0,3.0,0.0,0.0,0.0,0.0
john,0.0,4.0,2.0,3.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0
lluis,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0,5.0
mari,0.0,0.0,0.0,0.0,3.0,5.0,2.0,5.0,0.0,0.0,0.0
pepe,0.0,0.0,5.0,0.0,4.0,0.0,2.0,0.0,4.0,4.0,0.0
teus,0.0,0.0,5.0,0.0,4.0,5.0,0.0,0.0,0.0,0.0,4.0
yasset,0.0,0.0,0.0,0.0,4.0,1.0,2.0,4.0,0.0,5.0,0.0


In [6]:
# import
from sklearn import cluster

# instantiate
k = 3
kmeans = cluster.KMeans(n_clusters=k, random_state=1)

# fit
kmeans.fit(wine_ratings_pivoted)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=1, tol=0.0001,
    verbose=0)

In [7]:
kmeans.cluster_centers_

array([[ 0. ,  0. ,  2.5,  0. ,  4. ,  0.5,  2. ,  2. ,  2. ,  4.5,  0. ],
       [ 0. ,  1.8,  2.4,  1. ,  3.4,  4. ,  1.8,  1. ,  0. ,  0. ,  0.8],
       [ 4. ,  0. ,  0. ,  0. ,  4. ,  0. ,  0. ,  0. ,  4. ,  0. ,  5. ]])

In [8]:
kmeans.labels_

array([1, 1, 1, 2, 1, 0, 1, 0], dtype=int32)

In [9]:
# predict

kmeans.predict([wine_ratings_pivoted.loc['teus', :]])

array([1], dtype=int32)

In [10]:
my_taste = [5.0, 0.0, 1.0, 0.0, 4.0, 1.0, 2.0, 4.0, 0.0, 5.0, 5.0]

We can map my taste into one of the clusters like this.

In [11]:
kmeans.predict([my_taste])

array([0], dtype=int32)

We label each user by adding a new column "label" to the data frame.

In [12]:
wine_ratings_pivoted['label'] = kmeans.labels_

In [13]:
wine_ratings_pivoted

wine,Chateau Latour 1982,JL Chave Hermitage 2001,La Bota de Amontillado 1,Le Grappin Bagnum Rose 2013,Manzanilla La Gitana,Molino Real 2002,Pol Roger Rose 1998,Raveneu Le Clos 1996,Rosseau Chambertin 2001,Vega Sicilia Unico 1989,Viña Tondonia Blanco Reserva 1981,label
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
carlos,0.0,5.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,0.0,0.0,1
jadianes,0.0,0.0,5.0,2.0,4.0,4.0,3.0,0.0,0.0,0.0,0.0,1
john,0.0,4.0,2.0,3.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,1
lluis,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0,5.0,2
mari,0.0,0.0,0.0,0.0,3.0,5.0,2.0,5.0,0.0,0.0,0.0,1
pepe,0.0,0.0,5.0,0.0,4.0,0.0,2.0,0.0,4.0,4.0,0.0,0
teus,0.0,0.0,5.0,0.0,4.0,5.0,0.0,0.0,0.0,0.0,4.0,1
yasset,0.0,0.0,0.0,0.0,4.0,1.0,2.0,4.0,0.0,5.0,0.0,0


Show the list of people who have the same taste as mine.

In [16]:
my_taste = [5.0, 0.0, 1.0, 0.0, 4.0, 1.0, 2.0, 4.0, 0.0, 5.0, 5.0]
pred = kmeans.predict([my_taste])
pred

array([0], dtype=int32)

In [17]:
wine_ratings_pivoted[wine_ratings_pivoted.label == pred[0]]

wine,Chateau Latour 1982,JL Chave Hermitage 2001,La Bota de Amontillado 1,Le Grappin Bagnum Rose 2013,Manzanilla La Gitana,Molino Real 2002,Pol Roger Rose 1998,Raveneu Le Clos 1996,Rosseau Chambertin 2001,Vega Sicilia Unico 1989,Viña Tondonia Blanco Reserva 1981,label
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
pepe,0.0,0.0,5.0,0.0,4.0,0.0,2.0,0.0,4.0,4.0,0.0,0
yasset,0.0,0.0,0.0,0.0,4.0,1.0,2.0,4.0,0.0,5.0,0.0,0


### Find out which wine I have never tasted from Pepe's list.

In [18]:
wine_ratings_pivoted.loc['pepe', list(map(lambda x: True if x == 0 else False, my_taste))]

wine
JL Chave Hermitage 2001        0.0
Le Grappin Bagnum Rose 2013    0.0
Rosseau Chambertin 2001        4.0
Name: pepe, dtype: float64

We then should recommend "Rosseau Chambertin 2001" because Pepe has tasted it and gave it a good rating.

### Find out which wine I have never tasted from Yasset's list.

In [19]:
wine_ratings_pivoted.loc['yasset', list(map(lambda x: True if x == 0 else False, my_taste))]

wine
JL Chave Hermitage 2001        0.0
Le Grappin Bagnum Rose 2013    0.0
Rosseau Chambertin 2001        0.0
Name: yasset, dtype: float64

Here we don't have any wine to recommend since Yasset has never tried any wine in the list above.