### **In this notebook I will try to do an exploratory data analysis and clustering for the travel review ratings dataset, found in the UCI machine learning repository.**

<h1 id="Exploratory_Data_Analysis">
1. Exploratory Data Analysis
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#Exploratory_Data_Analysis">¶</a>
</h1>

In [None]:
import numpy as np
import pandas as pd 
import os

            
data = pd.read_csv('../input/travel-review-ratings/google_review_ratings.csv') 



Let's take a general first idea of our data with info()

In [None]:
data.info()

We will rename the columns for ease of understanding

In [None]:
column_names = ['user_id', 'churches', 'resorts', 'beaches',
                'parks', 'theatres', 'museums', 'malls', 'zoo',
                'restaurants', 'pubs_bars', 'local_services',
                'burger_pizza_shops', 'hotels_other_lodgings',
                'juice_bars', 'art_galleries', 'dance_clubs',
                 'swimming_pools', 'gyms', 'bakeries', 'beauty_spas',
                'cafes', 'view_points', 'monuments', 'gardens', 'Unnamed: 25']

data.columns = column_names

Let's check how many null values we have

In [None]:
data.isnull().sum()

It seems that we will drop the entire unnamed 25th column and we will impute the 2 rows with zeros supposing that the user did not give any rating to these categories (the 2 null values in category 12 and 24).
Also we will drop the 'User' row as it is of no use to us.Finally, we will map the local_services column from object to float

In [None]:
data.drop('Unnamed: 25', axis = 1, inplace = True)

data.drop('user_id', axis = 1, inplace = True)

data = data.fillna(0)

data['local_services'].astype('float')



So, we have one string in our local_services column.Let's impute it with the column's input_data['local_services'][input_data['local_services']mean

In [None]:
local_services_mean = data['local_services'][data['local_services'] != '2\t2.']

data['local_services'][data['local_services'] == '2\t2.'] = np.mean(local_services_mean.astype('float'))

data['local_services'] = data['local_services'].astype('float')



In [None]:
data.info()

Great.All of our data is type 'float'. Let's check some of the descriptive statistics

In [None]:
pd.set_option('display.max_columns', 30)
data.describe()

Let's visualize our first plot: we will examine the number of reviews under each category

In [None]:
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
import plotly.express as px

# Plotting pretty figures and avoid blurry images
%config InlineBackend.figure_format = 'retina'
# Larger scale for plots in notebooks


# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'


column_names = ['churches', 'resorts', 'beaches', 'parks', 'theatres', 'museums', 'malls', 'zoo',
                'restaurants', 'pubs_bars', 'local_services',
                'burger_pizza_shops', 'hotels_other_lodgings', 'juice_bars', 
                'art_galleries', 'dance_clubs', 'swimming_pools',
                'gyms', 'bakeries', 'beauty_spas', 'cafes', 'view_points', 'monuments', 'gardens']



counts = data[column_names[:]].astype(bool).sum(axis=0).sort_values()

test = []
for i in range(len(counts.index)):
    test.append(counts.index[i])
    



fig = px.bar(counts, 
             x=counts, 
             y=test,
             color=counts,
                             labels={
                     "total ratings": "this is x",
                     "categories": "this is y)"
                 },
                height = 800,
                title="Number of reviews under each category")

fig.show()



As we can observe, bakeries, gyms and beauty spas are the venues where users show the least amount of interest in rating them.
Let's see how many reviews were given for each category. 


In [None]:
reviews = data[column_names[:]].astype(bool).sum(axis=1).value_counts()


fig = px.bar(reviews, 
             x=reviews.index, 
             y=reviews.values,
             color=reviews.values,
                height = 800,
                title="Number of categories VS Number of reviews")

fig.show()



We can easily understand that a total of 3725 users have given a review for all 24 categories and only 6 gave a review for 15 of all of them. 
Let's check now the average rating for each category

In [None]:
avg_rating = data[column_names[:]].mean().sort_values()



fig = px.bar(avg_rating,
            x = avg_rating.index,
            y = avg_rating.values,
            color = avg_rating.values,
            height = 800,
            title = "Average rating for each category")

fig.show()

We clearly see that gyms are users' least favorite venue with an average rating of 0.82 and on the other hand, malls are the leaders  when it comes to ratings with an average score of 3.35.

Let's analyze the outliers

In [None]:

fig = px.box(data, y = ['churches', 'resorts', 'beaches', 'parks', 'theatres', 'museums', 'malls', 'zoo',
                'restaurants', 'pubs_bars', 'local_services',
                'burger_pizza_shops', 'hotels_other_lodgings', 'juice_bars', 
                'art_galleries', 'dance_clubs', 'swimming_pools',
                'gyms', 'bakeries', 'beauty_spas', 'cafes', 'view_points', 'monuments', 'gardens'] )
fig.show()

In [None]:
df = pd.DataFrame(data)

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

pd.set_option('display.max_info_rows', 30)

((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()



Outliers are expected for large sample sizes. Since the reviews in google do not occur by human mistake , we should not discard them. 

Outlier handling is even more difficult in unsupervised learning, since we are both trying to learn what the clusters are, and what data points correspond to "no" clusters.

In [None]:
import seaborn as sns

plt.figure(figsize=(15,10))
cor = data.corr() #Calculate the correlation of the above variables
sns.heatmap(cor, square = True) #Plot the correlation as heat map

In [None]:
data.shape

In [None]:
data = data.drop_duplicates()

In [None]:
data.shape

Not clear at all

It would help us significantly if we would basket the various categories into higher levels, both in terms of analysis and clustering

In [None]:
entertainment = ['theatres', 'dance_clubs', 'malls']
food_travel = ['restaurants', 'pubs_bars', 'burger_pizza_shops', 'juice_bars', 'bakeries', 'cafes']
places_of_stay = ['hotels_other_lodgings', 'resorts']
historical = ['churches', 'museums', 'art_galleries', 'monuments']
nature = ['beaches', 'parks', 'zoo', 'view_points', 'gardens']
services = ['local_services', 'swimming_pools', 'gyms', 'beauty_spas']

In [None]:
df_categories = pd.DataFrame(columns = ['entertainment', 'food_travel', 'places_of_stay', 'historical', 'nature', 'services'])

In [None]:
df_categories['entertainment'] = data[entertainment].mean(axis=1)
df_categories['food_travel'] = data[food_travel].mean(axis = 1)
df_categories['places_of_stay'] = data[places_of_stay].mean(axis = 1)
df_categories['historical'] = data[historical].mean(axis = 1)
df_categories['nature'] = data[nature].mean(axis = 1)
df_categories['services'] = data[services].mean(axis = 1)

In [None]:
df_categories.describe()

In [None]:

fig = px.box(df_categories, y = ['entertainment', 'food_travel', 'places_of_stay', 'historical', 'nature', 'services'] )
fig.show()

<h1 id="Clustering">
2. Clustering
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#Clustering">¶</a>
</h1>

Let's assess the clusterability of the dataset using the hopkins statistic. According to the pyclustertend library, on a scale from 0 to 1, the lower the score, the better the clusterability of the dataset

In [None]:
!pip install pyclustertend

from pyclustertend import hopkins 

hopkins(df_categories, df_categories.shape[0])

<h1 id="agglomerative">
2.1 Agglomerative Clustering
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#agglomerative">¶</a>
</h1>

We begin the clustering process with the agglomerative clustering algorithm for one simple reason: it is a hierarchical clustering algorithm, so we simplify the problem of having to choose beforehand the number of clusters in our model.Hierarchical clustering does not avoid the problem with choosing the number of clusters. Simply - it constructs the tree spaning over all samples, which shows which samples (later on - clusters) merge together to create a bigger cluster. This happens recursively till you have just two clusters (this is why the default number of clusters is 2) which are merged to the whole dataset.

Firstly, we are going to determine which linkage method to use. In order to do that we will calculate the cophenet index. Cophenet index is a measure of the correlation between the distance of points in feature space and distance on the dendrogram. If the distance between these points increases as the dendrogram distance between the clusters does then the Cophenet index is closer to 1. So, values closer to 1 mean a better linkage method.

In [None]:
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import pdist
import plotly.graph_objects as go

Z = hierarchy.linkage(df_categories, 'ward')

c, coph_dists = hierarchy.cophenet(Z, pdist(df_categories, 'hamming'))

ward = c


A = hierarchy.linkage(df_categories,'average')

c, coph_dists = hierarchy.cophenet(A, pdist(df_categories, 'hamming'))

average = c


B = hierarchy.linkage(df_categories,'single')

c, coph_dists = hierarchy.cophenet(B, pdist(df_categories, 'hamming'))

single = c


C = hierarchy.linkage(df_categories,'complete')

c, coph_dists = hierarchy.cophenet(C, pdist(df_categories, 'hamming'))

complete = c


D = hierarchy.linkage(df_categories,'weighted')

c, coph_dists = hierarchy.cophenet(D, pdist(df_categories, 'hamming'))

weighted = c


E = hierarchy.linkage(df_categories,'centroid')

c, coph_dists = hierarchy.cophenet(E, pdist(df_categories, 'hamming'))

centroid = c

F = hierarchy.linkage(df_categories,'median')

c, coph_dists = hierarchy.cophenet(F, pdist(df_categories, 'hamming'))

median = c


metrics=['ward', 'average', 'single', 'complete', 'weighted', 'centroid', 'median']


fig = go.Figure([go.Bar(x=metrics, y=[ward, average, single, complete, weighted, centroid, median])])
fig.show()


We clearly see that the average linkage method is the preferred one.

Let's calculate some useful metrics that will help us decide the number of clusters. Since the ground truth labels are not known we will use such metrics like the silhouette coefficient, the Davies-Bouldin score and the Calinski-Harabasz Index.

In [None]:
from sklearn.metrics import silhouette_score
for n_clusters in range(2,10):
    clusterer = AgglomerativeClustering (n_clusters=n_clusters, distance_threshold = None)
    preds = clusterer.fit_predict(df_categories)
    

    score = silhouette_score (df_categories, preds)
    print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))

From the various silhouette score we can see that although 2 clusters would be a better choice for our data, the score itself is pretty low

Let's check the Davies-Bouldin score.We want a values as close to 0 as possible

In [None]:
from sklearn.metrics import davies_bouldin_score

for n_clusters in range(2,10):
    clusterer = AgglomerativeClustering (n_clusters=n_clusters, distance_threshold = None)
    preds = clusterer.fit_predict(df_categories)
    
    score = davies_bouldin_score (df_categories , preds)
    print ("For n_clusters = {}, the Davies-Bouldin score is {})".format(n_clusters, score))

Again, 6 clusters have the lowest score from a reasonable range of clusters

Calinski-Harabasz index.We want as high a score as possible

In [None]:
from sklearn.metrics import calinski_harabasz_score

for n_clusters in range(2,10):
    clusterer = AgglomerativeClustering (n_clusters=n_clusters, distance_threshold = None)
    preds = clusterer.fit_predict(df_categories)
    
    score = calinski_harabasz_score(df_categories, preds)
    print ("For n_clusters = {}, the Calinski-Harabasz score is {})".format(n_clusters, score))

Here the metric shows that 2 clusters would be better than 3.

In [None]:
model = AgglomerativeClustering(distance_threshold=0, n_clusters = None)
model = model.fit(df_categories)

Z = hierarchy.linkage(model.children_, 'average')

plt.figure(figsize=(20,10))

dn = hierarchy.dendrogram(Z)

By looking at the above dendrogram, we observe 3 distinct colors in the dendrogram, but this will not determine how many clusters are formed. Following the main criteria of cutting the dendrogram appropriately, we discover that there are basically 5 clusters. Observing the height of each dendrogram division we decided to go with 4000 where the line would be drawn.

Now, let's plot our data using the labels that the algorithm generated.We are going to make a scatterplot.

In [None]:
df = pd.DataFrame(data)

model = AgglomerativeClustering(distance_threshold=None, n_clusters = 3)
model = model.fit(df_categories)
y_agg=model.fit_predict(df_categories)

df_agg = df_categories.copy()
df_agg["AggLabels"] = y_agg

#uncomment the following line and let your machine explode!
sns.pairplot( df_agg, hue="AggLabels")  

df_agg["AggLabels"].value_counts(0)




<h1 id="DBSCAN">
2.2 DBSCAN
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#DBSCAN">¶</a>
</h1>

Following, we apply to our data set the DBSCAN algorithm.DBSCAN works by running a connected components algorithm across the different core points. If two core points share border points, or a core point is a border point in another core point’s neighborhood, then they’re part of the same connected component, which forms a cluster.
A low min_samples parameter means it will build more clusters from noise, so we shouldn't choose it too small.
The DBSCAN paper suggests to choose minPts based on the dimensionality, and eps based on the elbow in the k-distance graph.
For eps, we can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.

In the more recent publication

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN.
ACM Transactions on Database Systems (TODS), 42(3), 19.


the authors suggest to use a larger minpts for large and noisy data sets, and to adjust epsilon depending on whether you get too large clusters (decrease epsilon) or too much noise (increase epsilon). Clustering requires iterations.

After some trial and error, the min_samples value with the less noise is 43

Let's apply the Knee method using KNN to find the optimal eps value for our model. Since KNN is a supervised learning algorithm and our data is not labeled, we will apply a general rule of thumb popularized by the "Pattern Classification" book by Duda et al., saying that the optimal K value usually found is the square root of N, where N is the total number of samples.

In [None]:
from sklearn.neighbors import NearestNeighbors

nearest_neighbors = NearestNeighbors(n_neighbors=73) #sqrt(5456) = 73
nearest_neighbors.fit(df_categories)
distances, indices = nearest_neighbors.kneighbors(df_categories)
distances = np.sort(distances, axis=0)[:, 1]
#print(distances)
plt.figure(figsize=(20,10))
plt.plot(distances)
plt.show()

Optimal value for eps where a 'knee' is formed is 0.6.

In [None]:
from sklearn.cluster import DBSCAN

model2 = DBSCAN(eps = 0.6, min_samples = 43)

model2 = model2.fit(df_categories)

np.unique(model2.labels_)

y_db=model2.labels_

df_db = df_categories.copy()
df_db["DBLabels"] = y_db

df_db["DBLabels"].value_counts(0)


sns.pairplot( df_db, hue="DBLabels")  




In [None]:
df_db

Let's remove the noise

In [None]:
# Get names of indexes for which column DB_Labels has value -1
indexNames = df_db[ df_db['DBLabels'] == -1 ].index
# Delete these row indexes from dataFrame
df_db.drop(indexNames , inplace=True)

df_db['DBLabels'].unique()

In [None]:
df_db

In [None]:
sns.pairplot( df_db, hue="DBLabels")  

In [None]:
y_db = df_db['DBLabels']
silhouette_score (df_db, y_db)

<h1 id="EM">
2.3 EM using GMM
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#EM">¶</a>
</h1>

At its simplest, GMM is also a type of clustering algorithm. As its name implies, each cluster is modelled according to a different Gaussian distribution. This flexible and probabilistic approach to modelling the data means that rather than having hard assignments into clusters like k-means, we have soft assignments. This means that each data point could have been generated by any of the distributions with a corresponding probability.

In [None]:
from sklearn.mixture import GaussianMixture as GMM

n_components = np.arange(1, 21)
models = [GMM(n, covariance_type='full', random_state=0).fit(df_categories)
          for n in n_components]

plt.plot(n_components, [m.bic(df_categories) for m in models], label='BIC')
plt.plot(n_components, [m.aic(df_categories) for m in models], label='AIC')
plt.legend(loc='best')
plt.xlabel('n_components');

The choice of number of components measures how well GMM works as a density estimator, not how well it works as a clustering algorithm. From the above plot it shows that optimal number of components is 13 (where the gradient stops decreasing)

In [None]:
from matplotlib.patches import Ellipse

df_gmm = df_categories.copy()

df_gmm['gmm'] = GMM(n_components=6, random_state=42).fit_predict(df_categories)

sns.pairplot( df_gmm, hue="gmm")  

<h1 id="kmeans">
2.4 KMeans
<a class="anchor-link" href="https://www.kaggle.com/johnmantios/travel-review-ratings-dataset/edit/run/47115230#kmeans">¶</a>
</h1>

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group
It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

In [None]:
from sklearn.cluster import KMeans

for n_clusters in range(2,10):
    clusterer = KMeans(n_clusters = n_clusters)
    preds = clusterer.fit_predict(df_categories)
    

    score = silhouette_score (df_categories, preds)
    print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))

In [None]:
from scipy.spatial.distance import cdist

distortions = [] 
inertias = [] 
mapping1 = {} 
mapping2 = {} 
K = range(1,10) 
  
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(df_categories) 
    kmeanModel.fit(df_categories)     
      
    distortions.append(sum(np.min(cdist(df_categories, kmeanModel.cluster_centers_, 
                      'euclidean'),axis=1)) / df_categories.shape[0]) 
    inertias.append(kmeanModel.inertia_) 
  

    mapping2[k] = kmeanModel.inertia_ 
    
for key,val in mapping2.items(): 
    print(str(key)+' : '+str(val))
    
plt.plot(K, inertias, 'bx-') 
plt.xlabel('Values of K') 
plt.ylabel('Inertia') 
plt.title('The Elbow Method using Inertia') 
plt.show()     

We'll go with 2 clusters.

In [None]:
model4 = KMeans(n_clusters = 2, n_init = 40)

model4 = model4.fit(df_categories)

df_kmeans = df_categories.copy()

df_kmeans['kmeans'] = model4.fit_predict(df_categories)

sns.pairplot( df_kmeans, hue = 'kmeans')