## My first analysis: PCA and K-means

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

#### In this analysis, I'll show how to cluster a set of restaaurant reviews in n segment based on users preferences.


![![image.png](attachment:image.png)](https://searchengineland.com/figz/wp-content/seloads/2017/08/restaurant-seo-featured-800x450.gif)

In [None]:
reviews = pd.read_csv('../input/trip-advisor-reviews/clean_full.csv')
reviews.head()

As we can see below, this dataset is full of missing value

In [None]:
reviews.describe()

In [None]:
reviews['rest_name'].value_counts()

Our analysis is focused on three different rating
- review_rating_service
- review_rating_atmosphere
- review_rating_food



In [None]:
reviews_info = ['rest_id', 'review_rating_service', 'review_rating_atmosphere', 'review_rating_food']
metrics = ['review_rating_service', 'review_rating_atmosphere', 'review_rating_food']

In [None]:
pd.isnull(reviews[reviews_info]).sum()

In [None]:
rev_mt_ = reviews[reviews_info].dropna()
rev_mt_.shape

In [None]:
rev_mt = rev_mt_[rev_mt_['rest_id'].isin(rev_mt_['rest_id'].value_counts()[rev_mt_['rest_id'].value_counts()> 5].index)]
rev_mt.set_index("rest_id", inplace=True)
rev_mt

In [None]:
X = np.asarray(rev_mt)
scale = StandardScaler()
X = scale.fit_transform(X)
X

In [None]:
pca = PCA(n_components=3)
pca.fit(X)
pca_samples = pca.transform(X)


plt.plot([0,1,2], pca.explained_variance_ratio_, 'ro-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.show()
print("Variance by PCA:")
print(pca.explained_variance_ratio_)

In [None]:
ps = pd.DataFrame(pca_samples)
ps.head()

PCA is a redution dimension method that allows to get a feature importance insight based on the variance

The first component analysis show that the variance for all the variables, in fact it remains constant

In [None]:
first_component = pd.DataFrame(pca.components_, columns=metrics)[:1]
first_component.transpose().sort_values(by=[0], ascending =False)

The second component maximizes the variance with respect to the first. As we can see **atmosphere** assumes higher value

In [None]:
second_component = pd.DataFrame(pca.components_, columns=metrics)[1:2]
second_component.transpose().sort_values(by=[1], ascending =False)

the third component maximizes the variance with respect to the others. As we can see **services** assumes higher value

In [None]:
third_component = pd.DataFrame(pca.components_, columns=metrics)[2:3]
third_component.transpose().sort_values(by=[2], ascending =False)

What does it mean? Distribution of user reviews varies significantly based on atmosphere and level of service. 

As we wanted prove, the quality of the food is not a determining parameter according to the variation in the dataset.

In [None]:
ps = pd.DataFrame(normalize(pca_samples))

a = 1
b = 2

tocluster = pd.DataFrame(ps[[a,b]])

fig = plt.figure(figsize=(8,8))
plt.plot(tocluster[a], tocluster[b], 'o', markersize=2, color='blue', alpha=0.5, label='')

plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.show()

In [None]:
n_clusters=15
cost=[]
for i in range(1,n_clusters):
    kmean= KMeans(i)
    kmean.fit(tocluster)
    cost.append(kmean.inertia_) 
    
# cost function, represents the Euclidean distance between the points of the centroids that have been identified
plt.plot(cost, 'bx-')
# I choose 3 clusters because there is a decrease in the distance between the centroids

In [None]:
clusterer = KMeans(n_clusters=3, random_state=42).fit(tocluster)
centers = clusterer.cluster_centers_
c_preds = clusterer.predict(tocluster)

fig = plt.figure(figsize=(8,8))
colors = ['orange','blue','green']
colored = [colors[k] for k in c_preds]

plt.scatter(tocluster[a],tocluster[b],  color = colored)
for ci, c in enumerate(centers):
    plt.plot(c[0], c[1], 'X', markersize=15, color='red', alpha=0.9, label=''+str(ci))
    
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.show()

In [None]:
final_cluster = rev_mt.copy()

final_cluster['cluster'] = c_preds

final_cluster.head(10)

In [None]:
c1_count = len(final_cluster[final_cluster['cluster']==0])

c0 = final_cluster[final_cluster['cluster']==0].drop('cluster',axis=1).mean()

c1 = final_cluster[final_cluster['cluster']==1].drop('cluster',axis=1).mean()

c2 = final_cluster[final_cluster['cluster']==2].drop('cluster',axis=1).mean()

In [None]:
print("First Cluster: People who love atmosphere and good food")
print(c0.sort_values(ascending=False))
print("")

print("Second Cluster: People who love just food and good service")
print(c1.sort_values(ascending=False))
print("")

print("First Cluster: People who prefer service and atmosphere rather than food")
print(c2.sort_values(ascending=False))
print("")

In [None]:
# Next step: We have created a cluster based on user reviews. But do restaurants invest well according this insigth?