In this project I will explore a dataset on Starbuck's Malaysia customer survey, analyse some main features and findings,predict whether customers will visit Starbucks again using the ratings and use a Kemans clustering algorithm to study the dataset and segment customers into groups.


In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
%matplotlib inline 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

In [None]:
Starbucks=pd.read_csv("../input/starbucks-customer-retention-malaysia-survey/Starbucks satisfactory survey.csv")

In [None]:
Starbucks.head(10)

In [None]:
Starbucks.info()

we will clear the coolumns that do not contribute to analysis of customer behavior and change the header for each column first


In [None]:
Starbucks=Starbucks.drop('Timestamp',axis=1)
Starbucks.columns=['Gender','Age','Status','Income','Frequency','Method','timepervisit','nearest','membership','fequencyofpurchase','spending','comparerate','pricerate','promotion','rateambiance','Wifi','rateservice','situational','source','loyalty']

In [None]:
Starbucks.info()

In [None]:
Starbucks=Starbucks.dropna()


In [None]:
Starbucks.describe()

We can see that average price rating is 3, comparing with other coffee retailers rating is 3.6, wifi rating is 3.2, rating for ambiance is 3.7 and service rating is 3.7.
Overall the rating performance is quite decent, we can examine how closely correlated each rating is with if the customer is going to visit again, but first transform the loyalty column to 1 and 0, with 1 indicating will visit again and 0 indicating will not.


In [None]:
Starbucks.loyalty.replace(('yes','no'),(1,0),inplace=True)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix
log=LogisticRegression()
x=Starbucks[['comparerate','pricerate','promotion','rateambiance','Wifi','rateservice']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)
y=Starbucks['loyalty']
model=log.fit(X_scaled,y)
np.array(model.coef_)



we can see that the highest price rating is highly correlated, followed by comparsion rating with other coffee sellers, wifi and service rating followed a surprisingly negative linear correlation, this may show the pricing strategy is starbucks will be very crucial and that Starbucks should continue to strengthen it's competitive advantage in order to gain more support from customers. While pricing is particularly important, its average rating is lowest among all rating variables, thus Starbucks may need to reconsider its pricing strategy and its focus. 


Now let's look at age, status and income factors


In [None]:
Starbucks.Age.value_counts()
Starbucks['Age'].value_counts().plot(kind='barh')

In [None]:
Starbucks['Status'].value_counts().plot(kind='barh')

In [None]:
Starbucks['Income'].value_counts().plot(kind='barh')

we can see that most of its customers come from age group of 20-29, mostly employed and have income less than RM 25,000. What this might mean is that starbucks should tailor its price towards lower income groups, and maintain its unique selling point of fast and convinent, serving the majority of its customer who are employed and at a quite young age.

How I would like to see which of these three factors have most influence on the frenquency of purchase and amount spent each time

In [None]:
Starbucks.Age.value_counts()

In [None]:
Starbucks.Age.replace(('From 20 to 29','From 30 to 39','Below 20','40 and above'),(2,3,1,4),inplace=True)

In [None]:
Starbucks.Status.value_counts()

In [None]:
Starbucks.Status.replace(('Employed','Student','Self-employed','Housewife'),(2,1,3,4),inplace=True)

In [None]:
Starbucks.Income.value_counts()

In [None]:
Starbucks.Income.replace(('Less than RM25,000','RM25,000 - RM50,000','RM50,000 - RM100,000','More than RM150,000','RM100,000 - RM150,000'),(1,2,3,4,5),inplace=True)

In [None]:
Starbucks.head()

In [None]:
Starbucks.spending.value_counts()

In [None]:
Starbucks.spending.replace(('Less than RM20','Around RM20 - RM40','Zero','More than RM40' ),(2,3,1,4),inplace=True)

In [None]:
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
x1=Starbucks[['Age','Status','Income']]
X1_scaled = scaler.fit_transform(x1)
y1=Starbucks[['spending']]
Y1_scaled=scaler.fit_transform(y1)
model2=lm.fit(X1_scaled,Y1_scaled)
model2.coef_

there is a weak correlation between amount spent and age,indicating that as age increases amount spent increases, however the correlation is not very significant, however this only measures the linear correlation, whereas intuitively we might expect people around the middle age groups (20-29 and 30-39) to spend more in Starbucks, so it might be useful to look at their relationship from plots  

In [None]:
sns.boxplot(Starbucks['Age'],Starbucks['spending'])

There is still evidently a weak relationship between age and spending, however it might not be very significant, if we use OLS method to do some statistical inference on age, income, status and spending we get the following result.

In [None]:
model3=sm.OLS.from_formula("spending~Income+Age+Status",data=Starbucks)
result3=model3.fit()
result3.summary()

With the large F statistic we can infer that there is at least one variable that is significant, and we can see from the summary that the P value for age is 0.009 which pass a 10% level significant test, wiht confident interval of 0.089 to 0.529. Thus, although most of Starbuck's customer are between age of 20-30, the customers that spend larger amount are actually the group above 40. 


Now we will divide the data into training  and testing set and try to predict whether the customer will visit again using the ratings

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(X_scaled,y,test_size=0.3,random_state=0)
model4=log.fit(xtrain,ytrain)
np.array(model4.coef_)
prediction=model4.predict(xtest)
from sklearn import metrics
print(metrics.accuracy_score(ytest,prediction))




The accuracy is 78% which is quite decent, however provided that promotion rating and wifi rating are found to be not very significant in our previous output result,  we might want to try running the model again removing the two variables

In [None]:
xless=Starbucks[['comparerate','pricerate','rateambiance','rateservice']]
Xless_scaled = scaler.fit_transform(xless)
xlesstrain,xlesstest,ytrain,ytest=train_test_split(Xless_scaled,y,test_size=0.3,random_state=0)
model5=log.fit(xlesstrain,ytrain)
np.array(model5.coef_)
predictionless=model5.predict(xlesstest)
print(metrics.accuracy_score(ytest,prediction))



The accuracy of prediction did not change, which proves that customer's attention to promotion and their rating for wifi services are not very significant in determining their next visit, and that our model could well predict whether a customer will visit Starbucks again based on their rating to Starbucks's price, ambiance, service and rating comparing to other coffee retailers.

If we want to segment our customers based on the current rating figures we have, we might try to proceed with first PCA analysis then k means clustering:   

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X_scaled)

cluster_range = range(1, 10,1)
cluster_errors = []

for num_clusters in cluster_range:
  clusters = KMeans( num_clusters )
  clusters.fit( principalComponents )
  cluster_errors.append( clusters.inertia_ )


clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:10]



In [None]:
sns.lineplot(clusters_df['num_clusters'], clusters_df['cluster_errors'])


From this plot we can see that using the elbow method, using 2-4 clusters are most suitable and in this case because we only have 6 regressors we might choose to use 3 cluster.

In [None]:
kmeans = KMeans(n_clusters=3)
model= kmeans.fit(principalComponents)
y_kmeans = kmeans.predict(principalComponents)
y_kmeans


In [None]:
table=pd.DataFrame(data=[principalComponents[:, 0],principalComponents[:,1],y_kmeans])
table=table.transpose()
table.columns=['x','y','group']
table.head()

In [None]:

sns.lmplot(data=table,x='x',y='y',hue='group',fit_reg=False,legend=True)


In [None]:
pca.components_
df = pd.DataFrame(pca.components_,columns=('comparerate','pricerate','promotion','rateambiance','Wifi','rateservice'))
df

from the k means plot we can see that the main determinant of which group the customer falls in is the "x" values, which corresponds to the first component in the PCA analysis and y values ( the second component of PCA) had very little influence on grouping. Thus we took a closer look at the PCA result and found that in component 1, rating for ambiance and the rating for service have higher influence in the feature, and rating for how much customer's decision depend on promotion have very little influence on the component, so we may inference that the customers could potentially be separated based on how muuch they rate the ambiance of Starbuck stores and the rate for the services starbuck provides. 