# Starbucks Costumer Survey Data Analysis

Starbucks is a multi-national, and the world largest, coffee chain originated in Seattle, Washington. With a well-recognized brand worldwide, Starbucks menus are composed of a variety of hot drinks, like whole-bean coffee, espresso and latte, and also cold drinks, like cold brew coffee and the famous Frappuccinos, besides other types of beverages and snacks.

---
## Objectives

On this notebook, a dataset containing answers to a Starbucks Costumer Survey are analysed, with the main objective of understanding the characteristics of the recurrent clients of Starbucks and how they stand out of the non-recurrent clients of the brand. For this, some minor objectives can be defined, made in the form of questions:

* Who are the recurrent costumers of Starbucks? And who are the non-recurrent costumers?
* How these two classes of costumers are different from one another?
* What are the strong points of Starbucks according to the recurrent costumers? And the weakests?

---
## Libraries

First, some custom libraries are installed.

In [None]:
!pip install prince

Here, all the libaries and frameworks used throughout the notebook are imported.

In [None]:
import numpy as np  
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import prince
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

sns.set_style('darkgrid')

---
## Read Data

First, let's read the data and make a initial check on its format and presence of null values.

In [None]:
data = pd.read_csv('../input/starbucks-customer-retention-malaysia-survey/Starbucks satisfactory survey.csv')

In [None]:
data.head()

In [None]:
data.info()

Despite being a relative small dataset, the data appears to be almost complete, with rare missing values.

---
## Pre-Processing

At first glance, the data seems to be well formatted, but there's some possible steps to be done to guarantee a easier time on future sections.

The first thing that can be noticed on this data is that the columns names are too extensive. To make better dataframe visualizations in the future, shorter names are prefered.

In [None]:
rename_dict = {
    '1. Your Gender' : 'Gender',
    '2. Your Age' : 'Age',     
    '3. Are you currently....?' : 'Profession',
    '4. What is your annual income?' : 'Income',
    '5. How often do you visit Starbucks?' : 'Visit Frequency',
    '6. How do you usually enjoy Starbucks?' : 'Prefered Form of Consumption',
    '7. How much time do you normally  spend during your visit?' : 'Time Spent on Visit',
    "8. The nearest Starbucks's outlet to you is...?" : 'Distance to Store',
    '9. Do you have Starbucks membership card?' : 'Membership',
    '10. What do you most frequently purchase at Starbucks?' : 'Most consumed Product',
    '11. On average, how much would you spend at Starbucks per visit?' : 'Spend per Visit',
    '12. How would you rate the quality of Starbucks compared to other brands (Coffee Bean, Old Town White Coffee..) to be:' : 'Quality Rate',
    '13. How would you rate the price range at Starbucks?' : 'Price Range Rate',
    '14. How important are sales and promotions in your purchase decision?' : 'Sales and Promotion Importance',
    '15. How would you rate the ambiance at Starbucks? (lighting, music, etc...)' : 'Ambiance Rate',
    '16. You rate the WiFi quality at Starbucks as..' : 'WiFi Quality Rate',
    '17. How would you rate the service at Starbucks? (Promptness, friendliness, etc..)' : 'Service Rate',
    '18. How likely you will choose Starbucks for doing business meetings or hangout with friends?' : 'Likely for Meetings or Hangouts',
    '19. How do you come to hear of promotions at Starbucks? Check all that apply.' : 'Form of communication to Promotions',
    '20. Will you continue buying at Starbucks?' : 'Recurrent Costumer'
}

In [None]:
data = data.rename(rename_dict,axis=1)
data.head()

To check that the formatting of data are constant, in a way to avoid duplicate answers, let´s check the number of unique values of each column, with the objective of check if any feature presents a unreasonably large number of possible values, which could indicate bad formatting.

In [None]:
for column in data.columns:
    print('{} column: {} unique values'.format(column,data[column].nunique()))

Checking the columns with more than 5 unique values, some redundancy could be found. Lets's fix those.

In [None]:
data['Prefered Form of Consumption'] = data['Prefered Form of Consumption'].replace({'Never':'None','never':'None','Never buy':'None','Never ':'None','I dont like coffee':'None',np.nan:'None'})
data['Most consumed Product'] = data['Most consumed Product'].replace({'Never buy any':'Nothing','never':'Nothing','Never':'Nothing','Jaws chip ':'Jaws Chip','cake ':'Cake'})
data['Form of communication to Promotions'] = data['Form of communication to Promotions'].replace({np.nan:'Never hear'})

On the next step, some rate columns, which are formatted as int are converted to strings. The reason to do that is the fact that in future sections, a clustering framework to be used is not compatible with int values.

In [None]:
int_columns = ['Quality Rate','Price Range Rate','Ambiance Rate','WiFi Quality Rate','Service Rate','Likely for Meetings or Hangouts','Sales and Promotion Importance']

In [None]:
for column in int_columns:
    data[column] = data[column].astype(object)

---
## Exploratory Analysis

After pre-processing the data, an initial exploratory analysis can be done to obtain a first understanding of the data. For this, let´s focus on the more general characteristics of the costumers, starting with Gender.

For these initial plots, a simple function is created to avoid repeated code.

In [None]:
def plot_bar(feature,figsize):
    plot_data = organize_plot_data(feature)
    generate_plot(plot_data,feature,figsize)

def organize_plot_data(feature):
    plot_data = data[['Timestamp',feature]] #Timestamp is used only for counting the occurrences
    plot_data = plot_data.groupby(feature).count()
    plot_data = plot_data.reset_index()
    plot_data.columns = [feature,'Counts'] 
    return plot_data

def generate_plot(plot_data,feature,figsize):
    plt.figure(figsize=figsize)
    plt.bar(x = plot_data[feature], height = plot_data['Counts'])
    plt.title('Costumers by {}'.format(feature))

In [None]:
plot_bar(feature='Gender',figsize=(6,8))

The data seems to be well balanced in terms of Gender, which can be a positive factor for the representative aspect of the dataset. Next, let's check the age distribution.

To sort values according to the Age, let's make a simple change on the names of possible answers.

In [None]:
data['Age'] = data['Age'].replace({'Below 20' : '1. Below 20',
                                    'From 20 to 29' : '2. From 20 to 29',
                                    'From 30 to 39' : '3. From 30 to 39',
                                    '40 and above' : '4. 40 and Above'})
data = data.sort_values('Age')

In [None]:
plot_bar(feature ='Age', figsize = (10,8))

It's clearly possible to see that the marjority of Starbucks costumers stands on the 20-29 Age Range. Let´s now check the Professions and Income Distributions.

In [None]:
plot_bar(feature ='Profession', figsize = (10,8))

The income values suffers the similar problems to the Age Data. Let's replace some values.

In [None]:
data['Income'] = data['Income'].replace({'Less than RM25,000' : '1. Less than RM25,000',
                                        'RM25,000 - RM50,000' : '2. RM25,000 - RM50,000',
                                        'RM50,000 - RM100,000' : '3. RM50,000 - RM100,000',
                                        'RM100,000 - RM150,000' : '4. RM100,000 - RM150,000',
                                        'More than RM150,000' : '5. More than RM150,000'})
data = data.sort_values('Income')

In [None]:
plot_bar(feature ='Income', figsize = (15,8))

It seems most of the costumers consists of Employees and Students with less than RM 25.000 of income (RM stands for Malaysian Ringgit, the local coin of Malaysia). This behaviour may have an direct impact on features like the spending of a costumer on a visit and quantity of costumers that have a membership card, and also indirect influences on features like the importance of sales and promotions on costumer decision and the price range rate. Maybe it's a good idea to check those. Let's briefly plot the two first mentioned.

In [None]:
data['Spend per Visit'] = data['Spend per Visit'].replace({'Zero' : '1. Zero',
                                        'Less than RM20' : '2. Less than RM20',
                                        'Around RM20 - RM40' : '3. Around RM20 - RM40',
                                        'More than RM40' : '4. More than RM40'})
data = data.sort_values('Spend per Visit')

In [None]:
plot_bar(feature ='Spend per Visit', figsize = (10,8))

In [None]:
plot_bar(feature ='Membership', figsize = (6,8))

Actually, the fact of the income of costumers are concetrated on minor ranges doesn't seen to affect their consuming behaviour. Not only the costumers tend to have regular spends for a coffee shop, but also the number of memberships are almost balanced to the number of common costumers, which seems reasonable.

---
## Costumer Clustering

To try to answer the questions defined on the Objectives Section, especially the first two, on this section a technique called Multiple Correspondence Analysis (MCA) is used. This technique involves, as the name suggests, the multiple application of a simpler technique called Correspondence Analysis, which objective is to find the association between two variables of data. In this way, MCA calculates the association between multiple columns and show the results as points in a dimension reduced space, generally bi-dimensional, where features close to each other are probably strongly associated, and features on opposite site are probably negatively associated.

With this technique, we may find associations between features both in a way to better understand the characteristics of costumers, but also to understand which of this features are more related to recurrent costumers.

In [None]:
def plot_mca(selected_columns,rows,columns,figsize):
    mca_data = data[selected_columns]
    mca = prince.MCA()
    mca.fit(mca_data)

    mca.plot_coordinates(mca_data,
                     show_row_points=rows,
                     show_column_points=columns,
                     show_column_labels=True,
                     figsize=figsize,
                    );

First, let's try to plot the columns related to demographic info about costumers.

In [None]:
selected_columns = ['Gender', 'Age', 'Profession', 'Income']

In [None]:
plot_mca(selected_columns,rows=False,columns=True,figsize=(10,10))

Despite reaffirming some insights obtained on the previous section, this plot shows us a expected, but still interesting relation between columns: The income seems very related to the profession and age of the costumers. In that way, Employed Costumers with age between 20 to 39 years old are closer to greater income points in the graph, for example.

Next, let's plot a graph involving features related to the costumers visit experience.

In [None]:
selected_columns = ['Visit Frequency','Prefered Form of Consumption', 'Time Spent on Visit','Spend per Visit','Distance to Store']

In [None]:
plot_mca(selected_columns,rows=False,columns=True,figsize=(10,10))

This plot presents a lot of interesting information. Let's check one by one:

* First, close to the center, it's possible to find costumers who visit a  Starbucks daily and spends up to one hour on each visit.

* Close to it, there's a group who rarely visit Starbucks, and spends less than RM 20 and generally only take-away.

* Next, a monthly visit group can be found, with more than 3 hours visits and spends of RM 20 - RM 40 and prefers Dine-In on Drive-Thru experience.

* More distant to the center, in the upper region, there's a group of weekly visitors, that spend between 1 and 3 hours on each visit, and actually spends the most among the costumers.

* Finally, on the right there's the group that never visits a Starbucks store. 

Also, there's an interesting pattern about the distance to store: The Monthly visitors group is more associated with costumers who live in greater distances than the ones who rarely visit Starbucks, which live within 1 km of the store.

Next, let's plot some rate features related to the service itself.

In [None]:
selected_columns = ['Quality Rate','Price Range Rate','Service Rate','Ambiance Rate', 'WiFi Quality Rate','Likely for Meetings or Hangouts']

In [None]:
plot_mca(selected_columns,rows=False,columns=True,figsize=(10,10))

In some way, it's interesting to see that similar rates generally are closer together (5's are close to 5's, 2's are close to 2's and so on). This pattern, unfortunately doesn't bring much information to us.

Now, let's remake the first two MCA plots, which gave the most information about data, but now also considering the "Recurrent Costumer" column, with the objective to try to discover the characteristics of the costumers that are prone to come back to Starbucks and the ones who don't think will return.

In [None]:
selected_columns = ['Gender', 'Age', 'Profession', 'Income','Recurrent Costumer']

In [None]:
plot_mca(selected_columns,rows=False,columns=True,figsize=(10,10))

On this plot, it's possible to see that the recurrent costumer is associated with the 20 to 29 year old range with employed people with RM 25.000 - RM 100.000 income. In contrast, the younger student costumers, with smaller incomes, are not prone to come back to Starbucks.

Next, let's check the visits experience features under the same conditions.

In [None]:
selected_columns = ['Visit Frequency','Prefered Form of Consumption', 'Time Spent on Visit','Spend per Visit','Recurrent Costumer']

In [None]:
plot_mca(selected_columns,rows=False,columns=True,figsize=(10,10))

As expected, the reccurent costumers are more related to daily and monthly visitors, for obvious reasons. But it's also possible to see that it is related to short visits, below 30 minutes and usually by drive-thru, indicanting a good retention of costumers on this format. In contrast, costumers who spend more time on each visit generally are less prone to come back. 

---
## Quality Rates Analysis

As the Quality Rates couldn't been very well utilized for insights generations on the last section, this section proposes the analysis of this features using the raw data itself. For creating a better visualization, the plot_bar function used before suffers some adaptations.

In [None]:
def plot_bar(feature):
    plot_data = organize_plot_data(feature)
    generate_plot(plot_data,feature)

def organize_plot_data(feature):
    plot_data = data[['Timestamp',feature]] #Timestamp is used only for counting the occurrences
    plot_data = plot_data.groupby(feature).count()
    plot_data = plot_data.reset_index()
    plot_data.columns = [feature,'Counts'] 
    return plot_data

def generate_plot(plot_data,feature):
    plt.bar(x = plot_data[feature], height = plot_data['Counts'])
    plt.title('Costumers by {}'.format(feature))

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1,5,1)
plot_bar(feature ='Quality Rate')
plt.subplot(1,5,2)
plot_bar(feature ='Price Range Rate')
plt.subplot(1,5,3)
plot_bar(feature ='Ambiance Rate')
plt.subplot(1,5,4)
plot_bar(feature ='WiFi Quality Rate')
plt.subplot(1,5,5)
plot_bar(feature ='Service Rate')

Most of the features rates shows distributions more skewed to higher rates (3 to 5) with exception of the Price Range Rate, which has a kind of simetric distribution, but has more 1's and 2' rates than the other features evaluated by the costumers. This may have relation to the fact that most of the costumers have lower incomes, as discussed before. This may also have some relation to the  Sales and Promotion Importance feature. Let's check that.

In [None]:
plt.figure(figsize=(6,8))
plot_bar('Sales and Promotion Importance')

Compared to the distributions of the other rates, actually the Sales and Promotion Importance seems even more skewed to higher rates, indicating it's considerably  big importance to costumers.

---
## Sales and Promotions Form of Communication Analysis

Continuing on the topic of Sales and Promotions, let's check how they are being communicated to costumers by creating a Word Cloud of the most used ways to do this task.

In [None]:
cloud_data = data['Form of communication to Promotions'].sum()

In [None]:
cloud = WordCloud(background_color = "white", max_words = 200, stopwords = set(STOPWORDS))
wordcloud = cloud.generate(cloud_data)

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
ax.imshow(wordcloud, interpolation='bilinear')
ax.set_axis_off()

It is clear that communication between friends and word of mouth (that's what the big "word" on the plot stands for) are important ways of communicating promotions to the costumers. 

---
## Conclusion

On this notebook, the Malaysian costumers of Starbucks were analysed through a dataset from a Survey about buying behaviour. For this task, visual exploration of data were made using Bar Plots and a Dimensional-Reduction technique called Multiple Correspondence Analysis, which were used to create a bi-dimensional space where the association between different categories of data could be easily seen. 

With this, some great insights could be obtained about the costumers. First, with the initial exploratory data, we could conclude that:

* The costumers are well balance between Male and Female
* The great portion of costumers are younger people, ranging between 20 and 29 years old, and generally are Employers and Students with incomes lower than RM 25.000
* The spend per visit generally seems to involve regular values, less than RM 40
* The data is also well balanced in terms of Membership

From the MCA plots, we could obtain:

* The recurrent costumers are clustered in two well-defined groups of Daily, Weekly and Monthly Visitors
* Daily Visitors tends to make short visits, up to one hour
* Monthly Visitors tends to consume by Drive-Thru, expending regular values on each visit
* Weekly Visitors, in turn, tends to take longer visits, between 1 and 3 hours, and spends more money on each visit
* Interestingly, the more regular visitors are associated with people living more distant of a Starbucks Store, as oposed to rare visitors, who live nearby

From the Quality Rates Analysis, we could learn that:
* The Rates related to Price Range are not so skewed to higher notes, indicating some insatisfaction of the costumers
* Consequently, the Importance of Sales and Promotions to Costumers presents higher rates

And finally, from the Word Cloud of Sales and Promotions communication, we could clearly see that the most relevant ways of advertasing are by Friends and Word of Mouth. 

From these conclusions, it is possible to elaborate some suggestions, in the form of concise questions, with the objective to inspire some ideas to grown the sales of the Malaysian Starbucks:
* Considering that a great part of the costumers are composed by younger people, and also considering the fact that Sales and Promotions have a great importance to them, why not use more modern ways of communication, like Social Networks, which are highly populated by this kind of public?
* Considering this same portion of the costumers, why not create ambient stores more friendly to students, like stores close to universities with quiet places where costumers could study and drink coffee?
* Considering the clusters of costumers according to the Frequency of Visits, would be a great ideia to create different Memberships for these different forms of consumption?
* Part of the recurrent costumers seems to live far away from the stores. Is there anyway to target this public on promotions like Drive-Thru Sales, develop delivery services or promotions to home products like coffee capsules or instant coffee?
* And also, for the closer costumers, which, as we saw, are associated with rare visits, why not promote Take-Away Sales that could target these costumers?

---
## References

[How Correspondence Analysis Works (A Simple Explanation)](https://www.displayr.com/how-correspondence-analysis-works/) and [How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think)](https://www.displayr.com/interpret-correspondence-analysis-plots-probably-isnt-way-think/), awesome articles by [Tim Bock](https://www.displayr.com/author/tim-bock/) found on [DisplayR Blog](https://www.displayr.com/).