# DataCamp Certification Case Study

### Project Brief

You are on the data science team for a coffee company that is looking to expand their business into Ukraine. They want to get an understanding of the existing coffee shop market there.

You have a dataset from Google businesses. It contains information about coffee shops in Ukraine. The marketing manager wants to identify the key coffee shop segments. They will use this to construct their marketing plan. In their current location, they split the market into 5 segments. The marketing manager wants to know how many segments are in this new market, and their key features.

You will be presenting your findings to the Marketing Manager, who has no data science background.

The data you will use for this analysis can be accessed here: `"data/coffee_shops.csv"`

## Introduction

* After identifying **Ukraine** as a potential place to expand our coffee business into, let's **identify and analyze clusters** of similar coffee shops.
* This analysis will help us in:
    * **understanding the current landscape** of coffee shops in Ukraine
    * **learning what local customers** are attracted to
    * **positioning ourselves** to best attract customers
* Clustering analysis will be done on a dataset from Google businesses.

## Importing and cleaning the data

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# read data
df = pd.read_csv('data/coffee_shops.csv')
# sort by region
df = df.sort_values(by='Region').reset_index(drop=True)
# inspect data
display(df.head())
df.info()

In [None]:
# features of coffee shop completeness by region
print('Before imputing missing values')
display(df.groupby('Region').count())

# numerical features
NUM_FEAT = ['Rating','Reviews']
# Rating, Reviews
# don't have many missing values --> impute with mean
df['Rating'].fillna(df['Rating'].mean(), inplace=True)
df['Reviews'].fillna(df['Reviews'].mean(), inplace=True)

# categorical features
CAT_FEAT = ['Region', 'Place name', 'Place type', 'Price', 'Delivery option', 'Dine in option', 'Takeout option']
# Region, Place name, Place type
# no missing values

# Price, Delivery option, Dine in option, Takeout option
# some missing values --> impute nulls with N/A
for cat in CAT_FEAT[3:]:
    # display(df.groupby('Region')[cat].value_counts())
    # print()
    df[cat].fillna('N/A', inplace=True)

print()
print('After imputing missing values')
display(df.groupby('Region').count())

## EDA

### Dashboard

* Key insights
    * 200 coffee shops, 20 from each region
    * All regions has more coffee shops who don't offer delivery options
    * Lviv and Kiev are the top two regions with the most coffee shop reviews
        * Most reviewed coffee shop, "Lviv Coffee Manufacture", has an overwhelming 17,937 reviews.
        * Distant second place, "Svit Kavy" also in Lviv, has 2,931 reviews.
    * All regions' coffee shops have median ratings around 4.6-4.8
        * Can be susceptible to response bias

In [None]:
# Quick overview of the data

fig, ax = plt.subplots(2, 2, figsize=(20,10))

# plot number of coffee shops by region
sns.countplot(ax=ax[0][0], data=df, x='Region')
ax[0][0].set_title('Number of Coffee Shops by Region')

# plot count of shops with each delivery option in each region 
sns.countplot(ax=ax[0][1], data=df, x='Region', hue='Delivery option')
ax[0][1].set_title('Number of Coffee Shops with Delivery option')

# plot number of customer reviews by region
sns.barplot(ax=ax[1][0], data=df, x='Region', y='Reviews', estimator=np.sum, ci=False)
ax[1][0].set_title('Number of reviews by Region')

# plot distribution of ratings by region
sns.boxplot(ax=ax[1][1], data=df, x='Region', y='Rating')
ax[1][1].set_title('Boxplot of coffee shops ratings by region')

plt.show()

### Exploring other features

* Most places are categorized as "Coffee Shops" and "Cafes" with 97 and 59 establishments respectively.
* Looking at the place names, there are no major franchises in this dataset.
* 116 from 122 places categorized in price ranges are in the middle range (\$\$). 
* For dine-in and takeout options, all 140 places with data present offer both.

In [None]:
# Top 5 most reviewed
print('Top 5 Most reviewed coffee shops')
display(df.sort_values(by='Reviews', ascending=False).head(5))

# Top 5 rated
print('Top 5 rated coffee shops')
display(df.sort_values(by='Rating', ascending=False).head(5))

In [None]:
# look at place types
print('Coffee shop types')
display(df['Place type'].value_counts().sort_values(ascending=False))
# 48.5% 'Coffee shop'
# 29% 'Cafe'

print()

# look at place names are there franchises??
print('Coffee shop names')
display(df['Place name'].value_counts().sort_values(ascending=False).head(10))

# NOTE: There are similar coffee shops names, that should be refering to the same establishments, registered differently.
# example:  'Dom Kofe' ---- 'Dom Kofe, Mah.'
#            'Don Marco' ---- 'Don Marco coffee shop'
#         Will not try to correct them because neither I'm an expert in Ukrainian nor it seems to be major franchises.

In [None]:
for cat in CAT_FEAT[3:]:
    print('Count of category: {}'.format(cat))
    display(df[cat].value_counts().sort_values(ascending=False))
    print()
# All places offers both Dine in and take out which should be true
# Almost all are in the middle price point
# Delivery Option 94/129 do not offer Delivery

## Clustering

Compute clusters based on rating and number of reviews. <br>
These are the two numerical features available to us that can tell how customers are feeling about each place:
* rating - higher value suggests that customers love the place
* reviews - higher value suggests customers traffic and social media presence <br>

Then we'll explore the characteristics of each cluster.<br>
We will use the K-means clustering algorithm. <br>


### Preprocessing

* remove one outlier, the super popular coffee shop
* scale features to be on the same scale

In [None]:
# Inspect num features
fig, ax = plt.subplots(1, 2, figsize=(10,5))

sns.boxplot(ax=ax[0], x=df['Rating'])
# no extreme outliers
sns.boxplot(ax=ax[1], x=df['Reviews'])
# one outlier present, Lviv Coffee Manufacture has so many reviews, so popular

print('Boxplots of numerical features')
plt.show()

In [None]:
# Drop single outlier since it will affect kmeans clustering algorithm
# we can study this super popular shop indepth if we want

print('Before dropping outlier')
display(df[NUM_FEAT].describe())
sns.scatterplot(data=df, x='Rating', y='Reviews')
plt.show()

# remember we have one sole extreme outlier in reviews
# remove the outlier, since it will affect the KMeans algorithm

df_c = df[df.Reviews != df.Reviews.max()].reset_index(drop=True)

print('After dropping outlier')
display(df_c[NUM_FEAT].describe())
sns.scatterplot(data=df_c, x='Rating', y='Reviews')
plt.show()

In [None]:
 # scale both features so that they are on the same scale
from sklearn.preprocessing import StandardScaler

X = df_c[NUM_FEAT]
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X), columns=['Scaled_Rating', 'Scaled_Reviews'])
X_scaled.head()
display(X_scaled.describe())
sns.scatterplot(data=X_scaled, x='Scaled_Rating', y='Scaled_Reviews')

plt.show()

display(X.head())
display(X_scaled.head())

### Build model

* find appropriate amount of clusters by applying the elbow method
    * Ukrainian coffee shops are nicely clustered into 5 clusters like our home market
* visualize the clusters

In [None]:
# fit k-means clustering and apply elbow method to find appropriate value for k
from sklearn.cluster import KMeans

# calculate kmeans from 1 to 9
trial_k = list(range(1,10))
inertias = []
for i in trial_k:
    km = KMeans(n_clusters=i)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

# plot elbow plot
plt.plot(trial_k, inertias, marker='.')
plt.title('Elbow method')
plt.xlabel('n_clusters')
plt.ylabel('inertia')

plt.show()

In [None]:
# choose n_clusters = 5 from the elbow plot and compute kmeans clusters
km = KMeans(n_clusters=5, random_state=69)
y = pd.Series(km.fit_predict(X_scaled))
# replace cluster numbers with color names
colors = ['Red', 'Blue', 'Green', 'Cyan', 'Magenta']
for i in range(0,5):
    y = y.replace(i, colors[i])
# add cluster label to df_c
df_c['Cluster'] = y

#Visualizing all the clusters 
plt.figure(figsize=(10,10))

for i in range(0,5):
    c = colors[i]
    plt.scatter(df_c[df_c.Cluster == c]['Rating'],
                df_c[df_c.Cluster == c]['Reviews'],
                c=c,
                label='{} cluster'.format(c))

plt.title('Clusters of Ukrainian coffee shops')
plt.xlabel('Rating')
plt.ylabel('Reviews')
plt.legend()
plt.show()

## Insights and next steps

Interesting places to look at:

* Lviv Coffee Manufacture - Most reviewed coffee shop
    * suggests highest customer base and social media presence
    * model of what the local customers love
* Green cluster - Highly rated and reviewed
    * suggests very popular places that locals love
    * delivery option may be the key to success
        * highest proportion of places that offer delivery from all clusters (~50%) 
    * opportunities arise in region where there are no coffee shops from this cluster
        * Kherson and Khrivoy Rog do not have any of these well performing coffee shops
        * at least 14 out of 20 coffee shops in Kherson do not offer delivery

Next steps:
* Conduct an in-depth research on Lviv Coffee Manufacture to see what makes them successful in the Ukrainian market
* Conduct a feasibility study on coffee delivery in the Kherson region to try and capitilize on this opportunity 

In [None]:
# Inspect Lviv Coffee Manufacture
df[df['Place name'] == 'Lviv Coffee Manufacture']

# probably an online review and an inhouse review data respectively

In [None]:
# Amount of coffee shops in each cluster
sns.countplot(data=df_c, x='Cluster')
plt.show()

In [None]:
# Inspect regions in each cluter
plt.figure(figsize=(10,10))
sns.countplot(data=df_c, x='Cluster', hue='Region')
plt.show()

In [None]:
# Inspect delivery option in each cluter
sns.countplot(data=df_c, x='Cluster', hue='Delivery option')
plt.show()