# The Kopi Latte Ratio Project: Data Preparation

The objective of this notebook is to clean up the location & reviews data collected from Google Maps, and prepare a dataset that can be used to train a review text classification model.

## Load Extracted Data

In [1]:
import os
os.chdir('..')

In [28]:
import pandas as pd

places_df = pd.read_csv('data/raw/places.csv', index_col=0)[['name', 'place_id', 'formatted_address']]
places_df.head()

Unnamed: 0,name,place_id,formatted_address
0,Percolate,ChIJV6HD-Eo92jERjhfY7NEDrOM,"136 Bedok North Ave 3, #01-152, Singapore 460136"
1,Generation Coffee Roasters (Bedok),ChIJhbwWY-I92jERxtB-gF22sL0,"216 Bedok North Street 1, #01-32, Singapore 46..."
2,Refuel Cafe,ChIJcf_SpPk82jERM28p3SYNBnI,"744 Bedok Reservoir Rd, #01-3029 Reservoir Vil..."
3,Marie's Lapis Cafe,ChIJ3Vc6OY092jERhObI1bZ_4Sk,"537 Bedok North Street 3, #01-575, Singapore 4..."
4,COFFEESARANG,ChIJOb_8OAwj2jERPK-QVelr5Vk,"311 New Upper Changi Rd #01-78 Bedok Mall, Sin..."


## Data Exploration

In [29]:
places_df['place_id'].nunique()

4884

With 4884 relevant locations and 21685 reviews pulled from Google Maps, I came up with a system to efficiently classify the locations as a cafe or kopitiam. There will be 3 layers of classification in order to minimise the manual effort required:
1. Heuristic classification - Classify the locations based on popular coffee franchises and keywords in names. This requires the least effort and can classify the greatest number of locations accurately. It will also enable
2. Review text classification - Train a model to classify the location based on its reviews. This will automate the classification process.
3. Manual human classification - For locations that are still ambiguous, I will manually classify them. This requires the most effort and should be limited to a reasonable number of location

## Heuristic Classification

In [30]:
# many coffee franchise locations are in the format "<franchise name> - <location>"
# the franchise store names by splitting 
franchise_names = places_df['name'].str.split('-',expand=True)[0].str.split('(', expand=True)[0].rename('franchise_name')

# get top 20 franchise names
franchise_names_top = franchise_names.value_counts().head(20)
franchise_names_top

franchise_name
Toast Box                        54
The Coffee Bean & Tea Leaf       34
Ya Kun Kaya Toast                34
luckin coffee                    25
Kopitiam                         24
Kimly Coffeeshop                 24
Fun Toast                        19
The Coffee Bean and Tea Leaf     14
Tiong Bahru Bakery               14
Killiney Kopitiam                14
S                                13
Huggs Coffee                     12
Happy Hawkers                    11
Kimly Coffeeshop                 10
Kopi & Tarts                     10
Baker & Cook                      9
Han's Cafe                        8
Kopitiam Corner                   8
Koufu                             8
De Tian Coffee House              8
Name: count, dtype: int64

In [31]:
import numpy as np

# Define list of cafe & kopi franchises and classify locations accordingly
cafe_franchises = ['Starbucks', 'The Coffee Bean & Tea Leaf', 'luckin coffee', 'The Coffee Bean and Tea Leaf', 'Tiong Bahru Bakery', 'Huggs Coffee', 'Baker & Cook']
kopi_franchises = ['Ya Kun Kaya Toast', 'Toast Box', 'Kopitiam', 'Kimly Coffeeshop', 'Fun Toast', 'Kopi Kiosk', 'S-11', 'Toast Box', 'Killiney Kopitiam', 'Happy Hawkers', 'Kopi & Tarts', 'De Tian Coffee House'] 

places_df['is_cafe'] = np.where(places_df['name'].str.contains('|'.join(cafe_franchises), regex=True, case=False), 1,0)
places_df['is_kopitiam'] = np.where(places_df['name'].str.contains('|'.join(kopi_franchises), regex=True, case=False), 1,0)
places_df[['is_cafe', 'is_kopitiam']].value_counts()

is_cafe  is_kopitiam
0        0              4268
         1               418
1        0               197
         1                 1
Name: count, dtype: int64

In [32]:
# check location that was labelled as both kopitiam & cafe
places_df.loc[(places_df['is_cafe'] == 1) & (places_df['is_kopitiam'] == 1)]

Unnamed: 0,name,place_id,formatted_address,is_cafe,is_kopitiam
9966,Starbucks Kopitiam City,ChIJ3UHR8T0X2jERdBuJJ2ly4EY,"277C Compassvale Link, #01-13 (Shop 2A Kopitia...",1,1


In [33]:
places_df.loc[places_df['place_id'] == 'ChIJ3UHR8T0X2jERdBuJJ2ly4EY', 'is_kopitiam'] = 0
places_df[['is_cafe', 'is_kopitiam']].value_counts()

is_cafe  is_kopitiam
0        0              4268
         1               418
1        0               198
Name: count, dtype: int64

In [34]:
# Get top 50 words in the place names
pd.Series(' '.join(places_df['name']).lower().split()).value_counts()[:20]

coffee        709
cafe          487
@             340
&             326
the           292
-             272
house         246
food          230
kopitiam      186
shop          149
toast         147
coffeeshop    141
tea           138
eating        125
kopi           99
centre         92
mall           87
restaurant     84
café           84
bar            80
Name: count, dtype: int64

In [35]:
# Check number of samples for each label
kopi_word_list = ['kopitiam', 'toast', 'kopi']
places_df.loc[places_df['name'].str.contains('|'.join(kopi_word_list), regex=True, case=False), 'is_kopitiam'] = 1
places_df[['is_cafe', 'is_kopitiam']].value_counts()

is_cafe  is_kopitiam
0        0              4144
         1               542
1        0               197
         1                 1
Name: count, dtype: int64

After classifying the locations based on coffee franchises & keywords, we are able to quickly identify about 500 kopitiams and 200 cafes. This also gives us a good base to create a training dataset for our review text classification model.

In [36]:
conditions = (places_df['is_cafe'] == 1) & (places_df['is_kopitiam'] == 1)
places_df.loc[conditions]

Unnamed: 0,name,place_id,formatted_address,is_cafe,is_kopitiam
9966,Starbucks Kopitiam City,ChIJ3UHR8T0X2jERdBuJJ2ly4EY,"277C Compassvale Link, #01-13 (Shop 2A Kopitia...",1,1


In [37]:
# starbucks kopitiam location is wrongly classified as both cafe & kopitiam, remove kopitiam label
places_df.loc[places_df['place_id'] == 'ChIJ3UHR8T0X2jERdBuJJ2ly4EY', 'is_kopitiam'] = 0
places_df.loc[conditions]

Unnamed: 0,name,place_id,formatted_address,is_cafe,is_kopitiam
9966,Starbucks Kopitiam City,ChIJ3UHR8T0X2jERdBuJJ2ly4EY,"277C Compassvale Link, #01-13 (Shop 2A Kopitia...",1,0


In [38]:
places_labelled = places_df.loc[places_df[['is_cafe', 'is_kopitiam']].sum(axis=1) > 0] 
places_labelled.head()


Unnamed: 0,name,place_id,formatted_address,is_cafe,is_kopitiam
9,Happy Hawkers Coffeeshop,ChIJxQb7Ulo92jERes4lOO9WDso,"739 Bedok Reservoir Rd, Singapore 470741",0,1
24,Baker & Cook - Swan Lake,ChIJfYtFcq4i2jERVe8R8VhEmrg,"1 Swan Lake Ave, Opera Estate, Singapore 455700",1,0
29,Kimly Coffeeshop (Chai Chee 29 FoodHouse),ChIJN2zPCK0i2jERyJsDXfiesho,"29B Chai Chee Ave, #01-60, Singapore 462029",0,1
31,The Coffee Bean & Tea Leaf,ChIJL-rUh7kj2jEReVwjCg1YIW8,"311 New Upper Changi Rd, #01-01, Singapore 467360",1,0
36,Fun Toast,ChIJ840QRIE92jERlGb5nG8QtPE,"900 Bedok North Rd, #01-01 HomeTeamNS, Singapo...",0,1


In [39]:
places_unlabelled = places_df.loc[places_df[['is_cafe', 'is_kopitiam']].sum(axis=1) == 0] 
places_unlabelled.head()

Unnamed: 0,name,place_id,formatted_address,is_cafe,is_kopitiam
0,Percolate,ChIJV6HD-Eo92jERjhfY7NEDrOM,"136 Bedok North Ave 3, #01-152, Singapore 460136",0,0
1,Generation Coffee Roasters (Bedok),ChIJhbwWY-I92jERxtB-gF22sL0,"216 Bedok North Street 1, #01-32, Singapore 46...",0,0
2,Refuel Cafe,ChIJcf_SpPk82jERM28p3SYNBnI,"744 Bedok Reservoir Rd, #01-3029 Reservoir Vil...",0,0
3,Marie's Lapis Cafe,ChIJ3Vc6OY092jERhObI1bZ_4Sk,"537 Bedok North Street 3, #01-575, Singapore 4...",0,0
4,COFFEESARANG,ChIJOb_8OAwj2jERPK-QVelr5Vk,"311 New Upper Changi Rd #01-78 Bedok Mall, Sin...",0,0


## Prepare reviews dataset for text classification

In [42]:
# load reviews data and drop potentially identifying columns
reviews_df = pd.read_csv('data/raw/reviews.csv')[['place_id', 'text']]
reviews_df.head()

Unnamed: 0,place_id,text
0,ChIJV6HD-Eo92jERjhfY7NEDrOM,It’s great to find a cafe that serves good cof...
1,ChIJV6HD-Eo92jERjhfY7NEDrOM,Nice little independent cafe in Bedok. Quite a...
2,ChIJV6HD-Eo92jERjhfY7NEDrOM,Really blessed this is just around the neighbo...
3,ChIJV6HD-Eo92jERjhfY7NEDrOM,Crowded on a PH morning and I can see why! Cof...
4,ChIJV6HD-Eo92jERjhfY7NEDrOM,This place have very delicious garlic cheese c...


In [43]:
len(reviews_df)

21685

In [44]:
# check character count of longest review
reviews_df['char_count'] = reviews_df['text'].str.len()
reviews_df['char_count'].max()

4085.0

In [45]:
# Join reviews with places
reviews_train_df = places_labelled.merge(reviews_df, on='place_id', how='inner')[['text', 'place_id', 'name', 'is_cafe', 'is_kopitiam']]

# Convert is_cafe and is_kopitiam to single binary label. 1 for cafe, 0 for kopitiam
reviews_train_df['label'] = np.where(reviews_train_df['is_cafe'] == 1, 1, 0)

# drop unnecessary columns and missing reviews  
reviews_train_df = reviews_train_df.drop(columns={'place_id', 'name', 'is_cafe', 'is_kopitiam'}).dropna()
reviews_train_df.head()

Unnamed: 0,text,label
0,Junyi fortune fish. 吉祥鱼\nDabao the salted veg ...,0
1,Five star for Fish So Nice shop selling grille...,0
2,Mixed Rice stall- i don’t even know how they c...,0
3,Crazy drink lady charged me $1 for this cup of...,0
4,Dim sum stall needs improvement.\nManagement s...,0


In [46]:
reviews_train_df['label'].value_counts(normalize=True)

label
0    0.705687
1    0.294313
Name: proportion, dtype: float64

A clean labelled dataset has been prepared to train the text classification model. Due to the slight imbalanced nature of the labels, we should stratify the train test split accordingly 

In [27]:
# reviews_train_df.to_csv('data/interim/reviews_train.csv', index=False)