# Navigating the Gym Landscape: Analyzing Customer Reviews in Danish Facilities
## Notebook code
This notebook provides the code needed to reproduce the results presented in the report named "Navigating the Gym Landscape: Analyzing Customer Reviews in Danish Facilities" by the Data-Wild-West group. In case of doubts, consult the README file.

# 1. Introduction
Customer reviews have become a major topic of research for brands of all industries, and without a doubt a significant driving force moving consumers’ decisions. This notebook provides a complete framework from data collection to data insights.

##### Imports

In [1]:
# Google Maps API
import googlemaps

# Basic libraries
import pandas as pd
import numpy as np
import os

# Custom util functions
import sys; sys.path.append("./libraries/")
from utils import *

# Classification models
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer

# Statistics
from scipy.stats import tukey_hsd, f_oneway


# Visualization
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
import seaborn as sns

# Maps
import folium
from folium.plugins import MarkerCluster

import warnings; warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


### Settings

##### Reproducibility settings

In [2]:
# Random seed for experiments
np.random.seed = 7

# Relative Paths
GOOGLE_API_TOKEN = "./Google_API_key.txt"
RAW_DATA = "../data/raw_data/"
PROCESSED_DATA = "../data/processed_data/"
ANNOTATIONS_DATA = "../annotations/"

# Some style settings
sns.set_style("whitegrid")
sns.set_context("paper", font_scale=1.5)
sns.set(font="Arial")

# Flags
collect = False # Flag to collect data or load existent raw_data
process = False # Flag to process data or load existent processed data

##### Google API

In [3]:
key = open(GOOGLE_API_TOKEN).readline()
gmaps = googlemaps.Client(key=key)

# 2. Methods
The following section introduces the steps taken to collect and process the data.

## 2.1 Data collection

We start by creating a list of query values that relate to the dataset. We are interested in getting mostly reviews (and some other metadata) on specific fitness facilities (i.e. popular chains) from main cities in Denmark. To do this, we will compute the query list as a combination of cities and fitness chains. 

In [4]:
# List of cities
cities = ['Copenhagen', 'Aalborg', 'Arhus', 'Odense']
 
# Popular fitness chains
gyms = ["PureGym", "SATS", "Vesterbronx"]

# Query list
query_list = [g + " " + c for g in gyms for c in cities]

print(query_list)

['PureGym Copenhagen', 'PureGym Aalborg', 'PureGym Arhus', 'PureGym Odense', 'SATS Copenhagen', 'SATS Aalborg', 'SATS Arhus', 'SATS Odense', 'Vesterbronx Copenhagen', 'Vesterbronx Aalborg', 'Vesterbronx Arhus', 'Vesterbronx Odense']


### 2.1.1 Google Maps API

The Google maps API takes a single query string to search for results (similar to the User Interface searchbox). Therefore, we combine popular fitness facilities with main Danish cities as our query keys.

In [5]:
# Get responses for all the queries from the API
if collect:
    # Get response for queries
    dfs = []

    # For each query in the query list
    for query in query_list:  
        # Get the response using our custom made querier
        dfs.append(google_querier(gmaps, query))

    google_reviews = pd.concat(dfs)

    # Save to disk
    google_reviews.to_csv(RAW_DATA + "google_reviews.csv", index=False, encoding="utf-8")

else:
    google_reviews = pd.read_csv(RAW_DATA + "google_reviews.csv")

Check the results.

In [6]:
check_dataframe_results(google_reviews)

Resulting dataframe has shape (360, 9)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   place_id       360 non-null    object 
 1   type           360 non-null    object 
 2   name           360 non-null    object 
 3   lat            360 non-null    float64
 4   lng            360 non-null    float64
 5   author_name    360 non-null    object 
 6   rating         360 non-null    int64  
 7   text           360 non-null    object 
 8   opening_hours  360 non-null    object 
dtypes: float64(2), int64(1), object(6)
memory usage: 25.4+ KB
None


Unnamed: 0,place_id,type,name,lat,lng,author_name,rating,text,opening_hours
0,ChIJh3mB6UxSUkYREbiH4JDK-7M,PureGym Copenhagen,PureGym,55.669812,12.54739,madi sharp,4,"Sweet small gym, staff are kind when you see t...","{'Monday': '05:00AM - 12:00AM', 'Tuesday': '05..."
1,ChIJh3mB6UxSUkYREbiH4JDK-7M,PureGym Copenhagen,PureGym,55.669812,12.54739,Lewis Atkins,2,"Just a very bad gym. Staff don’t really care, ...","{'Monday': '05:00AM - 12:00AM', 'Tuesday': '05..."
2,ChIJh3mB6UxSUkYREbiH4JDK-7M,PureGym Copenhagen,PureGym,55.669812,12.54739,Eric,1,"terrible facilities\nbathrooms are gross, dirt...","{'Monday': '05:00AM - 12:00AM', 'Tuesday': '05..."
3,ChIJh3mB6UxSUkYREbiH4JDK-7M,PureGym Copenhagen,PureGym,55.669812,12.54739,Rune Perstrup,1,An Unhygienic Coronavirus Petri Dish.\n\nI hav...,"{'Monday': '05:00AM - 12:00AM', 'Tuesday': '05..."
4,ChIJh3mB6UxSUkYREbiH4JDK-7M,PureGym Copenhagen,PureGym,55.669812,12.54739,Mario Piazza,1,In a huge gym there is only one hair dryer and...,"{'Monday': '05:00AM - 12:00AM', 'Tuesday': '05..."


### 2.1.2 Trustpilot WebScraper

Trustpilot is a Danish consumer review website very popular in Denmark. It is publicly available and easy to access, but it does not provide any API integration. Therefore, we use a simple webcrawler to extract the reviews of interest.

In [7]:
if collect:
    dfs = []

    # Reuse the gyms
    for g in gyms:
        df = trustpilot_crawler(key=g, verbose=False)

        # Append the facility DF to main df
        dfs.append(df)

    # Join all DFs
    trustpilot_reviews = pd.concat(dfs)

    # Save to disk
    trustpilot_reviews.to_csv(RAW_DATA + "trustpilot_reviews.csv", index=False, encoding="utf-8")

else:
    trustpilot_reviews = pd.read_csv(RAW_DATA + "trustpilot_reviews.csv", encoding="utf-8")

Check the results.

In [8]:
check_dataframe_results(trustpilot_reviews)

Resulting dataframe has shape (2802, 7)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2802 entries, 0 to 2801
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   datetime    2802 non-null   object
 1   name        2802 non-null   object
 2   rating      2802 non-null   int64 
 3   title       2802 non-null   object
 4   review      2802 non-null   object
 5   event_time  2802 non-null   object
 6   enterprise  2802 non-null   object
dtypes: int64(1), object(6)
memory usage: 153.4+ KB
None


Unnamed: 0,datetime,name,rating,title,review,event_time,enterprise
0,2023-11-13T14:03:40.000Z,Jan Winther,4,Godt fitness-center,Gennemgående er jeg godt tilfreds med mit fitn...,13. november 2023,PureGym
1,2023-11-14T13:07:20.000Z,Tina Holst,5,Syntes altid det er dejligt at komme i…,Syntes altid det er dejligt at komme i centret...,14. november 2023,PureGym
2,2023-11-13T09:22:36.000Z,Pfændtner,5,Jeg har gået i Fitness centeret i 22år…,Jeg har gået i Fitness centeret i 22år og efte...,12. november 2023,PureGym
3,2023-11-13T17:18:33.000Z,Gitte,5,Puregym Ikast,Puregym Ikast er et fantastisk center. Man føl...,13. november 2023,PureGym
4,2023-11-13T10:01:35.000Z,GITTE MIKKELSEN,2,Der mangler Stram op hold,Der mangler Stram op hold (eller ligende fx Pu...,11. november 2023,PureGym


### 2.1.3 Københavns Kommune Scraper

The Københavns Kommune website provides an extensive list of training facilities, both indoors and outdoors. Since this is a dynamic site built on JavaScript, the traditional webcrawler approach is not suitable, and thus we will use an approach that simulates human-like interactions using Selenium.

In [9]:
if collect:

    # Create crawler instance
    kbh_scraper = KBHFacilitiesWebScraper()
    # Get dataframe with entries
    kbh_facilities = kbh_scraper.get()

    # Save to disk
    kbh_facilities.to_csv(RAW_DATA + "kbh_facilities.csv", index=False, encoding="utf-16") # Since some Danish characters don't map to utf-8, we use utf-16
    

else:
    kbh_facilities = pd.read_csv(RAW_DATA + "kbh_facilities.csv", encoding="utf-16")

Check the results.

In [10]:
check_dataframe_results(kbh_facilities)

Resulting dataframe has shape (515, 9)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 515 entries, 0 to 514
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  515 non-null    int64 
 1   type        515 non-null    object
 2   activity    515 non-null    object
 3   location    508 non-null    object
 4   website     514 non-null    object
 5   gender      515 non-null    object
 6   age         515 non-null    object
 7   special     106 non-null    object
 8   address     499 non-null    object
dtypes: int64(1), object(8)
memory usage: 36.3+ KB
None


Unnamed: 0.1,Unnamed: 0,type,activity,location,website,gender,age,special,address
0,0,gym,Styrke- og grundtræning,SOS Motion,http://www.sosmotion.dk/,both,all,,"Sundhedshus Østerbro, Randersgade 60, 4 sal, 2..."
1,3,gym,Nærgymnastik,LOFskolen,https://lofskolen.dk/kurser/motion-og-sundhed/...,both,all,Målrettet personer der har brug for træning me...,"Østerbrogade 240, 2100 København Ø"
2,4,ball_sports,Floorball for kvinder 65+ år,BK Skjold,https://www.bkskjold.dk/klub/boldklubben-skjol...,women,seniors,,"Nørrebrogade 208, 2200 Kbh. N"
3,5,gym,Fitness og styrketræning,Nabo Østerbro,https://www.naboosterbro.dk/styrketr%c3%a6ning...,both,seniors,,"Nyborggade 9, 2100 Kbh Ø"
4,6,gym,KOL-/hjerte træningshold,FOF,https://www.fof.dk/da/kbh/kurser/samarbejde-me...,both,all,,"Vesterbrogade 121, 1620 Kbh. V"


#### 2.1.3.1 Lookup reviews for KBH Facilities

We observe that this dataset only contains addresses, but not geolocation (latitude and longitude) or reviews for the places. We then try to collect that missing data from the Google Maps API.

In [11]:
if collect:
    # Use custom function to iterate through the facilities and retrieve coordinates and reviews for the places.
    kbh_facilities_reviews = review_finder(gmaps, kbh_facilities)

    # Save to disk
    kbh_facilities_reviews.to_csv(RAW_DATA + "kbh_facilities.csv", index=False, encoding="utf-16") # Since some Danish characters don't map to utf-8, we use utf-16

else:
    kbh_facilities_reviews = pd.read_csv(RAW_DATA + "kbh_facilities_reviews.csv", encoding="utf-16")

Check the results.

In [12]:
check_dataframe_results(kbh_facilities_reviews)

Resulting dataframe has shape (1841, 13)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1841 entries, 0 to 1840
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   type      1841 non-null   object 
 1   activity  1841 non-null   object 
 2   location  1756 non-null   object 
 3   website   1477 non-null   object 
 4   gender    1841 non-null   object 
 5   age       1841 non-null   object 
 6   special   250 non-null    object 
 7   address   1481 non-null   object 
 8   lat       1460 non-null   float64
 9   lng       1460 non-null   float64
 10  author    1841 non-null   object 
 11  review    1841 non-null   object 
 12  rating    1841 non-null   float64
dtypes: float64(3), object(10)
memory usage: 187.1+ KB
None


Unnamed: 0,type,activity,location,website,gender,age,special,address,lat,lng,author,review,rating
0,outdoors,Træningspavillion,,,both,all,,"Kvægtorvsgade, 1710 KBH V",55.669719,12.56313,Ximena Ramos,This was the first time that we ordered this f...,3.0
1,outdoors,Træningspavillion,,,both,all,,"Kvægtorvsgade, 1710 KBH V",55.669719,12.56313,David Olafsson,My wife and I have been coming here with our d...,5.0
2,outdoors,Træningspavillion,,,both,all,,"Kvægtorvsgade, 1710 KBH V",55.669719,12.56313,Rune Madsen,Amazing new Chinese food in the area. We had M...,5.0
3,outdoors,Træningspavillion,,,both,all,,"Kvægtorvsgade, 1710 KBH V",55.669719,12.56313,Richard Grieg Higginson,Nice food and staff,4.0
4,outdoors,Træningspavillion,,,both,all,,"Kvægtorvsgade, 1710 KBH V",55.669719,12.56313,Hjalte Christiansen,We ordered lunch takeaway. But they had forgot...,1.0


## Join the dataset

We are interested in constructing a dataset that includes the enterprise, rating and review text, so we need to ensure those attributes are accesible across the different data sources.

### Extract enterprise for Google reviews

In [13]:
# Extract enterprise for Google reviews
_enterprises_ = []
# Look at each row
for ix, row in google_reviews.iterrows():
    # If not one of the main chains, default to "OTHER"
    result = "OTHER"
    # Search for the enterprise in either "type" or "name" columns
    for enterprise in gyms:
        if (enterprise.lower() in row["type"].lower()) or (enterprise.lower() in row["name"].lower()):
            result = enterprise
            break
    _enterprises_.append(result)

google_reviews["enterprise"] = _enterprises_

### Extract enterprise for KBH Facilities reviews

In [14]:
if process:
    # Extract enterprise for Google reviews
    _enterprises_ = []

    # Since some attributes are NAN, we replace them by the string "nan"
    _ = kbh_facilities_reviews.fillna("nan")

    # Look at each row
    for ix, row in _.iterrows():
        # If not one of the main chains, default to "OTHER"
        result = "OTHER"
        # Search for the enterprise in either "type" or "name" columns
        for enterprise in gyms:
            _ = kbh_facilities_reviews.fillna("nan")
            if (enterprise.lower() in row["type"].lower()) or (enterprise.lower() in row["location"].lower()):
                result = enterprise
                break
        _enterprises_.append(result)

    # Add the enterprise to the dataset
    kbh_facilities_reviews["enterprise"] = _enterprises_

    # Save the results to disk
    kbh_facilities_reviews.to_csv(PROCESSED_DATA + "kbh_facilities_reviews.csv", index=False, encoding="utf-8")

else:
    kbh_facilities_reviews = pd.read_csv(PROCESSED_DATA + "kbh_facilities_reviews.csv")

## 2.2 Translation of Danish reviews
Our Trustpilot dataset contains content in both English and Danish languages. We want to translate everything to english, to work with a monolingual dataset.
To accomplish the translation task, we use a translation model from Hugging-Face: Helsinki-NLP/opus-mt-da-en.

In [15]:
if process:
    # First, remove all emojis to facilitate translation
    trustpilot_reviews["review"] = trustpilot_reviews["review"].apply(lambda x: remove_emojis(x))

    # Use custom function to translate the Danish reviews
    trustpilot_reviews = translate(df = trustpilot_reviews, text_colname = "review", translation_colname="translated_review")
else:
    trustpilot_reviews = pd.read_csv(PROCESSED_DATA + "trustpilot_reviews.csv")

### 2.2.1 Translation Assesment
We assess the quality of the model's translation by computing the WER (Word error rate) metric against human translators.

In [16]:
# Translations folder
filepath = "../translations/human_translations.csv"

# We load the human translations and strip the emojis
human = pd.read_csv(filepath)
human["review"] = human.review.apply(lambda x: remove_emojis(x))
human.rename(columns={"review": "text", "translation": "human"}, inplace=True)

# We extact the model translations
machine = trustpilot_reviews[["review", "translated_review"]]
machine.rename(columns={"review": "text", "translated_review": "machine"}, inplace=True)

# We match the translations to their human counterpart
translations = human.merge(machine, on="text", how="inner")

# We pass the text, human and machine translations to our custom WER class
WER = WER(translations.text, translations.human, translations.machine)

# We use our custom function to compute the average Word error rate for the whole sample
print(f"The WER for the translations sample is {WER.mean():.3f}")

The WER for the translations sample is 0.388


We can also see the top best and worst WER instances.

In [17]:
display(WER.ranking().head())
display(WER.ranking().tail())

Unnamed: 0,Text,Human,Machine,WER
21,Den dårligste Santa jeg har set alt den styrke...,The worst sats I have seen on Nørrebro,The worst Santa I've seen all the strength you...,1.078947
7,Jeg syntes dør er et godt trænings center jeg ...,"I think this is a good gym, however there shou...",I think door is a good training center. I just...,0.633136
28,Udemærket træningscenter med stort set hvad ma...,Fine training center with basically everything...,Excellent gym with practically what to use. To...,0.579882
41,God stemning i Sats og et utal af træningsmul...,Good vibes in Sats and countless ways of worki...,Good atmosphere in Sats and numerous training ...,0.535714
5,(anmeldelsen er skrevet efter 8 besøg i center...,The review is written after 8 visits to the ce...,(the review is written after 8 visits to the c...,0.528662


Unnamed: 0,Text,Human,Machine,WER
15,"Forfærdlig forløb hos sats, Sats sender min ti...","Horrible course at sats, Sets sent mine to deb...","Terrible course with the rate, Sats sends mine...",0.234286
16,Skal betale for en ekstra måned selvom jeg ops...,I had to pay for an extra month even though I ...,Must pay for an extra month even though I quit...,0.19797
19,Det er altid en stor fornøjelse at træne på Cl...,It is always a great pleassure to train on Cla...,It's always a great pleasure to train on Clara...,0.192
24,"Jeg er ikke medlem, jeg var på vej til at meld...","I'm not a member, I was about to become one at...","I'm not a member, I was about to sign up today...",0.187192
40,"SATS Glostrup's sauna er altid i stykker, 2-3 ...","Sats Glostrup's sauna is always broken, 2-3 ti...","SATS Glostrup's sauna is always broken, 2-3 ti...",0.170732


### Select the attributes to keep

In [18]:
# Rename columns to match across datasets
google_reviews = google_reviews.rename(columns={"author_name": "author", "text": "review"})
trustpilot_reviews = trustpilot_reviews.rename(columns={"name": "author", "translated_reviews": "review"})

# Columns to keep
cols = ['enterprise', 'author', 'rating', 'review']

# Keep useful columns
google_reviews = google_reviews[cols]
google_reviews["platform"] = "Google"
trustpilot_reviews = trustpilot_reviews[cols]
trustpilot_reviews["platform"] = "Trustpilot"
kbh_facilities_reviews = kbh_facilities_reviews[cols]
kbh_facilities_reviews["platform"] = "Google"

# Merge all reviews
reviews = pd.concat([google_reviews, trustpilot_reviews, kbh_facilities_reviews]).reset_index(drop=True)

# Drop duplicates
reviews = reviews.drop_duplicates(ignore_index=True, keep="first")

Check the resulting dataframe

In [19]:
check_dataframe_results(reviews)

Resulting dataframe has shape (3586, 5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3586 entries, 0 to 3585
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   enterprise  3586 non-null   object 
 1   author      3586 non-null   object 
 2   rating      3586 non-null   float64
 3   review      3583 non-null   object 
 4   platform    3586 non-null   object 
dtypes: float64(1), object(4)
memory usage: 140.2+ KB
None


Unnamed: 0,enterprise,author,rating,review,platform
0,PureGym,madi sharp,4.0,"Sweet small gym, staff are kind when you see t...",Google
1,PureGym,Lewis Atkins,2.0,"Just a very bad gym. Staff don’t really care, ...",Google
2,PureGym,Eric,1.0,"terrible facilities\nbathrooms are gross, dirt...",Google
3,PureGym,Rune Perstrup,1.0,An Unhygienic Coronavirus Petri Dish.\n\nI hav...,Google
4,PureGym,Mario Piazza,1.0,In a huge gym there is only one hair dryer and...,Google


## 2.3 Annotations

The goal is to discern specific elements or topics within the reviews to enhance the classification of customer satisfaction. The classification comprised two dimensions: sentiment and object/topic. Sentiment analysis utilized three labels— “Positive,” “Negative,” and “Neutral.” The latter allowed for the identification of reviews expressing both sentiments toward a single object or exhibiting a subtle sentiment.

### Annotation distribution
To avoid introducing bias to the task, we remove all columns except for the text to annotate, and we randomly distribute the samples across annotators.

In [None]:
# Join both datasets
review_text = pd.DataFrame(reviews["review"])

# Shuffle reviews
review_text = review_text.sample(frac=1)

# Give unique ID to reviews
review_text["ID"] = np.arange(1, len(reviews)+1)

# Drop the index
review_text.reset_index(drop=True, inplace=True)

# Size of sample annotated by all annotators
size = 100

# Keep a list of not assigned IDs
remaining_ids = list(review_text.ID)

# Randomly select some IDs
common_ids =np.random.choice(remaining_ids, size=size, replace=False)
# Assign those instances to "all" annotators
review_text.loc[review_text.ID.isin(common_ids), "annotator"] = "all"
# Remove the selected IDs from the remaining not assigned IDs
remaining_ids = [x for x in remaining_ids if x not in common_ids]

# List of annotators
annotators = ["Bogdan", "Chrisanna", "Christian", "Gino", "Veron"]

# Size of the samples
size = 202
# Assign to each annotator
for a in annotators:
    # Randomly select some IDs
    selected_ids = np.random.choice(remaining_ids, size=size, replace=False)
    # Assign those instances to the specific annotator
    review_text.loc[review_text.ID.isin(selected_ids), "annotator"] = a
    # Remove the selected IDs from the remaining not assigned IDs
    remaining_ids = [x for x in remaining_ids if x not in selected_ids]

# Show number of instances per annotator
display(review_text.groupby("annotator").size())

We can now distribute the samples to annotate across annotators.

In [None]:
if process:
    # For each annotator, create a file
    for a in annotators:
        # Get the annotations for the specific annotator
        annotators_sample = reviews.loc[(reviews.annotator == a) | (reviews.annotator == "all"), ["ID", "text"]]
        annotators_sample.to_csv(ANNOTATIONS_DATA + f"annotators_samples/{a}.csv", index=False)

### Load the annotation responses

In [None]:
# Container for individual annotation responses datasets
dfs = []

# Look at the JSON files, parse and join
for file in os.listdir(ANNOTATIONS_DATA + "annotators_results"):
    if file.endswith(".json"):
        # Use our custom function to parse the response file
        df = parse_label_studio_file(ANNOTATIONS_DATA + "annotators_results/" + file)
        # Append to the container
        dfs.append(df)

# Join all files
annotations = pd.concat(dfs).reset_index(drop=True)

print(f"A total of {annotations.shape[0]} are now joined.")

### 2.3.1 Calculate IAA
To assess the reliability of the annotations we calculate Fleiss' kappa inter-annotator agreement.

In [None]:
# The categories are in the columns (except the first two: "ID" and "text")
categories = annotations.columns[2:]
# The possible labels are 1.0 (Positive), 0.0 (Neutral), -1.0 (Negative) or NAN (if no sentiment)
labels = [1.0, 0.0, -1.0, np.nan]
# Select the common annotations for IAA
common_annotations = annotations.groupby("ID").filter(lambda x: len(x) == 5)

IAA = fleiss_kappa(common_annotations, categories, labels=labels)

print(f"The Fleiss Kappa for IAA is {IAA:.2f}.")

We can also look at each category separetely.

In [None]:
for cat in categories:

    IAA = fleiss_kappa(common_annotations, [cat], labels=labels)

    print(f"The Fleiss Kappa for IAA for the {cat} category is {IAA:.2f}.")

### Decide on a golden label
We now decide on a golden label for the common annotations. We will use majority voting.

In [None]:
# Select the golden label by majority voting
common_annotations = common_annotations.groupby("ID").agg(lambda x: pd.Series.mode(x, dropna=False)[0]).reset_index()

# Join to individual annotations
annotations = pd.concat([common_annotations, annotations[~annotations.ID.isin(common_annotations.ID)]])
annotations.head()

# 3 Experiments

## 3.1 Basic Data Exploration

The purpose of this section is to conduct some initial exploration into the reviews

In [None]:
## PLACEHOLDER FOR GRAPH FUNCTION

## 3.2 Auto-labeller classifier
Train a classifier model using Multinomial Naive Bayes.

### Pre-Processing steps

In [None]:
# Make a copy of the annotations
df = annotations.copy()
# Convert the labels into text
df.iloc[:,2:] = df.iloc[:,2:].applymap(lambda x: num_to_sent(x))
# Correct spelling mistakes
clean_text = grammar_corrector(df["text"])
df["text"] = clean_text
# Lemmatize
df["text"] = df["text"].apply(lambda x: lemmatize_with_postag(x))

# Use vectorizer to encode text as features. TfidfVectorizer takes care of lowering and tokenizing the text
vectorizer = TfidfVectorizer(stop_words="english", strip_accents="unicode")
X_tfidf = vectorizer.fit_transform(df.text)

### Training and testing the classifier

In [None]:
# Choose your classifier
classifier = MultinomialNB()

# Set up cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics
scoring_metrics = ['precision_macro', 'recall_macro', 'f1_macro', "accuracy"]

# Evaluate category-wise
for cat in categories:
    print(cat)
    # Perform cross-validation
    scores = cross_validate(classifier, X_tfidf, df[cat], cv=cv, scoring=scoring_metrics, return_train_score=False, error_score="raise")

    # Display the results
    print("Cross-Validation Scores:")
    scores = pd.DataFrame(scores)
    scores.loc['mean'] = scores.mean()
    scores = scores.round(decimals=2)
    display(scores)

### Labelling all reviews

In [None]:
# Lemmatize and use the vectorizer to transform text to features
vectorizer = TfidfVectorizer(stop_words='english', strip_accents="unicode")
# We can't predict on missing text reviews
reviews_w_NA = reviews[~reviews.review.isna()]
reviews_w_NA["review"] = reviews_w_NA["review"].apply(lambda x: lemmatize_with_postag(x))
X_tfidf = vectorizer.fit_transform(reviews_w_NA.review)

# Choose your classifier
for cat in categories:
    # Train the classifier for a given category
    classifier = MultinomialNB()
    X_train = vectorizer.transform(df.text)
    y_train = df[cat]
    classifier.fit(X_train, y_train)

    # Predict on all instances of the reviews
    y_pred = classifier.predict(X_tfidf)
    reviews_w_NA[cat] = y_pred

# Look at the final results
reviews = pd.concat([reviews_w_NA, reviews[reviews.review.isna()]])
reviews.head()

## 3.3 Visualizing the facilities with Folium

In [None]:
# Set the starting point for the map
lat = 56
lng = 9

denmark_map = folium.Map(location=[lat,lng], zoom_start=7)

In [None]:
geolocations_google = pd.read_csv(RAW_DATA + "google_reviews.csv") 
geolocations_kbh = pd.read_csv(PROCESSED_DATA + "kbh_facilities_reviews.csv") 

In [None]:
# check if the location column is entirely null if so group by address instead
if geolocations_kbh['location'].isnull().all():
    kbh_grouping_field = 'address'
else:
    kbh_grouping_field = 'location'

# group and aggregate kbh_facilities_reviews
kbh_grouped_data = geolocations_kbh.groupby(kbh_grouping_field).agg({
    'lat': 'first', 'lng': 'first', 'activity': 'first', 
    'type': 'first', 'rating': 'mean'}).reset_index()
kbh_grouped_data['rating'] = kbh_grouped_data['rating'].round(1)

# group and aggregate google_reviews
google_grouped_data = geolocations_google.groupby(['lat', 'lng']).agg({
    'type': 'first', 'rating': 'mean'}).reset_index()
google_grouped_data['rating'] = google_grouped_data['rating'].round(1)

marker_cluster = MarkerCluster().add_to(denmark_map)

add_kbh_markers(kbh_grouped_data, marker_cluster)
add_google_markers(google_grouped_data, marker_cluster)

denmark_map.add_child(marker_cluster)
# denmark_map.save('../visualizations/denmark_map.html')
denmark_map

## 3.4 Statistical Analysis
Is there a statistically significant difference in the ratings from the main fitness brands in Denmark's main cities?

### Plot of rating average per brand

In [None]:
# Collect the ratings per gym
ratings = []
labels = []

for gym in reviews.enterprise.unique():
    labels.append(gym)
    ratings.append(list(reviews.loc[reviews.enterprise == gym, "rating"].values))

In [None]:
# Boxplot of ratings
fig, ax = plt.subplots(1, 1)
sns.boxplot(ratings, palette="deep", color=".8", linewidth=.75)
ax.set_xticklabels(labels,fontsize=14) 
ax.set_ylabel("Rating", fontsize=14) 

sns.despine(offset=10, trim=True)
fig.tight_layout()
plt.show()

### Difference between brands: ANOVA Test and Tukey's HSD
How significant are these differences in mean rating?

In [None]:
# Set the significance level
alpha = 0.05

# ANOVA one way test
stat, p_value = f_oneway(*ratings)

if p_value < alpha:
    print(f"With a significance level of {alpha}, we can reject the null hypothesis (P-value = {p_value:.3f})")
else:
    print(f"With a significance level of {alpha}, we cannot reject the null hypothesis (P-value = {p_value:.3f})")

If the ANOVA test allows to reject the null hypothesis of equal means, we can perform a Tukey's HSD test.

In [None]:
if p_value < alpha:
    print([" | ".join(f"{i}: {l}" for i, l in enumerate(labels))])
    res = tukey_hsd(*ratings)
    print(res)

### Platform effect on review ratings: Mixed Effect Model

In [None]:
# Fit the model
model = sm.MixedLM.from_formula("rating ~ platform", groups="enterprise", data=data)
result = model.fit()

# Print the model summary
print(result.summary())