Why does Tokyo have low review scores in the "location" category?

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
city_names = ["tokyo", "sydney", "melbourne", "singapore", "hongkong", "taipei", "bangkok"]

In [3]:
cities_df = dict()
for city in city_names:
    cities_df[city] = pd.read_csv("./data/listings_{c}.csv".format(c=city))

In [4]:
reviews_df = dict()
for city in city_names:
    reviews_df[city] = pd.read_csv("./data/reviews_{c}.csv".format(c=city))

In [5]:
df_full_tokyo = pd.merge(cities_df["tokyo"], reviews_df["tokyo"], left_on='id', right_on='listing_id')

df_full_tokyo['all_reviews'] = df_full_tokyo.apply(lambda row: f"Score: {row['review_scores_location']}<br>Review: {row['comments']}", axis=1)

# This step concatenates all reviews and scores for each listing into a single string.
df_full_tokyo['all_reviews'] = df_full_tokyo.groupby('listing_id')['all_reviews'].transform(lambda x: '<br><br>'.join(x))

# Drop duplicate listings, keeping only the first entry (which now contains all reviews and scores)
df_full = df_full_tokyo.drop_duplicates(subset='listing_id')


## Plotting the location in a geographical map

Plotting the location of listings that have top 85% review scores and bottom 10% review scores

In [7]:
import folium

tokyo_coords = [35.682839, 139.759455]

m = folium.Map(location=tokyo_coords, zoom_start=12)


quantile_25 = df_full['review_scores_location'].quantile(0.10) # Too many reviews below 5%, so we use 3% quantile
quantile_75 = df_full['review_scores_location'].quantile(0.85) # No reviews above 90%, so we use 85% quantile

# Plot each point on the map
for idx, row in df_full.iterrows():
    if row['review_scores_location'] < quantile_25:
        color = 'blue'
    elif row['review_scores_location'] > quantile_75:
        color = 'green'
    else:
        continue

    marker = folium.CircleMarker(location=[row['latitude'], row['longitude']],
                                 radius=5,
                                 color=color,
                                 fill=True,
                                 fill_color=color,
                                 fill_opacity=0.6)

    marker.add_child(folium.Popup(row['all_reviews'], max_width=300))  # adjust max_width to your preference
    marker.add_to(m)

m.save('./map/bad_and_good_location_{c}.html'.format(c="tokyo"))
print("Map saved to 'bad_and_good_location_tokyo.html'. Open it in your browser to view the map.")

Map saved to 'bad_and_good_location_tokyo.html'. Open it in your browser to view the map.


Plot all the location of listings in Tokyo.
Red color for scores above the median, blue color for scores below the median.

In [9]:
import numpy as np

df_sampled = df_full.copy()

quantile_50 = df_sampled['review_scores_location'].quantile(0.50)

m = folium.Map(location=[35.6895, 139.6917], zoom_start=12)

df_sampled['deviation'] = df_sampled['review_scores_location'] - quantile_50

def determine_color(deviation):
    if deviation is None or np.isnan(deviation):
        return "#ffffff"
    # Use logarithmic scaling to emphasize the deviation
    scaled_deviation = np.sign(deviation) * np.log1p(abs(deviation))

    if scaled_deviation > 0:
        # Red color for scores above the median
        intensity = int(255 * (1 - scaled_deviation))
        return f"#{255:02x}{intensity:02x}{intensity:02x}"
    else:
        # Blue color for scores below the median
        intensity = int(255 * (1 + scaled_deviation))
        return f"#{intensity:02x}{intensity:02x}ff"


# Plot each point on the map using the sampled dataframe
for idx, row in df_sampled.iterrows():
    color = determine_color(row['deviation'])
    marker = folium.CircleMarker(location=[row['latitude'], row['longitude']],
                                 radius=5,
                                 color=color,
                                 fill=True,
                                 fill_color=color,
                                 fill_opacity=0.6)
    marker.add_to(m)

m.save('./map/location_scores_heatmap_tokyo_sampled.html')
print("Map saved to 'location_scores_heatmap_tokyo_sampled.html'. Open it in your browser to view the map.")


Map saved to 'location_scores_heatmap_tokyo_sampled.html'. Open it in your browser to view the map.


### Relation between location and neighbourhood

I created a table of average location scores for each neighbourhood.
The table is sorted by the average location score.

It seems that the neighbourhoods with the highest average location can be classified into to groups:
1. Those located in nature, far from the city center. For example, "Fussa Shi" and "Ome Shi" are located in the mountains.
2. Those located in luxury areas. For example, "Minato Ku", "Chiyoda Ku", "Chuo Ku" are business districts, and "Setagaya Ku" is a residential area with many expensive houses.

The neighbourhoods with the lowest average location score are mostly located in residential but not the best nature areas.

In [13]:
neighbourhood_scores = df_full_tokyo.groupby('neighbourhood_cleansed')['review_scores_location'].mean()

neighbourhood_scores_sorted = neighbourhood_scores.reset_index().sort_values('review_scores_location', ascending=False)
neighbourhood_scores_sorted.reset_index(drop=True, inplace=True)

print("Top 10 neighbourhoods with the highest average location score:")
print(neighbourhood_scores_sorted.head(10))

print("\nBottom 10 neighbourhoods with the lowest average location score:")
print(neighbourhood_scores_sorted.tail(10).sort_values('review_scores_location', ascending=True))

Top 10 neighbourhoods with the highest average location score:
  neighbourhood_cleansed  review_scores_location
0              Fussa Shi                4.966265
1              Minato Ku                4.809081
2                Ome Shi                4.795043
3            Setagaya Ku                4.777035
4             Chiyoda Ku                4.776826
5              Chofu Shi                4.756662
6            Koganei Shi                4.754881
7             Shibuya Ku                4.749732
8                Chuo Ku                4.744929
9            Shinjuku Ku                4.744700

Bottom 10 neighbourhoods with the lowest average location score:
   neighbourhood_cleansed  review_scores_location
46          Tachikawa Shi                4.146311
45    Musashimurayama Shi                4.239318
44               Hino Shi                4.286992
43              Inagi Shi                4.330000
42             Hamura Shi                4.380460
41            Akiruno Shi       

To find relation between location review scores and neighbourhoods, I conducted Kruskal-Wallis test for each neighbourhood and property type.

The reason I used Kruskal-Wallis test instead of ANOVA is that the location review scores are not normally distributed.
The null hypothesis is that there is no difference in location review scores between neighbourhoods/property type.

The extremely low p-values in both suggest that the differences in location scores between neighbourhoods are not due to random chance.

In [24]:
from scipy.stats import kruskal

data_arrays = [df_full['review_scores_location'][df_full['neighbourhood_cleansed'] == neighbourhood].dropna().values for neighbourhood in df_full['neighbourhood_cleansed'].unique()]

# Conduct the Kruskal-Wallis H test
stat, p = kruskal(*data_arrays)

print(f"Kruskal-Wallis H statistic: {stat: .10f}")
print(f"P-value: {p :.10f}")


Kruskal-Wallis H statistic:  340.4728430923
P-value: 0.0000000000
