# Sorting place results

Goal is to determine a nice sequence of place results for the end user.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

import sys
sys.path.append('../../../../')

In [None]:
data_dir = '../../../../data/wikivoyage/'
# folder where data should live for flask API
api_dir = '../../../../api/data/'

input_path = data_dir + 'processed/wikivoyage_destinations.csv'
output_path1 = data_dir + 'enriched/wikivoyage_destinations.csv'
output_path2 = api_dir + 'wikivoyage_destinations.csv'

In [None]:
from stairway.utils.utils import add_normalized_column

### Read data

In [None]:
df = pd.read_csv(input_path)
df.head()

In [None]:
df['nr_tokens'].describe()

In [None]:
df.columns

In [None]:
columns = ['country', 'id', 'name']

df[columns].head()

### Remove destinations with no tokens

Has to be done for resampling, otherwise there will be observations with weight 0 which means they will never get sampled and you can thus not 'sort' the *entire* data set as some observations aren't drawn.

In [None]:
df = df.loc[lambda df: df['nr_tokens'] > 0]

## Biased sorting

In order to get some randomness, but make sure the more important destinations get oversampled, use `nr_tokens` as a weight in the sampling method.

For now, let's first have a look at the overall distribution of `nr_tokens` in our data. It is strongly skewed towards destinations with very few tokens:

In [None]:
(
    df
#     .loc[lambda df: df['country'] == 'Netherlands']
    .assign(nr_tokens_bins = lambda df: pd.cut(df['nr_tokens'], bins = list(range(0, 10000, 500)) + [99999]))
    ['nr_tokens_bins']
    .value_counts()
    .sort_index()
    .plot(kind='bar')
);

You can imagine that you don't want to random sample this way. It would mean that you would mostly show very unknown destinations to the user. 

Let's compare 3 different ways of sampling:
1. without weights (so random)
2. weighting by `nr_tokens`
3. weighting by `nr_tokens` to the power `X`

The more weighting, the more places are drawn with a larger number of tokens.

In [None]:
n_results = 16 # number of fetched results per API call
power_factor = 1.5 # nr of times to the power of nr_tokens for sampling bigger documents

fig, axes = plt.subplots(nrows=8, ncols=3, figsize=(16, 8*4))

df_bins = (
    df
    .assign(nr_tokens_bins = lambda df: pd.cut(df['nr_tokens'], bins = list(range(0, 10000, 500)) + [99999]))
    .assign(nr_tokens_powered = lambda df: df['nr_tokens']**power_factor)
) 

for i, row in enumerate(axes):
    for weights, ax in zip(['random', 'nr_tokens', 'nr_tokens^{}'.format(power_factor)], row):
        
        n = (i+1)*n_results
        
        # depending on weights type, sample differently
        if weights == 'random':
            df_plot = df_bins.sample(frac=1, random_state=1234)
        elif weights == 'nr_tokens':
            df_plot = df_bins.sample(frac=1, random_state=1234, weights='nr_tokens')
        else: 
            df_plot = df_bins.sample(frac=1, random_state=1234, weights='nr_tokens_powered')
        
        # plot
        (
            df_plot
            .head(n)
            ['nr_tokens_bins']
            .value_counts()
            .sort_index()
            .plot(kind='bar', ax=ax)
        )
        # prettify plot
        if i < 7:
            ax.get_xaxis().set_ticks([])
        ax.set_title('{} - {} obs'.format(weights, n))
        
fig.tight_layout()
plt.show()

Power factor 1.5 seems to be nice. Powering even more will deplete the places with most observations very quickly. For the user this means that they first get all the well known destinations, and then the rest. The aim of our app is to surprise and inspire, so we also want to show more lesser known destinations.

## Write to CSV

Add the sampling weight feature and write the final data set to be used by the frontend

In [None]:
from stairway.wikivoyage.feature_engineering import add_sample_weight

In [None]:
power_factor = 1.5

output_df = (
    df
    # add the feature
    .pipe(add_sample_weight)
    # other hygiene
    .drop(columns=['nr_tokens', 'ispartof', 'parentid'])
    .set_index('id', drop=False)
    # need to do this to convert numpy int and float to native data types
    .astype('object')
)
output_df.head()

In [None]:
# write 'approved' file to the data and api folders
# output_df.to_csv(output_path1, index=False)
# output_df.to_csv(output_path2, index=False)

## Sorting based on profiles

We want to allow the user to sort based on profiles like 'Nature', 'Culture', 'Beach'. To do this, we have identified which features are part of a profile. For the sorting, we then select the features in scope and sum their BM25 scores to get the final score for the sorting.

The question is: do these BM25 scores bias towards smaller destinations? If yes, do we want to apply some kind of weighting with the number of tokens as is demonstrated above?

### Imports and data

In [None]:
file_name = 'wikivoyage_destinations.csv'
features_file_name = 'wikivoyage_features.csv'
features_types = 'wikivoyage_features_types.csv'

In [None]:
df_places = pd.read_csv(data_dir + 'enriched/' + file_name).set_index("id", drop=False)
df_features = pd.read_csv(api_dir + features_file_name).set_index("id")
df_feature_types = pd.read_csv(api_dir + features_types)

### Do a sort

In [None]:
from api.resources.utils.features import add_sorting_weight_by_profiles, sort_places_by_profiles

In [None]:
profiles = ['nature']

In [None]:
sort_places_by_profiles(df_places, profiles, df_features, df_feature_types).head()

### Visualize

In [None]:
n_results = 16 # number of fetched results per API call
profiles = ['nature', 'city', 'culture', 'active', 'beach']

fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(16, 8*4))

df_bins = (
    df_places
    .assign(nr_tokens_bins = lambda df: pd.cut(df['nr_tokens'], bins = list(range(0, 10000, 500)) + [99999]))
) 

i = 0
for profile, row in zip(profiles, axes):
    i += 1
    for j, ax in enumerate(row):
        
        n = (j+1)*n_results
        
        # depending on profile, sort differently
        df_sorted = df_bins.pipe(sort_places_by_profiles, [profile], df_features, df_feature_types)
        
        # plot
        (
            df_sorted
            .head(n)
            ['nr_tokens_bins']
            .value_counts()
            .sort_index()
            .plot(kind='bar', ax=ax)
        )
        # prettify plot
        if i < len(profiles):
            ax.get_xaxis().set_ticks([])
        ax.set_title('{} - {} obs'.format(profile, n))
        
fig.tight_layout()
plt.show()

This confirms our hypothesis that the sorting using BM25 weights heavily skews the top results towards destinations with little amount of tokens. So let's experiment a little, and scale the profile score with the number of tokens:

In [None]:
n_results = 16 # number of fetched results per API call
power_factor = 1.5
profiles = ['nature', 'city', 'culture', 'active', 'beach']

fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(16, 8*4))

df_bins = (
    df_places
    .assign(nr_tokens_bins = lambda df: pd.cut(df['nr_tokens'], bins = list(range(0, 10000, 500)) + [99999]))
    .assign(nr_tokens_powered = lambda df: df['nr_tokens']**power_factor)
) 

i = 0
for profile, row in zip(profiles, axes):
    i += 1
    for j, ax in enumerate(row):
        
        n = (j+1)*n_results
        
        # depending on profile, sort differently
        df_sorted = (
            df_bins
            .pipe(add_sorting_weight_by_profiles, [profile], df_features, df_feature_types)
            .assign(weight = lambda df: df['nr_tokens'] * df['profile_weight'])
            .sort_values('weight', ascending=False)
        )
        
        # plot
        (
            df_sorted
            .head(n)
            ['nr_tokens_bins']
            .value_counts()
            .sort_index()
            .plot(kind='bar', ax=ax)
        )
        # prettify plot
        if i < len(profiles):
            ax.get_xaxis().set_ticks([])
        ax.set_title('{} - {} obs'.format(profile, n))
        
fig.tight_layout()
plt.show()

That helps, although we seem to be overshooting a bit.... and now we are not even using the power factor. 

This could be because `nr_tokens` is of quite some magnitudes bigger than `profiles_weight`. Let's therefore try normalizing first and then adding both.

In [None]:
n_results = 16 # number of fetched results per API call
power_factor = 1.5
profiles = ['nature', 'city', 'culture', 'active', 'beach']

fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(16, 8*4))

df_bins = (
    df_places
    .assign(nr_tokens_bins = lambda df: pd.cut(df['nr_tokens'], bins = list(range(0, 10000, 500)) + [99999]))
    .assign(nr_tokens_norm = lambda df: (df['nr_tokens'] - df['nr_tokens'].min()) / (df['nr_tokens'].max() 
                                                                                     - df['nr_tokens'].min()))
) 

i = 0
for profile, row in zip(profiles, axes):
    i += 1
    for j, ax in enumerate(row):
        
        n = (j+1)*n_results
        
        # depending on profile, sort differently
        df_sorted = (
            df_bins
            .pipe(add_sorting_weight_by_profiles, [profile], df_features, df_feature_types)
            .assign(profile_weight_norm = lambda df: (df['profile_weight'] - df['profile_weight'].min()) / 
                    (df['profile_weight'].max() - df['profile_weight'].min()))
            .assign(weight = lambda df: df['nr_tokens_norm'] + df['profile_weight_norm'])
            .sort_values('weight', ascending=False)
        )
        
        # plot
        (
            df_sorted
            .head(n)
            ['nr_tokens_bins']
            .value_counts()
            .sort_index()
            .plot(kind='bar', ax=ax)
        )
        # prettify plot
        if i < len(profiles):
            ax.get_xaxis().set_ticks([])
        ax.set_title('{} - {} obs'.format(profile, n))
        
fig.tight_layout()
plt.show()

Better :) 

Now give the profile scores an even higher weight by multiplicating that score before adding it. 

Tuning it a bit suggests that `multiplication_factor = 2` is too high, it favors more of the unknown destinations. But notably, even more important is that the right multiplication factor varies quite some per profile.. We will need to spend time to make something more robust.

In [None]:
n_results = 16 # number of fetched results per API call
multiplication_factor = 1.5
profiles = ['nature', 'city', 'culture', 'active', 'beach']

fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(16, 8*4))

df_bins = (
    df_places
    .assign(nr_tokens_bins = lambda df: pd.cut(df['nr_tokens'], bins = list(range(0, 10000, 500)) + [99999]))
    .pipe(add_normalized_column, 'nr_tokens')
) 

i = 0
for profile, row in zip(profiles, axes):
    i += 1
    for j, ax in enumerate(row):
        
        n = (j+1)*n_results
        
        # depending on profile, sort differently
        df_sorted = (
            df_bins
            .pipe(add_sorting_weight_by_profiles, [profile], df_features, df_feature_types)
            .pipe(add_normalized_column, 'profile_weight')
            .assign(weight = lambda df: df['nr_tokens_norm'] + (df['profile_weight_norm']*multiplication_factor))
            .sort_values('weight', ascending=False)
        )
        
        # plot
        (
            df_sorted
            .head(n)
            ['nr_tokens_bins']
            .value_counts()
            .sort_index()
            .plot(kind='bar', ax=ax)
        )
        # prettify plot
        if i < len(profiles):
            ax.get_xaxis().set_ticks([])
        ax.set_title('{} - {} obs'.format(profile, n))
        
fig.tight_layout()
plt.show()

In [None]:
df_sorted.head()

Let's just go for that for now, but add the requirement that there needs to be at least 0.1 feature_score (before normalisation), so that we know there won't be destinations added that have nothing to do with the profile.

Done.