# Preprocessing Wikivoyage

Assumes wikivoyage data has been parsed into a dataframe with metadata and a generator for creating the tokens.

In [None]:
data_dir = '../../../data/wikivoyage/processed/'
api_dir = '../../../api/data/'

path_wiki_metadata_in  = data_dir + 'wikivoyage_metadata_all.csv'
path_wiki_metadata_out = data_dir + 'wikivoyage_destinations.csv'
path_api_out = api_dir + 'wikivoyage_destinations.csv'

In [None]:
import pandas as pd

## Requirements for base product

Essentials (= scope of this notebook):
- Structured metadata:
    * Destination name
    * Geolocation
    * Parent, needed for retrieving country (but could also be done on geolocation?)
- Unstructured content:
    * Embeddings for retrieving activities

Also retrieved in the past, but not yet needed (= not in scope):
- Redirect names so that people can search by name?
- Links to other datasets: DMOZ, Commons, Wikipedia
- Full hierarchy
- Number of direct children & sum of all children destinations
- Number of parents, and whether the parent is 'odd' (parent is park or city)
- Continent

### Load data

In [None]:
data = (
    # converter is not needed if converted earlier at extraction
    pd.read_csv(path_wiki_metadata_in, converters={'status' : str.lower})
    # throw away one odd case in which the title is missing
    .loc[lambda df: ~df['title'].isnull()]
)
data.shape

## Preprocessing

### Setting some of the parents manually

Note: "World" for the continents is later undone, as "World" redirects to "Destinations"

In [None]:
# set is part of for continents
data.loc[(data['ispartof'].isnull()) & (data['articletype'] == 'continent'), 'ispartof'] = 'World'

Some destinations have a missign parent that cannot be fixed programatically:
* "Sonoma County" doesn't have any parent listed in it's xml text...

So, for these let's also set `ispartof` manually:

In [None]:
data.loc[lambda df: df['title'] == 'Sonoma County', 'ispartof'] = 'North Coast (California)'

### Scope data: throw away most irrelevant content

In [None]:
df = (
    data.copy()  
    .loc[lambda df: ~df['title'].isin(['Space', 'Moon'])]  # these are of type 'park' so need to excl. them by name
    .loc[lambda df: ~df['title'].str.contains('disambiguation')]
    .loc[lambda df: df['disambiguation'] == False]
    .loc[lambda df: df['historical'] == False]
    .loc[lambda df: ~df['articletype'].isnull()]
    .loc[lambda df: ~df['ispartof'].isnull()]
)
print(df.shape)

### Getting the parent path for each destination

Before we can get each parent, we need to replace all `title`'s and `ispartof`'s with their redirects if available. This way we can avoid 'broken chains' wheren an `ispartof` refers to a redirect title instead of a title.

To do this we create a lookup dataframe and apply a function to replace the title if there is a redirect.

In [None]:
def replace_title_with_redirect_if_possible(title, lookup_df):
    redirect_title = lookup_df.loc[lookup_df['title'] == title, 'redirect']
    return redirect_title.iat[0] if len(redirect_title) > 0 else title

redirect_table = (
    data
    .loc[lambda df: ~df['redirect'].isnull()]
    [['pageid', 'title', 'redirect']]
    .copy()
)

# filter away all redirects
df = df.loc[lambda df: df['redirect'].isnull()]

df['title'] = df.apply(lambda x: replace_title_with_redirect_if_possible(x['title'], redirect_table), axis=1)
df['ispartof'] = df.apply(lambda x: replace_title_with_redirect_if_possible(x['ispartof'], redirect_table), axis=1)

Now we are going to do a left join with itself to get the `parentid`. We need lowercased helper columns for the join as sometimes the capitals between the `title` and `ispartof` don't match. For example "Geraldton (Ontario)" has as a parent "northern Ontario" versus the actual record that starts with a capital N: "Northern Ontario".

In [None]:
lower_case_matching_df = (
    df[['pageid', 'title']]
    # lowercase for better matching
    .assign(title_lower = lambda df: df['title'].str.lower())
    .drop('title', axis=1)
    # rename columns for matching
    .rename({'pageid' : 'parentid'}, axis=1)
)

df = (
    df
    .assign(ispartof_lower = lambda df: df['ispartof'].str.lower())
    .merge(lower_case_matching_df, how='left', left_on='ispartof_lower', right_on='title_lower')
    .drop(['title_lower', 'ispartof_lower'], axis=1)
#     .assign(parentid = lambda df: df['parentid'].astype(int))
)

Set `parentid` for "Destinations" and "Other destinations" to 0.

In [None]:
df.loc[df['ispartof'] == 'Destinations', 'parentid'] = 0
df.loc[df['ispartof'] == 'Other destinations', 'parentid'] = 0

print(df.shape)

### Scope data: require good articletype and having a parent

Focus on core destinations content here

In [None]:
df_scoped = (
    df.copy()  
    .loc[lambda df: df['articletype'].isin(['district', 'city', 'region', 'park', 'country', 'continent'])]
)
print(df_scoped.shape)

Check uitval o.b.v. parent matching. Zo weinig op dit punt. Gewoon negeren/deleten.

In [None]:
uitval_parent = (
    df_scoped
    .copy()
    .loc[lambda df: df['parentid'].isnull()]
#     .loc[lambda df: ~df['ispartof'].isin(['Destinations', 'Other destinations'])]
)
print(uitval_parent.shape)
uitval_parent

In [None]:
df_scoped = (
    df_scoped.loc[lambda df: ~df['parentid'].isnull()]
    # finally convert parent_id into int now that it's always available
    .assign(parentid = lambda df: df['parentid'].astype(int))
)
print(df_scoped.shape)

### Save country as feature

In [None]:
def find_record(pageid, lookup_df):
    return lookup_df.loc[lookup_df['pageid'] == pageid].iloc[0]

def find_parent(pageid, lookup_df):
    current_record = find_record(pageid, lookup_df)
    articletype, country, parentid = current_record['articletype'], current_record['title'], current_record['parentid']
    
    # loop until country found, or no other possibilities left
    while (current_record['articletype'] != 'country') and (current_record['parentid'] != 0):
        
        # lookup parent record and get type
        current_record = find_record(current_record['parentid'], lookup_df)
        articletype, country, parentid = current_record['articletype'], current_record['title'], current_record['parentid']
                
#     when done with loop, return country name if found
    return country if articletype == 'country' else None


lookup_df = df_scoped.copy()
df_scoped['country'] = df_scoped['pageid'].apply(lambda x: find_parent(x, lookup_df))

There are quite some destinations for which a country couldn't been found. Many of these are special regions, belonging to bigger countries, like many of the carribean islands:

- Puerto Rico
- Cayman Islands
- U.S. Virgin Islands
- Bonaire
- French Guiana (doesn't have its own flag, but is part of France - could set France as parentid)

However, many of these islands have their own flag. Need to solve that by matching with some flag dataset in the future.

In [None]:
uitval_country = df_scoped.loc[(df_scoped['country'].isnull()) & (df_scoped['articletype'] != "region")].copy()
print(uitval_country.shape)
uitval_country.sample(10)

To make sure any destination has a country feature value, set it to `ispartof` when `country` is missing:

In [None]:
df_scoped.loc[lambda df: df['country'].isnull(), 'country'] = df_scoped.loc[lambda df: df['country'].isnull(), 'ispartof']

### Scope data: select only end destinations

Keep cities and parks only.

In [None]:
df_dest = (
    df_scoped
    .loc[lambda df: df['articletype'].isin(['city', 'park'])]
    .drop(['redirect', 'disambiguation', 'historical'], axis=1)
)
print(df_dest.shape)

### Scope data: require geo location

Make sure all have a geo location

In [None]:
uitval_geo = df_dest.loc[lambda df: (df['lat'].isnull()) | (df['lon'].isnull())].copy()

print(uitval_geo.shape)
uitval_geo.sample(3)

In [None]:
df_final = df_dest.loc[lambda df: (~df['lat'].isnull()) & (~df['lon'].isnull())].copy()
print(df_final.shape)

## Write to CSV

TODO: make sure input dataframe longitude columns is renamed from `lon` to `lng`

In [None]:
(
    df_final
    .rename(columns={'pageid': 'id', 'title': 'name', 'articletype': 'type', 'lon': 'lng'})
    .to_csv(path_api_out, index=False)
#     .to_csv(path_wiki_metadata_out, index=False)
)

Done.