# Activities with BM25 - Old style


Steps taken previously:
1. Select list of activities (We had 87 - from booking?)
2. Train Word2Vec model on entire Wikivoyage corpus (no filtering!)
3. For each activity: 
    - get vector with 50 most similar words
    - manually remove words that are not relevant for a topic (output in [gsheet](https://docs.google.com/spreadsheets/d/1aucwUbyvVzBQ39lz4ipzKeBE30VADS_nFc2ey_iYo-8/edit#gid=0))
    - what is left is the search query for the activity
4. Get texts for all destinations in scope>
5. Use BM25 to create a score for each place/activity pair.


In [None]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
source_dir = '../../../../'
feature_input_dir = 'src/stairway/wikivoyage/feature_engineering/features_input_data/'
feature_terms_file = 'feature_terms.csv'
feature_mapping_file = 'feature_profiles.csv'

## Destinations

In [None]:
queries_df = pd.read_csv(source_dir + feature_input_dir + feature_terms_file, header=None, index_col=0)

In [None]:
queries = queries_df.apply(
#     lambda x: ','.join(x.dropna().astype(str)),
    lambda x: x.dropna().astype(str).tolist(),
    axis=1
)
queries.head()

Nice texts

In [None]:
types = pd.read_csv(source_dir + feature_input_dir + feature_mapping_file)
types.head()

## Place texts

In [None]:
%%time
path_wiki_in  = source_dir + 'data/wikivoyage/raw/enwikivoyage-20191001-pages-articles.xml.bz2'

from gensim.corpora import WikiCorpus

wiki = WikiCorpus(path_wiki_in, article_min_tokens=0) 

In [None]:
%%time
corpus = list(wiki.get_texts())
print(len(corpus))

Get index for places in scope

In [None]:
df = pd.read_csv(source_dir + 'data/wikivoyage/enriched/wikivoyage_destinations.csv')
df.shape

In [None]:
df_all = pd.read_csv(source_dir + 'data/wikivoyage/clean/wikivoyage_metadata_all.csv')
df_all.shape

Check that df_all matches with corpus size! Otherwise indexing wouldn't work.

In [None]:
assert len(corpus) == len(df_all)

In [None]:
# get indices from df_all that are in scope
scope = df_all.loc[lambda row: row['pageid'].isin(df['wiki_id'])][['pageid']]
scope.shape

In [None]:
# get texts for places in scope
corpus_scope = [corpus[i] for i in scope.index]
len(corpus_scope)

## BM25

[Explaination of BM25](https://turi.com/learn/userguide/feature-engineering/bm25.html) including a Python example/libary. The transformed output is a column of type float with the BM25 score for each document.

This implementation seems easiest to use: https://pypi.org/project/rank-bm25/

In [None]:
from rank_bm25 import BM25Okapi

bm25 = BM25Okapi(corpus_scope)

### Try one

In [None]:
query = queries['art galleries']
query

In [None]:
# apply bm25
doc_scores = bm25.get_scores(query)


In [None]:
# print min, max scores and how many documents got a score bigger than 0
print('min:', min(doc_scores), 'max:', max(doc_scores), '>0:', sum(doc_scores > 0))

In [None]:
top_5 = np.argsort(doc_scores)[-5:]
print(top_5)
print(doc_scores[top_5])

In [None]:
df.iloc[top_5]

It seems to heavily bias towards places with relatively little text which contains a couple of the required terms. 

**TODO**: investigate how longer documents could still end up high in the ranking?

### Loop over all queries

In [None]:
%%time
scores = np.array([bm25.get_scores(queries[i]) for i in range(0, len(queries))]).T
print(scores.shape)

In [None]:
df_scores = pd.DataFrame(scores, columns=queries.index)
df_scores.shape

In [None]:
df_scores.head()

Add proper column names

In [None]:
df_scores.columns = types['feature_name']
df_scores.head()

#### Examine a top 5:

In [None]:
df.iloc[np.argsort(df_scores['Whale watching'])[-5:]]

## Compare to old scores

Note, the old scores had more places in scope so exact counts don't match. Also the BM25 implementation was done manually by Bram instead of importing a library.

In [None]:
df_scores_old = pd.read_csv(source_dir + "data/old-sql-database/destination_scores.csv")
df_scores_old.shape

Compare distributions for some features:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))

df_scores['Museums'].hist(bins=30, ax=axes[0])
axes[0].set_title('New scores. Count = {}'.format(sum(df_scores['Museums'] > 0)), size=15)

df_scores_old['museums'].hist(bins=30, ax=axes[1]);
axes[1].set_title('Old scores. Count = {}'.format(sum(df_scores_old['museums'] > 0)), size=15)

fig.tight_layout()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))

df_scores['Islands'].hist(bins=30, ax=axes[0])
axes[0].set_title('New scores. Count = {}'.format(sum(df_scores['Islands'] > 0)), size=15)

df_scores_old['islands'].hist(bins=30, ax=axes[1]);
axes[1].set_title('Old scores. Count = {}'.format(sum(df_scores_old['islands'] > 0)), size=15)

fig.tight_layout()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))

# this time plot only > 0
df_scores['Wineries'].loc[df_scores['Wineries'] > 0].hist(bins=30, ax=axes[0])
axes[0].set_title('New scores. Count = {}'.format(sum(df_scores['Wineries'] > 0)), size=15)

df_scores_old['wineries'].loc[df_scores_old['wineries'] > 0].hist(bins=30, ax=axes[1]);
axes[1].set_title('Old scores. Count = {}'.format(sum(df_scores_old['wineries'] > 0)), size=15)

fig.tight_layout()

Distributions are quite different. Possibly reasons:

* New text data, things might have changed in wikivoyage
* Different sizes and possibly different places in scope
* Different implementations of BM25 (package vs. manual)
* Different hyperparameters for BM25

However counts are total counts per category and distributions are enough alike to accept the new scores as feature scores.


## Write to csv

In [None]:
output_path = source_dir + 'data/wikivoyage/enriched/wikivoyage_features.csv'

df_final = pd.concat([df[['id']], df_scores], axis=1)
df_final.to_csv(output_path, index=False)

In [None]:
api_path = 'api/data/wikivoyage_features.csv'

df_final.to_csv(source_dir + api_path, index=False)

In [None]:
api_path_types = 'api/data/wikivoyage_features_types.csv'

types.to_csv(source_dir + api_path_types, index=False)

Done.