In [None]:
%load_ext autoreload
%autoreload 2

# Wikivoyage

Latest version of the English source Wikivoyage content can be downloaded at:
https://dumps.wikimedia.org/enwikivoyage/latest/

Specific months can also be downloaded by adding yearmonth to the url:
https://dumps.wikimedia.org/enwikivoyage/20191001/

In [None]:
data_dir = '../../../data/wikivoyage/'

path_wiki_in  = data_dir + 'raw/enwikivoyage-20191001-pages-articles.xml.bz2'
path_wiki_out = data_dir + 'clean/wikivoyage_metadata_all.csv'

In [None]:
import re
import pandas as pd

from itertools import islice

## requirements for base product

structured:
* destination name
* parent (incl. hierarchy) -> country, continent
* geolocation
* (possibly: synonyms?)

text:
* activities



## Gensim

Gensim has a `WikiCorpus` class that can be read to parse the wikitravel dump. 

In [None]:
from gensim.corpora import WikiCorpus

wiki = WikiCorpus(path_wiki_in, article_min_tokens=10) 
wiki.metadata = True

`wiki.metadata = True` adds `pageid` and `title` to each tokenized document.

Some arguments to play with: 
- Only articles of sufficient length are returned (short articles & redirects etc are ignored). This is control by `article_min_tokens` on the class instance.
- Set `token_min_len`, `token_max_len` as thresholds for token lengths that are returned (default to 2 and 15).

Eventually, `wiki.get_texts()` can be used to retrieve the parsed contents:

In [None]:
for (tokens, (pageid, title)) in islice(wiki.get_texts(), 5):
    print(pageid, title)

To see how many documents were parsed in total:

In [None]:
all_pages = [pageid for (tokens, (pageid, title)) in wiki.get_texts()]
len(all_pages)

## Extending Gensim to parse text content

Cleaning steps on text taken previously:
1. lower case
2. extracting type (city, park, region, country, continent)
3. extracting status (outline, usable, guide, star)
4. remove empty texts (maybe also throw away ones with very little text?)
5. get geo coordinates
6. get wikipedia link
7. get parent
8. get commons name (reference to other dataset)
9. get DMOZ folder (reference to other dataset)
10. add size of text
11. set parents of continet to 'world'
12. get parent ids by string matching ... follows from 'ispartof' ...
13. throw away some specific stuff with parents like moon and space

shit. that's a lot..!

Instead of running the R scripts I built long time ago, it's probably better to adapt the Gensim code to parse this info on the fly. Let's make a copy of the Gensim code and create our own module.

In [None]:
from stairway.wikivoyage import parsing

Let's begin with retrieving the following data from the text:

```
{{IsPartOf|North Brabant}}
{{guidecity}}
{{geo|51.69014|5.29897|zoom=15}}
```

logic of the class, happens in `get_texts()`:
1. `extract_pages()` yields texts, pageid, and title. So this is for the **metadata**.
2. `process_article()` in multithreated fashion. Converts texts into tokens. Need to adapt for parsing **text**
    1. `filter_wiki()` filters out wiki markup from `raw`, leaving only text:
        1. to unicode
        2. decode html
        3. `remove_markup()` filters out wiki markup from `text`, leaving only text.
            1. `remove_template()` is finally the function that removes our fields of interest
    2. `lemmatize()`. If wanted: lemmatizes text.
    3. `tokenize()`.  Tokenizes text.
   
The `remove_markup()` function contains a lot of regex parsing. Let's adjust this function and see if instead of removing these regex strings, see if we can return it (together with the text). What it takes as an input is a text. So let's get one example text to work with first.

In [None]:
wiki_new = wikivoyage.WikiCorpus(path_wiki_in, article_min_tokens=10)
wiki_new.metadata = True

In [None]:
example_texts = []
for (pageid, title, redirect, nr_tokens, patterns, text) in islice(wiki_new.get_texts(), 10):
    example_texts.append(text)
    
example_texts[0]

Now let's further examine Gensim's logic by looking into the `remove_markup()` function.  It looks like the part that we look for is between `{{ ... }}` and is removed by the `remove_template()` function in it. Let's check what that does to our text:

In [None]:
from gensim.corpora.wikicorpus import remove_markup, remove_template

# remove_markup(example_texts[0], promote_remaining=True, simplify_links=True)
# remove_template(example_texts[0])

Indeed, remove template is doing this. So we need to alter something here. 

Also, one can look at the [template documentation](https://meta.wikimedia.org/wiki/Help:Template) to understand it a bit better.

Now, let's try to adapt the function. Or instead of adapting it, let's add a function that retrieves the desired output and then apply the `remove_template()` after it to clean up as usual.

First let's examine how the regex works in the Gensim code:

In [None]:
RE_P3 = re.compile(r'{{([^}{]*)}}', re.DOTALL | re.UNICODE)
re.search(RE_P3, example_texts[0]).groups()

Check out documentation on [regex syntax](https://docs.python.org/3/library/re.html?highlight=dotall#regular-expression-syntax) to break down this regex expression:
- `[^}{]`:
    - Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
    - If the first character of the set is '^', all the characters that are not in the set will be matched.
    - In normal words: match anything that is not a `{` or `}`
- `*` Causes the resulting RE to match 0 or more repetitions of the preceding RE
- `(...)` Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group

Great. So basically this defaults to 'match anything between `{{ ... }}`'.

Now let's create our own pattern specific to one of our use cases:

In [None]:
RE_P_Geo = re.compile(r'{{(geo|mapframe)\|([-]?[0-9]+[.]?[0-9]*)\|([-]?[0-9]+[.]?[0-9]*)([^}{]*)}}', 
                      re.DOTALL | re.UNICODE)

match = re.search(RE_P_Geo, example_texts[0])
match.groups()

Awesome! Now see if we can feed this back to the final output. First make a function.

In [None]:
def extract_patterns(s, pattern):

    # get geo coordinates if available
    match = re.search(pattern, s)
    if match:
        lat = match.group(2)
        lon = match.group(3)
        return lat, lon
    else: 
        return None

In [None]:
extract_patterns(example_texts[0], RE_P_Geo)

In [None]:
extract_patterns(example_texts[1], RE_P_Geo)

Sometimes we have to find the last occurance of a match. For example a page can have multiple `IsPartOf`s as is the case for "Azores":

```
{{isPartOf|Islands of the Atlantic Ocean}}
{{isPartOf|Portugal}}
```

Or more often, there are multiple GEO tags. 

We resolve this by adding `(?s:.*)` in front of the regex which will make sure it will match the furthest and gradually backs of. However, this does slow down the speed of parsing the wikivoyage data considerably. So do remember to fix this if you want to speed up the code.

Other tweeks that have been done is cleaning the strings of parsed text. For example we trim whitespace and we need to replace `_` in `ispartof` titles with spaces to avoid mismatches like:
* `ispartof`: Lhasa_(prefecture) vs. `title`: Lhasa (prefecture)	
* `ispartof`: West_Yorkshire vs. `title`: West Yorkshire

Now that we have this, all we need to do is add this output to our own version of the WikiCorpus class.

In [None]:
wikivoyage.remove_markup(example_texts[0], extract_features=True)

Sweet, so now we just need to create similar functions for the other features of interest and pass the results on so that it all finally ends up in the output of the `WikiCorpus.get_texts()` function.

Note: let's leave out links to Wikipedia, DMOZ and Commons databases for now.

In [None]:
for (pageid, title, redirect, nr_tokens, patterns, text) in islice(wiki_new.get_texts(), 10):
    print(pageid, title, redirect, nr_tokens, patterns)

Ok bam! Let's get all data!

#### Write to CSV

Write it to a CSV so it can be preprocessed further in another piece of code. Set `article_min_tokens=0` to get all redirects.

In [None]:
wiki_new = wikivoyage.WikiCorpus(path_wiki_in, article_min_tokens=0)
wiki_new.metadata = True

wiki_new.write_to_csv(path_wiki_out)

#### Extract all texts

Can be usefull for debugging purposes, where you want to look up specific destinations and what is happening there in the parsing.

In [None]:
all_texts = []
for (pageid, title, nr_tokens, patterns, text) in wiki_new.get_texts():
    all_texts.append(text)

In [None]:
all_texts[1130]

### Parse text into tokens

To just parse the text and not return metadata set `metadata` to False (the default option):

In [None]:
wiki_new = wikivoyage.WikiCorpus(path_wiki_in, article_min_tokens=10)
wiki_new.metadata = False

In [None]:
example_tokens = []

for tokens in islice(wiki_new.get_texts(), 2):
    example_tokens.append(tokens)
    
example_tokens[0]

## Capturing redirects

We need to capture redirects to make sure we can complete the full hierarchy for each destination. 

For example:
* "Madeira" and "Saint Helena, Ascension and Tristan da Cunha" have `{{IsPartOf|Islands of the Atlantic Ocean}}`
* However, "Islands of the Atlantic Ocean" in itself is a redirect to "South Atlantic Islands":
```xml
    <title>South Atlantic islands</title>
    <ns>0</ns>
    <id>33370</id>
    <redirect title="Islands of the Atlantic Ocean" />
```
* This means we cannot traverse the tree further if we wouldn't have this redirect information.

To capture the redirect information the following line of code was added to our own `Wikivoyage` Class:

```python
redirect = node.find(ns+'redirect').attrib.get('title')
```

## TODO: check out find interlinks

is a function in gensim wikicorpus

Done.