In [None]:
%load_ext autoreload
%autoreload 2

# Wikivoyage

Latest version of the English source Wikivoyage content can be downloaded at:
https://dumps.wikimedia.org/enwikivoyage/latest/

Specific months can also be downloaded by adding yearmonth to the url:
https://dumps.wikimedia.org/enwikivoyage/20191001/

In [None]:
data_dir = '../../../data/wikivoyage/'

path_wiki_in = data_dir + 'raw/enwikivoyage-20191001-pages-articles.xml.bz2'

path_wiki_out = data_dir + 'processed/wikivoyage_dest_list.csv'

In [None]:
import re
import pandas as pd

from itertools import islice

## requirements for base product

structured:
* destination name
* parent (incl. hierarchy) -> country, continent
* geolocation
* (possibly: synonyms?)

text:
* activities



## Gensim

Gensim has a `WikiCorpus` class that can be read to parse the wikitravel dump. 

In [None]:
from gensim.corpora import WikiCorpus

wiki = WikiCorpus(path_wiki_in, article_min_tokens=10) 
wiki.metadata = True

`wiki.metadata = True` adds `pageid` and `title` to each tokenized document.

Some arguments to play with: 
- Only articles of sufficient length are returned (short articles & redirects etc are ignored). This is control by `article_min_tokens` on the class instance.
- Set `token_min_len`, `token_max_len` as thresholds for token lengths that are returned (default to 2 and 15).

Eventually, `wiki.get_texts()` can be used to retrieve the parsed contents:

In [None]:
for (tokens, (pageid, title)) in islice(wiki.get_texts(), 5):
    print(pageid, title)

To see how many documents were parsed in total:

In [None]:
all_pages = [pageid for (tokens, (pageid, title)) in wiki.get_texts()]
len(all_pages)

## Extending Gensim to parse text content

Cleaning steps on text taken previously:
1. lower case
2. extracting type (city, park, region, country, continent)
3. extracting status (outline, usable, guide, star)
4. remove empty texts (maybe also throw away ones with very little text?)
5. get geo coordinates
6. get wikipedia link
7. get parent
8. get commons name (reference to other dataset)
9. get DMOZ folder (reference to other dataset)
10. add size of text
11. set parents of continet to 'world'
12. get parent ids by string matching ... follows from 'ispartof' ...
13. throw away some specific stuff with parents like moon and space

shit. that's a lot..!

Instead of running the R scripts I built long time ago, it's probably better to adapt the Gensim code to parse this info on the fly. Let's make a copy of the Gensim code and create our own module.

In [None]:
from stairway.preprocessing import wikivoyage

Let's begin with retrieving the following data from the text:

```
{{IsPartOf|North Brabant}}
{{guidecity}}
{{geo|51.69014|5.29897|zoom=15}}
```

logic of the class, happens in `get_texts()`:
1. `extract_pages()` yields texts, pageid, and title. So this is for the **metadata**.
2. `process_article()` in multithreated fashion. Converts texts into tokens. Need to adapt for parsing **text**
    1. `filter_wiki()` filters out wiki markup from `raw`, leaving only text:
        1. to unicode
        2. decode html
        3. `remove_markup()` filters out wiki markup from `text`, leaving only text.
            1. `remove_template()` is finally the function that removes our fields of interest
    2. `lemmatize()`. If wanted: lemmatizes text.
    3. `tokenize()`.  Tokenizes text.
   
The `remove_markup()` function contains a lot of regex parsing. Let's adjust this function and see if instead of removing these regex strings, see if we can return it (together with the text). What it takes as an input is a text. So let's get one example text to work with first.

In [None]:
wiki_new = wikivoyage.WikiCorpus(path_wiki_in, article_min_tokens=10)
wiki_new.metadata = True

In [None]:
example_texts = []
for (pageid, title, patterns, text) in islice(wiki_new.get_texts(), 3):
    example_texts.append(text)
    
example_texts[0]

Now let's further examine Gensim's logic by looking into the `remove_markup()` function.  It looks like the part that we look for is between `{{ ... }}` and is removed by the `remove_template()` function in it. Let's check what that does to our text:

In [None]:
from gensim.corpora.wikicorpus import remove_markup, remove_template

# remove_markup(example_texts[0], promote_remaining=True, simplify_links=True)
# remove_template(example_texts[0])

Indeed, remove template is doing this. So we need to alter something here. 

Also, one can look at the [template documentation](https://meta.wikimedia.org/wiki/Help:Template) to understand it a bit better.

Now, let's try to adapt the function. Or instead of adapting it, let's add a function that retrieves the desired output and then apply the `remove_template()` after it to clean up as usual.

First let's examine how the regex works in the Gensim code:

In [None]:
RE_P3 = re.compile(r'{{([^}{]*)}}', re.DOTALL | re.UNICODE)
re.search(RE_P3, example_texts[0]).groups()

Check out documentation on [regex syntax](https://docs.python.org/3/library/re.html?highlight=dotall#regular-expression-syntax) to break down this regex expression:
- `[^}{]`:
    - Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
    - If the first character of the set is '^', all the characters that are not in the set will be matched.
    - In normal words: match anything that is not a `{` or `}`
- `*` Causes the resulting RE to match 0 or more repetitions of the preceding RE
- `(...)` Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group

Great. So basically this defaults to 'match anything between `{{ ... }}`'.

Now let's create our own pattern specific to one of our use cases:

In [None]:
RE_P_Geo = re.compile(r'{{(geo|mapframe)\|([-]?[0-9]+[.]?[0-9]*)\|([-]?[0-9]+[.]?[0-9]*)([^}{]*)}}', 
                      re.DOTALL | re.UNICODE)

match = re.search(RE_P_Geo, example_texts[0])
match.groups()

Awesome! Now see if we can feed this back to the final output. First make a function.

In [None]:
def extract_patterns(s, pattern):

    # get geo coordinates if available
    match = re.search(pattern, s)
    if match:
        lat = match.group(2)
        lon = match.group(3)
        return lat, lon
    else: 
        return None

In [None]:
extract_patterns(example_texts[0], RE_P_Geo)

In [None]:
extract_patterns(example_texts[1], RE_P_Geo)

Now that we have this, all we need to do is add this output to our own version of the WikiCorpus class.

In [None]:
wikivoyage.remove_markup(example_texts[0], extract_features=True)

Sweet, so now we just need to create similar functions for the other features of interest and pass the results on so that it all finally ends up in the output of the `WikiCorpus.get_texts()` function.

Note: let's leave out links to Wikipedia, DMOZ and Commons databases for now.

In [None]:
for (pageid, title, patterns, text) in islice(wiki_new.get_texts(), 5):
    print(pageid, title, patterns)

Ok bam! Let's get all data!

#### Write to CSV

In [None]:
wiki_new.write_to_csv(path_wiki_out)

TODO: add number of tokens as feature!

In [None]:
data = pd.read_csv(path_wiki_out)
data.shape

In [None]:
data.tail(10)

TODO: remove non-relevant articles

In [None]:

  ## CLEANING ON destination TITLES
  dest_clean <- dest[!grepl('disambiguation', dest$title, ignore.case = TRUE), ] # clean disambiguation ('aberdeen')
  dest_clean <- dest_clean[!grepl('wikivoyage', dest_clean$title, ignore.case = TRUE), ] # clean joke articles ('Mordor')
  dest_clean <- dest_clean[!grepl('template', dest_clean$title, ignore.case = TRUE), ] # clean templates
  dest_clean <- dest_clean[!grepl('mediawiki', dest_clean$title, ignore.case = TRUE), ] # clean MediaWiki
  dest_clean <- dest_clean[!grepl('phrasebook', dest_clean$title, ignore.case = TRUE), ] # clean phrasebooks ('Ainu')
  dest_clean <- dest_clean[!grepl('file:', dest_clean$title, ignore.case = TRUE), ] # clean files
  dest_clean <- dest_clean[!grepl('category:', dest_clean$title, ignore.case = TRUE), ] # clean categories.
  dest <- dest_clean
  rm(dest_clean)

## check out find interlinks

is a function in gensim wikicorpus

## 1. Capturing metadata

In [None]:
import xml.etree.ElementTree as etree
import csv
import bz2

In [None]:
file_name_dest = data_dir + 'processed/dest_list.csv'

In [None]:
# NOTE difference in writing csv between python 2 and 3!
# open file with encoding that can handle all characters!
f = open(file_name_dest, "w", newline="", encoding='utf-8') 
writer = csv.writer(f, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
#writer.writerow( ('page_nr', 'title', 'id', 'redirect_title') )
writer.writerow( ('title', 'id', 'redirect_title') )


Namespace is considered to be unstable according to the [Gensim documentation](https://github.com/RaRe-Technologies/gensim/blob/3e027c252eac3cf7e613f425ad8b070e8fe88065/gensim/corpora/wikicorpus.py#L411). Follow their code for a more flexible approach. 

## capture redirects too!

In [None]:

#count = 0
ns = '{http://www.mediawiki.org/xml/export-0.10/}' # set the namespace of the XML document
# events = 'starts' makes sure that everytime a node opens, iterparse starts processing
for (event, node) in etree.iterparse(wiki_xml, events=['end']):
    if node.tag == ns+'page': # only parse page nodes, add namespace before because of long string otherwise
        #count = count+1
        title = node.find(ns+'title').text # find title node and retrieve its contents
        wikiv_id = int(node.find(ns+'id').text) # find id node and convert contents to numeric
        if node.find(ns+'redirect') is None : # check if redirect exists, if not replace with 'NA'
            redirect = None
        else: 
            redirect = node.find(ns+'redirect').attrib.get('title')
        writer.writerow( (title, wikiv_id, redirect ) ) # write destination info 
        #writer.writerow( (count, title, wikiv_id, redirect ) ) # write destination info 
f.close()

In [None]:


# libraries used
import sys
import xml.etree.ElementTree as etree
import csv

# Get the arguments passed in
wikidump_name = sys.argv[1]
file_name_dest = sys.argv[2]
file_name_corp = sys.argv[3]


#####------------------------       Code Body      ------------------------#####


## parsing destination information 

# NOTE difference in writing csv between python 2 and 3!
# open file with encoding that can handle all characters!
f = open(file_name_dest, "w", newline="", encoding='utf-8') 
writer = csv.writer(f, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
#writer.writerow( ('page_nr', 'title', 'id', 'redirect_title') )
writer.writerow( ('title', 'id', 'redirect_title') )

#count = 0
ns = '{http://www.mediawiki.org/xml/export-0.10/}' # set the namespace of the XML document
# events = 'starts' makes sure that everytime a node opens, iterparse starts processing
for (event, node) in etree.iterparse(wikidump_name, events=['end']):
    if node.tag == ns+'page': # only parse page nodes, add namespace before because of long string otherwise
        #count = count+1
        title = node.find(ns+'title').text # find title node and retrieve its contents
        wikiv_id = int(node.find(ns+'id').text) # find id node and convert contents to numeric
        if node.find(ns+'redirect') is None : # check if redirect exists, if not replace with 'NA'
            redirect = None
        else: 
            redirect = node.find(ns+'redirect').attrib.get('title')
        writer.writerow( (title, wikiv_id, redirect ) ) # write destination info 
        #writer.writerow( (count, title, wikiv_id, redirect ) ) # write destination info 
f.close()

## parsing text information

# NOTE difference in writing csv between python 2 and 3!
# open file with encoding that can handle all characters!
f = open(file_name_corp, "w", newline="", encoding='utf-8') 
writer = csv.writer(f, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow( ('id', 'text') )

ns = '{http://www.mediawiki.org/xml/export-0.10/}' # set the namespace of the XML document
# events = 'starts' makes sure that everytime a node opens, iterparse starts processing
for (event, node) in etree.iterparse(wikidump_name, events=['end']):
    if node.tag == ns+'page': # only parse page nodes, add namespace before because of long string otherwise
        #title = node.find(ns+'title').text # find title node and retrieve its contents
        wikiv_id = int(node.find(ns+'id').text) # find id node and convert contents to numeric
        text = node.find(ns+'revision').find(ns+'text').text # find text node within the revision node 
        writer.writerow( (wikiv_id, text) ) # write destination info 
f.close()


## 2. Capturing the text

suggest to use Gensim. have a lot of text parsing in it. or straight to vectors depending on what you want to do.

For activity types. straight to vectors might just be perfect.

Cleaning steps on text taken previously:
1. lower case
2. extracting type (city, park, region, country, continent)
3. extracting status (outline, usable, guide, star)
4. remove empty texts (maybe also throw away ones with very little text?)
5. get geo coordinates
6. get wikipedia link
7. get parent
8. get commons name (reference to other dataset)
9. get DMOZ folder (reference to other dataset)
10. add size of text
11. set parents of continet to 'world'
12. get parent ids by string matching ... follows from 'ispartof' ...
13. throw away some specific stuff with parents like moon and space

shit. that's a lot..!

Probably better to run the default scripts I have again, than trying to adapt the Gensim functions to parse this info. Then just run Gensim to get the cleaned text. >> but Gensim code is well organized. Could just go and adjust the `remove_markup()` function. This cleans a lot. But instead of cleaning we could get the info we need. Maybe try that!

## Reading and parsing raw wikivoyage corpus

At Spacy conference I tried to label activities in the wikivoyage corpus

The way of working I documented in OneNote. There is also a video
tutorial online that describes the steps pretty well.

Some notes on what I did that training:

Parse the latest version of the wikivoyage dataset with Bram's: code
```
python process_wiki.py enwikivoyage-latest-pages-articles.xml.bz2 data/wiki.processed.csv
```

However, for the labelling in Prodigy this is not good enough. Actually
we need the data on a per sentence level, because labelling entire
texts of a destination is too much text. Therefore the next step would
be to adjust Bram's parser so that it doesn't remove punctuation.

The to be flow would then be something like:
* Parse wikipedia corpus in `.bz2` format, output `.jsonl` per sentence
* Add metadata on the source page to the parsed lines in `.jsonl`
* `.jsonl` necessary for classification per sentence.

example format:
```
{"text":"Uber\u2019s Lesson: Silicon Valley\u2019s Start-Up Machine Needs Fixing","meta":{"source":"The New York Times"}}
{"text":"Pearl Automation, Founded by Apple Veterans, Shuts Down","meta":{"source":"The New York Times"}}
```

Then use `textcat.teach` with `source` argument

possibly build own custom corpus loader for wiki:
https://support.prodi.gy/t/template-for-prodigy-corpus-and-api-loaders/331/4

For the labelling we have to choose ourselves which LABELS to use.

The corpus to start training on would best be `en_vectors_web_lg` as
this has the best text representation (vectors), without having the NER
and dependency crap.

When we tried Prodigy with the entire destination texts per time we
noted that `textcat.teach` is going through the texts alphabetically,
you might want to change this too such that it selects the sentences
it is most uncertain about.


## Let's go!

In [None]:
wiki_in = '../../data/wikivoyage/raw/enwikivoyage-latest-pages-articles.xml.bz2'

If pattern package is installed, use fancier shallow parsing to get token lemmas. Otherwise, use simple regexp tokenization. You can override this automatic logic by forcing the lemmatize parameter explicitly. self.metadata if set to true will ensure that serialize will write out article titles to a pickle file.

https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/corpora/wikicorpus/index.html

In [None]:

# import logging
# import os.path
# import sys
# import csv

from gensim.corpora import WikiCorpus


wiki = WikiCorpus(wiki_in, lemmatize=False)
wiki.metadata = True


In [None]:
type(wiki)

In [None]:
from itertools import islice

In [None]:
type(wiki.get_texts())

In [None]:
for i in islice(wiki.get_texts(), 1):
    print(i)

Will need a custom tokenizer as currently all punctuation is removed and thus you cannot label per sentence..

Current parser also doesn't yield important metadata like the parent, and removes all structure from the file (i.e. on the sections). 

also structure terms like 'buy' need to be removed. see screenshot of tagger

code from `corpora.wikicorpus` should be only. see if we can adjust the `process_article` or `tokenizer_func` methods.

https://www.pydoc.io/pypi/gensim-3.2.0/autoapi/corpora/wikicorpus/index.html#module-corpora.wikicorpus

source code: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/wikicorpus.py
* seems like we can replace the `tokenizer` to prevent the `.` to be removed: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py

### Do something here to write to jsonl?

In [None]:
for (tokens, (pageid, title)) in wiki.get_texts():
    if type(tokens[0]) == str:
        row = [title, ' '.join(tokens)]
    else:
        row = [title, b' '.join(tokens).decode('utf-8')]
    csv_out.writerow(row)
    i = i + 1
    if (i % 10000 == 0):
        logger.info("Saved " + str(i) + " articles")


Or just import after brams run:

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../../data/wikivoyage/processed/wiki-processed.csv', header=None)

In [None]:
df[0].head()

In [None]:
import pandas as pd

# Wrap pattern column in a dictionary
df["json"] = df.apply(lambda x: {"text": x[1], "meta" : {"place": x[0]}}, axis=1)

# Output in JSONL format
df['json'].to_json('../../data/wikivoyage/processed/wiki-processed-prodigy.jsonl', orient='records', lines=True)

### Write to jsonl

possibly use prodigy's build in functionality: https://support.prodi.gy/t/jsonl-format/783/2

or write to `jsonl` from pandas: https://stackoverflow.com/questions/51775175/pandas-dataframe-to-jsonl-json-lines-conversion

In [None]:
import ujson
from pathlib import Path

def read_jsonl(file_path):
    """Read a .jsonl file and yield its contents line by line.
    file_path (unicode / Path): The file path.
    YIELDS: The loaded JSON contents of each line.
    """
    with Path(file_path).open('r', encoding='utf8') as f:
        for line in f:
            try:  # hack to handle broken jsonl
                yield ujson.loads(line.strip())
            except ValueError:
                continue


def write_jsonl(file_path, lines):
    """Create a .jsonl file and dump contents.
    file_path (unicode / Path): The path to the output file.
    lines (list): The JSON-serializable contents of each line.
    """
    data = [ujson.dumps(line, escape_forward_slashes=False) for line in lines]
    Path(file_path).open('w', encoding='utf-8').write('\n'.join(data))

In [None]:
write_jsonl('test.jsonl', ["abra", "cadabra"])

In [None]:


with open(outp, 'w', encoding='utf-8') as output:
    csv_out = csv.writer(output, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)

    for (tokens, (pageid, title)) in wiki.get_texts():
        # print (tokens)
        # output.write(b' '.join(title).decode('utf-8') + '\n')
        # output.write("\"" + title + '\",\"' + b' '.join(tokens).decode('utf-8') + '\"\n')
        # print(type(b' '.join(tokens).decode('utf-8')))
        # row = [title, b' '.join(tokens).decode('utf-8')]
        if type(tokens[0]) == str:
            row = [title, ' '.join(tokens)]
        else:
            row = [title, b' '.join(tokens).decode('utf-8')]
        csv_out.writerow(row)
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")


In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
import os.path
import sys
import csv

from gensim.corpora import WikiCorpus

# if __name__ == '__main__':
#     program = os.path.basename(sys.argv[0])
#     logger = logging.getLogger(program)

#     logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
#     logging.root.setLevel(level=logging.INFO)
#     logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
#     if len(sys.argv) < 3:
#         print(globals()['__doc__'] % locals())
#         sys.exit(1)
#     inp, outp = sys.argv[1:3]
    space = " "
    i = 0
    # output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    wiki.metadata = True

    if not os.path.exists("data"):
        os.makedirs("data")

    with open(outp, 'w', encoding='utf-8') as output:
        csv_out = csv.writer(output, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)

        for (tokens, (pageid, title)) in wiki.get_texts():
            # print (tokens)
            # output.write(b' '.join(title).decode('utf-8') + '\n')
            # output.write("\"" + title + '\",\"' + b' '.join(tokens).decode('utf-8') + '\"\n')
            # print(type(b' '.join(tokens).decode('utf-8')))
            # row = [title, b' '.join(tokens).decode('utf-8')]
            if type(tokens[0]) == str:
                row = [title, ' '.join(tokens)]
            else:
                row = [title, b' '.join(tokens).decode('utf-8')]
            csv_out.writerow(row)
            i = i + 1
            if (i % 10000 == 0):
                logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")


In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
import os.path
import sys
import csv

from gensim.corpora import WikiCorpus

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) < 3:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0
    # output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    wiki.metadata = True

    if not os.path.exists("data"):
        os.makedirs("data")

    with open(outp, 'w', encoding='utf-8') as output:
        csv_out = csv.writer(output, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)

        for (tokens, (pageid, title)) in wiki.get_texts():
            # print (tokens)
            # output.write(b' '.join(title).decode('utf-8') + '\n')
            # output.write("\"" + title + '\",\"' + b' '.join(tokens).decode('utf-8') + '\"\n')
            # print(type(b' '.join(tokens).decode('utf-8')))
            # row = [title, b' '.join(tokens).decode('utf-8')]
            if type(tokens[0]) == str:
                row = [title, ' '.join(tokens)]
            else:
                row = [title, b' '.join(tokens).decode('utf-8')]
            csv_out.writerow(row)
            i = i + 1
            if (i % 10000 == 0):
                logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")
