## Introduction

This tutorial will introduce you the basic idea of parsing a wikipedia page and then store the results into Elasticsearch. 

> Wikipedia is a free online encyclopedia, created and edited by volunteers around the world.

People used to gain new knowledge by reading encyclopedia. But now the amount of information generated everyday is tremendous, the old-school encyclopedia cannot be updated at this speed and that's why wikipedia emerged. All wiki pages are edited by volunteers all around the world at any time. For data scientists nowadays, these information could be very useful if we can parse them and do analysis using some machine learning techniques like NLP, etc.

So, the first step would be understanding the format of a wikipedia page.

## Wikipage markup 

You can view the source of each page by clicking the *Edit* or *View source* tab on the top right corner. There are three basic markups:

### Template

Enclosed by ``{{ }}``. There can be various number of attributes inside and they are separated by ``|``. The first argument is the type of this template. Name and format of each template is defined by Wikipedia, you can lookup their documents for detail information.  Example: 

```
{{Infobox settlement
|name = {{raise|0.2em|Taipei}}
|official_name = Taipei City
}}
```
This is a infobox template which contains 2 attributes ``name`` and ``offical_name``. Note that markups can be in nested structure.

### Wiki Link

Enclosed by ``[[ ]]``. This represents a link to another wikipedia page. Example:

```
 [[Macintosh|Mac]] 
```

2 arguments are separated by ``|``, the first one is the actual name of the referenced wikipedia page, the second one is the text that you want to show as the link text. For this example, this represents a link displayed as ``Mac`` and the target page is ``Macintosh``. The second argument is optional.

### External Link

Enclosed by ``[ ]``. This is a link to an external website. Example:

```
[http://www.nogizaka46.com Official website of Nogizaka46]
```

The url is required and the displayed link text (optional) is followed by a space after the url.

## Parsing 

You can definitely write your own parser like what we did in homework one. But here I'm gonna introduce a third party python library called [mwparserfromhell](https://github.com/earwig/mwparserfromhell/tree/develop). Installation is pretty easy:

```
$ pip install mwparserfromhell
```

### Basic usages

In [50]:
import mwparserfromhell
wikicode = mwparserfromhell.parse("Some text")
print type(wikicode)
print wikicode

<class 'mwparserfromhell.wikicode.Wikicode'>
Some text


This library is pretty easy to use. Just pass in a string with wikipedia markups to the parser. This parser will parse the entire string into a tree-like structure. Each node is a `mwparserfromhell.Wikicode` object.

### Filter

The most powerful function of this library is the `filter` function. We can use it to get a specific type of wikicode objects. Let's use some examples to explain how it works.

In [51]:
text = """
This is a wikipedia page
Hello World!
{{Infobox settlement
|name = {{raise|0.2em|Taipei}}
|official_name = Taipei City
}}
This is a Wikipedia page.
I'm using [[Macintosh|Mac]] running [[OS X El Capitan]].
Reference links
[http://www.nogizaka46.com Official website of Nogizaka46]
End of page
"""
# Parse the text above
wikicode = mwparserfromhell.parse(text)
# Get all templates
templates = wikicode.filter_templates()
print "Templates: ", templates
print "Length: ", len(templates)
print "Type", type(templates[0])

Templates:  [u'{{Infobox settlement\n|name = {{raise|0.2em|Taipei}}\n|official_name = Taipei City\n}}', u'{{raise|0.2em|Taipei}}']
Length:  2
Type <class 'mwparserfromhell.nodes.template.Template'>


As you can see, the return value of this `filter_templates` function is a list of all `templates` in the text and the type of each element in the list is a `mwparserfromhell.nodes.template.Template` object.
In the example text, there is a template inside another template and the parser can still give us all of them. Actually, there is a more general function called `filter`. Let me introduce it.

In [52]:
# Get the top level templates only
template_type = mwparserfromhell.nodes.template.Template
templates = wikicode.filter(forcetype=template_type, recursive=False)
print "Templates: ", templates
print "Length: ", len(templates)
print "Type", type(templates[0])

Templates:  [u'{{Infobox settlement\n|name = {{raise|0.2em|Taipei}}\n|official_name = Taipei City\n}}']
Length:  1
Type <class 'mwparserfromhell.nodes.template.Template'>


In this case, we're using `filter` function not the `filter_templates` for advance usage. We set the flag `recursive` to `False` so we'll only get the top level templates. In this example, the return value will only be one element. The `forcetype` argument is for filtering only the types specified in this argument. Actually, `filter_templates()` equals `filter(forcetype=mwparserfromhell.nodes.template.Template)`. Furthermore, we can put more than one types in this argument.

In [53]:
# Get both wikilinks and external links together in the same list
wikilink_type = mwparserfromhell.nodes.wikilink.Wikilink
external_link_type = mwparserfromhell.nodes.external_link.ExternalLink
links = wikicode.filter(forcetype=(wikilink_type, external_link_type), recursive=False)
for i, link in enumerate(links):
    print i, link, type(link)

0 [[Macintosh|Mac]] <class 'mwparserfromhell.nodes.wikilink.Wikilink'>
1 [[OS X El Capitan]] <class 'mwparserfromhell.nodes.wikilink.Wikilink'>
2 [http://www.nogizaka46.com Official website of Nogizaka46] <class 'mwparserfromhell.nodes.external_link.ExternalLink'>


Here we passed in a tuple of types (wikilink and external link) so we got a list of both wikilinks and external links together.

### Object Attributes

Now we got all the parsed nodes, the next step is to get the value of each node.

In [54]:
# Get all wikilinks
wikilinks = wikicode.filter(forcetype=(wikilink_type))
for i, link in enumerate(wikilinks):
    # title is the canonical name of the target page, text is the displayed text
    print i, link.text, link.title

0 Mac Macintosh
1 None OS X El Capitan


In [55]:
# Get all external links
externallinks = wikicode.filter(forcetype=(external_link_type))
for i, link in enumerate(externallinks):
    # title is the displayed link text, url is ... the url
    print i, link.title, link.url

0 Official website of Nogizaka46 http://www.nogizaka46.com


In [56]:
# Get the first template
template = templates[0]
# Print template name
print template.name
# Print all the name and value of each parameter in this template
for param in template.params:
    print param.name, ":", param.value
    
print template.params[0].value.filter_templates()[0].params[1].value
# We can keep concatenating functions since all objects are wikicode objects

Infobox settlement

name  :  {{raise|0.2em|Taipei}}

official_name  :  Taipei City

Taipei


## Get a real Wikipedia page

In [57]:
import requests
import json

r = requests.get("https://en.wikipedia.org/w/api.php?action=query&titles=Apple%20Inc.&prop=revisions&rvprop=content&format=json")
res = r.text
obj = json.loads(res)
page = obj['query']['pages'].keys()[0]
content = obj['query']['pages'][page]['revisions'][0]['*']

wiki = mwparserfromhell.parse(content, skip_style_tags=True)
templates = wiki.filter(forcetype=template_type)
wiki_links = wiki.filter(forcetype=wikilink_type)
external_links = wiki.filter(forcetype=external_link_type)
print "# of templates:", len(templates)
print "# of wiki links:", len(wiki_links)
print "# of external links:", len(external_links)

# of templates: 493
# of wiki links: 911
# of external links: 495


Because the source of a Wikipedia page consists of lots of mark-ups, if you want to get the content text only, you'll need to strip them out. Use ```strip_code()``` function to do this.

In [58]:
print wiki.strip_code()

'''Apple Inc.''' is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. Its hardware products include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, the iPod portable media player, the Apple Watch smartwatch, and the Apple TV digital media player. Apple's consumer software includes the macOS and iOS operating systems, the iTunes media player, the Safari web browser, and the iLife and iWork creativity and productivity suites. Its online services include the iTunes Store, the iOS App Store and Mac App Store, Apple Music, and iCloud.

Apple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976 to develop and sell personal computers. It was incorporated as '''Apple Computer, Inc.''' in January 1977, and was renamed as Apple Inc. in January 2007 to reflect its shifted focus toward consumer electronics. Apple () joine

Also, there are different sections within each page. MWParser has a function called ```get_sections()```. We can leverage it if you want to split the entire page into different sections. The return value is a list of wikicode objects.

Given this parser and all the function calls, you can leverage them to parse the pages to get all the information you want. For the second half of the tutorial, I want to introduce another tool call **Elasticsearch**. It's like a database system so we can store all our parsed result in it. I'll explain it more in the next section.

## Elasticsearch

> Elasticsearch is a distributed RESTful search engine built for the cloud.

Elasticsearch is a search engine designed for high availability and distributed system. It's built on top of *Lucune* so it has a very high capability of full text searching. It provides both native Java apis and HTTP RESTful apis. The HTTP RESTful apis are designed for modern web architecture. All requests and responses are in JSON format so it's really easy to use.

### Get Started

First you need to install it. [Download](https://www.elastic.co/downloads/elasticsearch) the official distribution and unzip it. Run ```bin/elasticsearch``` on Unix, MacOS or ```bin\elasticsearch.bat``` on Windows. Then open your browser go to ```localhost:9200```. You should see something like this if the server is running correctly.
```
{
  "name" : "E8Y98x4",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "XiF921xjSSGfbkKb2ZU56w",
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"
  },
  "tagline" : "You Know, for Search"
}
```

## Basics

There are some different levels of structure inside Elasticsearch: ```index``` and ```type```. ```Index``` is like ```database``` in SQL and ```type``` is like ```table```. Each row of data is a ```document```. We can pre-define our document schema using a ```mapping``` file. You can design the mapping based on your needs, because it's in JSON format, it's very flexible. It's basically a dictionary, values could be any kinds of data types even arrays. And it could also be a nested structure which means dictionary in dictionary but this will make queries more complicated. There is no *join* function in Elasticsearch, so you cannot use the same design as you do in SQL.

Here I want to introduce another package in Python called ```Elasticsearch-py```. It's a high level wrapper for Python so you don't need to deal with the low level HTTP RESTful apis. To install the package, just type
```
$ pip install elasticsearch
```
in your terminal and you are all set!

## Usage

In [59]:
from elasticsearch import Elasticsearch
# Connect to the server
es = Elasticsearch()

# Create an index for testing
es.indices.create(index='test', ignore=400)

{u'acknowledged': True, u'shards_acknowledged': True}

There is an app that can help us manage all the data in our server using graphical UI. It's called ```elasticsearch-head```. Type in the following commands in your terminal to install it.
```
$ git clone git://github.com/mobz/elasticsearch-head.git
$ cd elasticsearch-head
$ npm install
$ grunt server
```
Go back to the elasticsearch folder and then find ```elasticsearch.yml``` file under ```config``` folder, add these two lines in the file and then restart the server.
```
http.cors.enabled: true
http.cors.allow-origin: "*"
```
Open your browser then go to ```localhost:9100``` and you should see the index ```test``` we just created.

Let's try to put something into Elasticsearch.

In [60]:
es.index(index="test", doc_type="test-type", id=1, body={
        "field-1": "Hello World",
        "field-2": 2,
        "field-3": True,
        "field-4": [1,2,3],
        "field-5":{
            "inner-1": "YES",
            "inner-2": 100
        }
    })

{u'_id': u'1',
 u'_index': u'test',
 u'_shards': {u'failed': 0, u'successful': 1, u'total': 2},
 u'_type': u'test-type',
 u'_version': 1,
 u'created': True,
 u'result': u'created'}

The ```index``` function will index this data into index *test* and put it under the type *test-type*. You can use the plug-in we just installed to see the data. Go to *Browser* tab then you can see all the data in the server. As you can see, the structure of each data is really flexible. Let's put some other data in it so that we can learn how to do queries.

In [61]:
es.index(index="test", doc_type="test-type", id=2, body={
        "field-1": "This is another test data",
        "field-2": 5,
        "field-3": False,
        "field-4": [2,4,6],
        "field-5":{
            "inner-1": "FALSE",
            "inner-2": 50
        }
    })

{u'_id': u'2',
 u'_index': u'test',
 u'_shards': {u'failed': 0, u'successful': 1, u'total': 2},
 u'_type': u'test-type',
 u'_version': 1,
 u'created': True,
 u'result': u'created'}

In [62]:
# Query by id
res = es.get(index="test", doc_type="test-type", id=2)
# Result is a dict
# Content is inside "_source"
content = res['_source']
print content
print content['field-1']

{u'field-4': [2, 4, 6], u'field-5': {u'inner-1': u'FALSE', u'inner-2': 50}, u'field-1': u'This is another test data', u'field-2': 5, u'field-3': False}
This is another test data


The command that is used to do query in Elasticsearch is called *Query DSL*. Here are some basic examples, for further usage, please check the [official website](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html).

In [63]:
# Get all data
res = es.search(index='test', doc_type='test-type', body={
        "query": {
            "match_all": {}
        }
    })
# print res
# Result is a dict, content is inside "hits" under "hits"
hits = res['hits']['hits']
print hits
# It's a list, each entry cotains the relevance score and some other fileds, the data itself is under "_source"
for hit in hits:
    print hit['_source']

[{u'_score': 1.0, u'_type': u'test-type', u'_id': u'2', u'_source': {u'field-4': [2, 4, 6], u'field-5': {u'inner-1': u'FALSE', u'inner-2': 50}, u'field-1': u'This is another test data', u'field-2': 5, u'field-3': False}, u'_index': u'test'}, {u'_score': 1.0, u'_type': u'test-type', u'_id': u'1', u'_source': {u'field-4': [1, 2, 3], u'field-5': {u'inner-1': u'YES', u'inner-2': 100}, u'field-1': u'Hello World', u'field-2': 2, u'field-3': True}, u'_index': u'test'}]
{u'field-4': [2, 4, 6], u'field-5': {u'inner-1': u'FALSE', u'inner-2': 50}, u'field-1': u'This is another test data', u'field-2': 5, u'field-3': False}
{u'field-4': [1, 2, 3], u'field-5': {u'inner-1': u'YES', u'inner-2': 100}, u'field-1': u'Hello World', u'field-2': 2, u'field-3': True}


In [31]:
# Find all the entries that have term "hello" in "field-1"
res = es.search(index='test', doc_type='test-type', body={
        "query": {
            "term": {
                "field-1": "hello"
            }
        }
    })
print res['hits']['total']
# The total hits should be 1

1


## Putting all together

We can now parse all the pages, extract all the data we want and then put them into Elasticsearch so that we can do simple queries to get what we want in the future.

In [67]:
es.indices.create(index='wiki', ignore=400)

{u'acknowledged': True, u'shards_acknowledged': True}

In [68]:
def parse_page(url, title):
    r = requests.get(url)
    res = r.text
    obj = json.loads(res)
    page = obj['query']['pages'].keys()[0]
    content = obj['query']['pages'][page]['revisions'][0]['*']

    wiki = mwparserfromhell.parse(content, skip_style_tags=True)
    
    wiki_links = wiki.filter(forcetype=wikilink_type)
    wiki_links = map(lambda a: {"title": unicode(a.title), "text": unicode(a.text) if a.text is not None else ""}, wiki_links)
    
    external_links = wiki.filter(forcetype=external_link_type)
    external_links = map(lambda a: {"url": unicode(a.url), "title": unicode(a.title) if a.title is not None else ""}, external_links)
    text = wiki.strip_code()
    return {"title": title, "text": text, "wiki_links": wiki_links, "external_links": external_links}

apple = parse_page("https://en.wikipedia.org/w/api.php?action=query&titles=Apple%20Inc.&prop=revisions&rvprop=content&format=json", "Apple Inc.")
es.index(index="wiki", doc_type="page", id=1, body=apple)
google = parse_page("https://en.wikipedia.org/w/api.php?action=query&titles=Google&prop=revisions&rvprop=content&format=json", "Google")
es.index(index="wiki", doc_type="page", id=2, body=google)

{u'_id': u'2',
 u'_index': u'wiki',
 u'_shards': {u'failed': 0, u'successful': 1, u'total': 2},
 u'_type': u'page',
 u'_version': 1,
 u'created': True,
 u'result': u'created'}

In [69]:
# Find all the entries that have term "ipod" in "text"
res = es.search(index='wiki', doc_type='page', body={
        "query": {
            "term": {
                "text": "ipod"
            }
        }
    })
print res['hits']['total']
# The total hits should be 1

1


## Reference

+ http://mwparserfromhell.readthedocs.io/en/latest/api/mwparserfromhell.html
+ https://elasticsearch-py.readthedocs.io/en/master/
+ https://github.com/earwig/mwparserfromhell/tree/develop
+ https://github.com/elastic/elasticsearch
+ https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
+ https://mobz.github.io/elasticsearch-head/